* [PATCH 0/2] KVM: arm64: Destroy the stage-2 page-table periodically @ 2025-07-24 23:51 Raghavendra Rao Ananta 2025-07-24 23:51 ` [PATCH 2/2] " Raghavendra Rao Ananta [not found] ` <20250724235144.2428795-2-rananta@google.com> 0 siblings, 2 replies; 10+ messages in thread From: Raghavendra Rao Ananta @ 2025-07-24 23:51 UTC (permalink / raw) To: Oliver Upton, Marc Zyngier Cc: Raghavendra Rao Anata, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm Hello, When destroying a fully-mapped 128G VM abruptly, the following scheduler warning is observed: sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE Tainted: [O]=OOT_MODULE Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x3c/0xb8 dump_stack+0x18/0x30 resched_latency_warn+0x7c/0x88 sched_tick+0x1c4/0x268 update_process_times+0xa8/0xd8 tick_nohz_handler+0xc8/0x168 __hrtimer_run_queues+0x11c/0x338 hrtimer_interrupt+0x104/0x308 arch_timer_handler_phys+0x40/0x58 handle_percpu_devid_irq+0x8c/0x1b0 generic_handle_domain_irq+0x48/0x78 gic_handle_irq+0x1b8/0x408 call_on_irq_stack+0x24/0x30 do_interrupt_handler+0x54/0x78 el1_interrupt+0x44/0x88 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x84/0x88 stage2_free_walker+0x30/0xa0 (P) __kvm_pgtable_walk+0x11c/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 kvm_pgtable_walk+0xc4/0x140 kvm_pgtable_stage2_destroy+0x5c/0xf0 kvm_free_stage2_pgd+0x6c/0xe8 kvm_uninit_stage2_mmu+0x24/0x48 kvm_arch_flush_shadow_all+0x80/0xa0 kvm_mmu_notifier_release+0x38/0x78 __mmu_notifier_release+0x15c/0x250 exit_mmap+0x68/0x400 __mmput+0x38/0x1c8 mmput+0x30/0x68 exit_mm+0xd4/0x198 do_exit+0x1a4/0xb00 do_group_exit+0x8c/0x120 get_signal+0x6d4/0x778 do_signal+0x90/0x718 do_notify_resume+0x70/0x170 el0_svc+0x74/0xd8 el0t_64_sync_handler+0x60/0xc8 el0t_64_sync+0x1b0/0x1b8 The host kernel was running with CONFIG_PREEMPT_NONE=y, and since the page-table walk operation takes considerable amount of time for a VM with such a large number of PTEs mapped, the warning is seen. To mitigate this, split the walk into smaller ranges, by checking for cond_resched() between each range. Since the path is executed during VM destruction, after the page-table structure is unlinked from the KVM MMU, relying on cond_resched_rwlock_write() isn't necessary. Patch-1 splits the kvm_pgtable_stage2_destroy() function into separate 'walk' and 'free PGD' parts. Patch-2 leverages the split and performs the walk periodically over smaller ranges and calls cond_resched() between them. Thank you. Raghavendra Raghavendra Rao Ananta (2): KVM: arm64: Split kvm_pgtable_stage2_destroy() KVM: arm64: Destroy the stage-2 page-table periodically arch/arm64/include/asm/kvm_pgtable.h | 19 ++++++++++++ arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++-- arch/arm64/kvm/mmu.c | 46 +++++++++++++++++++++++++--- 3 files changed, 80 insertions(+), 8 deletions(-) base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 -- 2.50.1.470.g6ba607880d-goog ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-07-24 23:51 [PATCH 0/2] KVM: arm64: Destroy the stage-2 page-table periodically Raghavendra Rao Ananta @ 2025-07-24 23:51 ` Raghavendra Rao Ananta 2025-07-25 14:59 ` Sean Christopherson 2025-07-29 16:01 ` Oliver Upton [not found] ` <20250724235144.2428795-2-rananta@google.com> 1 sibling, 2 replies; 10+ messages in thread From: Raghavendra Rao Ananta @ 2025-07-24 23:51 UTC (permalink / raw) To: Oliver Upton, Marc Zyngier Cc: Raghavendra Rao Anata, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm When a large VM, specifically one that holds a significant number of PTEs, gets abruptly destroyed, the following warning is seen during the page-table walk: sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE Tainted: [O]=OOT_MODULE Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x3c/0xb8 dump_stack+0x18/0x30 resched_latency_warn+0x7c/0x88 sched_tick+0x1c4/0x268 update_process_times+0xa8/0xd8 tick_nohz_handler+0xc8/0x168 __hrtimer_run_queues+0x11c/0x338 hrtimer_interrupt+0x104/0x308 arch_timer_handler_phys+0x40/0x58 handle_percpu_devid_irq+0x8c/0x1b0 generic_handle_domain_irq+0x48/0x78 gic_handle_irq+0x1b8/0x408 call_on_irq_stack+0x24/0x30 do_interrupt_handler+0x54/0x78 el1_interrupt+0x44/0x88 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x84/0x88 stage2_free_walker+0x30/0xa0 (P) __kvm_pgtable_walk+0x11c/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 kvm_pgtable_walk+0xc4/0x140 kvm_pgtable_stage2_destroy+0x5c/0xf0 kvm_free_stage2_pgd+0x6c/0xe8 kvm_uninit_stage2_mmu+0x24/0x48 kvm_arch_flush_shadow_all+0x80/0xa0 kvm_mmu_notifier_release+0x38/0x78 __mmu_notifier_release+0x15c/0x250 exit_mmap+0x68/0x400 __mmput+0x38/0x1c8 mmput+0x30/0x68 exit_mm+0xd4/0x198 do_exit+0x1a4/0xb00 do_group_exit+0x8c/0x120 get_signal+0x6d4/0x778 do_signal+0x90/0x718 do_notify_resume+0x70/0x170 el0_svc+0x74/0xd8 el0t_64_sync_handler+0x60/0xc8 el0t_64_sync+0x1b0/0x1b8 The warning is seen majorly on the host kernels that are configured not to force-preempt, such as CONFIG_PREEMPT_NONE=y. To avoid this, instead of walking the entire page-table in one go, split it into smaller ranges, by checking for cond_resched() between each range. Since the path is executed during VM destruction, after the page-table structure is unlinked from the KVM MMU, relying on cond_resched_rwlock_write() isn't necessary. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> --- arch/arm64/kvm/mmu.c | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 2942ec92c5a4..6c4b9fb1211b 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -387,6 +387,40 @@ static void stage2_flush_vm(struct kvm *kvm) srcu_read_unlock(&kvm->srcu, idx); } +/* + * Assume that @pgt is valid and unlinked from the KVM MMU to free the + * page-table without taking the kvm_mmu_lock and without performing any + * TLB invalidations. + * + * Also, the range of addresses can be large enough to cause need_resched + * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke + * cond_resched() periodically to prevent hogging the CPU for a long time + * and schedule something else, if required. + */ +static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr, + phys_addr_t end) +{ + u64 next; + + do { + next = stage2_range_addr_end(addr, end); + kvm_pgtable_stage2_destroy_range(pgt, addr, next - addr); + + if (next != end) + cond_resched(); + } while (addr = next, addr != end); +} + +static void kvm_destroy_stage2_pgt(struct kvm_pgtable *pgt) +{ + if (!is_protected_kvm_enabled()) { + stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits)); + kvm_pgtable_stage2_destroy_pgd(pgt); + } else { + pkvm_pgtable_stage2_destroy(pgt); + } +} + /** * free_hyp_pgds - free Hyp-mode page tables */ @@ -984,7 +1018,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t return 0; out_destroy_pgtable: - KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt); + kvm_destroy_stage2_pgt(pgt); out_free_pgtable: kfree(pgt); return err; @@ -1081,7 +1115,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu) write_unlock(&kvm->mmu_lock); if (pgt) { - KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt); + kvm_destroy_stage2_pgt(pgt); kfree(pgt); } } -- 2.50.1.470.g6ba607880d-goog ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-07-24 23:51 ` [PATCH 2/2] " Raghavendra Rao Ananta @ 2025-07-25 14:59 ` Sean Christopherson 2025-07-25 15:04 ` ChaosEsque Team 2025-07-25 16:22 ` Raghavendra Rao Ananta 2025-07-29 16:01 ` Oliver Upton 1 sibling, 2 replies; 10+ messages in thread From: Sean Christopherson @ 2025-07-25 14:59 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Oliver Upton, Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm Heh, without full conext, the shortlog reads like "destroy stage-2 page tables from time to time". Something like this would be more appropriate: KVM: arm64: Reschedule as needed when destroying stage-2 page-tables ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-07-25 14:59 ` Sean Christopherson @ 2025-07-25 15:04 ` ChaosEsque Team 2025-07-25 16:22 ` Raghavendra Rao Ananta 1 sibling, 0 replies; 10+ messages in thread From: ChaosEsque Team @ 2025-07-25 15:04 UTC (permalink / raw) To: Sean Christopherson Cc: Raghavendra Rao Ananta, Oliver Upton, Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm Linux License of PaX/OpenSourceSecurity should be canceled Pax/OpenSourceSecurity(R)(C) has prevented redistribution of it's derivative work of the Linux(R) kernel successfully. There have been no leaks of this derivative work. They have also added a term to their derivative work of the Linux(R) kernel stating that any redistribution of the derivative work will incur a penalty: no more derivative work and withholding of paid funds. To avoid this penalty no one has leaked the proprietary work of PaX/OpenSourceSecurity which is a derivative work of the Linux Kernel. OpenSource licenses are now considered a joke, and opensource programmers who work for free as clowns. You can cancel the license PaX/OpenSourceSecurity has to the Linux Kernel code. Any of you whom has contributed to the Linux Kernel own your code: you can cancle PaX/OpenSourceSecurity's license to your code. 1) PaX/OpenSourceSecurity never paid you anything. (nor did you ask for anything) They cannot rely on contract law to enforce their license. 2) Pax/OpenSourceSecurity is required to follow copyright law. They cannot claim following the terms of the license are "performance" under contract law: they have a duty to not violate Federal Copyright. 3) Pax/OpenSourceSecurity is violating the copyright license (which they got for free) of your code. Your permission states that additional restrictive terms are not permitted to be attached to the corpus of the copyrighted Work. Pax/OpenSourceSecurity have added additional restrictive terms: they have abrogated the "no additional restrictions" term by adding a new term: no redistribution of the derivative work. That new term has been successfully enforced by self-help by PaX/OpenSourceSecurity. You will laugh and claim that "THE GPL ALLOWS PEOPLE TO FORBID REDISTRIBUTION LOL!" and "THE GPL IS JUST TALKING ABOUT NOT CHANGING THE TEXT OF THE GPL ITSELF LOL!" and "ITS A PATCH, SO THEY CAN PUT IT UNDER WHATEVER LICENSE THEY WANT SINCE THEY DONT INCLUDE LINUX WITH IT" And there is no way for anyone to convince you otherwise. But you are wrong. Dear Bruce Perens, Please. Please sue Open Source Security. Before it's too late. Dear RMS; Open Source Security has defeated you. On Fri, Jul 25, 2025 at 10:59 AM Sean Christopherson <seanjc@google.com> wrote: > > Heh, without full conext, the shortlog reads like "destroy stage-2 page tables > from time to time". Something like this would be more appropriate: > > KVM: arm64: Reschedule as needed when destroying stage-2 page-tables > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-07-25 14:59 ` Sean Christopherson 2025-07-25 15:04 ` ChaosEsque Team @ 2025-07-25 16:22 ` Raghavendra Rao Ananta 1 sibling, 0 replies; 10+ messages in thread From: Raghavendra Rao Ananta @ 2025-07-25 16:22 UTC (permalink / raw) To: Sean Christopherson Cc: Oliver Upton, Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm On Fri, Jul 25, 2025 at 7:59 AM Sean Christopherson <seanjc@google.com> wrote: > > Heh, without full conext, the shortlog reads like "destroy stage-2 page tables > from time to time". Something like this would be more appropriate: > > KVM: arm64: Reschedule as needed when destroying stage-2 page-tables This definitely sounds better :) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-07-24 23:51 ` [PATCH 2/2] " Raghavendra Rao Ananta 2025-07-25 14:59 ` Sean Christopherson @ 2025-07-29 16:01 ` Oliver Upton 2025-08-07 18:58 ` Raghavendra Rao Ananta 1 sibling, 1 reply; 10+ messages in thread From: Oliver Upton @ 2025-07-29 16:01 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm On Thu, Jul 24, 2025 at 11:51:44PM +0000, Raghavendra Rao Ananta wrote: > +/* > + * Assume that @pgt is valid and unlinked from the KVM MMU to free the > + * page-table without taking the kvm_mmu_lock and without performing any > + * TLB invalidations. > + * > + * Also, the range of addresses can be large enough to cause need_resched > + * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke > + * cond_resched() periodically to prevent hogging the CPU for a long time > + * and schedule something else, if required. > + */ > +static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr, > + phys_addr_t end) > +{ > + u64 next; > + > + do { > + next = stage2_range_addr_end(addr, end); > + kvm_pgtable_stage2_destroy_range(pgt, addr, next - addr); > + > + if (next != end) > + cond_resched(); > + } while (addr = next, addr != end); > +} > + > +static void kvm_destroy_stage2_pgt(struct kvm_pgtable *pgt) > +{ > + if (!is_protected_kvm_enabled()) { > + stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits)); > + kvm_pgtable_stage2_destroy_pgd(pgt); > + } else { > + pkvm_pgtable_stage2_destroy(pgt); > + } > +} > + Protected mode is affected by the same problem, potentially even worse due to the overheads of calling into EL2. Both protected and non-protected flows should use stage2_destroy_range(). Thanks, Oliver ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-07-29 16:01 ` Oliver Upton @ 2025-08-07 18:58 ` Raghavendra Rao Ananta 2025-08-08 18:56 ` Oliver Upton 0 siblings, 1 reply; 10+ messages in thread From: Raghavendra Rao Ananta @ 2025-08-07 18:58 UTC (permalink / raw) To: Oliver Upton Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm Hi Oliver, > > Protected mode is affected by the same problem, potentially even worse > due to the overheads of calling into EL2. Both protected and > non-protected flows should use stage2_destroy_range(). > I experimented with this (see diff below), and it looks like it takes significantly longer to finish the destruction even for a very small VM. For instance, it takes ~140 seconds on an Ampere Altra machine. This is probably because we run cond_resched() for every breakup in the entire sweep of the possible address range, 0 to ~(0ULL), even though there are no actual mappings there, and we context switch out more often. --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c + static void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) + { + u64 end = is_protected_kvm_enabled() ? ~(0ULL) : BIT(pgt->ia_bits); + u64 next, addr = 0; + + do { + next = stage2_range_addr_end(addr, end); + KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, addr, + next - addr); + + if (next != end) + cond_resched(); + } while (addr = next, addr != end); + + + KVM_PGT_FN(kvm_pgtable_stage2_destroy_pgd)(pgt); + } --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -316,9 +316,13 @@ static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 e return 0; } -void pkvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) +void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, u64 addr, u64 size) +{ + __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); +} + +void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) +{ +} Without cond_resched() in place, it takes about half the time. I also tried moving cond_resched() to __pkvm_pgtable_stage2_unmap(), as per the below diff, and calling pkvm_pgtable_stage2_destroy_range() for the entire 0 to ~(1ULL) range (instead of breaking it up). Even for a fully 4K mapped 128G VM, I see it taking ~65 seconds, which is close to the baseline (no cond_resched()). --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -311,8 +311,11 @@ static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 e return ret; pkvm_mapping_remove(mapping, &pgt->pkvm_mappings); kfree(mapping); + cond_resched(); } Does it make sense to call cond_resched() only when we actually unmap pages? Thank you. Raghavendra ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-08-07 18:58 ` Raghavendra Rao Ananta @ 2025-08-08 18:56 ` Oliver Upton 0 siblings, 0 replies; 10+ messages in thread From: Oliver Upton @ 2025-08-08 18:56 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm On Thu, Aug 07, 2025 at 11:58:01AM -0700, Raghavendra Rao Ananta wrote: > Hi Oliver, > > > > > Protected mode is affected by the same problem, potentially even worse > > due to the overheads of calling into EL2. Both protected and > > non-protected flows should use stage2_destroy_range(). > > > I experimented with this (see diff below), and it looks like it takes > significantly longer to finish the destruction even for a very small > VM. For instance, it takes ~140 seconds on an Ampere Altra machine. > This is probably because we run cond_resched() for every breakup in > the entire sweep of the possible address range, 0 to ~(0ULL), even > though there are no actual mappings there, and we context switch out > more often. This seems more like an issue with the upper bound on a pKVM walk rather than a problem with the suggestion. The information in pgt->ia_bits is actually derived from the VTCR value of the owning MMU. Even though we never use the VTCR value in hardware, pKVM MMUs have a valid VTCR value that encodes the size of the IPA space and we use that in the common stage-2 abort path. I'm attaching some fixups that I have on top of your series that'd allow the resched logic to remain common, like it is in other MMU flows. From 421468dcaa4692208c3f708682b058cfc072a984 Mon Sep 17 00:00:00 2001 From: Oliver Upton <oliver.upton@linux.dev> Date: Fri, 8 Aug 2025 11:43:12 -0700 Subject: [PATCH 4/4] fixup! KVM: arm64: Destroy the stage-2 page-table periodically --- arch/arm64/kvm/mmu.c | 60 ++++++++++++++++++-------------------------- 1 file changed, 25 insertions(+), 35 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index b82412323054..fc93cc256bd8 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -383,40 +383,6 @@ static void stage2_flush_vm(struct kvm *kvm) srcu_read_unlock(&kvm->srcu, idx); } -/* - * Assume that @pgt is valid and unlinked from the KVM MMU to free the - * page-table without taking the kvm_mmu_lock and without performing any - * TLB invalidations. - * - * Also, the range of addresses can be large enough to cause need_resched - * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke - * cond_resched() periodically to prevent hogging the CPU for a long time - * and schedule something else, if required. - */ -static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr, - phys_addr_t end) -{ - u64 next; - - do { - next = stage2_range_addr_end(addr, end); - kvm_pgtable_stage2_destroy_range(pgt, addr, next - addr); - - if (next != end) - cond_resched(); - } while (addr = next, addr != end); -} - -static void kvm_destroy_stage2_pgt(struct kvm_pgtable *pgt) -{ - if (!is_protected_kvm_enabled()) { - stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits)); - kvm_pgtable_stage2_destroy_pgd(pgt); - } else { - pkvm_pgtable_stage2_destroy(pgt); - } -} - /** * free_hyp_pgds - free Hyp-mode page tables */ @@ -938,11 +904,35 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type) return 0; } +/* + * Assume that @pgt is valid and unlinked from the KVM MMU to free the + * page-table without taking the kvm_mmu_lock and without performing any + * TLB invalidations. + * + * Also, the range of addresses can be large enough to cause need_resched + * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke + * cond_resched() periodically to prevent hogging the CPU for a long time + * and schedule something else, if required. + */ +static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr, + phys_addr_t end) +{ + u64 next; + + do { + next = stage2_range_addr_end(addr, end); + KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, addr, next - addr); + + if (next != end) + cond_resched(); + } while (addr = next, addr != end); +} + static void kvm_stage2_destroy(struct kvm_pgtable *pgt) { unsigned int ia_bits = VTCR_EL2_IPA(pgt->mmu->vtcr); - KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, 0, BIT(ia_bits)); + stage2_destroy_range(pgt, 0, BIT(ia_bits)); KVM_PGT_FN(kvm_pgtable_stage2_destroy_pgd)(pgt); } -- 2.39.5 ^ permalink raw reply related [flat|nested] 10+ messages in thread
[parent not found: <20250724235144.2428795-2-rananta@google.com>]
* Re: [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() [not found] ` <20250724235144.2428795-2-rananta@google.com> @ 2025-07-29 15:57 ` Oliver Upton 2025-08-08 18:57 ` Oliver Upton 1 sibling, 0 replies; 10+ messages in thread From: Oliver Upton @ 2025-07-29 15:57 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm On Thu, Jul 24, 2025 at 11:51:43PM +0000, Raghavendra Rao Ananta wrote: > Split kvm_pgtable_stage2_destroy() into two: > - kvm_pgtable_stage2_destroy_range(), that performs the > page-table walk and free the entries over a range of addresses. > - kvm_pgtable_stage2_destroy_pgd(), that frees the PGD. > > This refactoring enables subsequent patches to free large page-tables > in chunks, calling cond_resched() between each chunk, to yield the CPU > as necessary. > > Direct callers of kvm_pgtable_stage2_destroy() will continue to walk > the entire range of the VM as before, ensuring no functional changes. > > Also, add equivalent pkvm_pgtable_stage2_*() stubs to maintain 1:1 > mapping of the page-table functions. Uhh... We can't stub these functions out for protected mode, we already have a load-bearing implementation of pkvm_pgtable_stage2_destroy(). Just reuse what's already there and provide a NOP for pkvm_pgtable_stage2_destroy_pgd(). > +void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) > +{ > + /* > + * We aren't doing a pgtable walk here, but the walker struct is needed > + * for kvm_dereference_pteref(), which only looks at the ->flags. > + */ > + struct kvm_pgtable_walker walker = {0}; This feels subtle and prone for error. I'd rather we have something that boils down to rcu_dereference_raw() (with the appropriate n/hVHE awareness) and add a comment why it is safe. > +void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) > +{ > + kvm_pgtable_stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits)); > + kvm_pgtable_stage2_destroy_pgd(pgt); > +} > + Move this to mmu.c as a static function and use KVM_PGT_FN() Thanks, Oliver ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() [not found] ` <20250724235144.2428795-2-rananta@google.com> 2025-07-29 15:57 ` [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() Oliver Upton @ 2025-08-08 18:57 ` Oliver Upton 1 sibling, 0 replies; 10+ messages in thread From: Oliver Upton @ 2025-08-08 18:57 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm On Thu, Jul 24, 2025 at 11:51:43PM +0000, Raghavendra Rao Ananta wrote: > Split kvm_pgtable_stage2_destroy() into two: > - kvm_pgtable_stage2_destroy_range(), that performs the > page-table walk and free the entries over a range of addresses. > - kvm_pgtable_stage2_destroy_pgd(), that frees the PGD. > > This refactoring enables subsequent patches to free large page-tables > in chunks, calling cond_resched() between each chunk, to yield the CPU > as necessary. > > Direct callers of kvm_pgtable_stage2_destroy() will continue to walk > the entire range of the VM as before, ensuring no functional changes. > > Also, add equivalent pkvm_pgtable_stage2_*() stubs to maintain 1:1 > mapping of the page-table functions. > > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Here's the other half of my fixups From 7d3e948357d0d2568afc136906e1b973ed39deeb Mon Sep 17 00:00:00 2001 From: Oliver Upton <oliver.upton@linux.dev> Date: Fri, 8 Aug 2025 11:35:43 -0700 Subject: [PATCH 2/4] fixup! KVM: arm64: Split kvm_pgtable_stage2_destroy() --- arch/arm64/include/asm/kvm_pgtable.h | 4 ++-- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 2 +- arch/arm64/kvm/hyp/pgtable.c | 2 +- arch/arm64/kvm/mmu.c | 12 ++++++++++-- arch/arm64/kvm/pkvm.c | 12 ++++-------- 5 files changed, 18 insertions(+), 14 deletions(-) diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index 20aea58eca18..fdae4685b9ac 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -562,13 +562,13 @@ void kvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt); /** - * kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table. + * __kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table. * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). * * The page-table is assumed to be unreachable by any hardware walkers prior * to freeing and therefore no TLB invalidation is performed. */ -void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt); +void __kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt); /** * kvm_pgtable_stage2_free_unlinked() - Free an unlinked stage-2 paging structure. diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 95d7534c9679..5eb8d6e29ac4 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -297,7 +297,7 @@ void reclaim_pgtable_pages(struct pkvm_hyp_vm *vm, struct kvm_hyp_memcache *mc) /* Dump all pgtable pages in the hyp_pool */ guest_lock_component(vm); - kvm_pgtable_stage2_destroy(&vm->pgt); + __kvm_pgtable_stage2_destroy(&vm->pgt); vm->kvm.arch.mmu.pgd_phys = 0ULL; guest_unlock_component(vm); diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 7fad791cf40b..aa735ffe8d49 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -1577,7 +1577,7 @@ void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) pgt->pgd = NULL; } -void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) +void __kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) { kvm_pgtable_stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits)); kvm_pgtable_stage2_destroy_pgd(pgt); diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 9a45daf817bf..6330a02c8418 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -904,6 +904,14 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type) return 0; } +static void kvm_stage2_destroy(struct kvm_pgtable *pgt) +{ + unsigned int ia_bits = VTCR_EL2_IPA(pgt->mmu->vtcr); + + KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, 0, BIT(ia_bits)); + KVM_PGT_FN(kvm_pgtable_stage2_destroy_pgd)(pgt); +} + /** * kvm_init_stage2_mmu - Initialise a S2 MMU structure * @kvm: The pointer to the KVM structure @@ -980,7 +988,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t return 0; out_destroy_pgtable: - KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt); + kvm_stage2_destroy(pgt); out_free_pgtable: kfree(pgt); return err; @@ -1077,7 +1085,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu) write_unlock(&kvm->mmu_lock); if (pgt) { - KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt); + kvm_stage2_destroy(pgt); kfree(pgt); } } diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index bf737717ccb4..3be208449bd7 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -316,11 +316,6 @@ static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 e return 0; } -void pkvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) -{ - __pkvm_pgtable_stage2_unmap(pgt, 0, ~(0ULL)); -} - int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys, enum kvm_pgtable_prot prot, void *mc, enum kvm_pgtable_walk_flags flags) @@ -452,12 +447,13 @@ int pkvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, } void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, - u64 addr, u64 size) + u64 addr, u64 size) { - WARN_ON_ONCE(1); + __pkvm_pgtable_stage2_unmap(pgt, addr, size); } void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) { - WARN_ON_ONCE(1); + /* Expected to be called after all pKVM mappings have been released. */ + WARN_ON_ONCE(!RB_EMPTY_ROOT(&pgt->pkvm_mappings.rb_root)); } -- 2.39.5 ^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-08-08 19:02 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-24 23:51 [PATCH 0/2] KVM: arm64: Destroy the stage-2 page-table periodically Raghavendra Rao Ananta 2025-07-24 23:51 ` [PATCH 2/2] " Raghavendra Rao Ananta 2025-07-25 14:59 ` Sean Christopherson 2025-07-25 15:04 ` ChaosEsque Team 2025-07-25 16:22 ` Raghavendra Rao Ananta 2025-07-29 16:01 ` Oliver Upton 2025-08-07 18:58 ` Raghavendra Rao Ananta 2025-08-08 18:56 ` Oliver Upton [not found] ` <20250724235144.2428795-2-rananta@google.com> 2025-07-29 15:57 ` [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() Oliver Upton 2025-08-08 18:57 ` Oliver Upton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).