* [PATCH 0/2] KVM: arm64: Destroy the stage-2 page-table periodically @ 2025-07-24 23:51 Raghavendra Rao Ananta 2025-07-24 23:51 ` [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() Raghavendra Rao Ananta 2025-07-24 23:51 ` [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically Raghavendra Rao Ananta 0 siblings, 2 replies; 10+ messages in thread From: Raghavendra Rao Ananta @ 2025-07-24 23:51 UTC (permalink / raw) To: Oliver Upton, Marc Zyngier Cc: Raghavendra Rao Anata, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm Hello, When destroying a fully-mapped 128G VM abruptly, the following scheduler warning is observed: sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE Tainted: [O]=OOT_MODULE Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x3c/0xb8 dump_stack+0x18/0x30 resched_latency_warn+0x7c/0x88 sched_tick+0x1c4/0x268 update_process_times+0xa8/0xd8 tick_nohz_handler+0xc8/0x168 __hrtimer_run_queues+0x11c/0x338 hrtimer_interrupt+0x104/0x308 arch_timer_handler_phys+0x40/0x58 handle_percpu_devid_irq+0x8c/0x1b0 generic_handle_domain_irq+0x48/0x78 gic_handle_irq+0x1b8/0x408 call_on_irq_stack+0x24/0x30 do_interrupt_handler+0x54/0x78 el1_interrupt+0x44/0x88 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x84/0x88 stage2_free_walker+0x30/0xa0 (P) __kvm_pgtable_walk+0x11c/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 kvm_pgtable_walk+0xc4/0x140 kvm_pgtable_stage2_destroy+0x5c/0xf0 kvm_free_stage2_pgd+0x6c/0xe8 kvm_uninit_stage2_mmu+0x24/0x48 kvm_arch_flush_shadow_all+0x80/0xa0 kvm_mmu_notifier_release+0x38/0x78 __mmu_notifier_release+0x15c/0x250 exit_mmap+0x68/0x400 __mmput+0x38/0x1c8 mmput+0x30/0x68 exit_mm+0xd4/0x198 do_exit+0x1a4/0xb00 do_group_exit+0x8c/0x120 get_signal+0x6d4/0x778 do_signal+0x90/0x718 do_notify_resume+0x70/0x170 el0_svc+0x74/0xd8 el0t_64_sync_handler+0x60/0xc8 el0t_64_sync+0x1b0/0x1b8 The host kernel was running with CONFIG_PREEMPT_NONE=y, and since the page-table walk operation takes considerable amount of time for a VM with such a large number of PTEs mapped, the warning is seen. To mitigate this, split the walk into smaller ranges, by checking for cond_resched() between each range. Since the path is executed during VM destruction, after the page-table structure is unlinked from the KVM MMU, relying on cond_resched_rwlock_write() isn't necessary. Patch-1 splits the kvm_pgtable_stage2_destroy() function into separate 'walk' and 'free PGD' parts. Patch-2 leverages the split and performs the walk periodically over smaller ranges and calls cond_resched() between them. Thank you. Raghavendra Raghavendra Rao Ananta (2): KVM: arm64: Split kvm_pgtable_stage2_destroy() KVM: arm64: Destroy the stage-2 page-table periodically arch/arm64/include/asm/kvm_pgtable.h | 19 ++++++++++++ arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++-- arch/arm64/kvm/mmu.c | 46 +++++++++++++++++++++++++--- 3 files changed, 80 insertions(+), 8 deletions(-) base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 -- 2.50.1.470.g6ba607880d-goog ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() 2025-07-24 23:51 [PATCH 0/2] KVM: arm64: Destroy the stage-2 page-table periodically Raghavendra Rao Ananta @ 2025-07-24 23:51 ` Raghavendra Rao Ananta 2025-07-29 15:57 ` Oliver Upton 2025-08-08 18:57 ` Oliver Upton 2025-07-24 23:51 ` [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically Raghavendra Rao Ananta 1 sibling, 2 replies; 10+ messages in thread From: Raghavendra Rao Ananta @ 2025-07-24 23:51 UTC (permalink / raw) To: Oliver Upton, Marc Zyngier Cc: Raghavendra Rao Anata, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm Split kvm_pgtable_stage2_destroy() into two: - kvm_pgtable_stage2_destroy_range(), that performs the page-table walk and free the entries over a range of addresses. - kvm_pgtable_stage2_destroy_pgd(), that frees the PGD. This refactoring enables subsequent patches to free large page-tables in chunks, calling cond_resched() between each chunk, to yield the CPU as necessary. Direct callers of kvm_pgtable_stage2_destroy() will continue to walk the entire range of the VM as before, ensuring no functional changes. Also, add equivalent pkvm_pgtable_stage2_*() stubs to maintain 1:1 mapping of the page-table functions. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> --- arch/arm64/include/asm/kvm_pgtable.h | 19 +++++++++++++++++++ arch/arm64/include/asm/kvm_pkvm.h | 3 +++ arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++--- arch/arm64/kvm/pkvm.c | 11 +++++++++++ 4 files changed, 53 insertions(+), 3 deletions(-) diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index 2888b5d03757..20aea58eca18 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -542,6 +542,25 @@ static inline int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2 return __kvm_pgtable_stage2_init(pgt, mmu, mm_ops, 0, NULL); } +/** + * kvm_pgtable_stage2_destroy_range() - Destroy the unlinked range of addresses. + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). + * @addr: Intermediate physical address at which to place the mapping. + * @size: Size of the mapping. + * + * The page-table is assumed to be unreachable by any hardware walkers prior + * to freeing and therefore no TLB invalidation is performed. + */ +void kvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, + u64 addr, u64 size); +/** + * kvm_pgtable_stage2_destroy_pgd() - Destroy the PGD of guest stage-2 page-table. + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). + * + * It is assumed that the rest of the page-table is freed before this operation. + */ +void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt); + /** * kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table. * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h index ea58282f59bb..ad32ea90639c 100644 --- a/arch/arm64/include/asm/kvm_pkvm.h +++ b/arch/arm64/include/asm/kvm_pkvm.h @@ -197,4 +197,7 @@ void pkvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void * kvm_pte_t *pkvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt, u64 phys, s8 level, enum kvm_pgtable_prot prot, void *mc, bool force_pte); +void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, + u64 addr, u64 size); +void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt); #endif /* __ARM64_KVM_PKVM_H__ */ diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index c351b4abd5db..7fad791cf40b 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -1551,21 +1551,38 @@ static int stage2_free_walker(const struct kvm_pgtable_visit_ctx *ctx, return 0; } -void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) +void kvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, + u64 addr, u64 size) { - size_t pgd_sz; struct kvm_pgtable_walker walker = { .cb = stage2_free_walker, .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST, }; - WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker)); + WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker)); +} + +void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) +{ + /* + * We aren't doing a pgtable walk here, but the walker struct is needed + * for kvm_dereference_pteref(), which only looks at the ->flags. + */ + struct kvm_pgtable_walker walker = {0}; + size_t pgd_sz; + pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE; pgt->mm_ops->free_pages_exact(kvm_dereference_pteref(&walker, pgt->pgd), pgd_sz); pgt->pgd = NULL; } +void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) +{ + kvm_pgtable_stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits)); + kvm_pgtable_stage2_destroy_pgd(pgt); +} + void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, s8 level) { kvm_pteref_t ptep = (kvm_pteref_t)pgtable; diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index fcd70bfe44fb..bf737717ccb4 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -450,3 +450,14 @@ int pkvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, WARN_ON_ONCE(1); return -EINVAL; } + +void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, + u64 addr, u64 size) +{ + WARN_ON_ONCE(1); +} + +void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) +{ + WARN_ON_ONCE(1); +} -- 2.50.1.470.g6ba607880d-goog ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() 2025-07-24 23:51 ` [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() Raghavendra Rao Ananta @ 2025-07-29 15:57 ` Oliver Upton 2025-08-08 18:57 ` Oliver Upton 1 sibling, 0 replies; 10+ messages in thread From: Oliver Upton @ 2025-07-29 15:57 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm On Thu, Jul 24, 2025 at 11:51:43PM +0000, Raghavendra Rao Ananta wrote: > Split kvm_pgtable_stage2_destroy() into two: > - kvm_pgtable_stage2_destroy_range(), that performs the > page-table walk and free the entries over a range of addresses. > - kvm_pgtable_stage2_destroy_pgd(), that frees the PGD. > > This refactoring enables subsequent patches to free large page-tables > in chunks, calling cond_resched() between each chunk, to yield the CPU > as necessary. > > Direct callers of kvm_pgtable_stage2_destroy() will continue to walk > the entire range of the VM as before, ensuring no functional changes. > > Also, add equivalent pkvm_pgtable_stage2_*() stubs to maintain 1:1 > mapping of the page-table functions. Uhh... We can't stub these functions out for protected mode, we already have a load-bearing implementation of pkvm_pgtable_stage2_destroy(). Just reuse what's already there and provide a NOP for pkvm_pgtable_stage2_destroy_pgd(). > +void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) > +{ > + /* > + * We aren't doing a pgtable walk here, but the walker struct is needed > + * for kvm_dereference_pteref(), which only looks at the ->flags. > + */ > + struct kvm_pgtable_walker walker = {0}; This feels subtle and prone for error. I'd rather we have something that boils down to rcu_dereference_raw() (with the appropriate n/hVHE awareness) and add a comment why it is safe. > +void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) > +{ > + kvm_pgtable_stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits)); > + kvm_pgtable_stage2_destroy_pgd(pgt); > +} > + Move this to mmu.c as a static function and use KVM_PGT_FN() Thanks, Oliver ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() 2025-07-24 23:51 ` [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() Raghavendra Rao Ananta 2025-07-29 15:57 ` Oliver Upton @ 2025-08-08 18:57 ` Oliver Upton 1 sibling, 0 replies; 10+ messages in thread From: Oliver Upton @ 2025-08-08 18:57 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm On Thu, Jul 24, 2025 at 11:51:43PM +0000, Raghavendra Rao Ananta wrote: > Split kvm_pgtable_stage2_destroy() into two: > - kvm_pgtable_stage2_destroy_range(), that performs the > page-table walk and free the entries over a range of addresses. > - kvm_pgtable_stage2_destroy_pgd(), that frees the PGD. > > This refactoring enables subsequent patches to free large page-tables > in chunks, calling cond_resched() between each chunk, to yield the CPU > as necessary. > > Direct callers of kvm_pgtable_stage2_destroy() will continue to walk > the entire range of the VM as before, ensuring no functional changes. > > Also, add equivalent pkvm_pgtable_stage2_*() stubs to maintain 1:1 > mapping of the page-table functions. > > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Here's the other half of my fixups From 7d3e948357d0d2568afc136906e1b973ed39deeb Mon Sep 17 00:00:00 2001 From: Oliver Upton <oliver.upton@linux.dev> Date: Fri, 8 Aug 2025 11:35:43 -0700 Subject: [PATCH 2/4] fixup! KVM: arm64: Split kvm_pgtable_stage2_destroy() --- arch/arm64/include/asm/kvm_pgtable.h | 4 ++-- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 2 +- arch/arm64/kvm/hyp/pgtable.c | 2 +- arch/arm64/kvm/mmu.c | 12 ++++++++++-- arch/arm64/kvm/pkvm.c | 12 ++++-------- 5 files changed, 18 insertions(+), 14 deletions(-) diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index 20aea58eca18..fdae4685b9ac 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -562,13 +562,13 @@ void kvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt); /** - * kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table. + * __kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table. * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). * * The page-table is assumed to be unreachable by any hardware walkers prior * to freeing and therefore no TLB invalidation is performed. */ -void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt); +void __kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt); /** * kvm_pgtable_stage2_free_unlinked() - Free an unlinked stage-2 paging structure. diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 95d7534c9679..5eb8d6e29ac4 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -297,7 +297,7 @@ void reclaim_pgtable_pages(struct pkvm_hyp_vm *vm, struct kvm_hyp_memcache *mc) /* Dump all pgtable pages in the hyp_pool */ guest_lock_component(vm); - kvm_pgtable_stage2_destroy(&vm->pgt); + __kvm_pgtable_stage2_destroy(&vm->pgt); vm->kvm.arch.mmu.pgd_phys = 0ULL; guest_unlock_component(vm); diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 7fad791cf40b..aa735ffe8d49 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -1577,7 +1577,7 @@ void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) pgt->pgd = NULL; } -void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) +void __kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) { kvm_pgtable_stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits)); kvm_pgtable_stage2_destroy_pgd(pgt); diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 9a45daf817bf..6330a02c8418 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -904,6 +904,14 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type) return 0; } +static void kvm_stage2_destroy(struct kvm_pgtable *pgt) +{ + unsigned int ia_bits = VTCR_EL2_IPA(pgt->mmu->vtcr); + + KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, 0, BIT(ia_bits)); + KVM_PGT_FN(kvm_pgtable_stage2_destroy_pgd)(pgt); +} + /** * kvm_init_stage2_mmu - Initialise a S2 MMU structure * @kvm: The pointer to the KVM structure @@ -980,7 +988,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t return 0; out_destroy_pgtable: - KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt); + kvm_stage2_destroy(pgt); out_free_pgtable: kfree(pgt); return err; @@ -1077,7 +1085,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu) write_unlock(&kvm->mmu_lock); if (pgt) { - KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt); + kvm_stage2_destroy(pgt); kfree(pgt); } } diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index bf737717ccb4..3be208449bd7 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -316,11 +316,6 @@ static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 e return 0; } -void pkvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) -{ - __pkvm_pgtable_stage2_unmap(pgt, 0, ~(0ULL)); -} - int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys, enum kvm_pgtable_prot prot, void *mc, enum kvm_pgtable_walk_flags flags) @@ -452,12 +447,13 @@ int pkvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, } void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, - u64 addr, u64 size) + u64 addr, u64 size) { - WARN_ON_ONCE(1); + __pkvm_pgtable_stage2_unmap(pgt, addr, size); } void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) { - WARN_ON_ONCE(1); + /* Expected to be called after all pKVM mappings have been released. */ + WARN_ON_ONCE(!RB_EMPTY_ROOT(&pgt->pkvm_mappings.rb_root)); } -- 2.39.5 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-07-24 23:51 [PATCH 0/2] KVM: arm64: Destroy the stage-2 page-table periodically Raghavendra Rao Ananta 2025-07-24 23:51 ` [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() Raghavendra Rao Ananta @ 2025-07-24 23:51 ` Raghavendra Rao Ananta 2025-07-25 14:59 ` Sean Christopherson 2025-07-29 16:01 ` Oliver Upton 1 sibling, 2 replies; 10+ messages in thread From: Raghavendra Rao Ananta @ 2025-07-24 23:51 UTC (permalink / raw) To: Oliver Upton, Marc Zyngier Cc: Raghavendra Rao Anata, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm When a large VM, specifically one that holds a significant number of PTEs, gets abruptly destroyed, the following warning is seen during the page-table walk: sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE Tainted: [O]=OOT_MODULE Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x3c/0xb8 dump_stack+0x18/0x30 resched_latency_warn+0x7c/0x88 sched_tick+0x1c4/0x268 update_process_times+0xa8/0xd8 tick_nohz_handler+0xc8/0x168 __hrtimer_run_queues+0x11c/0x338 hrtimer_interrupt+0x104/0x308 arch_timer_handler_phys+0x40/0x58 handle_percpu_devid_irq+0x8c/0x1b0 generic_handle_domain_irq+0x48/0x78 gic_handle_irq+0x1b8/0x408 call_on_irq_stack+0x24/0x30 do_interrupt_handler+0x54/0x78 el1_interrupt+0x44/0x88 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x84/0x88 stage2_free_walker+0x30/0xa0 (P) __kvm_pgtable_walk+0x11c/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 kvm_pgtable_walk+0xc4/0x140 kvm_pgtable_stage2_destroy+0x5c/0xf0 kvm_free_stage2_pgd+0x6c/0xe8 kvm_uninit_stage2_mmu+0x24/0x48 kvm_arch_flush_shadow_all+0x80/0xa0 kvm_mmu_notifier_release+0x38/0x78 __mmu_notifier_release+0x15c/0x250 exit_mmap+0x68/0x400 __mmput+0x38/0x1c8 mmput+0x30/0x68 exit_mm+0xd4/0x198 do_exit+0x1a4/0xb00 do_group_exit+0x8c/0x120 get_signal+0x6d4/0x778 do_signal+0x90/0x718 do_notify_resume+0x70/0x170 el0_svc+0x74/0xd8 el0t_64_sync_handler+0x60/0xc8 el0t_64_sync+0x1b0/0x1b8 The warning is seen majorly on the host kernels that are configured not to force-preempt, such as CONFIG_PREEMPT_NONE=y. To avoid this, instead of walking the entire page-table in one go, split it into smaller ranges, by checking for cond_resched() between each range. Since the path is executed during VM destruction, after the page-table structure is unlinked from the KVM MMU, relying on cond_resched_rwlock_write() isn't necessary. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> --- arch/arm64/kvm/mmu.c | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 2942ec92c5a4..6c4b9fb1211b 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -387,6 +387,40 @@ static void stage2_flush_vm(struct kvm *kvm) srcu_read_unlock(&kvm->srcu, idx); } +/* + * Assume that @pgt is valid and unlinked from the KVM MMU to free the + * page-table without taking the kvm_mmu_lock and without performing any + * TLB invalidations. + * + * Also, the range of addresses can be large enough to cause need_resched + * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke + * cond_resched() periodically to prevent hogging the CPU for a long time + * and schedule something else, if required. + */ +static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr, + phys_addr_t end) +{ + u64 next; + + do { + next = stage2_range_addr_end(addr, end); + kvm_pgtable_stage2_destroy_range(pgt, addr, next - addr); + + if (next != end) + cond_resched(); + } while (addr = next, addr != end); +} + +static void kvm_destroy_stage2_pgt(struct kvm_pgtable *pgt) +{ + if (!is_protected_kvm_enabled()) { + stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits)); + kvm_pgtable_stage2_destroy_pgd(pgt); + } else { + pkvm_pgtable_stage2_destroy(pgt); + } +} + /** * free_hyp_pgds - free Hyp-mode page tables */ @@ -984,7 +1018,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t return 0; out_destroy_pgtable: - KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt); + kvm_destroy_stage2_pgt(pgt); out_free_pgtable: kfree(pgt); return err; @@ -1081,7 +1115,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu) write_unlock(&kvm->mmu_lock); if (pgt) { - KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt); + kvm_destroy_stage2_pgt(pgt); kfree(pgt); } } -- 2.50.1.470.g6ba607880d-goog ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-07-24 23:51 ` [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically Raghavendra Rao Ananta @ 2025-07-25 14:59 ` Sean Christopherson 2025-07-25 16:22 ` Raghavendra Rao Ananta 2025-07-29 16:01 ` Oliver Upton 1 sibling, 1 reply; 10+ messages in thread From: Sean Christopherson @ 2025-07-25 14:59 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Oliver Upton, Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm Heh, without full conext, the shortlog reads like "destroy stage-2 page tables from time to time". Something like this would be more appropriate: KVM: arm64: Reschedule as needed when destroying stage-2 page-tables ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-07-25 14:59 ` Sean Christopherson @ 2025-07-25 16:22 ` Raghavendra Rao Ananta 0 siblings, 0 replies; 10+ messages in thread From: Raghavendra Rao Ananta @ 2025-07-25 16:22 UTC (permalink / raw) To: Sean Christopherson Cc: Oliver Upton, Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm On Fri, Jul 25, 2025 at 7:59 AM Sean Christopherson <seanjc@google.com> wrote: > > Heh, without full conext, the shortlog reads like "destroy stage-2 page tables > from time to time". Something like this would be more appropriate: > > KVM: arm64: Reschedule as needed when destroying stage-2 page-tables This definitely sounds better :) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-07-24 23:51 ` [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically Raghavendra Rao Ananta 2025-07-25 14:59 ` Sean Christopherson @ 2025-07-29 16:01 ` Oliver Upton 2025-08-07 18:58 ` Raghavendra Rao Ananta 1 sibling, 1 reply; 10+ messages in thread From: Oliver Upton @ 2025-07-29 16:01 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm On Thu, Jul 24, 2025 at 11:51:44PM +0000, Raghavendra Rao Ananta wrote: > +/* > + * Assume that @pgt is valid and unlinked from the KVM MMU to free the > + * page-table without taking the kvm_mmu_lock and without performing any > + * TLB invalidations. > + * > + * Also, the range of addresses can be large enough to cause need_resched > + * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke > + * cond_resched() periodically to prevent hogging the CPU for a long time > + * and schedule something else, if required. > + */ > +static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr, > + phys_addr_t end) > +{ > + u64 next; > + > + do { > + next = stage2_range_addr_end(addr, end); > + kvm_pgtable_stage2_destroy_range(pgt, addr, next - addr); > + > + if (next != end) > + cond_resched(); > + } while (addr = next, addr != end); > +} > + > +static void kvm_destroy_stage2_pgt(struct kvm_pgtable *pgt) > +{ > + if (!is_protected_kvm_enabled()) { > + stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits)); > + kvm_pgtable_stage2_destroy_pgd(pgt); > + } else { > + pkvm_pgtable_stage2_destroy(pgt); > + } > +} > + Protected mode is affected by the same problem, potentially even worse due to the overheads of calling into EL2. Both protected and non-protected flows should use stage2_destroy_range(). Thanks, Oliver ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-07-29 16:01 ` Oliver Upton @ 2025-08-07 18:58 ` Raghavendra Rao Ananta 2025-08-08 18:56 ` Oliver Upton 0 siblings, 1 reply; 10+ messages in thread From: Raghavendra Rao Ananta @ 2025-08-07 18:58 UTC (permalink / raw) To: Oliver Upton Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm Hi Oliver, > > Protected mode is affected by the same problem, potentially even worse > due to the overheads of calling into EL2. Both protected and > non-protected flows should use stage2_destroy_range(). > I experimented with this (see diff below), and it looks like it takes significantly longer to finish the destruction even for a very small VM. For instance, it takes ~140 seconds on an Ampere Altra machine. This is probably because we run cond_resched() for every breakup in the entire sweep of the possible address range, 0 to ~(0ULL), even though there are no actual mappings there, and we context switch out more often. --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c + static void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) + { + u64 end = is_protected_kvm_enabled() ? ~(0ULL) : BIT(pgt->ia_bits); + u64 next, addr = 0; + + do { + next = stage2_range_addr_end(addr, end); + KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, addr, + next - addr); + + if (next != end) + cond_resched(); + } while (addr = next, addr != end); + + + KVM_PGT_FN(kvm_pgtable_stage2_destroy_pgd)(pgt); + } --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -316,9 +316,13 @@ static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 e return 0; } -void pkvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) +void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, u64 addr, u64 size) +{ + __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); +} + +void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) +{ +} Without cond_resched() in place, it takes about half the time. I also tried moving cond_resched() to __pkvm_pgtable_stage2_unmap(), as per the below diff, and calling pkvm_pgtable_stage2_destroy_range() for the entire 0 to ~(1ULL) range (instead of breaking it up). Even for a fully 4K mapped 128G VM, I see it taking ~65 seconds, which is close to the baseline (no cond_resched()). --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -311,8 +311,11 @@ static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 e return ret; pkvm_mapping_remove(mapping, &pgt->pkvm_mappings); kfree(mapping); + cond_resched(); } Does it make sense to call cond_resched() only when we actually unmap pages? Thank you. Raghavendra ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically 2025-08-07 18:58 ` Raghavendra Rao Ananta @ 2025-08-08 18:56 ` Oliver Upton 0 siblings, 0 replies; 10+ messages in thread From: Oliver Upton @ 2025-08-08 18:56 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm, linux-kernel, kvm On Thu, Aug 07, 2025 at 11:58:01AM -0700, Raghavendra Rao Ananta wrote: > Hi Oliver, > > > > > Protected mode is affected by the same problem, potentially even worse > > due to the overheads of calling into EL2. Both protected and > > non-protected flows should use stage2_destroy_range(). > > > I experimented with this (see diff below), and it looks like it takes > significantly longer to finish the destruction even for a very small > VM. For instance, it takes ~140 seconds on an Ampere Altra machine. > This is probably because we run cond_resched() for every breakup in > the entire sweep of the possible address range, 0 to ~(0ULL), even > though there are no actual mappings there, and we context switch out > more often. This seems more like an issue with the upper bound on a pKVM walk rather than a problem with the suggestion. The information in pgt->ia_bits is actually derived from the VTCR value of the owning MMU. Even though we never use the VTCR value in hardware, pKVM MMUs have a valid VTCR value that encodes the size of the IPA space and we use that in the common stage-2 abort path. I'm attaching some fixups that I have on top of your series that'd allow the resched logic to remain common, like it is in other MMU flows. From 421468dcaa4692208c3f708682b058cfc072a984 Mon Sep 17 00:00:00 2001 From: Oliver Upton <oliver.upton@linux.dev> Date: Fri, 8 Aug 2025 11:43:12 -0700 Subject: [PATCH 4/4] fixup! KVM: arm64: Destroy the stage-2 page-table periodically --- arch/arm64/kvm/mmu.c | 60 ++++++++++++++++++-------------------------- 1 file changed, 25 insertions(+), 35 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index b82412323054..fc93cc256bd8 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -383,40 +383,6 @@ static void stage2_flush_vm(struct kvm *kvm) srcu_read_unlock(&kvm->srcu, idx); } -/* - * Assume that @pgt is valid and unlinked from the KVM MMU to free the - * page-table without taking the kvm_mmu_lock and without performing any - * TLB invalidations. - * - * Also, the range of addresses can be large enough to cause need_resched - * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke - * cond_resched() periodically to prevent hogging the CPU for a long time - * and schedule something else, if required. - */ -static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr, - phys_addr_t end) -{ - u64 next; - - do { - next = stage2_range_addr_end(addr, end); - kvm_pgtable_stage2_destroy_range(pgt, addr, next - addr); - - if (next != end) - cond_resched(); - } while (addr = next, addr != end); -} - -static void kvm_destroy_stage2_pgt(struct kvm_pgtable *pgt) -{ - if (!is_protected_kvm_enabled()) { - stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits)); - kvm_pgtable_stage2_destroy_pgd(pgt); - } else { - pkvm_pgtable_stage2_destroy(pgt); - } -} - /** * free_hyp_pgds - free Hyp-mode page tables */ @@ -938,11 +904,35 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type) return 0; } +/* + * Assume that @pgt is valid and unlinked from the KVM MMU to free the + * page-table without taking the kvm_mmu_lock and without performing any + * TLB invalidations. + * + * Also, the range of addresses can be large enough to cause need_resched + * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke + * cond_resched() periodically to prevent hogging the CPU for a long time + * and schedule something else, if required. + */ +static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr, + phys_addr_t end) +{ + u64 next; + + do { + next = stage2_range_addr_end(addr, end); + KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, addr, next - addr); + + if (next != end) + cond_resched(); + } while (addr = next, addr != end); +} + static void kvm_stage2_destroy(struct kvm_pgtable *pgt) { unsigned int ia_bits = VTCR_EL2_IPA(pgt->mmu->vtcr); - KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, 0, BIT(ia_bits)); + stage2_destroy_range(pgt, 0, BIT(ia_bits)); KVM_PGT_FN(kvm_pgtable_stage2_destroy_pgd)(pgt); } -- 2.39.5 ^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-08-08 18:57 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-24 23:51 [PATCH 0/2] KVM: arm64: Destroy the stage-2 page-table periodically Raghavendra Rao Ananta 2025-07-24 23:51 ` [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() Raghavendra Rao Ananta 2025-07-29 15:57 ` Oliver Upton 2025-08-08 18:57 ` Oliver Upton 2025-07-24 23:51 ` [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically Raghavendra Rao Ananta 2025-07-25 14:59 ` Sean Christopherson 2025-07-25 16:22 ` Raghavendra Rao Ananta 2025-07-29 16:01 ` Oliver Upton 2025-08-07 18:58 ` Raghavendra Rao Ananta 2025-08-08 18:56 ` Oliver Upton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).