* [PATCH 1/3] KVM: arm64: Only drop references on empty tables in stage2_free_walker
2025-11-13 5:24 [PATCH 0/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables Raghavendra Rao Ananta
@ 2025-11-13 5:24 ` Raghavendra Rao Ananta
2025-11-13 5:24 ` [PATCH 2/3] KVM: arm64: Split kvm_pgtable_stage2_destroy() Raghavendra Rao Ananta
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Raghavendra Rao Ananta @ 2025-11-13 5:24 UTC (permalink / raw)
To: Oliver Upton, Marc Zyngier
Cc: Raghavendra Rao Anata, Mingwei Zhang, linux-arm-kernel, kvmarm,
linux-kernel, kvm, Oliver Upton
From: Oliver Upton <oliver.upton@linux.dev>
A subsequent change to the way KVM frees stage-2s will invoke the free
walker on sub-ranges of the VM's IPA space, meaning there's potential
for only partially visiting a table's PTEs.
Split the leaf and table visitors and only drop references on a table
when the page count reaches 1, implying there are no valid PTEs that
need to be visited. Invalidate the table PTE to avoid traversing the
stale reference.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
arch/arm64/kvm/hyp/pgtable.c | 38 ++++++++++++++++++++++++++++++------
1 file changed, 32 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index c351b4abd5dbf..6d6a23f7dedb6 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1535,20 +1535,46 @@ size_t kvm_pgtable_stage2_pgd_size(u64 vtcr)
return kvm_pgd_pages(ia_bits, start_level) * PAGE_SIZE;
}
-static int stage2_free_walker(const struct kvm_pgtable_visit_ctx *ctx,
- enum kvm_pgtable_walk_flags visit)
+static int stage2_free_leaf(const struct kvm_pgtable_visit_ctx *ctx)
{
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
- if (!stage2_pte_is_counted(ctx->old))
+ mm_ops->put_page(ctx->ptep);
+ return 0;
+}
+
+static int stage2_free_table_post(const struct kvm_pgtable_visit_ctx *ctx)
+{
+ struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
+ kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
+
+ if (mm_ops->page_count(childp) != 1)
return 0;
+ /*
+ * Drop references and clear the now stale PTE to avoid rewalking the
+ * freed page table.
+ */
mm_ops->put_page(ctx->ptep);
+ mm_ops->put_page(childp);
+ kvm_clear_pte(ctx->ptep);
+ return 0;
+}
- if (kvm_pte_table(ctx->old, ctx->level))
- mm_ops->put_page(kvm_pte_follow(ctx->old, mm_ops));
+static int stage2_free_walker(const struct kvm_pgtable_visit_ctx *ctx,
+ enum kvm_pgtable_walk_flags visit)
+{
+ if (!stage2_pte_is_counted(ctx->old))
+ return 0;
- return 0;
+ switch (visit) {
+ case KVM_PGTABLE_WALK_LEAF:
+ return stage2_free_leaf(ctx);
+ case KVM_PGTABLE_WALK_TABLE_POST:
+ return stage2_free_table_post(ctx);
+ default:
+ return -EINVAL;
+ }
}
void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH 2/3] KVM: arm64: Split kvm_pgtable_stage2_destroy()
2025-11-13 5:24 [PATCH 0/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables Raghavendra Rao Ananta
2025-11-13 5:24 ` [PATCH 1/3] KVM: arm64: Only drop references on empty tables in stage2_free_walker Raghavendra Rao Ananta
@ 2025-11-13 5:24 ` Raghavendra Rao Ananta
2025-11-13 5:24 ` [PATCH 3/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables Raghavendra Rao Ananta
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Raghavendra Rao Ananta @ 2025-11-13 5:24 UTC (permalink / raw)
To: Oliver Upton, Marc Zyngier
Cc: Raghavendra Rao Anata, Mingwei Zhang, linux-arm-kernel, kvmarm,
linux-kernel, kvm, Oliver Upton
Split kvm_pgtable_stage2_destroy() into two:
- kvm_pgtable_stage2_destroy_range(), that performs the
page-table walk and free the entries over a range of addresses.
- kvm_pgtable_stage2_destroy_pgd(), that frees the PGD.
This refactoring enables subsequent patches to free large page-tables
in chunks, calling cond_resched() between each chunk, to yield the
CPU as necessary.
Existing callers of kvm_pgtable_stage2_destroy(), that probably cannot
take advantage of this (such as nVMHE), will continue to function as is.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250820162242.2624752-2-rananta@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
arch/arm64/include/asm/kvm_pgtable.h | 30 ++++++++++++++++++++++++++++
arch/arm64/include/asm/kvm_pkvm.h | 4 +++-
arch/arm64/kvm/hyp/pgtable.c | 25 +++++++++++++++++++----
arch/arm64/kvm/mmu.c | 12 +++++++++--
arch/arm64/kvm/pkvm.c | 11 ++++++++--
5 files changed, 73 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 2888b5d037573..1246216616b51 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -355,6 +355,11 @@ static inline kvm_pte_t *kvm_dereference_pteref(struct kvm_pgtable_walker *walke
return pteref;
}
+static inline kvm_pte_t *kvm_dereference_pteref_raw(kvm_pteref_t pteref)
+{
+ return pteref;
+}
+
static inline int kvm_pgtable_walk_begin(struct kvm_pgtable_walker *walker)
{
/*
@@ -384,6 +389,11 @@ static inline kvm_pte_t *kvm_dereference_pteref(struct kvm_pgtable_walker *walke
return rcu_dereference_check(pteref, !(walker->flags & KVM_PGTABLE_WALK_SHARED));
}
+static inline kvm_pte_t *kvm_dereference_pteref_raw(kvm_pteref_t pteref)
+{
+ return rcu_dereference_raw(pteref);
+}
+
static inline int kvm_pgtable_walk_begin(struct kvm_pgtable_walker *walker)
{
if (walker->flags & KVM_PGTABLE_WALK_SHARED)
@@ -551,6 +561,26 @@ static inline int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2
*/
void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
+/**
+ * kvm_pgtable_stage2_destroy_range() - Destroy the unlinked range of addresses.
+ * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
+ * @addr: Intermediate physical address at which to place the mapping.
+ * @size: Size of the mapping.
+ *
+ * The page-table is assumed to be unreachable by any hardware walkers prior
+ * to freeing and therefore no TLB invalidation is performed.
+ */
+void kvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt,
+ u64 addr, u64 size);
+
+/**
+ * kvm_pgtable_stage2_destroy_pgd() - Destroy the PGD of guest stage-2 page-table.
+ * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
+ *
+ * It is assumed that the rest of the page-table is freed before this operation.
+ */
+void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt);
+
/**
* kvm_pgtable_stage2_free_unlinked() - Free an unlinked stage-2 paging structure.
* @mm_ops: Memory management callbacks.
diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h
index 08be89c95466e..0aecd4ac5f45d 100644
--- a/arch/arm64/include/asm/kvm_pkvm.h
+++ b/arch/arm64/include/asm/kvm_pkvm.h
@@ -180,7 +180,9 @@ struct pkvm_mapping {
int pkvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
struct kvm_pgtable_mm_ops *mm_ops);
-void pkvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
+void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt,
+ u64 addr, u64 size);
+void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt);
int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
enum kvm_pgtable_prot prot, void *mc,
enum kvm_pgtable_walk_flags flags);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 6d6a23f7dedb6..0882896dbf8f2 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1577,21 +1577,38 @@ static int stage2_free_walker(const struct kvm_pgtable_visit_ctx *ctx,
}
}
-void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
+void kvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt,
+ u64 addr, u64 size)
{
- size_t pgd_sz;
struct kvm_pgtable_walker walker = {
.cb = stage2_free_walker,
.flags = KVM_PGTABLE_WALK_LEAF |
KVM_PGTABLE_WALK_TABLE_POST,
};
- WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
+ WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker));
+}
+
+void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt)
+{
+ size_t pgd_sz;
+
pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
- pgt->mm_ops->free_pages_exact(kvm_dereference_pteref(&walker, pgt->pgd), pgd_sz);
+
+ /*
+ * Since the pgtable is unlinked at this point, and not shared with
+ * other walkers, safely deference pgd with kvm_dereference_pteref_raw()
+ */
+ pgt->mm_ops->free_pages_exact(kvm_dereference_pteref_raw(pgt->pgd), pgd_sz);
pgt->pgd = NULL;
}
+void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
+{
+ kvm_pgtable_stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits));
+ kvm_pgtable_stage2_destroy_pgd(pgt);
+}
+
void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, s8 level)
{
kvm_pteref_t ptep = (kvm_pteref_t)pgtable;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7cc964af8d305..c2bc1eba032cd 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -904,6 +904,14 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type)
return 0;
}
+static void kvm_stage2_destroy(struct kvm_pgtable *pgt)
+{
+ unsigned int ia_bits = VTCR_EL2_IPA(pgt->mmu->vtcr);
+
+ KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, 0, BIT(ia_bits));
+ KVM_PGT_FN(kvm_pgtable_stage2_destroy_pgd)(pgt);
+}
+
/**
* kvm_init_stage2_mmu - Initialise a S2 MMU structure
* @kvm: The pointer to the KVM structure
@@ -980,7 +988,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
return 0;
out_destroy_pgtable:
- KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt);
+ kvm_stage2_destroy(pgt);
out_free_pgtable:
kfree(pgt);
return err;
@@ -1081,7 +1089,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
write_unlock(&kvm->mmu_lock);
if (pgt) {
- KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt);
+ kvm_stage2_destroy(pgt);
kfree(pgt);
}
}
diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index 24f0f8a8c943c..d7a0f69a99821 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -344,9 +344,16 @@ static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 e
return 0;
}
-void pkvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
+void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt,
+ u64 addr, u64 size)
{
- __pkvm_pgtable_stage2_unmap(pgt, 0, ~(0ULL));
+ __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size);
+}
+
+void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt)
+{
+ /* Expected to be called after all pKVM mappings have been released. */
+ WARN_ON_ONCE(!RB_EMPTY_ROOT(&pgt->pkvm_mappings.rb_root));
}
int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH 3/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables
2025-11-13 5:24 [PATCH 0/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables Raghavendra Rao Ananta
2025-11-13 5:24 ` [PATCH 1/3] KVM: arm64: Only drop references on empty tables in stage2_free_walker Raghavendra Rao Ananta
2025-11-13 5:24 ` [PATCH 2/3] KVM: arm64: Split kvm_pgtable_stage2_destroy() Raghavendra Rao Ananta
@ 2025-11-13 5:24 ` Raghavendra Rao Ananta
2025-11-19 22:35 ` [PATCH 0/3] " Oliver Upton
2026-01-28 16:47 ` Marc Zyngier
4 siblings, 0 replies; 6+ messages in thread
From: Raghavendra Rao Ananta @ 2025-11-13 5:24 UTC (permalink / raw)
To: Oliver Upton, Marc Zyngier
Cc: Raghavendra Rao Anata, Mingwei Zhang, linux-arm-kernel, kvmarm,
linux-kernel, kvm, Oliver Upton
When a large VM, specifically one that holds a significant number of PTEs,
gets abruptly destroyed, the following warning is seen during the
page-table walk:
sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule
CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE
Tainted: [O]=OOT_MODULE
Call trace:
show_stack+0x20/0x38 (C)
dump_stack_lvl+0x3c/0xb8
dump_stack+0x18/0x30
resched_latency_warn+0x7c/0x88
sched_tick+0x1c4/0x268
update_process_times+0xa8/0xd8
tick_nohz_handler+0xc8/0x168
__hrtimer_run_queues+0x11c/0x338
hrtimer_interrupt+0x104/0x308
arch_timer_handler_phys+0x40/0x58
handle_percpu_devid_irq+0x8c/0x1b0
generic_handle_domain_irq+0x48/0x78
gic_handle_irq+0x1b8/0x408
call_on_irq_stack+0x24/0x30
do_interrupt_handler+0x54/0x78
el1_interrupt+0x44/0x88
el1h_64_irq_handler+0x18/0x28
el1h_64_irq+0x84/0x88
stage2_free_walker+0x30/0xa0 (P)
__kvm_pgtable_walk+0x11c/0x258
__kvm_pgtable_walk+0x180/0x258
__kvm_pgtable_walk+0x180/0x258
__kvm_pgtable_walk+0x180/0x258
kvm_pgtable_walk+0xc4/0x140
kvm_pgtable_stage2_destroy+0x5c/0xf0
kvm_free_stage2_pgd+0x6c/0xe8
kvm_uninit_stage2_mmu+0x24/0x48
kvm_arch_flush_shadow_all+0x80/0xa0
kvm_mmu_notifier_release+0x38/0x78
__mmu_notifier_release+0x15c/0x250
exit_mmap+0x68/0x400
__mmput+0x38/0x1c8
mmput+0x30/0x68
exit_mm+0xd4/0x198
do_exit+0x1a4/0xb00
do_group_exit+0x8c/0x120
get_signal+0x6d4/0x778
do_signal+0x90/0x718
do_notify_resume+0x70/0x170
el0_svc+0x74/0xd8
el0t_64_sync_handler+0x60/0xc8
el0t_64_sync+0x1b0/0x1b8
The warning is seen majorly on the host kernels that are configured
not to force-preempt, such as CONFIG_PREEMPT_NONE=y. To avoid this,
instead of walking the entire page-table in one go, split it into
smaller ranges, by checking for cond_resched() between each range.
Since the path is executed during VM destruction, after the
page-table structure is unlinked from the KVM MMU, relying on
cond_resched_rwlock_write() isn't necessary.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250820162242.2624752-3-rananta@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
arch/arm64/kvm/mmu.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c2bc1eba032cd..f86d17ad50a7f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -904,11 +904,35 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type)
return 0;
}
+/*
+ * Assume that @pgt is valid and unlinked from the KVM MMU to free the
+ * page-table without taking the kvm_mmu_lock and without performing any
+ * TLB invalidations.
+ *
+ * Also, the range of addresses can be large enough to cause need_resched
+ * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke
+ * cond_resched() periodically to prevent hogging the CPU for a long time
+ * and schedule something else, if required.
+ */
+static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr,
+ phys_addr_t end)
+{
+ u64 next;
+
+ do {
+ next = stage2_range_addr_end(addr, end);
+ KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, addr,
+ next - addr);
+ if (next != end)
+ cond_resched();
+ } while (addr = next, addr != end);
+}
+
static void kvm_stage2_destroy(struct kvm_pgtable *pgt)
{
unsigned int ia_bits = VTCR_EL2_IPA(pgt->mmu->vtcr);
- KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, 0, BIT(ia_bits));
+ stage2_destroy_range(pgt, 0, BIT(ia_bits));
KVM_PGT_FN(kvm_pgtable_stage2_destroy_pgd)(pgt);
}
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH 0/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables
2025-11-13 5:24 [PATCH 0/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables Raghavendra Rao Ananta
` (2 preceding siblings ...)
2025-11-13 5:24 ` [PATCH 3/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables Raghavendra Rao Ananta
@ 2025-11-19 22:35 ` Oliver Upton
2026-01-28 16:47 ` Marc Zyngier
4 siblings, 0 replies; 6+ messages in thread
From: Oliver Upton @ 2025-11-19 22:35 UTC (permalink / raw)
To: Marc Zyngier, Raghavendra Rao Ananta
Cc: Oliver Upton, Mingwei Zhang, linux-arm-kernel, kvmarm,
linux-kernel, kvm
On Thu, 13 Nov 2025 05:24:49 +0000, Raghavendra Rao Ananta wrote:
> When destroying a fully-mapped 128G VM abruptly, the following scheduler
> warning is observed:
>
> sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule
> CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE
> Tainted: [O]=OOT_MODULE
> Call trace:
> show_stack+0x20/0x38 (C)
> dump_stack_lvl+0x3c/0xb8
> dump_stack+0x18/0x30
> resched_latency_warn+0x7c/0x88
> sched_tick+0x1c4/0x268
> update_process_times+0xa8/0xd8
> tick_nohz_handler+0xc8/0x168
> __hrtimer_run_queues+0x11c/0x338
> hrtimer_interrupt+0x104/0x308
> arch_timer_handler_phys+0x40/0x58
> handle_percpu_devid_irq+0x8c/0x1b0
> generic_handle_domain_irq+0x48/0x78
> gic_handle_irq+0x1b8/0x408
> call_on_irq_stack+0x24/0x30
> do_interrupt_handler+0x54/0x78
> el1_interrupt+0x44/0x88
> el1h_64_irq_handler+0x18/0x28
> el1h_64_irq+0x84/0x88
> stage2_free_walker+0x30/0xa0 (P)
> __kvm_pgtable_walk+0x11c/0x258
> __kvm_pgtable_walk+0x180/0x258
> __kvm_pgtable_walk+0x180/0x258
> __kvm_pgtable_walk+0x180/0x258
> kvm_pgtable_walk+0xc4/0x140
> kvm_pgtable_stage2_destroy+0x5c/0xf0
> kvm_free_stage2_pgd+0x6c/0xe8
> kvm_uninit_stage2_mmu+0x24/0x48
> kvm_arch_flush_shadow_all+0x80/0xa0
> kvm_mmu_notifier_release+0x38/0x78
> __mmu_notifier_release+0x15c/0x250
> exit_mmap+0x68/0x400
> __mmput+0x38/0x1c8
> mmput+0x30/0x68
> exit_mm+0xd4/0x198
> do_exit+0x1a4/0xb00
> do_group_exit+0x8c/0x120
> get_signal+0x6d4/0x778
> do_signal+0x90/0x718
> do_notify_resume+0x70/0x170
> el0_svc+0x74/0xd8
> el0t_64_sync_handler+0x60/0xc8
> el0t_64_sync+0x1b0/0x1b8
>
> [...]
Applied to next, thanks!
[1/3] KVM: arm64: Only drop references on empty tables in stage2_free_walker
https://git.kernel.org/kvmarm/kvmarm/c/156f70afcfec
[2/3] KVM: arm64: Split kvm_pgtable_stage2_destroy()
https://git.kernel.org/kvmarm/kvmarm/c/d68d66e57e2b
[3/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables
https://git.kernel.org/kvmarm/kvmarm/c/4ddfab5436b6
--
Best,
Oliver
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH 0/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables
2025-11-13 5:24 [PATCH 0/3] KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables Raghavendra Rao Ananta
` (3 preceding siblings ...)
2025-11-19 22:35 ` [PATCH 0/3] " Oliver Upton
@ 2026-01-28 16:47 ` Marc Zyngier
4 siblings, 0 replies; 6+ messages in thread
From: Marc Zyngier @ 2026-01-28 16:47 UTC (permalink / raw)
To: Raghavendra Rao Ananta
Cc: Oliver Upton, Mingwei Zhang, linux-arm-kernel, kvmarm,
linux-kernel, kvm
On Thu, 13 Nov 2025 05:24:49 +0000,
Raghavendra Rao Ananta <rananta@google.com> wrote:
>
> Hello,
>
> When destroying a fully-mapped 128G VM abruptly, the following scheduler
> warning is observed:
>
> sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule
> CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE
> Tainted: [O]=OOT_MODULE
> Call trace:
> show_stack+0x20/0x38 (C)
> dump_stack_lvl+0x3c/0xb8
> dump_stack+0x18/0x30
> resched_latency_warn+0x7c/0x88
> sched_tick+0x1c4/0x268
> update_process_times+0xa8/0xd8
> tick_nohz_handler+0xc8/0x168
> __hrtimer_run_queues+0x11c/0x338
> hrtimer_interrupt+0x104/0x308
> arch_timer_handler_phys+0x40/0x58
> handle_percpu_devid_irq+0x8c/0x1b0
> generic_handle_domain_irq+0x48/0x78
> gic_handle_irq+0x1b8/0x408
> call_on_irq_stack+0x24/0x30
> do_interrupt_handler+0x54/0x78
> el1_interrupt+0x44/0x88
> el1h_64_irq_handler+0x18/0x28
> el1h_64_irq+0x84/0x88
> stage2_free_walker+0x30/0xa0 (P)
> __kvm_pgtable_walk+0x11c/0x258
> __kvm_pgtable_walk+0x180/0x258
> __kvm_pgtable_walk+0x180/0x258
> __kvm_pgtable_walk+0x180/0x258
> kvm_pgtable_walk+0xc4/0x140
> kvm_pgtable_stage2_destroy+0x5c/0xf0
> kvm_free_stage2_pgd+0x6c/0xe8
> kvm_uninit_stage2_mmu+0x24/0x48
> kvm_arch_flush_shadow_all+0x80/0xa0
> kvm_mmu_notifier_release+0x38/0x78
> __mmu_notifier_release+0x15c/0x250
> exit_mmap+0x68/0x400
> __mmput+0x38/0x1c8
> mmput+0x30/0x68
> exit_mm+0xd4/0x198
> do_exit+0x1a4/0xb00
> do_group_exit+0x8c/0x120
> get_signal+0x6d4/0x778
> do_signal+0x90/0x718
> do_notify_resume+0x70/0x170
> el0_svc+0x74/0xd8
> el0t_64_sync_handler+0x60/0xc8
> el0t_64_sync+0x1b0/0x1b8
>
> The host kernel was running with CONFIG_PREEMPT_NONE=y, and since the
> page-table walk operation takes considerable amount of time for a VM
> with such a large number of PTEs mapped, the warning is seen.
>
> To mitigate this, split the walk into smaller ranges, by checking for
> cond_resched() between each range. Since the path is executed during
> VM destruction, after the page-table structure is unlinked from the
> KVM MMU, relying on cond_resched_rwlock_write() isn't necessary.
>
> Patch-1 kills the assumption that the page-table hierarchy under the
> table is free (in stage2_free_walker()). Instead, drop and clear the
> references only on empty tables.
>
> Patch-2 splits the kvm_pgtable_stage2_destroy() function into separate
> 'walk' and 'free PGD' parts.
>
> Patch-3 leverages the split and performs the walk periodically over
> smaller ranges and calls cond_resched() between them.
>
> The series was originally posted and merged [1], but was later reverted
> due to syzkaller catching a UAF bug [2]. This series fixes the issue, and
> the original need_resched warning is addressed.
>
> [1]: https://lore.kernel.org/all/175582091313.1266576.4329884314263043118.b4-ty@linux.dev/
> [2]: https://lore.kernel.org/all/20250910180930.3679473-1-oliver.upton@linux.dev/
>
> Oliver Upton (1):
> KVM: arm64: Only drop references on empty tables in stage2_free_walker
>
> Raghavendra Rao Ananta (2):
> KVM: arm64: Split kvm_pgtable_stage2_destroy()
> KVM: arm64: Reschedule as needed when destroying the stage-2
> page-tables
>
> arch/arm64/include/asm/kvm_pgtable.h | 30 +++++++++++++
> arch/arm64/include/asm/kvm_pkvm.h | 4 +-
> arch/arm64/kvm/hyp/pgtable.c | 63 +++++++++++++++++++++++-----
> arch/arm64/kvm/mmu.c | 36 +++++++++++++++-
> arch/arm64/kvm/pkvm.c | 11 ++++-
> 5 files changed, 129 insertions(+), 15 deletions(-)
>
>
> base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa
As a heads-up: I am suspecting this series to break my NV guests in a
pretty bad way. L2 and L3 guests are getting stuck, L0 and L1 barf on
S2 PTs that are being destroyed. This stinks of TLB invalidation going
very wrong, which would result in S2 management going similarly
sideways. I still need to work out whether that is just triggering
something bad somewhere else.
For what it is worth, this reproduces on both M2 and QC machines.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 6+ messages in thread