* [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations @ 2026-02-09 16:15 shaikh.kamal 2026-02-11 12:09 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 10+ messages in thread From: shaikh.kamal @ 2026-02-09 16:15 UTC (permalink / raw) To: kvm, linux-kernel, linux-rt-devel; +Cc: shaikh.kamal mmu_notifier_invalidate_range_start() may be invoked via mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(), where sleeping is explicitly forbidden. KVM's mmu_notifier invalidate_range_start currently takes mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps to rt_mutex and may sleep, triggering: BUG: sleeping function called from invalid context This violates the MMU notifier contract regardless of PREEMPT_RT; RT kernels merely make the issue deterministic. Fix by converting mn_invalidate_lock to a raw spinlock so that invalidate_range_start() remains non-sleeping while preserving the existing serialization between invalidate_range_start() and invalidate_range_end(). Signed-off-by: shaikh.kamal <shaikhkamal2012@gmail.com> --- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c | 18 +++++++++--------- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index d93f75b05ae2..77a6d4833eda 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -797,7 +797,7 @@ struct kvm { atomic_t nr_memslots_dirty_logging; /* Used to wait for completion of MMU notifiers. */ - spinlock_t mn_invalidate_lock; + raw_spinlock_t mn_invalidate_lock; unsigned long mn_active_invalidate_count; struct rcuwait mn_memslots_update_rcuwait; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 5fcd401a5897..7a9c33f01a37 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, * * Pairs with the decrement in range_end(). */ - spin_lock(&kvm->mn_invalidate_lock); + raw_spin_lock(&kvm->mn_invalidate_lock); kvm->mn_active_invalidate_count++; - spin_unlock(&kvm->mn_invalidate_lock); + raw_spin_unlock(&kvm->mn_invalidate_lock); /* * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e. @@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, kvm_handle_hva_range(kvm, &hva_range); /* Pairs with the increment in range_start(). */ - spin_lock(&kvm->mn_invalidate_lock); + raw_spin_lock(&kvm->mn_invalidate_lock); if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count)) --kvm->mn_active_invalidate_count; wake = !kvm->mn_active_invalidate_count; - spin_unlock(&kvm->mn_invalidate_lock); + raw_spin_unlock(&kvm->mn_invalidate_lock); /* * There can only be one waiter, since the wait happens under @@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) mutex_init(&kvm->irq_lock); mutex_init(&kvm->slots_lock); mutex_init(&kvm->slots_arch_lock); - spin_lock_init(&kvm->mn_invalidate_lock); + raw_spin_lock_init(&kvm->mn_invalidate_lock); rcuwait_init(&kvm->mn_memslots_update_rcuwait); xa_init(&kvm->vcpu_array); #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES @@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id) * progress, otherwise the locking in invalidate_range_start and * invalidate_range_end will be unbalanced. */ - spin_lock(&kvm->mn_invalidate_lock); + raw_spin_lock(&kvm->mn_invalidate_lock); prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait); while (kvm->mn_active_invalidate_count) { set_current_state(TASK_UNINTERRUPTIBLE); - spin_unlock(&kvm->mn_invalidate_lock); + raw_spin_unlock(&kvm->mn_invalidate_lock); schedule(); - spin_lock(&kvm->mn_invalidate_lock); + raw_spin_lock(&kvm->mn_invalidate_lock); } finish_rcuwait(&kvm->mn_memslots_update_rcuwait); rcu_assign_pointer(kvm->memslots[as_id], slots); - spin_unlock(&kvm->mn_invalidate_lock); + raw_spin_unlock(&kvm->mn_invalidate_lock); /* * Acquired in kvm_set_memslot. Must be released before synchronize -- 2.43.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations 2026-02-09 16:15 [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations shaikh.kamal @ 2026-02-11 12:09 ` Sebastian Andrzej Siewior 2026-02-11 15:34 ` Sean Christopherson 0 siblings, 1 reply; 10+ messages in thread From: Sebastian Andrzej Siewior @ 2026-02-11 12:09 UTC (permalink / raw) To: shaikh.kamal; +Cc: kvm, linux-kernel, linux-rt-devel On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote: > mmu_notifier_invalidate_range_start() may be invoked via > mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(), > where sleeping is explicitly forbidden. > > KVM's mmu_notifier invalidate_range_start currently takes > mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps > to rt_mutex and may sleep, triggering: > > BUG: sleeping function called from invalid context > > This violates the MMU notifier contract regardless of PREEMPT_RT; RT > kernels merely make the issue deterministic. > > Fix by converting mn_invalidate_lock to a raw spinlock so that > invalidate_range_start() remains non-sleeping while preserving the > existing serialization between invalidate_range_start() and > invalidate_range_end(). > > Signed-off-by: shaikh.kamal <shaikhkamal2012@gmail.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> I don't see any down side doing this, but… > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 5fcd401a5897..7a9c33f01a37 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > * > * Pairs with the decrement in range_end(). > */ > - spin_lock(&kvm->mn_invalidate_lock); > + raw_spin_lock(&kvm->mn_invalidate_lock); > kvm->mn_active_invalidate_count++; > - spin_unlock(&kvm->mn_invalidate_lock); > + raw_spin_unlock(&kvm->mn_invalidate_lock); atomic_inc(mn_active_invalidate_count) > > /* > * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e. > @@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, > kvm_handle_hva_range(kvm, &hva_range); > > /* Pairs with the increment in range_start(). */ > - spin_lock(&kvm->mn_invalidate_lock); > + raw_spin_lock(&kvm->mn_invalidate_lock); > if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count)) > --kvm->mn_active_invalidate_count; > wake = !kvm->mn_active_invalidate_count; wake = atomic_dec_return_safe(mn_active_invalidate_count); WARN_ON_ONCE(wake < 0); wake = !wake; > - spin_unlock(&kvm->mn_invalidate_lock); > + raw_spin_unlock(&kvm->mn_invalidate_lock); > > /* > * There can only be one waiter, since the wait happens under > @@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) > @@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id) > * progress, otherwise the locking in invalidate_range_start and > * invalidate_range_end will be unbalanced. > */ > - spin_lock(&kvm->mn_invalidate_lock); > + raw_spin_lock(&kvm->mn_invalidate_lock); > prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait); > while (kvm->mn_active_invalidate_count) { > set_current_state(TASK_UNINTERRUPTIBLE); > - spin_unlock(&kvm->mn_invalidate_lock); > + raw_spin_unlock(&kvm->mn_invalidate_lock); > schedule(); And this I don't understand. The lock protects the rcuwait assignment which would be needed if multiple waiters are possible. But this goes away after the unlock and schedule() here. So these things could be moved outside of the locked section which limits it only to the mn_active_invalidate_count value. > - spin_lock(&kvm->mn_invalidate_lock); > + raw_spin_lock(&kvm->mn_invalidate_lock); > } > finish_rcuwait(&kvm->mn_memslots_update_rcuwait); > rcu_assign_pointer(kvm->memslots[as_id], slots); > - spin_unlock(&kvm->mn_invalidate_lock); > + raw_spin_unlock(&kvm->mn_invalidate_lock); > > /* > * Acquired in kvm_set_memslot. Must be released before synchronize Sebastian ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations 2026-02-11 12:09 ` Sebastian Andrzej Siewior @ 2026-02-11 15:34 ` Sean Christopherson 2026-03-03 18:49 ` shaikh kamaluddin 0 siblings, 1 reply; 10+ messages in thread From: Sean Christopherson @ 2026-02-11 15:34 UTC (permalink / raw) To: Sebastian Andrzej Siewior; +Cc: shaikh.kamal, kvm, linux-kernel, linux-rt-devel On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote: > On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote: > > mmu_notifier_invalidate_range_start() may be invoked via > > mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(), > > where sleeping is explicitly forbidden. > > > > KVM's mmu_notifier invalidate_range_start currently takes > > mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps > > to rt_mutex and may sleep, triggering: > > > > BUG: sleeping function called from invalid context > > > > This violates the MMU notifier contract regardless of PREEMPT_RT; I highly doubt that. kvm.mmu_lock is also a spinlock, and KVM has been taking that in invalidate_range_start() since e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map") which was a full decade before mmu_notifiers even added the blockable concept in 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") and even predate the current concept of a "raw" spinlock introduced by c2f21ce2e312 ("locking: Implement new raw_spinlock") > > RT kernels merely make the issue deterministic. No, RT kernels change the rules, because suddenly a non-sleeping locking becomes sleepable. > > Fix by converting mn_invalidate_lock to a raw spinlock so that > > invalidate_range_start() remains non-sleeping while preserving the > > existing serialization between invalidate_range_start() and > > invalidate_range_end(). This is insufficient. To actually "fix" this in KVM mmu_lock would need to be turned into a raw lock on all KVM architectures. I suspect the only reason there haven't been bug reports is because no one trips an OOM kill on VM while running with CONFIG_DEBUG_ATOMIC_SLEEP=y. That combination is required because since commit 8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot") KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot, i.e. affects memory that may be mapped into the guest. E.g. this hack to simulate a non-blockable invalidation diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7015edce5bd8..7a35a83420ec 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, .handler = kvm_mmu_unmap_gfn_range, .on_lock = kvm_mmu_invalidate_begin, .flush_on_ret = true, - .may_block = mmu_notifier_range_blockable(range), + .may_block = false,//mmu_notifier_range_blockable(range), }; trace_kvm_unmap_hva_range(range->start, range->end); @@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, */ gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end); + non_block_start(); /* * If one or more memslots were found and thus zapped, notify arch code * that guest memory has been reclaimed. This needs to be done *after* @@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, */ if (kvm_handle_hva_range(kvm, &hva_range).found_memslot) kvm_arch_guest_memory_reclaimed(kvm); + non_block_end(); return 0; } immediately triggers BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241 in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu preempt_count: 0, expected: 0 RCU nest depth: 0, expected: 0 CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x51/0x60 __might_resched+0x10e/0x160 rt_write_lock+0x49/0x310 kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm] __mmu_notifier_invalidate_range_start+0x9b/0x230 do_wp_page+0xce1/0xf30 __handle_mm_fault+0x380/0x3a0 handle_mm_fault+0xde/0x290 __get_user_pages+0x20d/0xbe0 get_user_pages_unlocked+0xf6/0x340 hva_to_pfn+0x295/0x420 [kvm] __kvm_faultin_pfn+0x5d/0x90 [kvm] kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm] kvm_tdp_page_fault+0xb6/0x160 [kvm] kvm_mmu_do_page_fault+0xee/0x1f0 [kvm] kvm_mmu_page_fault+0x8d/0x600 [kvm] vmx_handle_exit+0x18c/0x5a0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm] kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm] __x64_sys_ioctl+0x8a/0xd0 do_syscall_64+0x5e/0x11b0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> kvm: emulating exchange as write It's not at all clear to me that switching mmu_lock to a raw lock would be a net positive for PREEMPT_RT. OOM-killing a KVM guest in a PREEMPT_RT seems like a comically rare scenario. Whereas contending mmu_lock in normal operation is relatively common (assuming there are even use cases for running VMs with a PREEMPT_RT host kernel). In fact, the only reason the splat happens is because mmu_notifiers somewhat artificially forces an atomic context via non_block_start() since commit ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable") Given the massive amount of churn in KVM that would be required to fully eliminate the splat, and that it's not at all obvious that it would be a good change overall, at least for now: NAK I'm not fundamentally opposed to such a change, but there needs to be a _lot_ more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y". > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 5fcd401a5897..7a9c33f01a37 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > * > > * Pairs with the decrement in range_end(). > > */ > > - spin_lock(&kvm->mn_invalidate_lock); > > + raw_spin_lock(&kvm->mn_invalidate_lock); > > kvm->mn_active_invalidate_count++; > > - spin_unlock(&kvm->mn_invalidate_lock); > > + raw_spin_unlock(&kvm->mn_invalidate_lock); > > atomic_inc(mn_active_invalidate_count) > > > > /* > > * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e. > > @@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, > > kvm_handle_hva_range(kvm, &hva_range); > > > > /* Pairs with the increment in range_start(). */ > > - spin_lock(&kvm->mn_invalidate_lock); > > + raw_spin_lock(&kvm->mn_invalidate_lock); > > if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count)) > > --kvm->mn_active_invalidate_count; > > wake = !kvm->mn_active_invalidate_count; > > wake = atomic_dec_return_safe(mn_active_invalidate_count); > WARN_ON_ONCE(wake < 0); > wake = !wake; > > > - spin_unlock(&kvm->mn_invalidate_lock); > > + raw_spin_unlock(&kvm->mn_invalidate_lock); > > > > /* > > * There can only be one waiter, since the wait happens under > > @@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) > > @@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id) > > * progress, otherwise the locking in invalidate_range_start and > > * invalidate_range_end will be unbalanced. > > */ > > - spin_lock(&kvm->mn_invalidate_lock); > > + raw_spin_lock(&kvm->mn_invalidate_lock); > > prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait); > > while (kvm->mn_active_invalidate_count) { > > set_current_state(TASK_UNINTERRUPTIBLE); > > - spin_unlock(&kvm->mn_invalidate_lock); > > + raw_spin_unlock(&kvm->mn_invalidate_lock); > > schedule(); > > And this I don't understand. The lock protects the rcuwait assignment > which would be needed if multiple waiters are possible. But this goes > away after the unlock and schedule() here. So these things could be > moved outside of the locked section which limits it only to the > mn_active_invalidate_count value. The implementation is essentially a deliberately unfair rwswem. The "write" side in kvm_swap_active_memslots() subtly protect this code: rcu_assign_pointer(kvm->memslots[as_id], slots); and the "read" side protects the kvm->memslot lookups in kvm_handle_hva_range(). KVM optimizes its mmu_notifier invalidation path to only take action if the to-be-invalidated range overlaps one or more memslots, i.e. affects memory that be can be mapped into the guest. The wrinkle with those optimizations is that KVM needs to prevent changes to the memslots between invalidation start() and end(), otherwise the accounting can become imbalanced, e.g. mmu_invalidate_in_progress will underflow or be left elevated and essentially hang the VM (among other bad things). So simply making mn_active_invalidate_count an atomic won't suffice, because KVM needs to block start() to ensure start()+end() see the exact same set of memslots. ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations 2026-02-11 15:34 ` Sean Christopherson @ 2026-03-03 18:49 ` shaikh kamaluddin 2026-03-06 16:42 ` Sean Christopherson 2026-03-06 18:14 ` Paolo Bonzini 0 siblings, 2 replies; 10+ messages in thread From: shaikh kamaluddin @ 2026-03-03 18:49 UTC (permalink / raw) To: Sean Christopherson Cc: Sebastian Andrzej Siewior, kvm, linux-kernel, linux-rt-devel On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote: > On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote: > > On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote: > > > mmu_notifier_invalidate_range_start() may be invoked via > > > mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(), > > > where sleeping is explicitly forbidden. > > > > > > KVM's mmu_notifier invalidate_range_start currently takes > > > mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps > > > to rt_mutex and may sleep, triggering: > > > > > > BUG: sleeping function called from invalid context > > > > > > This violates the MMU notifier contract regardless of PREEMPT_RT; > > I highly doubt that. kvm.mmu_lock is also a spinlock, and KVM has been taking > that in invalidate_range_start() since > > e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map") > > which was a full decade before mmu_notifiers even added the blockable concept in > > 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") > > and even predate the current concept of a "raw" spinlock introduced by > > c2f21ce2e312 ("locking: Implement new raw_spinlock") > > > > RT kernels merely make the issue deterministic. > > No, RT kernels change the rules, because suddenly a non-sleeping locking becomes > sleepable. > > > > Fix by converting mn_invalidate_lock to a raw spinlock so that > > > invalidate_range_start() remains non-sleeping while preserving the > > > existing serialization between invalidate_range_start() and > > > invalidate_range_end(). > > This is insufficient. To actually "fix" this in KVM mmu_lock would need to be > turned into a raw lock on all KVM architectures. I suspect the only reason there > haven't been bug reports is because no one trips an OOM kill on VM while running > with CONFIG_DEBUG_ATOMIC_SLEEP=y. > > That combination is required because since commit > > 8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot") > > KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot, > i.e. affects memory that may be mapped into the guest. > > E.g. this hack to simulate a non-blockable invalidation > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 7015edce5bd8..7a35a83420ec 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > .handler = kvm_mmu_unmap_gfn_range, > .on_lock = kvm_mmu_invalidate_begin, > .flush_on_ret = true, > - .may_block = mmu_notifier_range_blockable(range), > + .may_block = false,//mmu_notifier_range_blockable(range), > }; > > trace_kvm_unmap_hva_range(range->start, range->end); > @@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > */ > gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end); > > + non_block_start(); > /* > * If one or more memslots were found and thus zapped, notify arch code > * that guest memory has been reclaimed. This needs to be done *after* > @@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > */ > if (kvm_handle_hva_range(kvm, &hva_range).found_memslot) > kvm_arch_guest_memory_reclaimed(kvm); > + non_block_end(); > > return 0; > } > > immediately triggers > > BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241 > in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu > preempt_count: 0, expected: 0 > RCU nest depth: 0, expected: 0 > CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > Call Trace: > <TASK> > dump_stack_lvl+0x51/0x60 > __might_resched+0x10e/0x160 > rt_write_lock+0x49/0x310 > kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm] > __mmu_notifier_invalidate_range_start+0x9b/0x230 > do_wp_page+0xce1/0xf30 > __handle_mm_fault+0x380/0x3a0 > handle_mm_fault+0xde/0x290 > __get_user_pages+0x20d/0xbe0 > get_user_pages_unlocked+0xf6/0x340 > hva_to_pfn+0x295/0x420 [kvm] > __kvm_faultin_pfn+0x5d/0x90 [kvm] > kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm] > kvm_tdp_page_fault+0xb6/0x160 [kvm] > kvm_mmu_do_page_fault+0xee/0x1f0 [kvm] > kvm_mmu_page_fault+0x8d/0x600 [kvm] > vmx_handle_exit+0x18c/0x5a0 [kvm_intel] > kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm] > kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm] > __x64_sys_ioctl+0x8a/0xd0 > do_syscall_64+0x5e/0x11b0 > entry_SYSCALL_64_after_hwframe+0x4b/0x53 > </TASK> > kvm: emulating exchange as write > > > It's not at all clear to me that switching mmu_lock to a raw lock would be a net > positive for PREEMPT_RT. OOM-killing a KVM guest in a PREEMPT_RT seems like a > comically rare scenario. Whereas contending mmu_lock in normal operation is > relatively common (assuming there are even use cases for running VMs with a > PREEMPT_RT host kernel). > > In fact, the only reason the splat happens is because mmu_notifiers somewhat > artificially forces an atomic context via non_block_start() since commit > > ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable") > > Given the massive amount of churn in KVM that would be required to fully eliminate > the splat, and that it's not at all obvious that it would be a good change overall, > at least for now: > > NAK > > I'm not fundamentally opposed to such a change, but there needs to be a _lot_ > more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y". > Hi Sean, Thanks for the detailed explanation and for spelling out the border issue. Understood on both points: 1. The changelog wording was too strong; PREEMPT_RT changes spin_lock() semantics, and the splat is fundamentally due to spinlocks becoming sleepable there. 2. Converting only mm_invalidate_lock to raw is insufficient since KVM can still take the mmu_lock (and other sleeping locks RT) in invalidate_range_start() when the invalidation hits a memslot. Given the above, it shounds like "convert locks to raw" is not the right direction without sinificat rework and justification. Would an acceptable direction be to handle the !blockable notifier case by deferring the heavyweight invalidation work(anything that take mmu_lock/may sleep on RT) to a context that may block(e.g. queued work), while keeping start()/end() accounting consisting with memslot changes ? if so, I can protoptype a patch along those lines and share for feedback. Alternatively, if you think this needs to be addressed in mmu_notifiers(eg. how non_block_start() is applied), I'm happy to redirect my efforts there-Please advise. > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index 5fcd401a5897..7a9c33f01a37 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > > * > > > * Pairs with the decrement in range_end(). > > > */ > > > - spin_lock(&kvm->mn_invalidate_lock); > > > + raw_spin_lock(&kvm->mn_invalidate_lock); > > > kvm->mn_active_invalidate_count++; > > > - spin_unlock(&kvm->mn_invalidate_lock); > > > + raw_spin_unlock(&kvm->mn_invalidate_lock); > > > > atomic_inc(mn_active_invalidate_count) > > > > > > /* > > > * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e. > > > @@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, > > > kvm_handle_hva_range(kvm, &hva_range); > > > > > > /* Pairs with the increment in range_start(). */ > > > - spin_lock(&kvm->mn_invalidate_lock); > > > + raw_spin_lock(&kvm->mn_invalidate_lock); > > > if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count)) > > > --kvm->mn_active_invalidate_count; > > > wake = !kvm->mn_active_invalidate_count; > > > > wake = atomic_dec_return_safe(mn_active_invalidate_count); > > WARN_ON_ONCE(wake < 0); > > wake = !wake; > > > > > - spin_unlock(&kvm->mn_invalidate_lock); > > > + raw_spin_unlock(&kvm->mn_invalidate_lock); > > > > > > /* > > > * There can only be one waiter, since the wait happens under > > > @@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) > > > @@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id) > > > * progress, otherwise the locking in invalidate_range_start and > > > * invalidate_range_end will be unbalanced. > > > */ > > > - spin_lock(&kvm->mn_invalidate_lock); > > > + raw_spin_lock(&kvm->mn_invalidate_lock); > > > prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait); > > > while (kvm->mn_active_invalidate_count) { > > > set_current_state(TASK_UNINTERRUPTIBLE); > > > - spin_unlock(&kvm->mn_invalidate_lock); > > > + raw_spin_unlock(&kvm->mn_invalidate_lock); > > > schedule(); > > > > And this I don't understand. The lock protects the rcuwait assignment > > which would be needed if multiple waiters are possible. But this goes > > away after the unlock and schedule() here. So these things could be > > moved outside of the locked section which limits it only to the > > mn_active_invalidate_count value. > > The implementation is essentially a deliberately unfair rwswem. The "write" side > in kvm_swap_active_memslots() subtly protect this code: > > rcu_assign_pointer(kvm->memslots[as_id], slots); > > and the "read" side protects the kvm->memslot lookups in kvm_handle_hva_range(). > > KVM optimizes its mmu_notifier invalidation path to only take action if the > to-be-invalidated range overlaps one or more memslots, i.e. affects memory that > be can be mapped into the guest. The wrinkle with those optimizations is that > KVM needs to prevent changes to the memslots between invalidation start() and end(), > otherwise the accounting can become imbalanced, e.g. mmu_invalidate_in_progress > will underflow or be left elevated and essentially hang the VM (among other bad > things). > > So simply making mn_active_invalidate_count an atomic won't suffice, because KVM > needs to block start() to ensure start()+end() see the exact same set of memslots. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations 2026-03-03 18:49 ` shaikh kamaluddin @ 2026-03-06 16:42 ` Sean Christopherson 2026-03-06 18:14 ` Paolo Bonzini 1 sibling, 0 replies; 10+ messages in thread From: Sean Christopherson @ 2026-03-06 16:42 UTC (permalink / raw) To: shaikh kamaluddin Cc: Sebastian Andrzej Siewior, kvm, linux-kernel, linux-rt-devel On Wed, Mar 04, 2026, shaikh kamaluddin wrote: > On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote: > > On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote: > > It's not at all clear to me that switching mmu_lock to a raw lock would be a net > > positive for PREEMPT_RT. OOM-killing a KVM guest in a PREEMPT_RT seems like a > > comically rare scenario. Whereas contending mmu_lock in normal operation is > > relatively common (assuming there are even use cases for running VMs with a > > PREEMPT_RT host kernel). > > > > In fact, the only reason the splat happens is because mmu_notifiers somewhat > > artificially forces an atomic context via non_block_start() since commit > > > > ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable") > > > > Given the massive amount of churn in KVM that would be required to fully eliminate > > the splat, and that it's not at all obvious that it would be a good change overall, > > at least for now: > > > > NAK > > > > I'm not fundamentally opposed to such a change, but there needs to be a _lot_ > > more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y". > > > Hi Sean, > Thanks for the detailed explanation and for spelling out the border > issue. > Understood on both points: > 1. The changelog wording was too strong; PREEMPT_RT changes > spin_lock() semantics, and the splat is fundamentally due to > spinlocks becoming sleepable there. > 2. Converting only mm_invalidate_lock to raw is insufficient > since KVM can still take the mmu_lock (and other sleeping locks > RT) in invalidate_range_start() when the invalidation hits a > memslot. > Given the above, it shounds like "convert locks to raw" is not the right > direction without sinificat rework and justification. > Would an acceptable direction be to handle the !blockable notifier case > by deferring the heavyweight invalidation work(anything that take > mmu_lock/may sleep on RT) to a context that may block(e.g. queued work), > while keeping start()/end() accounting consisting with memslot changes ? No, because the _only_ case where the invalidation is non-blockable is when the kernel is OOM-killing. Deferring the invalidations when we're OOM is likely to make the problem *worse*. That's the crux of my NAK. We'd be making KVM and kernel behavior worse to "fix" a largely hypothetical issue (OOM-killing a KVM guest in a RT kernel). ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations 2026-03-03 18:49 ` shaikh kamaluddin 2026-03-06 16:42 ` Sean Christopherson @ 2026-03-06 18:14 ` Paolo Bonzini 2026-03-12 19:24 ` shaikh kamaluddin 1 sibling, 1 reply; 10+ messages in thread From: Paolo Bonzini @ 2026-03-06 18:14 UTC (permalink / raw) To: shaikh kamaluddin, Sean Christopherson Cc: Sebastian Andrzej Siewior, kvm, linux-kernel, linux-rt-devel On 3/3/26 19:49, shaikh kamaluddin wrote: > On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote: >> On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote: >>> On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote: >>>> mmu_notifier_invalidate_range_start() may be invoked via >>>> mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(), >>>> where sleeping is explicitly forbidden. >>>> >>>> KVM's mmu_notifier invalidate_range_start currently takes >>>> mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps >>>> to rt_mutex and may sleep, triggering: >>>> >>>> BUG: sleeping function called from invalid context >>>> >>>> This violates the MMU notifier contract regardless of PREEMPT_RT; >> >> I highly doubt that. kvm.mmu_lock is also a spinlock, and KVM has been taking >> that in invalidate_range_start() since >> >> e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map") >> >> which was a full decade before mmu_notifiers even added the blockable concept in >> >> 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") >> >> and even predate the current concept of a "raw" spinlock introduced by >> >> c2f21ce2e312 ("locking: Implement new raw_spinlock") >> >>>> RT kernels merely make the issue deterministic. >> >> No, RT kernels change the rules, because suddenly a non-sleeping locking becomes >> sleepable. >> >>>> Fix by converting mn_invalidate_lock to a raw spinlock so that >>>> invalidate_range_start() remains non-sleeping while preserving the >>>> existing serialization between invalidate_range_start() and >>>> invalidate_range_end(). >> >> This is insufficient. To actually "fix" this in KVM mmu_lock would need to be >> turned into a raw lock on all KVM architectures. I suspect the only reason there >> haven't been bug reports is because no one trips an OOM kill on VM while running >> with CONFIG_DEBUG_ATOMIC_SLEEP=y. >> >> That combination is required because since commit >> >> 8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot") >> >> KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot, >> i.e. affects memory that may be mapped into the guest. >> >> E.g. this hack to simulate a non-blockable invalidation >> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >> index 7015edce5bd8..7a35a83420ec 100644 >> --- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, >> .handler = kvm_mmu_unmap_gfn_range, >> .on_lock = kvm_mmu_invalidate_begin, >> .flush_on_ret = true, >> - .may_block = mmu_notifier_range_blockable(range), >> + .may_block = false,//mmu_notifier_range_blockable(range), >> }; >> >> trace_kvm_unmap_hva_range(range->start, range->end); >> @@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, >> */ >> gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end); >> >> + non_block_start(); >> /* >> * If one or more memslots were found and thus zapped, notify arch code >> * that guest memory has been reclaimed. This needs to be done *after* >> @@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, >> */ >> if (kvm_handle_hva_range(kvm, &hva_range).found_memslot) >> kvm_arch_guest_memory_reclaimed(kvm); >> + non_block_end(); >> >> return 0; >> } >> >> immediately triggers >> >> BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241 >> in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu >> preempt_count: 0, expected: 0 >> RCU nest depth: 0, expected: 0 >> CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT >> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 >> Call Trace: >> <TASK> >> dump_stack_lvl+0x51/0x60 >> __might_resched+0x10e/0x160 >> rt_write_lock+0x49/0x310 >> kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm] >> __mmu_notifier_invalidate_range_start+0x9b/0x230 >> do_wp_page+0xce1/0xf30 >> __handle_mm_fault+0x380/0x3a0 >> handle_mm_fault+0xde/0x290 >> __get_user_pages+0x20d/0xbe0 >> get_user_pages_unlocked+0xf6/0x340 >> hva_to_pfn+0x295/0x420 [kvm] >> __kvm_faultin_pfn+0x5d/0x90 [kvm] >> kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm] >> kvm_tdp_page_fault+0xb6/0x160 [kvm] >> kvm_mmu_do_page_fault+0xee/0x1f0 [kvm] >> kvm_mmu_page_fault+0x8d/0x600 [kvm] >> vmx_handle_exit+0x18c/0x5a0 [kvm_intel] >> kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm] >> kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm] >> __x64_sys_ioctl+0x8a/0xd0 >> do_syscall_64+0x5e/0x11b0 >> entry_SYSCALL_64_after_hwframe+0x4b/0x53 >> </TASK> >> kvm: emulating exchange as write >> >> >> It's not at all clear to me that switching mmu_lock to a raw lock would be a net >> positive for PREEMPT_RT. OOM-killing a KVM guest in a PREEMPT_RT seems like a >> comically rare scenario. Whereas contending mmu_lock in normal operation is >> relatively common (assuming there are even use cases for running VMs with a >> PREEMPT_RT host kernel). >> >> In fact, the only reason the splat happens is because mmu_notifiers somewhat >> artificially forces an atomic context via non_block_start() since commit >> >> ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable") >> >> Given the massive amount of churn in KVM that would be required to fully eliminate >> the splat, and that it's not at all obvious that it would be a good change overall, >> at least for now: >> >> NAK >> >> I'm not fundamentally opposed to such a change, but there needs to be a _lot_ >> more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y". >> > Hi Sean, > Thanks for the detailed explanation and for spelling out the border > issue. > Understood on both points: > 1. The changelog wording was too strong; PREEMPT_RT changes > spin_lock() semantics, and the splat is fundamentally due to > spinlocks becoming sleepable there. > 2. Converting only mm_invalidate_lock to raw is insufficient > since KVM can still take the mmu_lock (and other sleeping locks > RT) in invalidate_range_start() when the invalidation hits a > memslot. > Given the above, it shounds like "convert locks to raw" is not the right > direction without sinificat rework and justification. > Would an acceptable direction be to handle the !blockable notifier case > by deferring the heavyweight invalidation work(anything that take > mmu_lock/may sleep on RT) to a context that may block(e.g. queued work), > while keeping start()/end() accounting consisting with memslot changes ? > if so, I can protoptype a patch along those lines and share for > feedback. > > Alternatively, if you think this needs to be addressed in > mmu_notifiers(eg. how non_block_start() is applied), I'm happy to > redirect my efforts there-Please advise. Have you considered a "OOM entered" callback for MMU notifiers? KVM's MMU notifier can just remove itself for example, in fact there is code in kvm_destroy_vm() to do that even if invalidations are unbalanced. Paolo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations 2026-03-06 18:14 ` Paolo Bonzini @ 2026-03-12 19:24 ` shaikh kamaluddin 2026-03-14 7:47 ` Paolo Bonzini 0 siblings, 1 reply; 10+ messages in thread From: shaikh kamaluddin @ 2026-03-12 19:24 UTC (permalink / raw) To: Paolo Bonzini Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm, linux-kernel, linux-rt-devel On Fri, Mar 06, 2026 at 07:14:40PM +0100, Paolo Bonzini wrote: > On 3/3/26 19:49, shaikh kamaluddin wrote: > > On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote: > > > On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote: > > > > On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote: > > > > > mmu_notifier_invalidate_range_start() may be invoked via > > > > > mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(), > > > > > where sleeping is explicitly forbidden. > > > > > > > > > > KVM's mmu_notifier invalidate_range_start currently takes > > > > > mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps > > > > > to rt_mutex and may sleep, triggering: > > > > > > > > > > BUG: sleeping function called from invalid context > > > > > > > > > > This violates the MMU notifier contract regardless of PREEMPT_RT; > > > > > > I highly doubt that. kvm.mmu_lock is also a spinlock, and KVM has been taking > > > that in invalidate_range_start() since > > > > > > e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map") > > > > > > which was a full decade before mmu_notifiers even added the blockable concept in > > > > > > 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") > > > > > > and even predate the current concept of a "raw" spinlock introduced by > > > > > > c2f21ce2e312 ("locking: Implement new raw_spinlock") > > > > > > > > RT kernels merely make the issue deterministic. > > > > > > No, RT kernels change the rules, because suddenly a non-sleeping locking becomes > > > sleepable. > > > > > > > > Fix by converting mn_invalidate_lock to a raw spinlock so that > > > > > invalidate_range_start() remains non-sleeping while preserving the > > > > > existing serialization between invalidate_range_start() and > > > > > invalidate_range_end(). > > > > > > This is insufficient. To actually "fix" this in KVM mmu_lock would need to be > > > turned into a raw lock on all KVM architectures. I suspect the only reason there > > > haven't been bug reports is because no one trips an OOM kill on VM while running > > > with CONFIG_DEBUG_ATOMIC_SLEEP=y. > > > > > > That combination is required because since commit > > > > > > 8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot") > > > > > > KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot, > > > i.e. affects memory that may be mapped into the guest. > > > > > > E.g. this hack to simulate a non-blockable invalidation > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index 7015edce5bd8..7a35a83420ec 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > > .handler = kvm_mmu_unmap_gfn_range, > > > .on_lock = kvm_mmu_invalidate_begin, > > > .flush_on_ret = true, > > > - .may_block = mmu_notifier_range_blockable(range), > > > + .may_block = false,//mmu_notifier_range_blockable(range), > > > }; > > > trace_kvm_unmap_hva_range(range->start, range->end); > > > @@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > > */ > > > gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end); > > > + non_block_start(); > > > /* > > > * If one or more memslots were found and thus zapped, notify arch code > > > * that guest memory has been reclaimed. This needs to be done *after* > > > @@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > > */ > > > if (kvm_handle_hva_range(kvm, &hva_range).found_memslot) > > > kvm_arch_guest_memory_reclaimed(kvm); > > > + non_block_end(); > > > return 0; > > > } > > > > > > immediately triggers > > > > > > BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241 > > > in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu > > > preempt_count: 0, expected: 0 > > > RCU nest depth: 0, expected: 0 > > > CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > > > Call Trace: > > > <TASK> > > > dump_stack_lvl+0x51/0x60 > > > __might_resched+0x10e/0x160 > > > rt_write_lock+0x49/0x310 > > > kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm] > > > __mmu_notifier_invalidate_range_start+0x9b/0x230 > > > do_wp_page+0xce1/0xf30 > > > __handle_mm_fault+0x380/0x3a0 > > > handle_mm_fault+0xde/0x290 > > > __get_user_pages+0x20d/0xbe0 > > > get_user_pages_unlocked+0xf6/0x340 > > > hva_to_pfn+0x295/0x420 [kvm] > > > __kvm_faultin_pfn+0x5d/0x90 [kvm] > > > kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm] > > > kvm_tdp_page_fault+0xb6/0x160 [kvm] > > > kvm_mmu_do_page_fault+0xee/0x1f0 [kvm] > > > kvm_mmu_page_fault+0x8d/0x600 [kvm] > > > vmx_handle_exit+0x18c/0x5a0 [kvm_intel] > > > kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm] > > > kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm] > > > __x64_sys_ioctl+0x8a/0xd0 > > > do_syscall_64+0x5e/0x11b0 > > > entry_SYSCALL_64_after_hwframe+0x4b/0x53 > > > </TASK> > > > kvm: emulating exchange as write > > > > > > > > > It's not at all clear to me that switching mmu_lock to a raw lock would be a net > > > positive for PREEMPT_RT. OOM-killing a KVM guest in a PREEMPT_RT seems like a > > > comically rare scenario. Whereas contending mmu_lock in normal operation is > > > relatively common (assuming there are even use cases for running VMs with a > > > PREEMPT_RT host kernel). > > > > > > In fact, the only reason the splat happens is because mmu_notifiers somewhat > > > artificially forces an atomic context via non_block_start() since commit > > > > > > ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable") > > > > > > Given the massive amount of churn in KVM that would be required to fully eliminate > > > the splat, and that it's not at all obvious that it would be a good change overall, > > > at least for now: > > > > > > NAK > > > > > > I'm not fundamentally opposed to such a change, but there needs to be a _lot_ > > > more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y". > > > > > Hi Sean, > > Thanks for the detailed explanation and for spelling out the border > > issue. > > Understood on both points: > > 1. The changelog wording was too strong; PREEMPT_RT changes > > spin_lock() semantics, and the splat is fundamentally due to > > spinlocks becoming sleepable there. > > 2. Converting only mm_invalidate_lock to raw is insufficient > > since KVM can still take the mmu_lock (and other sleeping locks > > RT) in invalidate_range_start() when the invalidation hits a > > memslot. > > Given the above, it shounds like "convert locks to raw" is not the right > > direction without sinificat rework and justification. > > Would an acceptable direction be to handle the !blockable notifier case > > by deferring the heavyweight invalidation work(anything that take > > mmu_lock/may sleep on RT) to a context that may block(e.g. queued work), > > while keeping start()/end() accounting consisting with memslot changes ? > > if so, I can protoptype a patch along those lines and share for > > feedback. > > > > Alternatively, if you think this needs to be addressed in > > mmu_notifiers(eg. how non_block_start() is applied), I'm happy to > > redirect my efforts there-Please advise. > > Have you considered a "OOM entered" callback for MMU notifiers? KVM's MMU > notifier can just remove itself for example, in fact there is code in > kvm_destroy_vm() to do that even if invalidations are unbalanced. > > Paolo > Thanks for the suggestion! That's a much cleaner approach than what I was considering. If I understand correctly, the idea would be: 1. Add a new MMU notifier callback (e.g., .oom_entered or .release_on_oom) 2. Have KVM implement it to unregister the notifier when OOM reaper starts 3. Leverage the existing kvm_destroy_vm() logic that already handles unbalanced invalidations This avoids the whole "convert locks to raw" problem and the complexity of deferring work. I have questions on Testing part: ------------------------------------ I tried to reproduce the bug scenario using the virtme-ng then running the stress-ng putting memory pressure on VM, but not able to reproduce the scenario. I tried this way .. vng -v -r ./arch/x86/boot/bzImage VM is up, then running the stress-ng as below stress-ng --vm 2 --vm-bytes 95% --timeout 20s & sleep 5 & dmesg | tail -30 | grep "sleeping function" OOM Killer is triggered, but exact bug not able to reproduce, Please suggest how to reproduce this bug, even we need to verify after code changes which you have suggested. Shaikh Kamal ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations 2026-03-12 19:24 ` shaikh kamaluddin @ 2026-03-14 7:47 ` Paolo Bonzini 2026-03-25 5:19 ` shaikh kamaluddin 0 siblings, 1 reply; 10+ messages in thread From: Paolo Bonzini @ 2026-03-14 7:47 UTC (permalink / raw) To: shaikh kamaluddin Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm, linux-kernel, linux-rt-devel On 3/12/26 20:24, shaikh kamaluddin wrote: >>> Alternatively, if you think this needs to be addressed in >>> mmu_notifiers(eg. how non_block_start() is applied), I'm happy to >>> redirect my efforts there-Please advise. >> >> Have you considered a "OOM entered" callback for MMU notifiers? KVM's MMU >> notifier can just remove itself for example, in fact there is code in >> kvm_destroy_vm() to do that even if invalidations are unbalanced. >> >> Paolo >> > Thanks for the suggestion! That's a much cleaner approach than what I was considering. > > If I understand correctly, the idea would be: > 1. Add a new MMU notifier callback (e.g., .oom_entered or .release_on_oom) > 2. Have KVM implement it to unregister the notifier when OOM reaper starts > 3. Leverage the existing kvm_destroy_vm() logic that already handles unbalanced invalidations Yes pretty much. Essentially, move the existing logic to the new callback and invoke it from kvm_destroy_vm(). > This avoids the whole "convert locks to raw" problem and the complexity of deferring work. > > I have questions on Testing part: > ------------------------------------ > I tried to reproduce the bug scenario using the virtme-ng then running > the stress-ng putting memory pressure on VM, but not able to reproduce > the scenario. > I tried this way .. > vng -v -r ./arch/x86/boot/bzImage > VM is up, then running the stress-ng as below > stress-ng --vm 2 --vm-bytes 95% --timeout 20s & sleep 5 & dmesg | tail -30 | grep "sleeping function" > OOM Killer is triggered, but exact bug not able to reproduce, Please > suggest how to reproduce this bug, even we need to verify after code > changes which you have suggested. I don't know, sorry. But with this new approach there will always be a call to the new callback from the OOM killer, so it's easier to test. Thanks, Paolo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations 2026-03-14 7:47 ` Paolo Bonzini @ 2026-03-25 5:19 ` shaikh kamaluddin 2026-03-26 18:23 ` Paolo Bonzini 0 siblings, 1 reply; 10+ messages in thread From: shaikh kamaluddin @ 2026-03-25 5:19 UTC (permalink / raw) To: Paolo Bonzini Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm, linux-kernel, linux-rt-devel, skhan, me On Sat, Mar 14, 2026 at 08:47:40AM +0100, Paolo Bonzini wrote: > On 3/12/26 20:24, shaikh kamaluddin wrote: > > > > Alternatively, if you think this needs to be addressed in > > > > mmu_notifiers(eg. how non_block_start() is applied), I'm happy to > > > > redirect my efforts there-Please advise. > > > > > > Have you considered a "OOM entered" callback for MMU notifiers? KVM's MMU > > > notifier can just remove itself for example, in fact there is code in > > > kvm_destroy_vm() to do that even if invalidations are unbalanced. > > > > > > Paolo > > > > > Thanks for the suggestion! That's a much cleaner approach than what I was considering. > > > > If I understand correctly, the idea would be: > > 1. Add a new MMU notifier callback (e.g., .oom_entered or .release_on_oom) > > 2. Have KVM implement it to unregister the notifier when OOM reaper starts > > 3. Leverage the existing kvm_destroy_vm() logic that already handles unbalanced invalidations > > Yes pretty much. Essentially, move the existing logic to the new callback > and invoke it from kvm_destroy_vm(). > Hi Paolo, Thank you for the suggestion to use an oom_enter callback approach. I've implemented v2 based on your guidance and have successfully validated it. Implementation Summary: ------------------------------------- Following your recommendation, I've added a new oom_enter callback to the mmu_notifier_ops structure. The implementation: 1. Added oom_enter callback to struct mmu_notifier_ops in include/linux/mmu_notifier.h 2. Implemented __mmu_notifier_oom_enter() in mm/mmu_notifier.c to invoke registered callbacks 3. Called mmu_notifier_oom_enter(mm) from __oom_kill_process in mm/oom_kill.c before any invalidations 4. As per your suggestion, move existing kvm_destroy_vm() logic that already handles unbalanced invalidation to the new helper function kvm_mmu_notifier_detach() and invoke it from the kvm_destroy_vm() Key Design Decision: ------------------------------ Implementation point no 4, while testing, Issue I was encountering is a recursive locking problem with the srcu lock, which is being acquired twice in the same context. This happens during the __mmu_notifier_oom_enter() and __synchronize_srcu() calls, leading to a potential deadlock. Please find below log snippet while launching the Guest VM ------------------------------------------------------------------------------------------------ OOM_REAPER: START reaping:func:__mmu_notifier_oom_enter [ 399.841599][T10882] OOM_REAPER: START reaping:func:__mmu_notifier_oom_enter [ 399.841608][T10882] KVM: oom_enter callback invoked for VM:kvm_mmu_notifier_oom_enter [ 399.841608][T10882] KVM: oom_enter callback invoked for VM:kvm_mmu_notifier_oom_enter [ 399.841961][T10882] [ 399.841961][T10882] [ 399.841962][T10882] ============================================ [ 399.841962][T10882] ============================================ [ 399.841964][T10882] WARNING: possible recursive locking detected [ 399.841964][T10882] WARNING: possible recursive locking detected [ 399.841966][T10882] 7.0.0-rc2-00467-g4ae12d8bd9a8-dirty #12 Not tainted [ 399.841966][T10882] 7.0.0-rc2-00467-g4ae12d8bd9a8-dirty #12 Not tainted [ 399.841969][T10882] -------------------------------------------- [ 399.841969][T10882] -------------------------------------------- [ 399.841971][T10882] qemu-system-x86/10882 is trying to acquire lock: [ 399.841971][T10882] qemu-system-x86/10882 is trying to acquire lock: [ 399.841974][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x83/0x380 [ 399.841974][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x83/0x380 [ 399.841991][T10882] [ 399.841991][T10882] but task is already holding lock: [ 399.841991][T10882] [ 399.841991][T10882] but task is already holding lock: [ 399.841992][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __mmu_notifier_oom_enter+0x93/0x1f0 [ 399.841992][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __mmu_notifier_oom_enter+0x93/0x1f0 [ 399.842005][T10882] [ 399.842005][T10882] other info that might help us debug this: [ 399.842005][T10882] [ 399.842005][T10882] other info that might help us debug this: [ 399.842006][T10882] Possible unsafe locking scenario: [ 399.842006][T10882] [ 399.842006][T10882] Possible unsafe locking scenario: [ 399.842006][T10882] [ 399.842008][T10882] CPU0 [ 399.842008][T10882] CPU0 [ 399.842009][T10882] ---- [ 399.842009][T10882] ---- [ 399.842010][T10882] lock(srcu); [ 399.842010][T10882] lock(srcu); [ 399.842014][T10882] lock(srcu); [ 399.842014][T10882] lock(srcu); [ 399.842017][T10882] [ 399.842017][T10882] *** DEADLOCK *** [ 399.842017][T10882] [ 399.842017][T10882] [ 399.842017][T10882] *** DEADLOCK *** [ 399.842017][T10882] [ 399.842018][T10882] May be due to missing lock nesting notation [ 399.842018][T10882] [ 399.842018][T10882] May be due to missing lock nesting notation ------------------------------------------------------------------------------------------------------------------- Then defered the kvm_mmu_notifier_detach() using workqueue, then above issue got fixed. Testing: ------------- I've validated the v2 approach with: Kernel: v7.0-rc2 with PREEMPT_RT and DEBUG_ATOMIC_SLEEP enabled Test: Triggered OOM conditions that killed a QEMU process with active KVM VM Use these commands for generating scenario: 1. vng -v -r ./arch/x86/boot/bzImage --qemu-opts='-m 2G -cpu EPYC,+svm,+npt,+tsc,+invtsc -s ' After successfully booting the virtme-ng(QEMU) ------> Act Host VM 2. chmod 666 /dev/kvm 3. dmesg -c > /dev/null 4. launching Guest VM using this command $qemu-system-x86_64 -enable-kvm -m 1000M -mem-prealloc \ -monitor none -serial none -display none -nographic & sleep 10 Results: ------------------- 1. oom_enter callback was successfully invoked 2 No SRCU deadlock warnings 3 No "sleeping function called from invalid context" warnings 4.OOM reaper completed successfully 5. Process was reaped without errors Question: Before I send the v2 patch series, I want to confirm this approach aligns with your expectations. Specifically: Defered this coommon helper kvm_mmu_notifier_detach() for mmu_nottifier_unregister() and unbalanced invalidation using workque is good design? Are there any specific test cases or scenarios you'd like me to validate? I can send the complete v2 patch series once you confirm this approach is on the right track. Thanks again for the guidance! Shaikh Kamal > > This avoids the whole "convert locks to raw" problem and the complexity of deferring work. > > > > I have questions on Testing part: > > ------------------------------------ > > I tried to reproduce the bug scenario using the virtme-ng then running > > the stress-ng putting memory pressure on VM, but not able to reproduce > > the scenario. > > I tried this way .. > > vng -v -r ./arch/x86/boot/bzImage > > VM is up, then running the stress-ng as below > > stress-ng --vm 2 --vm-bytes 95% --timeout 20s & sleep 5 & dmesg | tail -30 | grep "sleeping function" > > OOM Killer is triggered, but exact bug not able to reproduce, Please > > suggest how to reproduce this bug, even we need to verify after code > > changes which you have suggested. > > I don't know, sorry. But with this new approach there will always be a call > to the new callback from the OOM killer, so it's easier to test. > > Thanks, > > Paolo > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations 2026-03-25 5:19 ` shaikh kamaluddin @ 2026-03-26 18:23 ` Paolo Bonzini 0 siblings, 0 replies; 10+ messages in thread From: Paolo Bonzini @ 2026-03-26 18:23 UTC (permalink / raw) To: shaikh kamaluddin Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm, Kernel Mailing List, Linux, linux-rt-devel, Shuah Khan, me Il mer 25 mar 2026, 06:19 shaikh kamaluddin <shaikhkamal2012@gmail.com> ha scritto: > > 1. Added oom_enter callback to struct mmu_notifier_ops in include/linux/mmu_notifier.h > 2. Implemented __mmu_notifier_oom_enter() in mm/mmu_notifier.c to invoke registered callbacks > 3. Called mmu_notifier_oom_enter(mm) from __oom_kill_process in mm/oom_kill.c before any invalidations > 4. As per your suggestion, move existing kvm_destroy_vm() logic that already handles unbalanced invalidation to the new helper function kvm_mmu_notifier_detach() and invoke it from the kvm_destroy_vm() This is not fully clear to me... It could be caused by a recursive locking, or also a false positive. It's hard to say without seeing the full backtrace, but seeing "lock(srcu)" is suspicious. I wouldn't have expected deferral to be necessary; and it seems to me that, if you defer removal to some time after the OOM reaper starts, you'd have the same problem as before with sleeping spinlocks. Can you post the original patch without deferral? Paolo > > Key Design Decision: > ------------------------------ > Implementation point no 4, while testing, Issue I was encountering is a recursive locking problem with the srcu lock, which is being acquired twice in the same context. This happens during the __mmu_notifier_oom_enter() and __synchronize_srcu() calls, leading to a potential deadlock. > Please find below log snippet while launching the Guest VM ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-03-26 18:24 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-09 16:15 [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations shaikh.kamal 2026-02-11 12:09 ` Sebastian Andrzej Siewior 2026-02-11 15:34 ` Sean Christopherson 2026-03-03 18:49 ` shaikh kamaluddin 2026-03-06 16:42 ` Sean Christopherson 2026-03-06 18:14 ` Paolo Bonzini 2026-03-12 19:24 ` shaikh kamaluddin 2026-03-14 7:47 ` Paolo Bonzini 2026-03-25 5:19 ` shaikh kamaluddin 2026-03-26 18:23 ` Paolo Bonzini
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox