public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
@ 2026-02-09 16:15 shaikh.kamal
  2026-02-11 12:09 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 10+ messages in thread
From: shaikh.kamal @ 2026-02-09 16:15 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-rt-devel; +Cc: shaikh.kamal

mmu_notifier_invalidate_range_start() may be invoked via
mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
where sleeping is explicitly forbidden.

KVM's mmu_notifier invalidate_range_start currently takes
mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
to rt_mutex and may sleep, triggering:

  BUG: sleeping function called from invalid context

This violates the MMU notifier contract regardless of PREEMPT_RT; RT
kernels merely make the issue deterministic.

Fix by converting mn_invalidate_lock to a raw spinlock so that
invalidate_range_start() remains non-sleeping while preserving the
existing serialization between invalidate_range_start() and
invalidate_range_end().

Signed-off-by: shaikh.kamal <shaikhkamal2012@gmail.com>
---
 include/linux/kvm_host.h |  2 +-
 virt/kvm/kvm_main.c      | 18 +++++++++---------
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d93f75b05ae2..77a6d4833eda 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -797,7 +797,7 @@ struct kvm {
 	atomic_t nr_memslots_dirty_logging;
 
 	/* Used to wait for completion of MMU notifiers.  */
-	spinlock_t mn_invalidate_lock;
+	raw_spinlock_t mn_invalidate_lock;
 	unsigned long mn_active_invalidate_count;
 	struct rcuwait mn_memslots_update_rcuwait;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5fcd401a5897..7a9c33f01a37 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 *
 	 * Pairs with the decrement in range_end().
 	 */
-	spin_lock(&kvm->mn_invalidate_lock);
+	raw_spin_lock(&kvm->mn_invalidate_lock);
 	kvm->mn_active_invalidate_count++;
-	spin_unlock(&kvm->mn_invalidate_lock);
+	raw_spin_unlock(&kvm->mn_invalidate_lock);
 
 	/*
 	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
@@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	kvm_handle_hva_range(kvm, &hva_range);
 
 	/* Pairs with the increment in range_start(). */
-	spin_lock(&kvm->mn_invalidate_lock);
+	raw_spin_lock(&kvm->mn_invalidate_lock);
 	if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
 		--kvm->mn_active_invalidate_count;
 	wake = !kvm->mn_active_invalidate_count;
-	spin_unlock(&kvm->mn_invalidate_lock);
+	raw_spin_unlock(&kvm->mn_invalidate_lock);
 
 	/*
 	 * There can only be one waiter, since the wait happens under
@@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	mutex_init(&kvm->irq_lock);
 	mutex_init(&kvm->slots_lock);
 	mutex_init(&kvm->slots_arch_lock);
-	spin_lock_init(&kvm->mn_invalidate_lock);
+	raw_spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
@@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
 	 * progress, otherwise the locking in invalidate_range_start and
 	 * invalidate_range_end will be unbalanced.
 	 */
-	spin_lock(&kvm->mn_invalidate_lock);
+	raw_spin_lock(&kvm->mn_invalidate_lock);
 	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
 	while (kvm->mn_active_invalidate_count) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		spin_unlock(&kvm->mn_invalidate_lock);
+		raw_spin_unlock(&kvm->mn_invalidate_lock);
 		schedule();
-		spin_lock(&kvm->mn_invalidate_lock);
+		raw_spin_lock(&kvm->mn_invalidate_lock);
 	}
 	finish_rcuwait(&kvm->mn_memslots_update_rcuwait);
 	rcu_assign_pointer(kvm->memslots[as_id], slots);
-	spin_unlock(&kvm->mn_invalidate_lock);
+	raw_spin_unlock(&kvm->mn_invalidate_lock);
 
 	/*
 	 * Acquired in kvm_set_memslot. Must be released before synchronize
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-02-09 16:15 [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations shaikh.kamal
@ 2026-02-11 12:09 ` Sebastian Andrzej Siewior
  2026-02-11 15:34   ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-11 12:09 UTC (permalink / raw)
  To: shaikh.kamal; +Cc: kvm, linux-kernel, linux-rt-devel

On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
> mmu_notifier_invalidate_range_start() may be invoked via
> mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
> where sleeping is explicitly forbidden.
> 
> KVM's mmu_notifier invalidate_range_start currently takes
> mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
> to rt_mutex and may sleep, triggering:
> 
>   BUG: sleeping function called from invalid context
> 
> This violates the MMU notifier contract regardless of PREEMPT_RT; RT
> kernels merely make the issue deterministic.
> 
> Fix by converting mn_invalidate_lock to a raw spinlock so that
> invalidate_range_start() remains non-sleeping while preserving the
> existing serialization between invalidate_range_start() and
> invalidate_range_end().
> 
> Signed-off-by: shaikh.kamal <shaikhkamal2012@gmail.com>

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

I don't see any down side doing this, but…

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5fcd401a5897..7a9c33f01a37 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	 *
>  	 * Pairs with the decrement in range_end().
>  	 */
> -	spin_lock(&kvm->mn_invalidate_lock);
> +	raw_spin_lock(&kvm->mn_invalidate_lock);
>  	kvm->mn_active_invalidate_count++;
> -	spin_unlock(&kvm->mn_invalidate_lock);
> +	raw_spin_unlock(&kvm->mn_invalidate_lock);

	atomic_inc(mn_active_invalidate_count)
>  
>  	/*
>  	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
> @@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  	kvm_handle_hva_range(kvm, &hva_range);
>  
>  	/* Pairs with the increment in range_start(). */
> -	spin_lock(&kvm->mn_invalidate_lock);
> +	raw_spin_lock(&kvm->mn_invalidate_lock);
>  	if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
>  		--kvm->mn_active_invalidate_count;
>  	wake = !kvm->mn_active_invalidate_count;

	wake = atomic_dec_return_safe(mn_active_invalidate_count);
	WARN_ON_ONCE(wake < 0);
	wake = !wake;

> -	spin_unlock(&kvm->mn_invalidate_lock);
> +	raw_spin_unlock(&kvm->mn_invalidate_lock);
>  
>  	/*
>  	 * There can only be one waiter, since the wait happens under
> @@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> @@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
>  	 * progress, otherwise the locking in invalidate_range_start and
>  	 * invalidate_range_end will be unbalanced.
>  	 */
> -	spin_lock(&kvm->mn_invalidate_lock);
> +	raw_spin_lock(&kvm->mn_invalidate_lock);
>  	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
>  	while (kvm->mn_active_invalidate_count) {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
> -		spin_unlock(&kvm->mn_invalidate_lock);
> +		raw_spin_unlock(&kvm->mn_invalidate_lock);
>  		schedule();

And this I don't understand. The lock protects the rcuwait assignment
which would be needed if multiple waiters are possible. But this goes
away after the unlock and schedule() here. So these things could be
moved outside of the locked section which limits it only to the
mn_active_invalidate_count value.

> -		spin_lock(&kvm->mn_invalidate_lock);
> +		raw_spin_lock(&kvm->mn_invalidate_lock);
>  	}
>  	finish_rcuwait(&kvm->mn_memslots_update_rcuwait);
>  	rcu_assign_pointer(kvm->memslots[as_id], slots);
> -	spin_unlock(&kvm->mn_invalidate_lock);
> +	raw_spin_unlock(&kvm->mn_invalidate_lock);
>  
>  	/*
>  	 * Acquired in kvm_set_memslot. Must be released before synchronize

Sebastian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-02-11 12:09 ` Sebastian Andrzej Siewior
@ 2026-02-11 15:34   ` Sean Christopherson
  2026-03-03 18:49     ` shaikh kamaluddin
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2026-02-11 15:34 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: shaikh.kamal, kvm, linux-kernel, linux-rt-devel

On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
> On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
> > mmu_notifier_invalidate_range_start() may be invoked via
> > mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
> > where sleeping is explicitly forbidden.
> > 
> > KVM's mmu_notifier invalidate_range_start currently takes
> > mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
> > to rt_mutex and may sleep, triggering:
> > 
> >   BUG: sleeping function called from invalid context
> > 
> > This violates the MMU notifier contract regardless of PREEMPT_RT;

I highly doubt that.  kvm.mmu_lock is also a spinlock, and KVM has been taking
that in invalidate_range_start() since

  e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map")

which was a full decade before mmu_notifiers even added the blockable concept in

  93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")

and even predate the current concept of a "raw" spinlock introduced by

  c2f21ce2e312 ("locking: Implement new raw_spinlock")

> > RT kernels merely make the issue deterministic.

No, RT kernels change the rules, because suddenly a non-sleeping locking becomes
sleepable.

> > Fix by converting mn_invalidate_lock to a raw spinlock so that
> > invalidate_range_start() remains non-sleeping while preserving the
> > existing serialization between invalidate_range_start() and
> > invalidate_range_end().

This is insufficient.  To actually "fix" this in KVM mmu_lock would need to be
turned into a raw lock on all KVM architectures.  I suspect the only reason there
haven't been bug reports is because no one trips an OOM kill on VM while running
with CONFIG_DEBUG_ATOMIC_SLEEP=y.

That combination is required because since commit

  8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot")

KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot,
i.e. affects memory that may be mapped into the guest.

E.g. this hack to simulate a non-blockable invalidation

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7015edce5bd8..7a35a83420ec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
                .handler        = kvm_mmu_unmap_gfn_range,
                .on_lock        = kvm_mmu_invalidate_begin,
                .flush_on_ret   = true,
-               .may_block      = mmu_notifier_range_blockable(range),
+               .may_block      = false,//mmu_notifier_range_blockable(range),
        };
 
        trace_kvm_unmap_hva_range(range->start, range->end);
@@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
         */
        gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
 
+       non_block_start();
        /*
         * If one or more memslots were found and thus zapped, notify arch code
         * that guest memory has been reclaimed.  This needs to be done *after*
@@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
         */
        if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
                kvm_arch_guest_memory_reclaimed(kvm);
+       non_block_end();
 
        return 0;
 }

immediately triggers

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241
  in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu
  preempt_count: 0, expected: 0
  RCU nest depth: 0, expected: 0
  CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT 
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x51/0x60
   __might_resched+0x10e/0x160
   rt_write_lock+0x49/0x310
   kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm]
   __mmu_notifier_invalidate_range_start+0x9b/0x230
   do_wp_page+0xce1/0xf30
   __handle_mm_fault+0x380/0x3a0
   handle_mm_fault+0xde/0x290
   __get_user_pages+0x20d/0xbe0
   get_user_pages_unlocked+0xf6/0x340
   hva_to_pfn+0x295/0x420 [kvm]
   __kvm_faultin_pfn+0x5d/0x90 [kvm]
   kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm]
   kvm_tdp_page_fault+0xb6/0x160 [kvm]
   kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
   kvm_mmu_page_fault+0x8d/0x600 [kvm]
   vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm]
   kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm]
   __x64_sys_ioctl+0x8a/0xd0
   do_syscall_64+0x5e/0x11b0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  kvm: emulating exchange as write


It's not at all clear to me that switching mmu_lock to a raw lock would be a net
positive for PREEMPT_RT.  OOM-killing a KVM guest in a PREEMPT_RT seems like a
comically rare scenario.  Whereas contending mmu_lock in normal operation is
relatively common (assuming there are even use cases for running VMs with a
PREEMPT_RT host kernel).

In fact, the only reason the splat happens is because mmu_notifiers somewhat
artificially forces an atomic context via non_block_start() since commit

  ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")

Given the massive amount of churn in KVM that would be required to fully eliminate
the splat, and that it's not at all obvious that it would be a good change overall,
at least for now:

NAK

I'm not fundamentally opposed to such a change, but there needs to be a _lot_
more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".

> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 5fcd401a5897..7a9c33f01a37 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  	 *
> >  	 * Pairs with the decrement in range_end().
> >  	 */
> > -	spin_lock(&kvm->mn_invalidate_lock);
> > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> >  	kvm->mn_active_invalidate_count++;
> > -	spin_unlock(&kvm->mn_invalidate_lock);
> > +	raw_spin_unlock(&kvm->mn_invalidate_lock);
> 
> 	atomic_inc(mn_active_invalidate_count)
> >  
> >  	/*
> >  	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
> > @@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> >  	kvm_handle_hva_range(kvm, &hva_range);
> >  
> >  	/* Pairs with the increment in range_start(). */
> > -	spin_lock(&kvm->mn_invalidate_lock);
> > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> >  	if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
> >  		--kvm->mn_active_invalidate_count;
> >  	wake = !kvm->mn_active_invalidate_count;
> 
> 	wake = atomic_dec_return_safe(mn_active_invalidate_count);
> 	WARN_ON_ONCE(wake < 0);
> 	wake = !wake;
> 
> > -	spin_unlock(&kvm->mn_invalidate_lock);
> > +	raw_spin_unlock(&kvm->mn_invalidate_lock);
> >  
> >  	/*
> >  	 * There can only be one waiter, since the wait happens under
> > @@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > @@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
> >  	 * progress, otherwise the locking in invalidate_range_start and
> >  	 * invalidate_range_end will be unbalanced.
> >  	 */
> > -	spin_lock(&kvm->mn_invalidate_lock);
> > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> >  	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
> >  	while (kvm->mn_active_invalidate_count) {
> >  		set_current_state(TASK_UNINTERRUPTIBLE);
> > -		spin_unlock(&kvm->mn_invalidate_lock);
> > +		raw_spin_unlock(&kvm->mn_invalidate_lock);
> >  		schedule();
> 
> And this I don't understand. The lock protects the rcuwait assignment
> which would be needed if multiple waiters are possible. But this goes
> away after the unlock and schedule() here. So these things could be
> moved outside of the locked section which limits it only to the
> mn_active_invalidate_count value.

The implementation is essentially a deliberately unfair rwswem.  The "write" side
in kvm_swap_active_memslots() subtly protect this code:

  rcu_assign_pointer(kvm->memslots[as_id], slots);

and the "read" side protects the kvm->memslot lookups in kvm_handle_hva_range().

KVM optimizes its mmu_notifier invalidation path to only take action if the
to-be-invalidated range overlaps one or more memslots, i.e. affects memory that
be can be mapped into the guest.  The wrinkle with those optimizations is that
KVM needs to prevent changes to the memslots between invalidation start() and end(),
otherwise the accounting can become imbalanced, e.g. mmu_invalidate_in_progress
will underflow or be left elevated and essentially hang the VM (among other bad
things).

So simply making mn_active_invalidate_count an atomic won't suffice, because KVM
needs to block start() to ensure start()+end() see the exact same set of memslots.

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-02-11 15:34   ` Sean Christopherson
@ 2026-03-03 18:49     ` shaikh kamaluddin
  2026-03-06 16:42       ` Sean Christopherson
  2026-03-06 18:14       ` Paolo Bonzini
  0 siblings, 2 replies; 10+ messages in thread
From: shaikh kamaluddin @ 2026-03-03 18:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Sebastian Andrzej Siewior, kvm, linux-kernel, linux-rt-devel

On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote:
> On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
> > On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
> > > mmu_notifier_invalidate_range_start() may be invoked via
> > > mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
> > > where sleeping is explicitly forbidden.
> > > 
> > > KVM's mmu_notifier invalidate_range_start currently takes
> > > mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
> > > to rt_mutex and may sleep, triggering:
> > > 
> > >   BUG: sleeping function called from invalid context
> > > 
> > > This violates the MMU notifier contract regardless of PREEMPT_RT;
> 
> I highly doubt that.  kvm.mmu_lock is also a spinlock, and KVM has been taking
> that in invalidate_range_start() since
> 
>   e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map")
> 
> which was a full decade before mmu_notifiers even added the blockable concept in
> 
>   93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
> 
> and even predate the current concept of a "raw" spinlock introduced by
> 
>   c2f21ce2e312 ("locking: Implement new raw_spinlock")
> 
> > > RT kernels merely make the issue deterministic.
> 
> No, RT kernels change the rules, because suddenly a non-sleeping locking becomes
> sleepable.
> 
> > > Fix by converting mn_invalidate_lock to a raw spinlock so that
> > > invalidate_range_start() remains non-sleeping while preserving the
> > > existing serialization between invalidate_range_start() and
> > > invalidate_range_end().
> 
> This is insufficient.  To actually "fix" this in KVM mmu_lock would need to be
> turned into a raw lock on all KVM architectures.  I suspect the only reason there
> haven't been bug reports is because no one trips an OOM kill on VM while running
> with CONFIG_DEBUG_ATOMIC_SLEEP=y.
> 
> That combination is required because since commit
> 
>   8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot")
> 
> KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot,
> i.e. affects memory that may be mapped into the guest.
> 
> E.g. this hack to simulate a non-blockable invalidation
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7015edce5bd8..7a35a83420ec 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                 .handler        = kvm_mmu_unmap_gfn_range,
>                 .on_lock        = kvm_mmu_invalidate_begin,
>                 .flush_on_ret   = true,
> -               .may_block      = mmu_notifier_range_blockable(range),
> +               .may_block      = false,//mmu_notifier_range_blockable(range),
>         };
>  
>         trace_kvm_unmap_hva_range(range->start, range->end);
> @@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>          */
>         gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
>  
> +       non_block_start();
>         /*
>          * If one or more memslots were found and thus zapped, notify arch code
>          * that guest memory has been reclaimed.  This needs to be done *after*
> @@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>          */
>         if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
>                 kvm_arch_guest_memory_reclaimed(kvm);
> +       non_block_end();
>  
>         return 0;
>  }
> 
> immediately triggers
> 
>   BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241
>   in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu
>   preempt_count: 0, expected: 0
>   RCU nest depth: 0, expected: 0
>   CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT 
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   Call Trace:
>    <TASK>
>    dump_stack_lvl+0x51/0x60
>    __might_resched+0x10e/0x160
>    rt_write_lock+0x49/0x310
>    kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm]
>    __mmu_notifier_invalidate_range_start+0x9b/0x230
>    do_wp_page+0xce1/0xf30
>    __handle_mm_fault+0x380/0x3a0
>    handle_mm_fault+0xde/0x290
>    __get_user_pages+0x20d/0xbe0
>    get_user_pages_unlocked+0xf6/0x340
>    hva_to_pfn+0x295/0x420 [kvm]
>    __kvm_faultin_pfn+0x5d/0x90 [kvm]
>    kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm]
>    kvm_tdp_page_fault+0xb6/0x160 [kvm]
>    kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
>    kvm_mmu_page_fault+0x8d/0x600 [kvm]
>    vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
>    kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm]
>    kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm]
>    __x64_sys_ioctl+0x8a/0xd0
>    do_syscall_64+0x5e/0x11b0
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>   kvm: emulating exchange as write
> 
> 
> It's not at all clear to me that switching mmu_lock to a raw lock would be a net
> positive for PREEMPT_RT.  OOM-killing a KVM guest in a PREEMPT_RT seems like a
> comically rare scenario.  Whereas contending mmu_lock in normal operation is
> relatively common (assuming there are even use cases for running VMs with a
> PREEMPT_RT host kernel).
> 
> In fact, the only reason the splat happens is because mmu_notifiers somewhat
> artificially forces an atomic context via non_block_start() since commit
> 
>   ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")
> 
> Given the massive amount of churn in KVM that would be required to fully eliminate
> the splat, and that it's not at all obvious that it would be a good change overall,
> at least for now:
> 
> NAK
> 
> I'm not fundamentally opposed to such a change, but there needs to be a _lot_
> more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".
>
Hi Sean,
Thanks for the detailed explanation and for spelling out the border
issue.
Understood on both points:
	1. The changelog wording was too strong; PREEMPT_RT changes
	spin_lock() semantics, and the splat is fundamentally due to
	spinlocks becoming sleepable there.
	2. Converting only mm_invalidate_lock to raw is insufficient
	since KVM can still take the mmu_lock (and other sleeping locks
	RT) in invalidate_range_start() when the invalidation hits a
	memslot.
Given the above, it shounds like "convert locks to raw" is not the right
direction without sinificat rework and justification.
Would an acceptable direction be to handle the !blockable notifier case
by deferring the heavyweight invalidation work(anything that take
mmu_lock/may sleep on RT) to a context that may block(e.g. queued work),
while keeping start()/end() accounting consisting with memslot changes ?
if so, I can protoptype a patch along those lines and share for
feedback.

Alternatively, if you think this needs to be addressed in
mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
redirect my efforts there-Please advise.
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 5fcd401a5897..7a9c33f01a37 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >  	 *
> > >  	 * Pairs with the decrement in range_end().
> > >  	 */
> > > -	spin_lock(&kvm->mn_invalidate_lock);
> > > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> > >  	kvm->mn_active_invalidate_count++;
> > > -	spin_unlock(&kvm->mn_invalidate_lock);
> > > +	raw_spin_unlock(&kvm->mn_invalidate_lock);
> > 
> > 	atomic_inc(mn_active_invalidate_count)
> > >  
> > >  	/*
> > >  	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
> > > @@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > >  	kvm_handle_hva_range(kvm, &hva_range);
> > >  
> > >  	/* Pairs with the increment in range_start(). */
> > > -	spin_lock(&kvm->mn_invalidate_lock);
> > > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> > >  	if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
> > >  		--kvm->mn_active_invalidate_count;
> > >  	wake = !kvm->mn_active_invalidate_count;
> > 
> > 	wake = atomic_dec_return_safe(mn_active_invalidate_count);
> > 	WARN_ON_ONCE(wake < 0);
> > 	wake = !wake;
> > 
> > > -	spin_unlock(&kvm->mn_invalidate_lock);
> > > +	raw_spin_unlock(&kvm->mn_invalidate_lock);
> > >  
> > >  	/*
> > >  	 * There can only be one waiter, since the wait happens under
> > > @@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > > @@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
> > >  	 * progress, otherwise the locking in invalidate_range_start and
> > >  	 * invalidate_range_end will be unbalanced.
> > >  	 */
> > > -	spin_lock(&kvm->mn_invalidate_lock);
> > > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> > >  	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
> > >  	while (kvm->mn_active_invalidate_count) {
> > >  		set_current_state(TASK_UNINTERRUPTIBLE);
> > > -		spin_unlock(&kvm->mn_invalidate_lock);
> > > +		raw_spin_unlock(&kvm->mn_invalidate_lock);
> > >  		schedule();
> > 
> > And this I don't understand. The lock protects the rcuwait assignment
> > which would be needed if multiple waiters are possible. But this goes
> > away after the unlock and schedule() here. So these things could be
> > moved outside of the locked section which limits it only to the
> > mn_active_invalidate_count value.
> 
> The implementation is essentially a deliberately unfair rwswem.  The "write" side
> in kvm_swap_active_memslots() subtly protect this code:
> 
>   rcu_assign_pointer(kvm->memslots[as_id], slots);
> 
> and the "read" side protects the kvm->memslot lookups in kvm_handle_hva_range().
> 
> KVM optimizes its mmu_notifier invalidation path to only take action if the
> to-be-invalidated range overlaps one or more memslots, i.e. affects memory that
> be can be mapped into the guest.  The wrinkle with those optimizations is that
> KVM needs to prevent changes to the memslots between invalidation start() and end(),
> otherwise the accounting can become imbalanced, e.g. mmu_invalidate_in_progress
> will underflow or be left elevated and essentially hang the VM (among other bad
> things).
> 
> So simply making mn_active_invalidate_count an atomic won't suffice, because KVM
> needs to block start() to ensure start()+end() see the exact same set of memslots.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-03 18:49     ` shaikh kamaluddin
@ 2026-03-06 16:42       ` Sean Christopherson
  2026-03-06 18:14       ` Paolo Bonzini
  1 sibling, 0 replies; 10+ messages in thread
From: Sean Christopherson @ 2026-03-06 16:42 UTC (permalink / raw)
  To: shaikh kamaluddin
  Cc: Sebastian Andrzej Siewior, kvm, linux-kernel, linux-rt-devel

On Wed, Mar 04, 2026, shaikh kamaluddin wrote:
> On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote:
> > On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
> > It's not at all clear to me that switching mmu_lock to a raw lock would be a net
> > positive for PREEMPT_RT.  OOM-killing a KVM guest in a PREEMPT_RT seems like a
> > comically rare scenario.  Whereas contending mmu_lock in normal operation is
> > relatively common (assuming there are even use cases for running VMs with a
> > PREEMPT_RT host kernel).
> > 
> > In fact, the only reason the splat happens is because mmu_notifiers somewhat
> > artificially forces an atomic context via non_block_start() since commit
> > 
> >   ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")
> > 
> > Given the massive amount of churn in KVM that would be required to fully eliminate
> > the splat, and that it's not at all obvious that it would be a good change overall,
> > at least for now:
> > 
> > NAK
> > 
> > I'm not fundamentally opposed to such a change, but there needs to be a _lot_
> > more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".
> >
> Hi Sean,
> Thanks for the detailed explanation and for spelling out the border
> issue.
> Understood on both points:
> 	1. The changelog wording was too strong; PREEMPT_RT changes
> 	spin_lock() semantics, and the splat is fundamentally due to
> 	spinlocks becoming sleepable there.
> 	2. Converting only mm_invalidate_lock to raw is insufficient
> 	since KVM can still take the mmu_lock (and other sleeping locks
> 	RT) in invalidate_range_start() when the invalidation hits a
> 	memslot.
> Given the above, it shounds like "convert locks to raw" is not the right
> direction without sinificat rework and justification.
> Would an acceptable direction be to handle the !blockable notifier case
> by deferring the heavyweight invalidation work(anything that take
> mmu_lock/may sleep on RT) to a context that may block(e.g. queued work),
> while keeping start()/end() accounting consisting with memslot changes ?

No, because the _only_ case where the invalidation is non-blockable is when the
kernel is OOM-killing.  Deferring the invalidations when we're OOM is likely to
make the problem *worse*.

That's the crux of my NAK.  We'd be making KVM and kernel behavior worse to "fix"
a largely hypothetical issue (OOM-killing a KVM guest in a RT kernel).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-03 18:49     ` shaikh kamaluddin
  2026-03-06 16:42       ` Sean Christopherson
@ 2026-03-06 18:14       ` Paolo Bonzini
  2026-03-12 19:24         ` shaikh kamaluddin
  1 sibling, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2026-03-06 18:14 UTC (permalink / raw)
  To: shaikh kamaluddin, Sean Christopherson
  Cc: Sebastian Andrzej Siewior, kvm, linux-kernel, linux-rt-devel

On 3/3/26 19:49, shaikh kamaluddin wrote:
> On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote:
>> On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
>>> On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
>>>> mmu_notifier_invalidate_range_start() may be invoked via
>>>> mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
>>>> where sleeping is explicitly forbidden.
>>>>
>>>> KVM's mmu_notifier invalidate_range_start currently takes
>>>> mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
>>>> to rt_mutex and may sleep, triggering:
>>>>
>>>>    BUG: sleeping function called from invalid context
>>>>
>>>> This violates the MMU notifier contract regardless of PREEMPT_RT;
>>
>> I highly doubt that.  kvm.mmu_lock is also a spinlock, and KVM has been taking
>> that in invalidate_range_start() since
>>
>>    e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map")
>>
>> which was a full decade before mmu_notifiers even added the blockable concept in
>>
>>    93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
>>
>> and even predate the current concept of a "raw" spinlock introduced by
>>
>>    c2f21ce2e312 ("locking: Implement new raw_spinlock")
>>
>>>> RT kernels merely make the issue deterministic.
>>
>> No, RT kernels change the rules, because suddenly a non-sleeping locking becomes
>> sleepable.
>>
>>>> Fix by converting mn_invalidate_lock to a raw spinlock so that
>>>> invalidate_range_start() remains non-sleeping while preserving the
>>>> existing serialization between invalidate_range_start() and
>>>> invalidate_range_end().
>>
>> This is insufficient.  To actually "fix" this in KVM mmu_lock would need to be
>> turned into a raw lock on all KVM architectures.  I suspect the only reason there
>> haven't been bug reports is because no one trips an OOM kill on VM while running
>> with CONFIG_DEBUG_ATOMIC_SLEEP=y.
>>
>> That combination is required because since commit
>>
>>    8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot")
>>
>> KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot,
>> i.e. affects memory that may be mapped into the guest.
>>
>> E.g. this hack to simulate a non-blockable invalidation
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 7015edce5bd8..7a35a83420ec 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>                  .handler        = kvm_mmu_unmap_gfn_range,
>>                  .on_lock        = kvm_mmu_invalidate_begin,
>>                  .flush_on_ret   = true,
>> -               .may_block      = mmu_notifier_range_blockable(range),
>> +               .may_block      = false,//mmu_notifier_range_blockable(range),
>>          };
>>   
>>          trace_kvm_unmap_hva_range(range->start, range->end);
>> @@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>           */
>>          gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
>>   
>> +       non_block_start();
>>          /*
>>           * If one or more memslots were found and thus zapped, notify arch code
>>           * that guest memory has been reclaimed.  This needs to be done *after*
>> @@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>           */
>>          if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
>>                  kvm_arch_guest_memory_reclaimed(kvm);
>> +       non_block_end();
>>   
>>          return 0;
>>   }
>>
>> immediately triggers
>>
>>    BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241
>>    in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu
>>    preempt_count: 0, expected: 0
>>    RCU nest depth: 0, expected: 0
>>    CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT
>>    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>    Call Trace:
>>     <TASK>
>>     dump_stack_lvl+0x51/0x60
>>     __might_resched+0x10e/0x160
>>     rt_write_lock+0x49/0x310
>>     kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm]
>>     __mmu_notifier_invalidate_range_start+0x9b/0x230
>>     do_wp_page+0xce1/0xf30
>>     __handle_mm_fault+0x380/0x3a0
>>     handle_mm_fault+0xde/0x290
>>     __get_user_pages+0x20d/0xbe0
>>     get_user_pages_unlocked+0xf6/0x340
>>     hva_to_pfn+0x295/0x420 [kvm]
>>     __kvm_faultin_pfn+0x5d/0x90 [kvm]
>>     kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm]
>>     kvm_tdp_page_fault+0xb6/0x160 [kvm]
>>     kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
>>     kvm_mmu_page_fault+0x8d/0x600 [kvm]
>>     vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
>>     kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm]
>>     kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm]
>>     __x64_sys_ioctl+0x8a/0xd0
>>     do_syscall_64+0x5e/0x11b0
>>     entry_SYSCALL_64_after_hwframe+0x4b/0x53
>>     </TASK>
>>    kvm: emulating exchange as write
>>
>>
>> It's not at all clear to me that switching mmu_lock to a raw lock would be a net
>> positive for PREEMPT_RT.  OOM-killing a KVM guest in a PREEMPT_RT seems like a
>> comically rare scenario.  Whereas contending mmu_lock in normal operation is
>> relatively common (assuming there are even use cases for running VMs with a
>> PREEMPT_RT host kernel).
>>
>> In fact, the only reason the splat happens is because mmu_notifiers somewhat
>> artificially forces an atomic context via non_block_start() since commit
>>
>>    ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")
>>
>> Given the massive amount of churn in KVM that would be required to fully eliminate
>> the splat, and that it's not at all obvious that it would be a good change overall,
>> at least for now:
>>
>> NAK
>>
>> I'm not fundamentally opposed to such a change, but there needs to be a _lot_
>> more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".
>>
> Hi Sean,
> Thanks for the detailed explanation and for spelling out the border
> issue.
> Understood on both points:
> 	1. The changelog wording was too strong; PREEMPT_RT changes
> 	spin_lock() semantics, and the splat is fundamentally due to
> 	spinlocks becoming sleepable there.
> 	2. Converting only mm_invalidate_lock to raw is insufficient
> 	since KVM can still take the mmu_lock (and other sleeping locks
> 	RT) in invalidate_range_start() when the invalidation hits a
> 	memslot.
> Given the above, it shounds like "convert locks to raw" is not the right
> direction without sinificat rework and justification.
> Would an acceptable direction be to handle the !blockable notifier case
> by deferring the heavyweight invalidation work(anything that take
> mmu_lock/may sleep on RT) to a context that may block(e.g. queued work),
> while keeping start()/end() accounting consisting with memslot changes ?
> if so, I can protoptype a patch along those lines and share for
> feedback.
> 
> Alternatively, if you think this needs to be addressed in
> mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
> redirect my efforts there-Please advise.

Have you considered a "OOM entered" callback for MMU notifiers?  KVM's 
MMU notifier can just remove itself for example, in fact there is code 
in kvm_destroy_vm() to do that even if invalidations are unbalanced.

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-06 18:14       ` Paolo Bonzini
@ 2026-03-12 19:24         ` shaikh kamaluddin
  2026-03-14  7:47           ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: shaikh kamaluddin @ 2026-03-12 19:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm, linux-kernel,
	linux-rt-devel

On Fri, Mar 06, 2026 at 07:14:40PM +0100, Paolo Bonzini wrote:
> On 3/3/26 19:49, shaikh kamaluddin wrote:
> > On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote:
> > > On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
> > > > On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
> > > > > mmu_notifier_invalidate_range_start() may be invoked via
> > > > > mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
> > > > > where sleeping is explicitly forbidden.
> > > > > 
> > > > > KVM's mmu_notifier invalidate_range_start currently takes
> > > > > mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
> > > > > to rt_mutex and may sleep, triggering:
> > > > > 
> > > > >    BUG: sleeping function called from invalid context
> > > > > 
> > > > > This violates the MMU notifier contract regardless of PREEMPT_RT;
> > > 
> > > I highly doubt that.  kvm.mmu_lock is also a spinlock, and KVM has been taking
> > > that in invalidate_range_start() since
> > > 
> > >    e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map")
> > > 
> > > which was a full decade before mmu_notifiers even added the blockable concept in
> > > 
> > >    93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
> > > 
> > > and even predate the current concept of a "raw" spinlock introduced by
> > > 
> > >    c2f21ce2e312 ("locking: Implement new raw_spinlock")
> > > 
> > > > > RT kernels merely make the issue deterministic.
> > > 
> > > No, RT kernels change the rules, because suddenly a non-sleeping locking becomes
> > > sleepable.
> > > 
> > > > > Fix by converting mn_invalidate_lock to a raw spinlock so that
> > > > > invalidate_range_start() remains non-sleeping while preserving the
> > > > > existing serialization between invalidate_range_start() and
> > > > > invalidate_range_end().
> > > 
> > > This is insufficient.  To actually "fix" this in KVM mmu_lock would need to be
> > > turned into a raw lock on all KVM architectures.  I suspect the only reason there
> > > haven't been bug reports is because no one trips an OOM kill on VM while running
> > > with CONFIG_DEBUG_ATOMIC_SLEEP=y.
> > > 
> > > That combination is required because since commit
> > > 
> > >    8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot")
> > > 
> > > KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot,
> > > i.e. affects memory that may be mapped into the guest.
> > > 
> > > E.g. this hack to simulate a non-blockable invalidation
> > > 
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 7015edce5bd8..7a35a83420ec 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >                  .handler        = kvm_mmu_unmap_gfn_range,
> > >                  .on_lock        = kvm_mmu_invalidate_begin,
> > >                  .flush_on_ret   = true,
> > > -               .may_block      = mmu_notifier_range_blockable(range),
> > > +               .may_block      = false,//mmu_notifier_range_blockable(range),
> > >          };
> > >          trace_kvm_unmap_hva_range(range->start, range->end);
> > > @@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >           */
> > >          gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
> > > +       non_block_start();
> > >          /*
> > >           * If one or more memslots were found and thus zapped, notify arch code
> > >           * that guest memory has been reclaimed.  This needs to be done *after*
> > > @@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >           */
> > >          if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
> > >                  kvm_arch_guest_memory_reclaimed(kvm);
> > > +       non_block_end();
> > >          return 0;
> > >   }
> > > 
> > > immediately triggers
> > > 
> > >    BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241
> > >    in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu
> > >    preempt_count: 0, expected: 0
> > >    RCU nest depth: 0, expected: 0
> > >    CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT
> > >    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > >    Call Trace:
> > >     <TASK>
> > >     dump_stack_lvl+0x51/0x60
> > >     __might_resched+0x10e/0x160
> > >     rt_write_lock+0x49/0x310
> > >     kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm]
> > >     __mmu_notifier_invalidate_range_start+0x9b/0x230
> > >     do_wp_page+0xce1/0xf30
> > >     __handle_mm_fault+0x380/0x3a0
> > >     handle_mm_fault+0xde/0x290
> > >     __get_user_pages+0x20d/0xbe0
> > >     get_user_pages_unlocked+0xf6/0x340
> > >     hva_to_pfn+0x295/0x420 [kvm]
> > >     __kvm_faultin_pfn+0x5d/0x90 [kvm]
> > >     kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm]
> > >     kvm_tdp_page_fault+0xb6/0x160 [kvm]
> > >     kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
> > >     kvm_mmu_page_fault+0x8d/0x600 [kvm]
> > >     vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
> > >     kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm]
> > >     kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm]
> > >     __x64_sys_ioctl+0x8a/0xd0
> > >     do_syscall_64+0x5e/0x11b0
> > >     entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > >     </TASK>
> > >    kvm: emulating exchange as write
> > > 
> > > 
> > > It's not at all clear to me that switching mmu_lock to a raw lock would be a net
> > > positive for PREEMPT_RT.  OOM-killing a KVM guest in a PREEMPT_RT seems like a
> > > comically rare scenario.  Whereas contending mmu_lock in normal operation is
> > > relatively common (assuming there are even use cases for running VMs with a
> > > PREEMPT_RT host kernel).
> > > 
> > > In fact, the only reason the splat happens is because mmu_notifiers somewhat
> > > artificially forces an atomic context via non_block_start() since commit
> > > 
> > >    ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")
> > > 
> > > Given the massive amount of churn in KVM that would be required to fully eliminate
> > > the splat, and that it's not at all obvious that it would be a good change overall,
> > > at least for now:
> > > 
> > > NAK
> > > 
> > > I'm not fundamentally opposed to such a change, but there needs to be a _lot_
> > > more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".
> > > 
> > Hi Sean,
> > Thanks for the detailed explanation and for spelling out the border
> > issue.
> > Understood on both points:
> > 	1. The changelog wording was too strong; PREEMPT_RT changes
> > 	spin_lock() semantics, and the splat is fundamentally due to
> > 	spinlocks becoming sleepable there.
> > 	2. Converting only mm_invalidate_lock to raw is insufficient
> > 	since KVM can still take the mmu_lock (and other sleeping locks
> > 	RT) in invalidate_range_start() when the invalidation hits a
> > 	memslot.
> > Given the above, it shounds like "convert locks to raw" is not the right
> > direction without sinificat rework and justification.
> > Would an acceptable direction be to handle the !blockable notifier case
> > by deferring the heavyweight invalidation work(anything that take
> > mmu_lock/may sleep on RT) to a context that may block(e.g. queued work),
> > while keeping start()/end() accounting consisting with memslot changes ?
> > if so, I can protoptype a patch along those lines and share for
> > feedback.
> > 
> > Alternatively, if you think this needs to be addressed in
> > mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
> > redirect my efforts there-Please advise.
> 
> Have you considered a "OOM entered" callback for MMU notifiers?  KVM's MMU
> notifier can just remove itself for example, in fact there is code in
> kvm_destroy_vm() to do that even if invalidations are unbalanced.
> 
> Paolo
>
Thanks for the suggestion! That's a much cleaner approach than what I was considering.

If I understand correctly, the idea would be:
1. Add a new MMU notifier callback (e.g., .oom_entered or .release_on_oom)
2. Have KVM implement it to unregister the notifier when OOM reaper starts
3. Leverage the existing kvm_destroy_vm() logic that already handles unbalanced invalidations

This avoids the whole "convert locks to raw" problem and the complexity of deferring work.

I have questions on Testing part:
------------------------------------
I tried to reproduce the bug scenario using the virtme-ng then running
the stress-ng putting memory pressure on VM, but not able to reproduce
the scenario.
I tried this way ..
vng -v -r ./arch/x86/boot/bzImage
VM is up, then running the stress-ng as below 
stress-ng --vm 2 --vm-bytes 95% --timeout 20s & sleep 5 & dmesg | tail -30 | grep "sleeping function"
OOM Killer is triggered, but exact bug not able to reproduce, Please
suggest how to reproduce this bug, even we need to verify after code
changes which you have suggested.

Shaikh Kamal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-12 19:24         ` shaikh kamaluddin
@ 2026-03-14  7:47           ` Paolo Bonzini
  2026-03-25  5:19             ` shaikh kamaluddin
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2026-03-14  7:47 UTC (permalink / raw)
  To: shaikh kamaluddin
  Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm, linux-kernel,
	linux-rt-devel

On 3/12/26 20:24, shaikh kamaluddin wrote:
>>> Alternatively, if you think this needs to be addressed in
>>> mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
>>> redirect my efforts there-Please advise.
>>
>> Have you considered a "OOM entered" callback for MMU notifiers?  KVM's MMU
>> notifier can just remove itself for example, in fact there is code in
>> kvm_destroy_vm() to do that even if invalidations are unbalanced.
>>
>> Paolo
>>
> Thanks for the suggestion! That's a much cleaner approach than what I was considering.
> 
> If I understand correctly, the idea would be:
> 1. Add a new MMU notifier callback (e.g., .oom_entered or .release_on_oom)
> 2. Have KVM implement it to unregister the notifier when OOM reaper starts
> 3. Leverage the existing kvm_destroy_vm() logic that already handles unbalanced invalidations

Yes pretty much.  Essentially, move the existing logic to the new 
callback and invoke it from kvm_destroy_vm().

> This avoids the whole "convert locks to raw" problem and the complexity of deferring work.
> 
> I have questions on Testing part:
> ------------------------------------
> I tried to reproduce the bug scenario using the virtme-ng then running
> the stress-ng putting memory pressure on VM, but not able to reproduce
> the scenario.
> I tried this way ..
> vng -v -r ./arch/x86/boot/bzImage
> VM is up, then running the stress-ng as below
> stress-ng --vm 2 --vm-bytes 95% --timeout 20s & sleep 5 & dmesg | tail -30 | grep "sleeping function"
> OOM Killer is triggered, but exact bug not able to reproduce, Please
> suggest how to reproduce this bug, even we need to verify after code
> changes which you have suggested.

I don't know, sorry.  But with this new approach there will always be a 
call to the new callback from the OOM killer, so it's easier to test.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-14  7:47           ` Paolo Bonzini
@ 2026-03-25  5:19             ` shaikh kamaluddin
  2026-03-26 18:23               ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: shaikh kamaluddin @ 2026-03-25  5:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm, linux-kernel,
	linux-rt-devel, skhan, me

On Sat, Mar 14, 2026 at 08:47:40AM +0100, Paolo Bonzini wrote:
> On 3/12/26 20:24, shaikh kamaluddin wrote:
> > > > Alternatively, if you think this needs to be addressed in
> > > > mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
> > > > redirect my efforts there-Please advise.
> > > 
> > > Have you considered a "OOM entered" callback for MMU notifiers?  KVM's MMU
> > > notifier can just remove itself for example, in fact there is code in
> > > kvm_destroy_vm() to do that even if invalidations are unbalanced.
> > > 
> > > Paolo
> > > 
> > Thanks for the suggestion! That's a much cleaner approach than what I was considering.
> > 
> > If I understand correctly, the idea would be:
> > 1. Add a new MMU notifier callback (e.g., .oom_entered or .release_on_oom)
> > 2. Have KVM implement it to unregister the notifier when OOM reaper starts
> > 3. Leverage the existing kvm_destroy_vm() logic that already handles unbalanced invalidations
> 
> Yes pretty much.  Essentially, move the existing logic to the new callback
> and invoke it from kvm_destroy_vm().
>

Hi Paolo,
Thank you for the suggestion to use an oom_enter callback approach. I've implemented v2 based on your guidance and have successfully validated it.

Implementation Summary:
-------------------------------------
Following your recommendation, I've added a new oom_enter callback to the mmu_notifier_ops structure. The implementation:

1. Added oom_enter callback to struct mmu_notifier_ops in include/linux/mmu_notifier.h
2. Implemented __mmu_notifier_oom_enter() in mm/mmu_notifier.c to invoke registered callbacks
3. Called mmu_notifier_oom_enter(mm) from __oom_kill_process in mm/oom_kill.c before any invalidations
4. As per your suggestion, move existing kvm_destroy_vm() logic that already handles unbalanced invalidation to the new helper function kvm_mmu_notifier_detach() and invoke it from the kvm_destroy_vm()

Key Design Decision:
------------------------------
Implementation point no 4, while testing, Issue I was encountering is a recursive locking problem with the srcu lock, which is being acquired twice in the same context. This happens during the __mmu_notifier_oom_enter() and __synchronize_srcu() calls, leading to a potential deadlock.
Please find below log snippet while launching the Guest VM
------------------------------------------------------------------------------------------------
OOM_REAPER: START reaping:func:__mmu_notifier_oom_enter
[  399.841599][T10882] OOM_REAPER: START reaping:func:__mmu_notifier_oom_enter
[  399.841608][T10882] KVM: oom_enter callback invoked for VM:kvm_mmu_notifier_oom_enter
[  399.841608][T10882] KVM: oom_enter callback invoked for VM:kvm_mmu_notifier_oom_enter
[  399.841961][T10882]
[  399.841961][T10882]
[  399.841962][T10882] ============================================
[  399.841962][T10882] ============================================
[  399.841964][T10882] WARNING: possible recursive locking detected
[  399.841964][T10882] WARNING: possible recursive locking detected
[  399.841966][T10882] 7.0.0-rc2-00467-g4ae12d8bd9a8-dirty #12 Not tainted
[  399.841966][T10882] 7.0.0-rc2-00467-g4ae12d8bd9a8-dirty #12 Not tainted
[  399.841969][T10882] --------------------------------------------
[  399.841969][T10882] --------------------------------------------
[  399.841971][T10882] qemu-system-x86/10882 is trying to acquire lock:
[  399.841971][T10882] qemu-system-x86/10882 is trying to acquire lock:
[  399.841974][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x83/0x380
[  399.841974][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x83/0x380
[  399.841991][T10882]
[  399.841991][T10882] but task is already holding lock:
[  399.841991][T10882]
[  399.841991][T10882] but task is already holding lock:
[  399.841992][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __mmu_notifier_oom_enter+0x93/0x1f0
[  399.841992][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __mmu_notifier_oom_enter+0x93/0x1f0
[  399.842005][T10882]
[  399.842005][T10882] other info that might help us debug this:
[  399.842005][T10882]
[  399.842005][T10882] other info that might help us debug this:
[  399.842006][T10882]  Possible unsafe locking scenario:
[  399.842006][T10882]
[  399.842006][T10882]  Possible unsafe locking scenario:
[  399.842006][T10882]
[  399.842008][T10882]        CPU0
[  399.842008][T10882]        CPU0
[  399.842009][T10882]        ----
[  399.842009][T10882]        ----
[  399.842010][T10882]   lock(srcu);
[  399.842010][T10882]   lock(srcu);
[  399.842014][T10882]   lock(srcu);
[  399.842014][T10882]   lock(srcu);
[  399.842017][T10882]
[  399.842017][T10882]  *** DEADLOCK ***
[  399.842017][T10882]
[  399.842017][T10882]
[  399.842017][T10882]  *** DEADLOCK ***
[  399.842017][T10882]
[  399.842018][T10882]  May be due to missing lock nesting notation
[  399.842018][T10882]
[  399.842018][T10882]  May be due to missing lock nesting notation

-------------------------------------------------------------------------------------------------------------------
Then defered the kvm_mmu_notifier_detach() using workqueue, then above issue got fixed.


Testing:
-------------
I've validated the v2 approach with:

Kernel: v7.0-rc2 with PREEMPT_RT and DEBUG_ATOMIC_SLEEP enabled
Test: Triggered OOM conditions that killed a QEMU process with active KVM VM
Use these commands for generating scenario:
1. vng -v -r ./arch/x86/boot/bzImage --qemu-opts='-m 2G -cpu EPYC,+svm,+npt,+tsc,+invtsc -s '
After successfully booting the virtme-ng(QEMU) ------> Act Host VM
2. chmod 666 /dev/kvm
3. dmesg -c > /dev/null
4. launching Guest VM using this command $qemu-system-x86_64 -enable-kvm -m 1000M -mem-prealloc \
        -monitor none -serial none -display none -nographic & sleep 10

Results:
-------------------
1. oom_enter callback was successfully invoked
2 No SRCU deadlock warnings
3 No "sleeping function called from invalid context" warnings
4.OOM reaper completed successfully
5. Process was reaped without errors



Question:
Before I send the v2 patch series, I want to confirm this approach aligns with your expectations. Specifically:
Defered this coommon helper kvm_mmu_notifier_detach() for mmu_nottifier_unregister() and unbalanced invalidation using workque is good design?
Are there any specific test cases or scenarios you'd like me to validate?

I can send the complete v2 patch series once you confirm this approach is on the right track.

Thanks again for the guidance!

Shaikh Kamal

> > This avoids the whole "convert locks to raw" problem and the complexity of deferring work.
> > 
> > I have questions on Testing part:
> > ------------------------------------
> > I tried to reproduce the bug scenario using the virtme-ng then running
> > the stress-ng putting memory pressure on VM, but not able to reproduce
> > the scenario.
> > I tried this way ..
> > vng -v -r ./arch/x86/boot/bzImage
> > VM is up, then running the stress-ng as below
> > stress-ng --vm 2 --vm-bytes 95% --timeout 20s & sleep 5 & dmesg | tail -30 | grep "sleeping function"
> > OOM Killer is triggered, but exact bug not able to reproduce, Please
> > suggest how to reproduce this bug, even we need to verify after code
> > changes which you have suggested.
> 
> I don't know, sorry.  But with this new approach there will always be a call
> to the new callback from the OOM killer, so it's easier to test.
> 
> Thanks,
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-25  5:19             ` shaikh kamaluddin
@ 2026-03-26 18:23               ` Paolo Bonzini
  0 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2026-03-26 18:23 UTC (permalink / raw)
  To: shaikh kamaluddin
  Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm,
	Kernel Mailing List, Linux, linux-rt-devel, Shuah Khan, me

Il mer 25 mar 2026, 06:19 shaikh kamaluddin
<shaikhkamal2012@gmail.com> ha scritto:
>
> 1. Added oom_enter callback to struct mmu_notifier_ops in include/linux/mmu_notifier.h
> 2. Implemented __mmu_notifier_oom_enter() in mm/mmu_notifier.c to invoke registered callbacks
> 3. Called mmu_notifier_oom_enter(mm) from __oom_kill_process in mm/oom_kill.c before any invalidations
> 4. As per your suggestion, move existing kvm_destroy_vm() logic that already handles unbalanced invalidation to the new helper function kvm_mmu_notifier_detach() and invoke it from the kvm_destroy_vm()

This is not fully clear to me... It could be caused by a recursive
locking, or also a false positive. It's hard to say without seeing the
full backtrace, but seeing "lock(srcu)" is suspicious.

I wouldn't have expected deferral to be necessary; and it seems to me
that, if you defer removal to some time after the OOM reaper starts,
you'd have the same problem as before with sleeping spinlocks.

Can you post the original patch without deferral?

Paolo

>
> Key Design Decision:
> ------------------------------
> Implementation point no 4, while testing, Issue I was encountering is a recursive locking problem with the srcu lock, which is being acquired twice in the same context. This happens during the __mmu_notifier_oom_enter() and __synchronize_srcu() calls, leading to a potential deadlock.
> Please find below log snippet while launching the Guest VM


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-26 18:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09 16:15 [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations shaikh.kamal
2026-02-11 12:09 ` Sebastian Andrzej Siewior
2026-02-11 15:34   ` Sean Christopherson
2026-03-03 18:49     ` shaikh kamaluddin
2026-03-06 16:42       ` Sean Christopherson
2026-03-06 18:14       ` Paolo Bonzini
2026-03-12 19:24         ` shaikh kamaluddin
2026-03-14  7:47           ` Paolo Bonzini
2026-03-25  5:19             ` shaikh kamaluddin
2026-03-26 18:23               ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox