[PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
@ 2026-02-09 16:15 shaikh.kamal
  2026-02-11 12:09 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 18+ messages in thread
From: shaikh.kamal @ 2026-02-09 16:15 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-rt-devel; +Cc: shaikh.kamal

mmu_notifier_invalidate_range_start() may be invoked via
mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
where sleeping is explicitly forbidden.

KVM's mmu_notifier invalidate_range_start currently takes
mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
to rt_mutex and may sleep, triggering:

  BUG: sleeping function called from invalid context

This violates the MMU notifier contract regardless of PREEMPT_RT; RT
kernels merely make the issue deterministic.

Fix by converting mn_invalidate_lock to a raw spinlock so that
invalidate_range_start() remains non-sleeping while preserving the
existing serialization between invalidate_range_start() and
invalidate_range_end().

Signed-off-by: shaikh.kamal <shaikhkamal2012@gmail.com>
---
 include/linux/kvm_host.h |  2 +-
 virt/kvm/kvm_main.c      | 18 +++++++++---------
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d93f75b05ae2..77a6d4833eda 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -797,7 +797,7 @@ struct kvm {
 	atomic_t nr_memslots_dirty_logging;
 
 	/* Used to wait for completion of MMU notifiers.  */
-	spinlock_t mn_invalidate_lock;
+	raw_spinlock_t mn_invalidate_lock;
 	unsigned long mn_active_invalidate_count;
 	struct rcuwait mn_memslots_update_rcuwait;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5fcd401a5897..7a9c33f01a37 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 *
 	 * Pairs with the decrement in range_end().
 	 */
-	spin_lock(&kvm->mn_invalidate_lock);
+	raw_spin_lock(&kvm->mn_invalidate_lock);
 	kvm->mn_active_invalidate_count++;
-	spin_unlock(&kvm->mn_invalidate_lock);
+	raw_spin_unlock(&kvm->mn_invalidate_lock);
 
 	/*
 	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
@@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	kvm_handle_hva_range(kvm, &hva_range);
 
 	/* Pairs with the increment in range_start(). */
-	spin_lock(&kvm->mn_invalidate_lock);
+	raw_spin_lock(&kvm->mn_invalidate_lock);
 	if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
 		--kvm->mn_active_invalidate_count;
 	wake = !kvm->mn_active_invalidate_count;
-	spin_unlock(&kvm->mn_invalidate_lock);
+	raw_spin_unlock(&kvm->mn_invalidate_lock);
 
 	/*
 	 * There can only be one waiter, since the wait happens under
@@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	mutex_init(&kvm->irq_lock);
 	mutex_init(&kvm->slots_lock);
 	mutex_init(&kvm->slots_arch_lock);
-	spin_lock_init(&kvm->mn_invalidate_lock);
+	raw_spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
@@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
 	 * progress, otherwise the locking in invalidate_range_start and
 	 * invalidate_range_end will be unbalanced.
 	 */
-	spin_lock(&kvm->mn_invalidate_lock);
+	raw_spin_lock(&kvm->mn_invalidate_lock);
 	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
 	while (kvm->mn_active_invalidate_count) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		spin_unlock(&kvm->mn_invalidate_lock);
+		raw_spin_unlock(&kvm->mn_invalidate_lock);
 		schedule();
-		spin_lock(&kvm->mn_invalidate_lock);
+		raw_spin_lock(&kvm->mn_invalidate_lock);
 	}
 	finish_rcuwait(&kvm->mn_memslots_update_rcuwait);
 	rcu_assign_pointer(kvm->memslots[as_id], slots);
-	spin_unlock(&kvm->mn_invalidate_lock);
+	raw_spin_unlock(&kvm->mn_invalidate_lock);
 
 	/*
 	 * Acquired in kvm_set_memslot. Must be released before synchronize
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-02-09 16:15 [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations shaikh.kamal
@ 2026-02-11 12:09 ` Sebastian Andrzej Siewior
  2026-02-11 15:34   ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-11 12:09 UTC (permalink / raw)
  To: shaikh.kamal; +Cc: kvm, linux-kernel, linux-rt-devel

On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
> mmu_notifier_invalidate_range_start() may be invoked via
> mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
> where sleeping is explicitly forbidden.
> 
> KVM's mmu_notifier invalidate_range_start currently takes
> mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
> to rt_mutex and may sleep, triggering:
> 
>   BUG: sleeping function called from invalid context
> 
> This violates the MMU notifier contract regardless of PREEMPT_RT; RT
> kernels merely make the issue deterministic.
> 
> Fix by converting mn_invalidate_lock to a raw spinlock so that
> invalidate_range_start() remains non-sleeping while preserving the
> existing serialization between invalidate_range_start() and
> invalidate_range_end().
> 
> Signed-off-by: shaikh.kamal <shaikhkamal2012@gmail.com>

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

I don't see any down side doing this, but…

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5fcd401a5897..7a9c33f01a37 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	 *
>  	 * Pairs with the decrement in range_end().
>  	 */
> -	spin_lock(&kvm->mn_invalidate_lock);
> +	raw_spin_lock(&kvm->mn_invalidate_lock);
>  	kvm->mn_active_invalidate_count++;
> -	spin_unlock(&kvm->mn_invalidate_lock);
> +	raw_spin_unlock(&kvm->mn_invalidate_lock);

	atomic_inc(mn_active_invalidate_count)
>  
>  	/*
>  	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
> @@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  	kvm_handle_hva_range(kvm, &hva_range);
>  
>  	/* Pairs with the increment in range_start(). */
> -	spin_lock(&kvm->mn_invalidate_lock);
> +	raw_spin_lock(&kvm->mn_invalidate_lock);
>  	if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
>  		--kvm->mn_active_invalidate_count;
>  	wake = !kvm->mn_active_invalidate_count;

	wake = atomic_dec_return_safe(mn_active_invalidate_count);
	WARN_ON_ONCE(wake < 0);
	wake = !wake;

> -	spin_unlock(&kvm->mn_invalidate_lock);
> +	raw_spin_unlock(&kvm->mn_invalidate_lock);
>  
>  	/*
>  	 * There can only be one waiter, since the wait happens under
> @@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> @@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
>  	 * progress, otherwise the locking in invalidate_range_start and
>  	 * invalidate_range_end will be unbalanced.
>  	 */
> -	spin_lock(&kvm->mn_invalidate_lock);
> +	raw_spin_lock(&kvm->mn_invalidate_lock);
>  	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
>  	while (kvm->mn_active_invalidate_count) {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
> -		spin_unlock(&kvm->mn_invalidate_lock);
> +		raw_spin_unlock(&kvm->mn_invalidate_lock);
>  		schedule();

And this I don't understand. The lock protects the rcuwait assignment
which would be needed if multiple waiters are possible. But this goes
away after the unlock and schedule() here. So these things could be
moved outside of the locked section which limits it only to the
mn_active_invalidate_count value.

> -		spin_lock(&kvm->mn_invalidate_lock);
> +		raw_spin_lock(&kvm->mn_invalidate_lock);
>  	}
>  	finish_rcuwait(&kvm->mn_memslots_update_rcuwait);
>  	rcu_assign_pointer(kvm->memslots[as_id], slots);
> -	spin_unlock(&kvm->mn_invalidate_lock);
> +	raw_spin_unlock(&kvm->mn_invalidate_lock);
>  
>  	/*
>  	 * Acquired in kvm_set_memslot. Must be released before synchronize

Sebastian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-02-11 12:09 ` Sebastian Andrzej Siewior
@ 2026-02-11 15:34   ` Sean Christopherson
  2026-03-03 18:49     ` shaikh kamaluddin
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2026-02-11 15:34 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: shaikh.kamal, kvm, linux-kernel, linux-rt-devel

On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
> On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
> > mmu_notifier_invalidate_range_start() may be invoked via
> > mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
> > where sleeping is explicitly forbidden.
> > 
> > KVM's mmu_notifier invalidate_range_start currently takes
> > mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
> > to rt_mutex and may sleep, triggering:
> > 
> >   BUG: sleeping function called from invalid context
> > 
> > This violates the MMU notifier contract regardless of PREEMPT_RT;

I highly doubt that.  kvm.mmu_lock is also a spinlock, and KVM has been taking
that in invalidate_range_start() since

  e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map")

which was a full decade before mmu_notifiers even added the blockable concept in

  93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")

and even predate the current concept of a "raw" spinlock introduced by

  c2f21ce2e312 ("locking: Implement new raw_spinlock")

> > RT kernels merely make the issue deterministic.

No, RT kernels change the rules, because suddenly a non-sleeping locking becomes
sleepable.

> > Fix by converting mn_invalidate_lock to a raw spinlock so that
> > invalidate_range_start() remains non-sleeping while preserving the
> > existing serialization between invalidate_range_start() and
> > invalidate_range_end().

This is insufficient.  To actually "fix" this in KVM mmu_lock would need to be
turned into a raw lock on all KVM architectures.  I suspect the only reason there
haven't been bug reports is because no one trips an OOM kill on VM while running
with CONFIG_DEBUG_ATOMIC_SLEEP=y.

That combination is required because since commit

  8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot")

KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot,
i.e. affects memory that may be mapped into the guest.

E.g. this hack to simulate a non-blockable invalidation

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7015edce5bd8..7a35a83420ec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
                .handler        = kvm_mmu_unmap_gfn_range,
                .on_lock        = kvm_mmu_invalidate_begin,
                .flush_on_ret   = true,
-               .may_block      = mmu_notifier_range_blockable(range),
+               .may_block      = false,//mmu_notifier_range_blockable(range),
        };
 
        trace_kvm_unmap_hva_range(range->start, range->end);
@@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
         */
        gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
 
+       non_block_start();
        /*
         * If one or more memslots were found and thus zapped, notify arch code
         * that guest memory has been reclaimed.  This needs to be done *after*
@@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
         */
        if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
                kvm_arch_guest_memory_reclaimed(kvm);
+       non_block_end();
 
        return 0;
 }

immediately triggers

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241
  in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu
  preempt_count: 0, expected: 0
  RCU nest depth: 0, expected: 0
  CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT 
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x51/0x60
   __might_resched+0x10e/0x160
   rt_write_lock+0x49/0x310
   kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm]
   __mmu_notifier_invalidate_range_start+0x9b/0x230
   do_wp_page+0xce1/0xf30
   __handle_mm_fault+0x380/0x3a0
   handle_mm_fault+0xde/0x290
   __get_user_pages+0x20d/0xbe0
   get_user_pages_unlocked+0xf6/0x340
   hva_to_pfn+0x295/0x420 [kvm]
   __kvm_faultin_pfn+0x5d/0x90 [kvm]
   kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm]
   kvm_tdp_page_fault+0xb6/0x160 [kvm]
   kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
   kvm_mmu_page_fault+0x8d/0x600 [kvm]
   vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm]
   kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm]
   __x64_sys_ioctl+0x8a/0xd0
   do_syscall_64+0x5e/0x11b0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  kvm: emulating exchange as write


It's not at all clear to me that switching mmu_lock to a raw lock would be a net
positive for PREEMPT_RT.  OOM-killing a KVM guest in a PREEMPT_RT seems like a
comically rare scenario.  Whereas contending mmu_lock in normal operation is
relatively common (assuming there are even use cases for running VMs with a
PREEMPT_RT host kernel).

In fact, the only reason the splat happens is because mmu_notifiers somewhat
artificially forces an atomic context via non_block_start() since commit

  ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")

Given the massive amount of churn in KVM that would be required to fully eliminate
the splat, and that it's not at all obvious that it would be a good change overall,
at least for now:

NAK

I'm not fundamentally opposed to such a change, but there needs to be a _lot_
more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".

> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 5fcd401a5897..7a9c33f01a37 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  	 *
> >  	 * Pairs with the decrement in range_end().
> >  	 */
> > -	spin_lock(&kvm->mn_invalidate_lock);
> > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> >  	kvm->mn_active_invalidate_count++;
> > -	spin_unlock(&kvm->mn_invalidate_lock);
> > +	raw_spin_unlock(&kvm->mn_invalidate_lock);
> 
> 	atomic_inc(mn_active_invalidate_count)
> >  
> >  	/*
> >  	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
> > @@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> >  	kvm_handle_hva_range(kvm, &hva_range);
> >  
> >  	/* Pairs with the increment in range_start(). */
> > -	spin_lock(&kvm->mn_invalidate_lock);
> > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> >  	if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
> >  		--kvm->mn_active_invalidate_count;
> >  	wake = !kvm->mn_active_invalidate_count;
> 
> 	wake = atomic_dec_return_safe(mn_active_invalidate_count);
> 	WARN_ON_ONCE(wake < 0);
> 	wake = !wake;
> 
> > -	spin_unlock(&kvm->mn_invalidate_lock);
> > +	raw_spin_unlock(&kvm->mn_invalidate_lock);
> >  
> >  	/*
> >  	 * There can only be one waiter, since the wait happens under
> > @@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > @@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
> >  	 * progress, otherwise the locking in invalidate_range_start and
> >  	 * invalidate_range_end will be unbalanced.
> >  	 */
> > -	spin_lock(&kvm->mn_invalidate_lock);
> > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> >  	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
> >  	while (kvm->mn_active_invalidate_count) {
> >  		set_current_state(TASK_UNINTERRUPTIBLE);
> > -		spin_unlock(&kvm->mn_invalidate_lock);
> > +		raw_spin_unlock(&kvm->mn_invalidate_lock);
> >  		schedule();
> 
> And this I don't understand. The lock protects the rcuwait assignment
> which would be needed if multiple waiters are possible. But this goes
> away after the unlock and schedule() here. So these things could be
> moved outside of the locked section which limits it only to the
> mn_active_invalidate_count value.

The implementation is essentially a deliberately unfair rwswem.  The "write" side
in kvm_swap_active_memslots() subtly protect this code:

  rcu_assign_pointer(kvm->memslots[as_id], slots);

and the "read" side protects the kvm->memslot lookups in kvm_handle_hva_range().

KVM optimizes its mmu_notifier invalidation path to only take action if the
to-be-invalidated range overlaps one or more memslots, i.e. affects memory that
be can be mapped into the guest.  The wrinkle with those optimizations is that
KVM needs to prevent changes to the memslots between invalidation start() and end(),
otherwise the accounting can become imbalanced, e.g. mmu_invalidate_in_progress
will underflow or be left elevated and essentially hang the VM (among other bad
things).

So simply making mn_active_invalidate_count an atomic won't suffice, because KVM
needs to block start() to ensure start()+end() see the exact same set of memslots.

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-02-11 15:34   ` Sean Christopherson
@ 2026-03-03 18:49     ` shaikh kamaluddin
  2026-03-06 16:42       ` Sean Christopherson
  2026-03-06 18:14       ` Paolo Bonzini
  0 siblings, 2 replies; 18+ messages in thread
From: shaikh kamaluddin @ 2026-03-03 18:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Sebastian Andrzej Siewior, kvm, linux-kernel, linux-rt-devel

On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote:
> On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
> > On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
> > > mmu_notifier_invalidate_range_start() may be invoked via
> > > mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
> > > where sleeping is explicitly forbidden.
> > > 
> > > KVM's mmu_notifier invalidate_range_start currently takes
> > > mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
> > > to rt_mutex and may sleep, triggering:
> > > 
> > >   BUG: sleeping function called from invalid context
> > > 
> > > This violates the MMU notifier contract regardless of PREEMPT_RT;
> 
> I highly doubt that.  kvm.mmu_lock is also a spinlock, and KVM has been taking
> that in invalidate_range_start() since
> 
>   e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map")
> 
> which was a full decade before mmu_notifiers even added the blockable concept in
> 
>   93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
> 
> and even predate the current concept of a "raw" spinlock introduced by
> 
>   c2f21ce2e312 ("locking: Implement new raw_spinlock")
> 
> > > RT kernels merely make the issue deterministic.
> 
> No, RT kernels change the rules, because suddenly a non-sleeping locking becomes
> sleepable.
> 
> > > Fix by converting mn_invalidate_lock to a raw spinlock so that
> > > invalidate_range_start() remains non-sleeping while preserving the
> > > existing serialization between invalidate_range_start() and
> > > invalidate_range_end().
> 
> This is insufficient.  To actually "fix" this in KVM mmu_lock would need to be
> turned into a raw lock on all KVM architectures.  I suspect the only reason there
> haven't been bug reports is because no one trips an OOM kill on VM while running
> with CONFIG_DEBUG_ATOMIC_SLEEP=y.
> 
> That combination is required because since commit
> 
>   8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot")
> 
> KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot,
> i.e. affects memory that may be mapped into the guest.
> 
> E.g. this hack to simulate a non-blockable invalidation
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7015edce5bd8..7a35a83420ec 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                 .handler        = kvm_mmu_unmap_gfn_range,
>                 .on_lock        = kvm_mmu_invalidate_begin,
>                 .flush_on_ret   = true,
> -               .may_block      = mmu_notifier_range_blockable(range),
> +               .may_block      = false,//mmu_notifier_range_blockable(range),
>         };
>  
>         trace_kvm_unmap_hva_range(range->start, range->end);
> @@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>          */
>         gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
>  
> +       non_block_start();
>         /*
>          * If one or more memslots were found and thus zapped, notify arch code
>          * that guest memory has been reclaimed.  This needs to be done *after*
> @@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>          */
>         if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
>                 kvm_arch_guest_memory_reclaimed(kvm);
> +       non_block_end();
>  
>         return 0;
>  }
> 
> immediately triggers
> 
>   BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241
>   in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu
>   preempt_count: 0, expected: 0
>   RCU nest depth: 0, expected: 0
>   CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT 
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   Call Trace:
>    <TASK>
>    dump_stack_lvl+0x51/0x60
>    __might_resched+0x10e/0x160
>    rt_write_lock+0x49/0x310
>    kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm]
>    __mmu_notifier_invalidate_range_start+0x9b/0x230
>    do_wp_page+0xce1/0xf30
>    __handle_mm_fault+0x380/0x3a0
>    handle_mm_fault+0xde/0x290
>    __get_user_pages+0x20d/0xbe0
>    get_user_pages_unlocked+0xf6/0x340
>    hva_to_pfn+0x295/0x420 [kvm]
>    __kvm_faultin_pfn+0x5d/0x90 [kvm]
>    kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm]
>    kvm_tdp_page_fault+0xb6/0x160 [kvm]
>    kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
>    kvm_mmu_page_fault+0x8d/0x600 [kvm]
>    vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
>    kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm]
>    kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm]
>    __x64_sys_ioctl+0x8a/0xd0
>    do_syscall_64+0x5e/0x11b0
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>   kvm: emulating exchange as write
> 
> 
> It's not at all clear to me that switching mmu_lock to a raw lock would be a net
> positive for PREEMPT_RT.  OOM-killing a KVM guest in a PREEMPT_RT seems like a
> comically rare scenario.  Whereas contending mmu_lock in normal operation is
> relatively common (assuming there are even use cases for running VMs with a
> PREEMPT_RT host kernel).
> 
> In fact, the only reason the splat happens is because mmu_notifiers somewhat
> artificially forces an atomic context via non_block_start() since commit
> 
>   ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")
> 
> Given the massive amount of churn in KVM that would be required to fully eliminate
> the splat, and that it's not at all obvious that it would be a good change overall,
> at least for now:
> 
> NAK
> 
> I'm not fundamentally opposed to such a change, but there needs to be a _lot_
> more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".
>
Hi Sean,
Thanks for the detailed explanation and for spelling out the border
issue.
Understood on both points:
	1. The changelog wording was too strong; PREEMPT_RT changes
	spin_lock() semantics, and the splat is fundamentally due to
	spinlocks becoming sleepable there.
	2. Converting only mm_invalidate_lock to raw is insufficient
	since KVM can still take the mmu_lock (and other sleeping locks
	RT) in invalidate_range_start() when the invalidation hits a
	memslot.
Given the above, it shounds like "convert locks to raw" is not the right
direction without sinificat rework and justification.
Would an acceptable direction be to handle the !blockable notifier case
by deferring the heavyweight invalidation work(anything that take
mmu_lock/may sleep on RT) to a context that may block(e.g. queued work),
while keeping start()/end() accounting consisting with memslot changes ?
if so, I can protoptype a patch along those lines and share for
feedback.

Alternatively, if you think this needs to be addressed in
mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
redirect my efforts there-Please advise.
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 5fcd401a5897..7a9c33f01a37 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >  	 *
> > >  	 * Pairs with the decrement in range_end().
> > >  	 */
> > > -	spin_lock(&kvm->mn_invalidate_lock);
> > > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> > >  	kvm->mn_active_invalidate_count++;
> > > -	spin_unlock(&kvm->mn_invalidate_lock);
> > > +	raw_spin_unlock(&kvm->mn_invalidate_lock);
> > 
> > 	atomic_inc(mn_active_invalidate_count)
> > >  
> > >  	/*
> > >  	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
> > > @@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > >  	kvm_handle_hva_range(kvm, &hva_range);
> > >  
> > >  	/* Pairs with the increment in range_start(). */
> > > -	spin_lock(&kvm->mn_invalidate_lock);
> > > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> > >  	if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
> > >  		--kvm->mn_active_invalidate_count;
> > >  	wake = !kvm->mn_active_invalidate_count;
> > 
> > 	wake = atomic_dec_return_safe(mn_active_invalidate_count);
> > 	WARN_ON_ONCE(wake < 0);
> > 	wake = !wake;
> > 
> > > -	spin_unlock(&kvm->mn_invalidate_lock);
> > > +	raw_spin_unlock(&kvm->mn_invalidate_lock);
> > >  
> > >  	/*
> > >  	 * There can only be one waiter, since the wait happens under
> > > @@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > > @@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
> > >  	 * progress, otherwise the locking in invalidate_range_start and
> > >  	 * invalidate_range_end will be unbalanced.
> > >  	 */
> > > -	spin_lock(&kvm->mn_invalidate_lock);
> > > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> > >  	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
> > >  	while (kvm->mn_active_invalidate_count) {
> > >  		set_current_state(TASK_UNINTERRUPTIBLE);
> > > -		spin_unlock(&kvm->mn_invalidate_lock);
> > > +		raw_spin_unlock(&kvm->mn_invalidate_lock);
> > >  		schedule();
> > 
> > And this I don't understand. The lock protects the rcuwait assignment
> > which would be needed if multiple waiters are possible. But this goes
> > away after the unlock and schedule() here. So these things could be
> > moved outside of the locked section which limits it only to the
> > mn_active_invalidate_count value.
> 
> The implementation is essentially a deliberately unfair rwswem.  The "write" side
> in kvm_swap_active_memslots() subtly protect this code:
> 
>   rcu_assign_pointer(kvm->memslots[as_id], slots);
> 
> and the "read" side protects the kvm->memslot lookups in kvm_handle_hva_range().
> 
> KVM optimizes its mmu_notifier invalidation path to only take action if the
> to-be-invalidated range overlaps one or more memslots, i.e. affects memory that
> be can be mapped into the guest.  The wrinkle with those optimizations is that
> KVM needs to prevent changes to the memslots between invalidation start() and end(),
> otherwise the accounting can become imbalanced, e.g. mmu_invalidate_in_progress
> will underflow or be left elevated and essentially hang the VM (among other bad
> things).
> 
> So simply making mn_active_invalidate_count an atomic won't suffice, because KVM
> needs to block start() to ensure start()+end() see the exact same set of memslots.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-03 18:49     ` shaikh kamaluddin
@ 2026-03-06 16:42       ` Sean Christopherson
  2026-03-06 18:14       ` Paolo Bonzini
  1 sibling, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2026-03-06 16:42 UTC (permalink / raw)
  To: shaikh kamaluddin
  Cc: Sebastian Andrzej Siewior, kvm, linux-kernel, linux-rt-devel

On Wed, Mar 04, 2026, shaikh kamaluddin wrote:
> On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote:
> > On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
> > It's not at all clear to me that switching mmu_lock to a raw lock would be a net
> > positive for PREEMPT_RT.  OOM-killing a KVM guest in a PREEMPT_RT seems like a
> > comically rare scenario.  Whereas contending mmu_lock in normal operation is
> > relatively common (assuming there are even use cases for running VMs with a
> > PREEMPT_RT host kernel).
> > 
> > In fact, the only reason the splat happens is because mmu_notifiers somewhat
> > artificially forces an atomic context via non_block_start() since commit
> > 
> >   ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")
> > 
> > Given the massive amount of churn in KVM that would be required to fully eliminate
> > the splat, and that it's not at all obvious that it would be a good change overall,
> > at least for now:
> > 
> > NAK
> > 
> > I'm not fundamentally opposed to such a change, but there needs to be a _lot_
> > more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".
> >
> Hi Sean,
> Thanks for the detailed explanation and for spelling out the border
> issue.
> Understood on both points:
> 	1. The changelog wording was too strong; PREEMPT_RT changes
> 	spin_lock() semantics, and the splat is fundamentally due to
> 	spinlocks becoming sleepable there.
> 	2. Converting only mm_invalidate_lock to raw is insufficient
> 	since KVM can still take the mmu_lock (and other sleeping locks
> 	RT) in invalidate_range_start() when the invalidation hits a
> 	memslot.
> Given the above, it shounds like "convert locks to raw" is not the right
> direction without sinificat rework and justification.
> Would an acceptable direction be to handle the !blockable notifier case
> by deferring the heavyweight invalidation work(anything that take
> mmu_lock/may sleep on RT) to a context that may block(e.g. queued work),
> while keeping start()/end() accounting consisting with memslot changes ?

No, because the _only_ case where the invalidation is non-blockable is when the
kernel is OOM-killing.  Deferring the invalidations when we're OOM is likely to
make the problem *worse*.

That's the crux of my NAK.  We'd be making KVM and kernel behavior worse to "fix"
a largely hypothetical issue (OOM-killing a KVM guest in a RT kernel).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-03 18:49     ` shaikh kamaluddin
  2026-03-06 16:42       ` Sean Christopherson
@ 2026-03-06 18:14       ` Paolo Bonzini
  2026-03-12 19:24         ` shaikh kamaluddin
  1 sibling, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2026-03-06 18:14 UTC (permalink / raw)
  To: shaikh kamaluddin, Sean Christopherson
  Cc: Sebastian Andrzej Siewior, kvm, linux-kernel, linux-rt-devel

On 3/3/26 19:49, shaikh kamaluddin wrote:
> On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote:
>> On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
>>> On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
>>>> mmu_notifier_invalidate_range_start() may be invoked via
>>>> mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
>>>> where sleeping is explicitly forbidden.
>>>>
>>>> KVM's mmu_notifier invalidate_range_start currently takes
>>>> mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
>>>> to rt_mutex and may sleep, triggering:
>>>>
>>>>    BUG: sleeping function called from invalid context
>>>>
>>>> This violates the MMU notifier contract regardless of PREEMPT_RT;
>>
>> I highly doubt that.  kvm.mmu_lock is also a spinlock, and KVM has been taking
>> that in invalidate_range_start() since
>>
>>    e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map")
>>
>> which was a full decade before mmu_notifiers even added the blockable concept in
>>
>>    93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
>>
>> and even predate the current concept of a "raw" spinlock introduced by
>>
>>    c2f21ce2e312 ("locking: Implement new raw_spinlock")
>>
>>>> RT kernels merely make the issue deterministic.
>>
>> No, RT kernels change the rules, because suddenly a non-sleeping locking becomes
>> sleepable.
>>
>>>> Fix by converting mn_invalidate_lock to a raw spinlock so that
>>>> invalidate_range_start() remains non-sleeping while preserving the
>>>> existing serialization between invalidate_range_start() and
>>>> invalidate_range_end().
>>
>> This is insufficient.  To actually "fix" this in KVM mmu_lock would need to be
>> turned into a raw lock on all KVM architectures.  I suspect the only reason there
>> haven't been bug reports is because no one trips an OOM kill on VM while running
>> with CONFIG_DEBUG_ATOMIC_SLEEP=y.
>>
>> That combination is required because since commit
>>
>>    8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot")
>>
>> KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot,
>> i.e. affects memory that may be mapped into the guest.
>>
>> E.g. this hack to simulate a non-blockable invalidation
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 7015edce5bd8..7a35a83420ec 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>                  .handler        = kvm_mmu_unmap_gfn_range,
>>                  .on_lock        = kvm_mmu_invalidate_begin,
>>                  .flush_on_ret   = true,
>> -               .may_block      = mmu_notifier_range_blockable(range),
>> +               .may_block      = false,//mmu_notifier_range_blockable(range),
>>          };
>>   
>>          trace_kvm_unmap_hva_range(range->start, range->end);
>> @@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>           */
>>          gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
>>   
>> +       non_block_start();
>>          /*
>>           * If one or more memslots were found and thus zapped, notify arch code
>>           * that guest memory has been reclaimed.  This needs to be done *after*
>> @@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>           */
>>          if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
>>                  kvm_arch_guest_memory_reclaimed(kvm);
>> +       non_block_end();
>>   
>>          return 0;
>>   }
>>
>> immediately triggers
>>
>>    BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241
>>    in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu
>>    preempt_count: 0, expected: 0
>>    RCU nest depth: 0, expected: 0
>>    CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT
>>    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>    Call Trace:
>>     <TASK>
>>     dump_stack_lvl+0x51/0x60
>>     __might_resched+0x10e/0x160
>>     rt_write_lock+0x49/0x310
>>     kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm]
>>     __mmu_notifier_invalidate_range_start+0x9b/0x230
>>     do_wp_page+0xce1/0xf30
>>     __handle_mm_fault+0x380/0x3a0
>>     handle_mm_fault+0xde/0x290
>>     __get_user_pages+0x20d/0xbe0
>>     get_user_pages_unlocked+0xf6/0x340
>>     hva_to_pfn+0x295/0x420 [kvm]
>>     __kvm_faultin_pfn+0x5d/0x90 [kvm]
>>     kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm]
>>     kvm_tdp_page_fault+0xb6/0x160 [kvm]
>>     kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
>>     kvm_mmu_page_fault+0x8d/0x600 [kvm]
>>     vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
>>     kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm]
>>     kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm]
>>     __x64_sys_ioctl+0x8a/0xd0
>>     do_syscall_64+0x5e/0x11b0
>>     entry_SYSCALL_64_after_hwframe+0x4b/0x53
>>     </TASK>
>>    kvm: emulating exchange as write
>>
>>
>> It's not at all clear to me that switching mmu_lock to a raw lock would be a net
>> positive for PREEMPT_RT.  OOM-killing a KVM guest in a PREEMPT_RT seems like a
>> comically rare scenario.  Whereas contending mmu_lock in normal operation is
>> relatively common (assuming there are even use cases for running VMs with a
>> PREEMPT_RT host kernel).
>>
>> In fact, the only reason the splat happens is because mmu_notifiers somewhat
>> artificially forces an atomic context via non_block_start() since commit
>>
>>    ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")
>>
>> Given the massive amount of churn in KVM that would be required to fully eliminate
>> the splat, and that it's not at all obvious that it would be a good change overall,
>> at least for now:
>>
>> NAK
>>
>> I'm not fundamentally opposed to such a change, but there needs to be a _lot_
>> more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".
>>
> Hi Sean,
> Thanks for the detailed explanation and for spelling out the border
> issue.
> Understood on both points:
> 	1. The changelog wording was too strong; PREEMPT_RT changes
> 	spin_lock() semantics, and the splat is fundamentally due to
> 	spinlocks becoming sleepable there.
> 	2. Converting only mm_invalidate_lock to raw is insufficient
> 	since KVM can still take the mmu_lock (and other sleeping locks
> 	RT) in invalidate_range_start() when the invalidation hits a
> 	memslot.
> Given the above, it shounds like "convert locks to raw" is not the right
> direction without sinificat rework and justification.
> Would an acceptable direction be to handle the !blockable notifier case
> by deferring the heavyweight invalidation work(anything that take
> mmu_lock/may sleep on RT) to a context that may block(e.g. queued work),
> while keeping start()/end() accounting consisting with memslot changes ?
> if so, I can protoptype a patch along those lines and share for
> feedback.
> 
> Alternatively, if you think this needs to be addressed in
> mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
> redirect my efforts there-Please advise.

Have you considered a "OOM entered" callback for MMU notifiers?  KVM's 
MMU notifier can just remove itself for example, in fact there is code 
in kvm_destroy_vm() to do that even if invalidations are unbalanced.

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-06 18:14       ` Paolo Bonzini
@ 2026-03-12 19:24         ` shaikh kamaluddin
  2026-03-14  7:47           ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: shaikh kamaluddin @ 2026-03-12 19:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm, linux-kernel,
	linux-rt-devel

On Fri, Mar 06, 2026 at 07:14:40PM +0100, Paolo Bonzini wrote:
> On 3/3/26 19:49, shaikh kamaluddin wrote:
> > On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote:
> > > On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
> > > > On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
> > > > > mmu_notifier_invalidate_range_start() may be invoked via
> > > > > mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
> > > > > where sleeping is explicitly forbidden.
> > > > > 
> > > > > KVM's mmu_notifier invalidate_range_start currently takes
> > > > > mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
> > > > > to rt_mutex and may sleep, triggering:
> > > > > 
> > > > >    BUG: sleeping function called from invalid context
> > > > > 
> > > > > This violates the MMU notifier contract regardless of PREEMPT_RT;
> > > 
> > > I highly doubt that.  kvm.mmu_lock is also a spinlock, and KVM has been taking
> > > that in invalidate_range_start() since
> > > 
> > >    e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map")
> > > 
> > > which was a full decade before mmu_notifiers even added the blockable concept in
> > > 
> > >    93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
> > > 
> > > and even predate the current concept of a "raw" spinlock introduced by
> > > 
> > >    c2f21ce2e312 ("locking: Implement new raw_spinlock")
> > > 
> > > > > RT kernels merely make the issue deterministic.
> > > 
> > > No, RT kernels change the rules, because suddenly a non-sleeping locking becomes
> > > sleepable.
> > > 
> > > > > Fix by converting mn_invalidate_lock to a raw spinlock so that
> > > > > invalidate_range_start() remains non-sleeping while preserving the
> > > > > existing serialization between invalidate_range_start() and
> > > > > invalidate_range_end().
> > > 
> > > This is insufficient.  To actually "fix" this in KVM mmu_lock would need to be
> > > turned into a raw lock on all KVM architectures.  I suspect the only reason there
> > > haven't been bug reports is because no one trips an OOM kill on VM while running
> > > with CONFIG_DEBUG_ATOMIC_SLEEP=y.
> > > 
> > > That combination is required because since commit
> > > 
> > >    8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot")
> > > 
> > > KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot,
> > > i.e. affects memory that may be mapped into the guest.
> > > 
> > > E.g. this hack to simulate a non-blockable invalidation
> > > 
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 7015edce5bd8..7a35a83420ec 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >                  .handler        = kvm_mmu_unmap_gfn_range,
> > >                  .on_lock        = kvm_mmu_invalidate_begin,
> > >                  .flush_on_ret   = true,
> > > -               .may_block      = mmu_notifier_range_blockable(range),
> > > +               .may_block      = false,//mmu_notifier_range_blockable(range),
> > >          };
> > >          trace_kvm_unmap_hva_range(range->start, range->end);
> > > @@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >           */
> > >          gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
> > > +       non_block_start();
> > >          /*
> > >           * If one or more memslots were found and thus zapped, notify arch code
> > >           * that guest memory has been reclaimed.  This needs to be done *after*
> > > @@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >           */
> > >          if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
> > >                  kvm_arch_guest_memory_reclaimed(kvm);
> > > +       non_block_end();
> > >          return 0;
> > >   }
> > > 
> > > immediately triggers
> > > 
> > >    BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241
> > >    in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu
> > >    preempt_count: 0, expected: 0
> > >    RCU nest depth: 0, expected: 0
> > >    CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT
> > >    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > >    Call Trace:
> > >     <TASK>
> > >     dump_stack_lvl+0x51/0x60
> > >     __might_resched+0x10e/0x160
> > >     rt_write_lock+0x49/0x310
> > >     kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm]
> > >     __mmu_notifier_invalidate_range_start+0x9b/0x230
> > >     do_wp_page+0xce1/0xf30
> > >     __handle_mm_fault+0x380/0x3a0
> > >     handle_mm_fault+0xde/0x290
> > >     __get_user_pages+0x20d/0xbe0
> > >     get_user_pages_unlocked+0xf6/0x340
> > >     hva_to_pfn+0x295/0x420 [kvm]
> > >     __kvm_faultin_pfn+0x5d/0x90 [kvm]
> > >     kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm]
> > >     kvm_tdp_page_fault+0xb6/0x160 [kvm]
> > >     kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
> > >     kvm_mmu_page_fault+0x8d/0x600 [kvm]
> > >     vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
> > >     kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm]
> > >     kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm]
> > >     __x64_sys_ioctl+0x8a/0xd0
> > >     do_syscall_64+0x5e/0x11b0
> > >     entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > >     </TASK>
> > >    kvm: emulating exchange as write
> > > 
> > > 
> > > It's not at all clear to me that switching mmu_lock to a raw lock would be a net
> > > positive for PREEMPT_RT.  OOM-killing a KVM guest in a PREEMPT_RT seems like a
> > > comically rare scenario.  Whereas contending mmu_lock in normal operation is
> > > relatively common (assuming there are even use cases for running VMs with a
> > > PREEMPT_RT host kernel).
> > > 
> > > In fact, the only reason the splat happens is because mmu_notifiers somewhat
> > > artificially forces an atomic context via non_block_start() since commit
> > > 
> > >    ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")
> > > 
> > > Given the massive amount of churn in KVM that would be required to fully eliminate
> > > the splat, and that it's not at all obvious that it would be a good change overall,
> > > at least for now:
> > > 
> > > NAK
> > > 
> > > I'm not fundamentally opposed to such a change, but there needs to be a _lot_
> > > more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".
> > > 
> > Hi Sean,
> > Thanks for the detailed explanation and for spelling out the border
> > issue.
> > Understood on both points:
> > 	1. The changelog wording was too strong; PREEMPT_RT changes
> > 	spin_lock() semantics, and the splat is fundamentally due to
> > 	spinlocks becoming sleepable there.
> > 	2. Converting only mm_invalidate_lock to raw is insufficient
> > 	since KVM can still take the mmu_lock (and other sleeping locks
> > 	RT) in invalidate_range_start() when the invalidation hits a
> > 	memslot.
> > Given the above, it shounds like "convert locks to raw" is not the right
> > direction without sinificat rework and justification.
> > Would an acceptable direction be to handle the !blockable notifier case
> > by deferring the heavyweight invalidation work(anything that take
> > mmu_lock/may sleep on RT) to a context that may block(e.g. queued work),
> > while keeping start()/end() accounting consisting with memslot changes ?
> > if so, I can protoptype a patch along those lines and share for
> > feedback.
> > 
> > Alternatively, if you think this needs to be addressed in
> > mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
> > redirect my efforts there-Please advise.
> 
> Have you considered a "OOM entered" callback for MMU notifiers?  KVM's MMU
> notifier can just remove itself for example, in fact there is code in
> kvm_destroy_vm() to do that even if invalidations are unbalanced.
> 
> Paolo
>
Thanks for the suggestion! That's a much cleaner approach than what I was considering.

If I understand correctly, the idea would be:
1. Add a new MMU notifier callback (e.g., .oom_entered or .release_on_oom)
2. Have KVM implement it to unregister the notifier when OOM reaper starts
3. Leverage the existing kvm_destroy_vm() logic that already handles unbalanced invalidations

This avoids the whole "convert locks to raw" problem and the complexity of deferring work.

I have questions on Testing part:
------------------------------------
I tried to reproduce the bug scenario using the virtme-ng then running
the stress-ng putting memory pressure on VM, but not able to reproduce
the scenario.
I tried this way ..
vng -v -r ./arch/x86/boot/bzImage
VM is up, then running the stress-ng as below 
stress-ng --vm 2 --vm-bytes 95% --timeout 20s & sleep 5 & dmesg | tail -30 | grep "sleeping function"
OOM Killer is triggered, but exact bug not able to reproduce, Please
suggest how to reproduce this bug, even we need to verify after code
changes which you have suggested.

Shaikh Kamal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-12 19:24         ` shaikh kamaluddin
@ 2026-03-14  7:47           ` Paolo Bonzini
  2026-03-25  5:19             ` shaikh kamaluddin
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2026-03-14  7:47 UTC (permalink / raw)
  To: shaikh kamaluddin
  Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm, linux-kernel,
	linux-rt-devel

On 3/12/26 20:24, shaikh kamaluddin wrote:
>>> Alternatively, if you think this needs to be addressed in
>>> mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
>>> redirect my efforts there-Please advise.
>>
>> Have you considered a "OOM entered" callback for MMU notifiers?  KVM's MMU
>> notifier can just remove itself for example, in fact there is code in
>> kvm_destroy_vm() to do that even if invalidations are unbalanced.
>>
>> Paolo
>>
> Thanks for the suggestion! That's a much cleaner approach than what I was considering.
> 
> If I understand correctly, the idea would be:
> 1. Add a new MMU notifier callback (e.g., .oom_entered or .release_on_oom)
> 2. Have KVM implement it to unregister the notifier when OOM reaper starts
> 3. Leverage the existing kvm_destroy_vm() logic that already handles unbalanced invalidations

Yes pretty much.  Essentially, move the existing logic to the new 
callback and invoke it from kvm_destroy_vm().

> This avoids the whole "convert locks to raw" problem and the complexity of deferring work.
> 
> I have questions on Testing part:
> ------------------------------------
> I tried to reproduce the bug scenario using the virtme-ng then running
> the stress-ng putting memory pressure on VM, but not able to reproduce
> the scenario.
> I tried this way ..
> vng -v -r ./arch/x86/boot/bzImage
> VM is up, then running the stress-ng as below
> stress-ng --vm 2 --vm-bytes 95% --timeout 20s & sleep 5 & dmesg | tail -30 | grep "sleeping function"
> OOM Killer is triggered, but exact bug not able to reproduce, Please
> suggest how to reproduce this bug, even we need to verify after code
> changes which you have suggested.

I don't know, sorry.  But with this new approach there will always be a 
call to the new callback from the OOM killer, so it's easier to test.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-14  7:47           ` Paolo Bonzini
@ 2026-03-25  5:19             ` shaikh kamaluddin
  2026-03-26 18:23               ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: shaikh kamaluddin @ 2026-03-25  5:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm, linux-kernel,
	linux-rt-devel, skhan, me

On Sat, Mar 14, 2026 at 08:47:40AM +0100, Paolo Bonzini wrote:
> On 3/12/26 20:24, shaikh kamaluddin wrote:
> > > > Alternatively, if you think this needs to be addressed in
> > > > mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
> > > > redirect my efforts there-Please advise.
> > > 
> > > Have you considered a "OOM entered" callback for MMU notifiers?  KVM's MMU
> > > notifier can just remove itself for example, in fact there is code in
> > > kvm_destroy_vm() to do that even if invalidations are unbalanced.
> > > 
> > > Paolo
> > > 
> > Thanks for the suggestion! That's a much cleaner approach than what I was considering.
> > 
> > If I understand correctly, the idea would be:
> > 1. Add a new MMU notifier callback (e.g., .oom_entered or .release_on_oom)
> > 2. Have KVM implement it to unregister the notifier when OOM reaper starts
> > 3. Leverage the existing kvm_destroy_vm() logic that already handles unbalanced invalidations
> 
> Yes pretty much.  Essentially, move the existing logic to the new callback
> and invoke it from kvm_destroy_vm().
>

Hi Paolo,
Thank you for the suggestion to use an oom_enter callback approach. I've implemented v2 based on your guidance and have successfully validated it.

Implementation Summary:
-------------------------------------
Following your recommendation, I've added a new oom_enter callback to the mmu_notifier_ops structure. The implementation:

1. Added oom_enter callback to struct mmu_notifier_ops in include/linux/mmu_notifier.h
2. Implemented __mmu_notifier_oom_enter() in mm/mmu_notifier.c to invoke registered callbacks
3. Called mmu_notifier_oom_enter(mm) from __oom_kill_process in mm/oom_kill.c before any invalidations
4. As per your suggestion, move existing kvm_destroy_vm() logic that already handles unbalanced invalidation to the new helper function kvm_mmu_notifier_detach() and invoke it from the kvm_destroy_vm()

Key Design Decision:
------------------------------
Implementation point no 4, while testing, Issue I was encountering is a recursive locking problem with the srcu lock, which is being acquired twice in the same context. This happens during the __mmu_notifier_oom_enter() and __synchronize_srcu() calls, leading to a potential deadlock.
Please find below log snippet while launching the Guest VM
------------------------------------------------------------------------------------------------
OOM_REAPER: START reaping:func:__mmu_notifier_oom_enter
[  399.841599][T10882] OOM_REAPER: START reaping:func:__mmu_notifier_oom_enter
[  399.841608][T10882] KVM: oom_enter callback invoked for VM:kvm_mmu_notifier_oom_enter
[  399.841608][T10882] KVM: oom_enter callback invoked for VM:kvm_mmu_notifier_oom_enter
[  399.841961][T10882]
[  399.841961][T10882]
[  399.841962][T10882] ============================================
[  399.841962][T10882] ============================================
[  399.841964][T10882] WARNING: possible recursive locking detected
[  399.841964][T10882] WARNING: possible recursive locking detected
[  399.841966][T10882] 7.0.0-rc2-00467-g4ae12d8bd9a8-dirty #12 Not tainted
[  399.841966][T10882] 7.0.0-rc2-00467-g4ae12d8bd9a8-dirty #12 Not tainted
[  399.841969][T10882] --------------------------------------------
[  399.841969][T10882] --------------------------------------------
[  399.841971][T10882] qemu-system-x86/10882 is trying to acquire lock:
[  399.841971][T10882] qemu-system-x86/10882 is trying to acquire lock:
[  399.841974][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x83/0x380
[  399.841974][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x83/0x380
[  399.841991][T10882]
[  399.841991][T10882] but task is already holding lock:
[  399.841991][T10882]
[  399.841991][T10882] but task is already holding lock:
[  399.841992][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __mmu_notifier_oom_enter+0x93/0x1f0
[  399.841992][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __mmu_notifier_oom_enter+0x93/0x1f0
[  399.842005][T10882]
[  399.842005][T10882] other info that might help us debug this:
[  399.842005][T10882]
[  399.842005][T10882] other info that might help us debug this:
[  399.842006][T10882]  Possible unsafe locking scenario:
[  399.842006][T10882]
[  399.842006][T10882]  Possible unsafe locking scenario:
[  399.842006][T10882]
[  399.842008][T10882]        CPU0
[  399.842008][T10882]        CPU0
[  399.842009][T10882]        ----
[  399.842009][T10882]        ----
[  399.842010][T10882]   lock(srcu);
[  399.842010][T10882]   lock(srcu);
[  399.842014][T10882]   lock(srcu);
[  399.842014][T10882]   lock(srcu);
[  399.842017][T10882]
[  399.842017][T10882]  *** DEADLOCK ***
[  399.842017][T10882]
[  399.842017][T10882]
[  399.842017][T10882]  *** DEADLOCK ***
[  399.842017][T10882]
[  399.842018][T10882]  May be due to missing lock nesting notation
[  399.842018][T10882]
[  399.842018][T10882]  May be due to missing lock nesting notation

-------------------------------------------------------------------------------------------------------------------
Then defered the kvm_mmu_notifier_detach() using workqueue, then above issue got fixed.


Testing:
-------------
I've validated the v2 approach with:

Kernel: v7.0-rc2 with PREEMPT_RT and DEBUG_ATOMIC_SLEEP enabled
Test: Triggered OOM conditions that killed a QEMU process with active KVM VM
Use these commands for generating scenario:
1. vng -v -r ./arch/x86/boot/bzImage --qemu-opts='-m 2G -cpu EPYC,+svm,+npt,+tsc,+invtsc -s '
After successfully booting the virtme-ng(QEMU) ------> Act Host VM
2. chmod 666 /dev/kvm
3. dmesg -c > /dev/null
4. launching Guest VM using this command $qemu-system-x86_64 -enable-kvm -m 1000M -mem-prealloc \
        -monitor none -serial none -display none -nographic & sleep 10

Results:
-------------------
1. oom_enter callback was successfully invoked
2 No SRCU deadlock warnings
3 No "sleeping function called from invalid context" warnings
4.OOM reaper completed successfully
5. Process was reaped without errors



Question:
Before I send the v2 patch series, I want to confirm this approach aligns with your expectations. Specifically:
Defered this coommon helper kvm_mmu_notifier_detach() for mmu_nottifier_unregister() and unbalanced invalidation using workque is good design?
Are there any specific test cases or scenarios you'd like me to validate?

I can send the complete v2 patch series once you confirm this approach is on the right track.

Thanks again for the guidance!

Shaikh Kamal

> > This avoids the whole "convert locks to raw" problem and the complexity of deferring work.
> > 
> > I have questions on Testing part:
> > ------------------------------------
> > I tried to reproduce the bug scenario using the virtme-ng then running
> > the stress-ng putting memory pressure on VM, but not able to reproduce
> > the scenario.
> > I tried this way ..
> > vng -v -r ./arch/x86/boot/bzImage
> > VM is up, then running the stress-ng as below
> > stress-ng --vm 2 --vm-bytes 95% --timeout 20s & sleep 5 & dmesg | tail -30 | grep "sleeping function"
> > OOM Killer is triggered, but exact bug not able to reproduce, Please
> > suggest how to reproduce this bug, even we need to verify after code
> > changes which you have suggested.
> 
> I don't know, sorry.  But with this new approach there will always be a call
> to the new callback from the OOM killer, so it's easier to test.
> 
> Thanks,
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-25  5:19             ` shaikh kamaluddin
@ 2026-03-26 18:23               ` Paolo Bonzini
  2026-03-28 14:50                 ` shaikh kamaluddin
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2026-03-26 18:23 UTC (permalink / raw)
  To: shaikh kamaluddin
  Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm,
	Kernel Mailing List, Linux, linux-rt-devel, Shuah Khan, me

Il mer 25 mar 2026, 06:19 shaikh kamaluddin
<shaikhkamal2012@gmail.com> ha scritto:
>
> 1. Added oom_enter callback to struct mmu_notifier_ops in include/linux/mmu_notifier.h
> 2. Implemented __mmu_notifier_oom_enter() in mm/mmu_notifier.c to invoke registered callbacks
> 3. Called mmu_notifier_oom_enter(mm) from __oom_kill_process in mm/oom_kill.c before any invalidations
> 4. As per your suggestion, move existing kvm_destroy_vm() logic that already handles unbalanced invalidation to the new helper function kvm_mmu_notifier_detach() and invoke it from the kvm_destroy_vm()

This is not fully clear to me... It could be caused by a recursive
locking, or also a false positive. It's hard to say without seeing the
full backtrace, but seeing "lock(srcu)" is suspicious.

I wouldn't have expected deferral to be necessary; and it seems to me
that, if you defer removal to some time after the OOM reaper starts,
you'd have the same problem as before with sleeping spinlocks.

Can you post the original patch without deferral?

Paolo

>
> Key Design Decision:
> ------------------------------
> Implementation point no 4, while testing, Issue I was encountering is a recursive locking problem with the srcu lock, which is being acquired twice in the same context. This happens during the __mmu_notifier_oom_enter() and __synchronize_srcu() calls, leading to a potential deadlock.
> Please find below log snippet while launching the Guest VM


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-26 18:23               ` Paolo Bonzini
@ 2026-03-28 14:50                 ` shaikh kamaluddin
  2026-03-30 11:24                   ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: shaikh kamaluddin @ 2026-03-28 14:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm,
	Kernel Mailing List, Linux, linux-rt-devel, Shuah Khan, me

On Thu, Mar 26, 2026 at 07:23:58PM +0100, Paolo Bonzini wrote:
> Il mer 25 mar 2026, 06:19 shaikh kamaluddin
> <shaikhkamal2012@gmail.com> ha scritto:
> >
> > 1. Added oom_enter callback to struct mmu_notifier_ops in include/linux/mmu_notifier.h
> > 2. Implemented __mmu_notifier_oom_enter() in mm/mmu_notifier.c to invoke registered callbacks
> > 3. Called mmu_notifier_oom_enter(mm) from __oom_kill_process in mm/oom_kill.c before any invalidations
> > 4. As per your suggestion, move existing kvm_destroy_vm() logic that already handles unbalanced invalidation to the new helper function kvm_mmu_notifier_detach() and invoke it from the kvm_destroy_vm()
> 
> This is not fully clear to me... It could be caused by a recursive
> locking, or also a false positive. It's hard to say without seeing the
> full backtrace, but seeing "lock(srcu)" is suspicious.
> 
> I wouldn't have expected deferral to be necessary; and it seems to me
> that, if you defer removal to some time after the OOM reaper starts,
> you'd have the same problem as before with sleeping spinlocks.
> 
> Can you post the original patch without deferral?
> 
> Paolo
>
Hi Paolo,

Here's the current implementation without deferral as you requested.

As you suspected, it causes an SRCU deadlock. The callback calls
kvm_mmu_notifier_detach() which attempts mmu_notifier_unregister()
while __mmu_notifier_oom_enter is holding SRCU.

Kernel log shows:
  WARNING: possible recursive locking detected
  lock(srcu) at __synchronize_srcu
  already holding lock at __mmu_notifier_oom_enter

Should the callback simply set a flag (kvm->oom_reaping) and have
invalidate_range_start check this flag to return early?

Current implementation (diff attached):
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 07a2bbaf86e9..bdc035242f13 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -88,6 +88,23 @@ struct mmu_notifier_ops {
 	void (*release)(struct mmu_notifier *subscription,
 			struct mm_struct *mm);

+	/*
+	 * Called when the OOM reaper is about to reap this mm.
+	 * This is invoked before any invalidation attempts and allows
+	 * the subscriber to handle the fact that OOM reclaim will proceed
+	 * in non-blockable mode.
+	 *
+	 * This callback is optional and is called in atomic context.
+	 * It must not sleep or use any locks that may block.
+	 *
+	 * Common use case: unregister the MMU notifier to avoid being
+	 * called back in non-blockable invalidation context where
+	 * sleeping locks cannot be used.
+	 *
+	 * This is called with a reference held on the mm_struct.
+	 */
+	void (*oom_enter)(struct mmu_notifier *subscription,
+			  struct mm_struct *mm);
 	/*
 	 * clear_flush_young is called after the VM is
 	 * test-and-clearing the young/accessed bitflag in the
@@ -375,6 +392,7 @@ mmu_interval_check_retry(struct mmu_interval_notifier *interval_sub,

 extern void __mmu_notifier_subscriptions_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
+extern void __mmu_notifier_oom_enter(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long start,
 					  unsigned long end);
@@ -402,6 +420,13 @@ static inline void mmu_notifier_release(struct mm_struct *mm)
 		__mmu_notifier_release(mm);
 }

+static inline void mmu_notifier_oom_enter(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_oom_enter(mm);
+
+}
+
 static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long start,
 					  unsigned long end)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index a6cdf3674bdc..7c2259fabb6d 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -359,6 +359,27 @@ void __mmu_notifier_release(struct mm_struct *mm)
 		mn_hlist_release(subscriptions, mm);
 }

+void __mmu_notifier_oom_enter(struct mm_struct *mm)
+{
+	struct mmu_notifier *subscription;
+	int id;
+	pr_info("Entering :func:%s\n", __func__);
+	if (!mm->notifier_subscriptions)
+		return;
+
+	id = srcu_read_lock(&srcu);
+	hlist_for_each_entry_rcu(subscription,
+				 &mm->notifier_subscriptions->list, hlist,
+				 rcu_read_lock_held(&srcu)) {
+		if(subscription->ops->oom_enter)
+			subscription->ops->oom_enter(subscription, mm);
+
+	}
+	srcu_read_unlock(&srcu, id);
+	pr_info("Done:%s\n", __func__);
+
+}
+
 /*
  * If no young bitflag is supported by the hardware, ->clear_flush_young can
  * unmap the address and return 1 or 0 depending if the mapping previously
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c6c95c169ee..9b487b210980 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -947,6 +947,9 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
 	mm = victim->mm;
 	mmgrab(mm);

+	/* Notify MMU notifiers about the OOM event */
+	mmu_notifier_oom_enter(mm);
+
 	/* Raise event before sending signal: task reaper must see this */
 	count_vm_event(OOM_KILL);
 	memcg_memory_event_mm(mm, MEMCG_OOM_KILL);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1bc1da66b4b0..ffa40ebab452 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -885,6 +885,43 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }

+static void kvm_mmu_notifier_detach(struct kvm *kvm)
+{
+    /* Ensure this function is only executed once */
+    if (xchg(&kvm->mn_killed, 1))
+        return;
+
+    /* Unregister the MMU notifier */
+    mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+
+    /*
+     * At this point, pending calls to invalidate_range_start()
+     * have completed but no more MMU notifiers will run, so
+     * mn_active_invalidate_count may remain unbalanced.
+     * No threads can be waiting in kvm_swap_active_memslots() as the
+     * last reference on KVM has been dropped, but freeing
+     * memslots would deadlock without this manual intervention.
+     *
+     * If the count isn't unbalanced, i.e. KVM did NOT unregister its MMU
+     * notifier between a start() and end(), then there shouldn't be any
+     * in-progress invalidations.
+     */
+
+    WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
+    if (kvm->mn_active_invalidate_count)
+        kvm->mn_active_invalidate_count = 0;
+    else
+        WARN_ON(kvm->mmu_invalidate_in_progress);
+}
+
+static void kvm_mmu_notifier_oom_enter(struct mmu_notifier *mn,
+					struct mm_struct *mm)
+{
+	struct kvm *kvm;
+	kvm = container_of(mn, struct kvm, mmu_notifier);
+	kvm_mmu_notifier_detach(kvm);
+}
+
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
@@ -892,6 +929,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.clear_young		= kvm_mmu_notifier_clear_young,
 	.test_young		= kvm_mmu_notifier_test_young,
 	.release		= kvm_mmu_notifier_release,
+	.oom_enter		= kvm_mmu_notifier_oom_enter,
 };

 static int kvm_init_mmu_notifier(struct kvm *kvm)
@@ -1280,24 +1318,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
 		kvm->buses[i] = NULL;
 	}
 	kvm_coalesced_mmio_free(kvm);
-	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
-	/*
-	 * At this point, pending calls to invalidate_range_start()
-	 * have completed but no more MMU notifiers will run, so
-	 * mn_active_invalidate_count may remain unbalanced.
-	 * No threads can be waiting in kvm_swap_active_memslots() as the
-	 * last reference on KVM has been dropped, but freeing
-	 * memslots would deadlock without this manual intervention.
-	 *
-	 * If the count isn't unbalanced, i.e. KVM did NOT unregister its MMU
-	 * notifier between a start() and end(), then there shouldn't be any
-	 * in-progress invalidations.
-	 */
-	WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
-	if (kvm->mn_active_invalidate_count)
-		kvm->mn_active_invalidate_count = 0;
-	else
-		WARN_ON(kvm->mmu_invalidate_in_progress);
+
+	/* Detach the MMU notifier before unregistering it */
+	kvm_mmu_notifier_detach(kvm);
 	kvm_arch_destroy_vm(kvm);
 	kvm_destroy_devices(kvm);
 	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {


Thanks,
Kamal
> >
> > Key Design Decision:
> > ------------------------------
> > Implementation point no 4, while testing, Issue I was encountering is a recursive locking problem with the srcu lock, which is being acquired twice in the same context. This happens during the __mmu_notifier_oom_enter() and __synchronize_srcu() calls, leading to a potential deadlock.
> > Please find below log snippet while launching the Guest VM
> 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
  2026-03-28 14:50                 ` shaikh kamaluddin
@ 2026-03-30 11:24                   ` Paolo Bonzini
  2026-04-30 14:16                     ` [PATCH v2 0/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu() shaikh.kamal
  2026-04-30 14:17                     ` [PATCH v2 1/1] " shaikh.kamal
  0 siblings, 2 replies; 18+ messages in thread
From: Paolo Bonzini @ 2026-03-30 11:24 UTC (permalink / raw)
  To: shaikh kamaluddin
  Cc: Sean Christopherson, Sebastian Andrzej Siewior, kvm,
	Kernel Mailing List, Linux, linux-rt-devel, Shuah Khan, me

[-- Attachment #1: Type: text/plain, Size: 1406 bytes --]

On Sat, Mar 28, 2026 at 3:50 PM shaikh kamaluddin
<shaikhkamal2012@gmail.com> wrote:
> +void __mmu_notifier_oom_enter(struct mm_struct *mm)
> +{
> +       struct mmu_notifier *subscription;
> +       int id;
> +       pr_info("Entering :func:%s\n", __func__);
> +       if (!mm->notifier_subscriptions)
> +               return;
> +
> +       id = srcu_read_lock(&srcu);
> +       hlist_for_each_entry_rcu(subscription,
> +                                &mm->notifier_subscriptions->list, hlist,
> +                                rcu_read_lock_held(&srcu)) {
> +               if(subscription->ops->oom_enter)
> +                       subscription->ops->oom_enter(subscription, mm);
> +
> +       }
> +       srcu_read_unlock(&srcu, id);
> +       pr_info("Done:%s\n", __func__);

Yeah, calling mmu_notifier_unregister() won't work from within this function.

One possibility is for the new method to be something like this:

       void (*after_oom_unregister)(struct mmu_notifier *subscription);

So it only has to do

kvm->mn_registered = false; /* or xchg, it's the same */
WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
if (kvm->mn_active_invalidate_count)
    kvm->mn_active_invalidate_count = 0;
else
    WARN_ON(kvm->mmu_invalidate_in_progress);

or something like that. See the attached sketch, feel free to reuse it
as you see fit.

Paolo

[-- Attachment #2: mm.patch --]
[-- Type: text/x-patch, Size: 4393 bytes --]

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 3b670ee4eb26..7b14d8099cc1 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1973,12 +1973,15 @@ static gpa_t svm_translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa,
 				      struct x86_exception *exception,
 				      u64 pte_access)
 {
+	struct vcpu_svm *svm = to_svm(vcpu);
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 
 	BUG_ON(!mmu_is_nested(vcpu));
 
-	/* NPT walks are always user-walks */
-	access |= PFERR_USER_MASK;
+	/* Non-GMET walks are always user-walks */
+	if (!(svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_GMET_ENABLE))
+		access |= PFERR_USER_MASK;
+
 	return mmu->gva_to_gpa(vcpu, mmu, gpa, access, exception);
 }
 
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index e4cb317807ab..4a1c1f5297c4 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -7444,6 +7444,15 @@ static gpa_t vmx_translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa,
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 
 	BUG_ON(!mmu_is_nested(vcpu));
+
+	/*
+	 * MBEC differentiates based on the effective U/S bit of
+	 * the guest page tables; not the processor CPL.
+	 */
+	access &= ~PFERR_USER_MASK;
+	if ((pte_access & ACC_USER_MASK) && (access & PFERR_GUEST_FINAL_MASK))
+		access |= PFERR_USER_MASK;
+
 	return mmu->gva_to_gpa(vcpu, mmu, gpa, access, exception);
 }
 
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 8450e18a87c2..3c67ec15c09c 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -212,6 +212,14 @@ struct mmu_notifier_ops {
 	 */
 	struct mmu_notifier *(*alloc_notifier)(struct mm_struct *mm);
 	void (*free_notifier)(struct mmu_notifier *subscription);
+
+	/*
+	 * Any mmu notifier that defines this is automatically unregistered
+	 * when its mm is the subject of an OOM kill.  after_oom_unregister()
+	 * is invoked after all other outstanding callbacks have terminated.
+	 */
+	void (*after_oom_unregister)(struct mmu_notifier *subscription,
+				     struct mm_struct *mm);
 };
 
 /*
@@ -287,6 +295,7 @@ mmu_notifier_get(const struct mmu_notifier_ops *ops, struct mm_struct *mm)
 }
 void mmu_notifier_put(struct mmu_notifier *subscription);
 void mmu_notifier_synchronize(void);
+void mmu_notifier_oom_enter(struct mm_struct *mm);
 
 extern int mmu_notifier_register(struct mmu_notifier *subscription,
 				 struct mm_struct *mm);
@@ -661,6 +670,10 @@ static inline void mmu_notifier_synchronize(void)
 {
 }
 
+static inline void mmu_notifier_oom_enter(struct mm_struct *mm)
+{
+}
+
 #endif /* CONFIG_MMU_NOTIFIER */
 
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index a6cdf3674bdc..deba056468b1 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -835,6 +835,56 @@ void mmu_notifier_unregister(struct mmu_notifier *subscription,
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
 
+void mmu_notifier_oom_enter(struct mm_struct *mm)
+{
+	struct mmu_notifier_subscriptions *subscriptions = mm->notifier_subscriptions;
+	struct mmu_notifier *subscription;
+	struct hlist_node *tmp;
+	HLIST_HEAD(oom_list);
+	int id;
+
+	id = srcu_read_lock(&srcu);
+
+	/*
+	 * Prevent further calls to the MMU notifier, except for
+	 * release and after_oom_unregister.
+	 */
+	spin_lock(&subscriptions->lock);
+	hlist_for_each_entry_safe(subscription, tmp, &subscriptions->list, hlist) {
+		if (!subscription->ops->after_oom_unregister)
+			continue;
+
+		/*
+		 * after_oom_unregister and alloc_notifier are incompatible,
+		 * because there could be other references to allocated
+		 * notifiers.
+		 */
+		if (WARN_ON(subscription->ops->alloc_notifier))
+			continue;
+
+		hlist_del_init_rcu(&subscription->hlist);
+		hlist_add_head(&subscription->hlist, &oom_list);
+	}
+	spin_unlock(&subscriptions->lock);
+
+	hlist_for_each_entry(subscription, &oom_list, hlist)
+		if (subscription->ops->release)
+			subscription->ops->release(subscription, mm);
+	srcu_read_unlock(&srcu, id);
+
+	if (hlist_empty(&oom_list))
+		return;
+
+	synchronize_srcu(&srcu);
+
+	hlist_for_each_entry_safe(subscription, tmp, &oom_list, hlist) {
+		subscription->ops->after_oom_unregister(subscription, mm);
+
+		BUG_ON(atomic_read(&mm->mm_count) <= 0);
+		mmdrop(mm);
+	}
+}
+
 static void mmu_notifier_free_rcu(struct rcu_head *rcu)
 {
 	struct mmu_notifier *subscription =

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 0/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu()
  2026-03-30 11:24                   ` Paolo Bonzini
@ 2026-04-30 14:16                     ` shaikh.kamal
  2026-04-30 14:17                     ` [PATCH v2 1/1] " shaikh.kamal
  1 sibling, 0 replies; 18+ messages in thread
From: shaikh.kamal @ 2026-04-30 14:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kvm, linux-rt-devel, pbonzini, skhan, me, shaikh.kamal,
	syzbot+c3178b6b512446632bac

This series implements the after_oom_unregister callback design
proposed by Paolo in v1 review [1].

The current OOM notifier path calls synchronize_srcu() inline from
mmu_notifier_oom_enter(), which can deadlock on PREEMPT_RT when
locks such as siglock are held. This series moves the cleanup to an
asynchronous context using call_srcu(), allowing the OOM path to
proceed without waiting for an SRCU grace period.

Subscribers opt in via a new after_oom_unregister callback in
struct mmu_notifier_ops.

KVM is the first (and currently only) user.

Changes since v1 [1]:
- Implement after_oom_unregister callback in struct
  mmu_notifier_ops as proposed by Paolo
- Add mmu_notifier_oom_enter() to detach subscriptions and
  schedule cleanup via call_srcu()
- Add mmu_notifier_barrier() (srcu_barrier wrapper) so consumers
  can wait for pending callbacks during teardown
- Move call site from __oom_kill_process() to __oom_reap_task_mm()
  to fix KASAN vmalloc-out-of-bounds observed in v1
- Use hlist_del_init() to keep hlist_unhashed() correct for the
  kvm_destroy_vm() detection path, avoiding use-after-free on the
  stack-allocated oom_list head
- Add KVM after_oom_unregister implementation to clear
  mn_active_invalidate_count
- Update kvm_destroy_vm() to detect detached subscriptions via
  hlist_unhashed() and use mmu_notifier_barrier() + mmdrop()
  instead of mmu_notifier_unregister()
- Remove pr_err() on GFP_ATOMIC failure per checkpatch; the
  trade-off is documented inline

Testing
-------

Developed and tested under virtme-ng with PREEMPT_RT, KASAN, and
lockdep enabled.

Test setup:
- simple_kvm.c: minimal userspace program that opens /dev/kvm,
  creates a VM, registers memory, creates a vCPU, and sleeps
- CONFIG_DEBUG_VM-only debugfs interface (not part of this
  submission) at /sys/kernel/debug/oom_reap_task to invoke
  __oom_reap_task_mm() on a target task

Test sequence:
  $ ./simple_kvm &
  $ echo $! | sudo tee /sys/kernel/debug/oom_reap_task

Observed with patch applied:
- __oom_reap_task_mm() completes
- mmu_notifier_oom_enter() detaches the KVM subscription
- call_srcu() callback runs after (SRCU grace period)
- KVM after_oom_unregister clears mn_active_invalidate_count
- mmu_notifier_barrier() returns cleanly
- No KASAN reports, no kernel BUGs, lockdep clean

Stress runs (20 iterations) showed consistent results.

Reproducing the syzbot-reported issue
-------------------------------------
The issue reported by syzbot is reproducible on an unpatched
PREEMPT_RT kernel, triggering a "sleeping function called from
invalid context" warning in kvm_mmu_notifier_invalidate_range_start().
With this patch applied, the warning is no longer observed..


Known limitations
-----------------

Failure of GFP_ATOMIC allocation in mmu_notifier_oom_enter()
causes the corresponding after_oom_unregister callback to be
skipped. The OOM path cannot sleep without reintroducing the
deadlock this series fixes, and synchronous execution would
require waiting for SRCU readers. Cleanup still occurs later via
the normal unregister path. A mempool-backed allocator could
address this in the future.

[1] https://lore.kernel.org/all/CABgObfZQM0Eq1=vzm812D+CAcjOaE1f1QAUqGo5rTzXgLnR9cQ@mail.gmail.com

Reported-by: syzbot+c3178b6b512446632bac@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c3178b6b512446632bac
Tested-by: Shaikh Kamaluddin <shaikhkamal2012@gmail.com>

shaikh.kamal (1):
  mm/mmu_notifier: Add async OOM cleanup via call_srcu()

 include/linux/mmu_notifier.h |  10 +++
 mm/mmu_notifier.c            | 123 +++++++++++++++++++++++++++++++++++
 mm/oom_kill.c                |   3 +
 virt/kvm/kvm_main.c          |  27 +++++++-
 4 files changed, 162 insertions(+), 1 deletion(-)

--
2.43.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu()
  2026-03-30 11:24                   ` Paolo Bonzini
  2026-04-30 14:16                     ` [PATCH v2 0/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu() shaikh.kamal
@ 2026-04-30 14:17                     ` shaikh.kamal
  1 sibling, 0 replies; 18+ messages in thread
From: shaikh.kamal @ 2026-04-30 14:17 UTC (permalink / raw)
  To: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, David Rientjes, Shakeel Butt,
	linux-mm, linux-kernel, kvm, linux-rt-devel
  Cc: pbonzini, skhan, me, shaikh.kamal, syzbot+c3178b6b512446632bac

When an mm undergoes OOM kill, the OOM reaper unmaps memory while
holding the mmap_lock. MMU notifier subscribers (notably KVM) need
to be informed so they can tear down their secondary mappings. The
current synchronous unregister path can deadlock on PREEMPT_RT
because synchronize_srcu() is called from contexts that cannot
safely sleep.

This patch implements the asynchronous cleanup design proposed by
Paolo Bonzini in v1 review: a new optional after_oom_unregister
callback in struct mmu_notifier_ops, invoked after the SRCU grace
period via call_srcu() so that no readers can still reference the
subscription when cleanup runs.

The flow is:

  1. The OOM reaper calls mmu_notifier_oom_enter() from
     __oom_reap_task_mm().
  2. mmu_notifier_oom_enter() walks the subscription list and, for
     each subscriber that provides after_oom_unregister, detaches
     the subscription from the active list and schedules a
     call_srcu() callback.
  3. The deferred callback invokes after_oom_unregister once the
     grace period has elapsed and all in-flight readers have
     finished.
  4. Subsystems waiting to free structures referenced by the
     callback can call the new mmu_notifier_barrier() helper, which
     wraps srcu_barrier() to wait for all outstanding callbacks
     scheduled this way.

after_oom_unregister is mutually exclusive with alloc_notifier
because allocated notifiers can have additional outstanding
references that the OOM path cannot safely drop.

KVM is updated to provide after_oom_unregister, which clears
mn_active_invalidate_count, and to detect via hlist_unhashed() in
kvm_destroy_vm() when its subscription was already detached by the
OOM path; in that case it calls mmu_notifier_barrier() and drops
the mm reference rather than calling mmu_notifier_unregister().

Reported-by: syzbot+c3178b6b512446632bac@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c3178b6b512446632bac
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/all/20260209161527.31978-1-shaikhkamal2012@gmail.com/

Signed-off-by: shaikh.kamal <shaikhkamal2012@gmail.com>
---
 include/linux/mmu_notifier.h |  10 +++
 mm/mmu_notifier.c            | 123 +++++++++++++++++++++++++++++++++++
 mm/oom_kill.c                |   3 +
 virt/kvm/kvm_main.c          |  27 +++++++-
 4 files changed, 162 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 07a2bbaf86e9..0ccd590f55d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -88,6 +88,14 @@ struct mmu_notifier_ops {
 	void (*release)(struct mmu_notifier *subscription,
 			struct mm_struct *mm);

+	/*
+	 * Any mmu notifier that defines this is automatically unregistered
+	 * when its mm is the subject of an OOM kill.  after_oom_unregister()
+	 * is invoked after all other outstanding callbacks have terminated.
+	 */
+	void (*after_oom_unregister)(struct mmu_notifier *subscription,
+				     struct mm_struct *mm);
+
 	/*
 	 * clear_flush_young is called after the VM is
 	 * test-and-clearing the young/accessed bitflag in the
@@ -375,6 +383,8 @@ mmu_interval_check_retry(struct mmu_interval_notifier *interval_sub,

 extern void __mmu_notifier_subscriptions_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
+void mmu_notifier_oom_enter(struct mm_struct *mm);
+extern void mmu_notifier_barrier(void);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long start,
 					  unsigned long end);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index a6cdf3674bdc..b8fa58fe6b7d 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -49,6 +49,37 @@ struct mmu_notifier_subscriptions {
 	struct hlist_head deferred_list;
 };

+/*
+ * Callback structure for asynchronous OOM cleanup.
+ * Used with call_srcu() to defer after_oom_unregister callbacks
+ * until after SRCU grace period completes.
+ */
+struct mmu_notifier_oom_callback {
+	struct rcu_head rcu;
+	struct mmu_notifier *subscription;
+	struct mm_struct *mm;
+};
+
+/*
+ * Callback function invoked after SRCU grace period.
+ * Safely calls after_oom_unregister once all readers have finished.
+ */
+static void mmu_notifier_oom_callback_fn(struct rcu_head *rcu)
+{
+	struct mmu_notifier_oom_callback *cb =
+		container_of(rcu, struct mmu_notifier_oom_callback, rcu);
+
+	/* Safe - all SRCU readers have finished */
+	cb->subscription->ops->after_oom_unregister(cb->subscription, cb->mm);
+
+	/* Release mm reference taken when callback was scheduled */
+	WARN_ON_ONCE(atomic_read(&cb->mm->mm_count) <= 0);
+	mmdrop(cb->mm);
+
+	/* Free callback structure */
+	kfree(cb);
+}
+
 /*
  * This is a collision-retry read-side/write-side 'lock', a lot like a
  * seqcount, however this allows multiple write-sides to hold it at
@@ -359,6 +390,85 @@ void __mmu_notifier_release(struct mm_struct *mm)
 		mn_hlist_release(subscriptions, mm);
 }

+void mmu_notifier_oom_enter(struct mm_struct *mm)
+{
+	struct mmu_notifier_subscriptions *subscriptions =
+						mm->notifier_subscriptions;
+	struct mmu_notifier *subscription;
+	struct hlist_node *tmp;
+	HLIST_HEAD(oom_list);
+	int id;
+
+	if (!subscriptions)
+		return;
+
+	id = srcu_read_lock(&srcu);
+
+	/*
+	 * Prevent further calls to the MMU notifier, except for
+	 * release and after_oom_unregister.
+	 */
+	spin_lock(&subscriptions->lock);
+	hlist_for_each_entry_safe(subscription, tmp,
+				  &subscriptions->list, hlist) {
+		if (!subscription->ops->after_oom_unregister)
+			continue;
+
+		/*
+		 * after_oom_unregister and alloc_notifier are incompatible,
+		 * because there could be other references to allocated
+		 * notifiers.
+		 */
+		if (WARN_ON(subscription->ops->alloc_notifier))
+			continue;
+
+		hlist_del_init_rcu(&subscription->hlist);
+		hlist_add_head(&subscription->hlist, &oom_list);
+	}
+	spin_unlock(&subscriptions->lock);
+	hlist_for_each_entry(subscription, &oom_list, hlist)
+		if (subscription->ops->release)
+			subscription->ops->release(subscription, mm);
+
+	srcu_read_unlock(&srcu, id);
+
+	if (hlist_empty(&oom_list))
+		return;
+
+	hlist_for_each_entry_safe(subscription, tmp,
+				  &oom_list, hlist) {
+		struct mmu_notifier_oom_callback *cb;
+		/*
+		 * Remove from stack-based oom_list and reset hlist to unhashed state.
+		 * This sets subscription->hlist.pprev = NULL, so future callers of
+		 * mmu_notifier_unregister() (e.g. kvm_destroy_vm) will see
+		 * hlist_unhashed() == true and take the safe path, avoiding
+		 * use-after-free on the stack-allocated oom_list head.
+		 */
+		hlist_del_init(&subscription->hlist);
+
+		/*
+		 * GFP_ATOMIC failure is exceedingly rare. We cannot sleep
+		 * here (would reintroduce the deadlock this patch fixes)
+		 * and cannot call after_oom_unregister synchronously
+		 * without first waiting for SRCU readers. The subscriber
+		 * will not receive after_oom_unregister but cleanup will
+		 * eventually happen via the unregister path.
+		 */
+		cb = kmalloc(sizeof(*cb), GFP_ATOMIC);
+		if (!cb)
+			continue;
+
+		cb->subscription = subscription;
+		cb->mm = mm;
+		mmgrab(mm);
+
+		/* Schedule callback - returns immediately */
+		call_srcu(&srcu, &cb->rcu, mmu_notifier_oom_callback_fn);
+	}
+
+}
+
 /*
  * If no young bitflag is supported by the hardware, ->clear_flush_young can
  * unmap the address and return 1 or 0 depending if the mapping previously
@@ -1096,3 +1206,16 @@ void mmu_notifier_synchronize(void)
 	synchronize_srcu(&srcu);
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_synchronize);
+
+/**
+ * mmu_notifier_barrier - Wait for all pending MMU notifier callbacks
+ *
+ * Waits for all call_srcu() callbacks scheduled by mmu_notifier_oom_enter()
+ * to complete. Used by subsystems during cleanup to prevent use-after-free
+ * when destroying structures accessed by the callbacks.
+ */
+void mmu_notifier_barrier(void)
+{
+	srcu_barrier(&srcu);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_barrier);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c6c95c169ee..029e041afc57 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -519,6 +519,9 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 	bool ret = true;
 	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);

+	/* Notify MMU notifiers about the OOM event */
+	mmu_notifier_oom_enter(mm);
+
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
 	 * is no longer stable. No barriers really needed because unmapping
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1bc1da66b4b0..a2df83d3b413 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -885,6 +885,24 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }

+static void kvm_mmu_notifier_after_oom_unregister(struct mmu_notifier *mn,
+					struct mm_struct *mm)
+{
+	struct kvm *kvm;
+
+	kvm = mmu_notifier_to_kvm(mn);
+
+	/*
+	 * At this point the unregister has completed and all other callbacks
+	 * have terminated. Clean up any unbalanced invalidation counts.
+	 */
+	WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
+	if (kvm->mn_active_invalidate_count)
+		kvm->mn_active_invalidate_count = 0;
+	else
+		WARN_ON(kvm->mmu_invalidate_in_progress);
+}
+
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
@@ -892,6 +910,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.clear_young		= kvm_mmu_notifier_clear_young,
 	.test_young		= kvm_mmu_notifier_test_young,
 	.release		= kvm_mmu_notifier_release,
+	.after_oom_unregister	= kvm_mmu_notifier_after_oom_unregister,
 };

 static int kvm_init_mmu_notifier(struct kvm *kvm)
@@ -1280,7 +1299,13 @@ static void kvm_destroy_vm(struct kvm *kvm)
 		kvm->buses[i] = NULL;
 	}
 	kvm_coalesced_mmio_free(kvm);
-	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+	if (hlist_unhashed(&kvm->mmu_notifier.hlist)) {
+		/* Subscription removed by OOM. Wait for async callback. */
+		mmu_notifier_barrier();
+		mmdrop(kvm->mm);
+	} else {
+		mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+	}
 	/*
 	 * At this point, pending calls to invalidate_range_start()
 	 * have completed but no more MMU notifiers will run, so
--
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/xen: Fix sleeping lock in hard IRQ context on PREEMPT_RT
@ 2026-04-01 15:40 Sean Christopherson
  2026-04-29 22:25 ` [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu() shaikh.kamal
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2026-04-01 15:40 UTC (permalink / raw)
  To: David Woodhouse
  Cc: rostedt@goodmis.org, shaikhkamal2012@gmail.com,
	syzbot+919877893c9d28162dc2@syzkaller.appspotmail.com,
	me@brighamcampbell.com, linux-rt-devel@lists.linux.dev,
	hpa@zytor.com, linux-kernel@vger.kernel.org, paul@xen.org,
	kvm@vger.kernel.org, skhan@linuxfoundation.org

On Mon, Mar 30, 2026, David Woodhouse wrote:
> On Mon, 2026-03-30 at 10:18 -0400, Steven Rostedt wrote:
> > 
> > > +static void xen_timer_inject_irqwork(struct irq_work *work)
> > > +{
> > > +     struct kvm_vcpu_xen *xen = container_of(work, struct kvm_vcpu_xen,
> > > +                                             timer_inject_irqwork);
> > > +     struct kvm_vcpu *vcpu = container_of(xen, struct kvm_vcpu, arch.xen);
> > > +     struct kvm_xen_evtchn e;
> > > +     int rc;
> > > +
> > > +     e.vcpu_id = vcpu->vcpu_id;
> > > +     e.vcpu_idx = vcpu->vcpu_idx;
> > > +     e.port = vcpu->arch.xen.timer_virq;
> > > +     e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
> > > +
> > > +     rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm);
> > > +     if (rc != -EWOULDBLOCK)
> > > +             vcpu->arch.xen.timer_expires = 0;
> > > +}
> > 
> > Why duplicate this code and not simply make a static inline helper
> > function that is used in both places?
> 
> It's already duplicating the functionality; the original
> xen_timer_callback() will already fall back to injecting the IRQ in
> process context when it needs to (by setting vcpu-
> >arch.xen.timer_pending and then setting KVM_REQ_UNBLOCK).
> 
> All you had to do was make kvm_xen_set_evtchn_fast() return 
> -EWOULDBLOCK in the in_hardirq() case in order to use the existing
> fallback, surely? 
> 
> Better still, can't kvm_xen_set_evtchn_fast() just use read_trylock()
> instead?

Re-reading through the thread where you proposed using trylock, and through
commit bbe17c625d68 ("KVM: x86/xen: Fix potential deadlock in kvm_xen_update_runstate_guest()"),
I think I agree with using trylock for "fast" paths.

Though I would prefer to not make it unconditional for the "fast" helper instead
of conditional based on in_interrupt().  And before we start doing surgery to
"fix" a setup no one uses, and also before we use gpcs more broadly, I think we
should try to up-level the gpc APIs to reduce the amount of duplicate, boilerplate
code.  kvm_xen_update_runstate_guest() and maybe kvm_xen_set_evtchn() will likely
need to open code some amount of logic, but 

Side topic, looks like kvm_xen_shared_info_init() is buggy in that it fails to
mark the slot as dirty.

E.g. sans the API implementations, I think we can and should end up with code
like this:

---
 arch/x86/kvm/x86.c |  14 ++---
 arch/x86/kvm/xen.c | 127 ++++++++++++---------------------------------
 2 files changed, 37 insertions(+), 104 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0b5d48e75b65..65bad25fd9d4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3274,15 +3274,8 @@ static void kvm_setup_guest_pvclock(struct pvclock_vcpu_time_info *ref_hv_clock,
 
 	memcpy(&hv_clock, ref_hv_clock, sizeof(hv_clock));
 
-	read_lock_irqsave(&gpc->lock, flags);
-	while (!kvm_gpc_check(gpc, offset + sizeof(*guest_hv_clock))) {
-		read_unlock_irqrestore(&gpc->lock, flags);
-
-		if (kvm_gpc_refresh(gpc, offset + sizeof(*guest_hv_clock)))
-			return;
-
-		read_lock_irqsave(&gpc->lock, flags);
-	}
+	if (kvm_gpc_acquire(gpc))
+		return;
 
 	guest_hv_clock = (void *)(gpc->khva + offset);
 
@@ -3305,8 +3298,7 @@ static void kvm_setup_guest_pvclock(struct pvclock_vcpu_time_info *ref_hv_clock,
 
 	guest_hv_clock->version = ++hv_clock.version;
 
-	kvm_gpc_mark_dirty_in_slot(gpc);
-	read_unlock_irqrestore(&gpc->lock, flags);
+	kvm_gpc_release_dirty(gpc);
 
 	trace_kvm_pvclock_update(vcpu->vcpu_id, &hv_clock);
 }
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 91fd3673c09a..a97fd88ee99c 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -42,19 +42,12 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	u32 *wc_sec_hi;
 	u32 wc_version;
 	u64 wall_nsec;
-	int ret = 0;
 	int idx = srcu_read_lock(&kvm->srcu);
+	int ret;
 
-	read_lock_irq(&gpc->lock);
-	while (!kvm_gpc_check(gpc, PAGE_SIZE)) {
-		read_unlock_irq(&gpc->lock);
-
-		ret = kvm_gpc_refresh(gpc, PAGE_SIZE);
-		if (ret)
-			goto out;
-
-		read_lock_irq(&gpc->lock);
-	}
+	ret = kvm_gpc_acquire(gpc);
+	if (ret)
+		goto out;
 
 	/*
 	 * This code mirrors kvm_write_wall_clock() except that it writes
@@ -96,7 +89,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	smp_wmb();
 
 	wc->version = wc_version + 1;
-	read_unlock_irq(&gpc->lock);
+	kvm_gpc_release_dirty(gpc);
 
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
 
@@ -155,22 +148,14 @@ static int xen_get_guest_pvclock(struct kvm_vcpu *vcpu,
 				 struct gfn_to_pfn_cache *gpc,
 				 unsigned int offset)
 {
-	unsigned long flags;
 	int r;
 
-	read_lock_irqsave(&gpc->lock, flags);
-	while (!kvm_gpc_check(gpc, offset + sizeof(*hv_clock))) {
-		read_unlock_irqrestore(&gpc->lock, flags);
-
-		r = kvm_gpc_refresh(gpc, offset + sizeof(*hv_clock));
-		if (r)
-			return r;
-
-		read_lock_irqsave(&gpc->lock, flags);
-	}
+	r = kvm_gpc_acquire(gpc);
+	if (r)
+		return r;
 
 	memcpy(hv_clock, gpc->khva + offset, sizeof(*hv_clock));
-	read_unlock_irqrestore(&gpc->lock, flags);
+	kvm_gpc_release_clean(gpc);
 
 	/*
 	 * Sanity check TSC shift+multiplier to verify the guest's view of time
@@ -420,27 +405,8 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	 * Attempt to obtain the GPC lock on *both* (if there are two)
 	 * gfn_to_pfn caches that cover the region.
 	 */
-	if (atomic) {
-		local_irq_save(flags);
-		if (!read_trylock(&gpc1->lock)) {
-			local_irq_restore(flags);
-			return;
-		}
-	} else {
-		read_lock_irqsave(&gpc1->lock, flags);
-	}
-	while (!kvm_gpc_check(gpc1, user_len1)) {
-		read_unlock_irqrestore(&gpc1->lock, flags);
-
-		/* When invoked from kvm_sched_out() we cannot sleep */
-		if (atomic)
-			return;
-
-		if (kvm_gpc_refresh(gpc1, user_len1))
-			return;
-
-		read_lock_irqsave(&gpc1->lock, flags);
-	}
+	if (__kvm_gpc_acquire(gpc, atomic))
+		return;
 
 	if (likely(!user_len2)) {
 		/*
@@ -465,6 +431,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		 * gpc1 lock to make lockdep shut up about it.
 		 */
 		lock_set_subclass(&gpc1->lock.dep_map, 1, _THIS_IP_);
+
 		if (atomic) {
 			if (!read_trylock(&gpc2->lock)) {
 				read_unlock_irqrestore(&gpc1->lock, flags);
@@ -575,13 +542,10 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		smp_wmb();
 	}
 
-	if (user_len2) {
-		kvm_gpc_mark_dirty_in_slot(gpc2);
-		read_unlock(&gpc2->lock);
-	}
+	if (user_len2)
+		kvm_gpc_release_dirty(gpc2);
 
-	kvm_gpc_mark_dirty_in_slot(gpc1);
-	read_unlock_irqrestore(&gpc1->lock, flags);
+	kvm_gpc_release_dirty(gpc1);
 }
 
 void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
@@ -645,20 +609,8 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 	if (!evtchn_pending_sel)
 		return;
 
-	/*
-	 * Yes, this is an open-coded loop. But that's just what put_user()
-	 * does anyway. Page it in and retry the instruction. We're just a
-	 * little more honest about it.
-	 */
-	read_lock_irqsave(&gpc->lock, flags);
-	while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
-		read_unlock_irqrestore(&gpc->lock, flags);
-
-		if (kvm_gpc_refresh(gpc, sizeof(struct vcpu_info)))
-			return;
-
-		read_lock_irqsave(&gpc->lock, flags);
-	}
+	if (kvm_gpc_acquire(gpc))
+		return;
 
 	/* Now gpc->khva is a valid kernel address for the vcpu_info */
 	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) {
@@ -686,8 +638,7 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 		WRITE_ONCE(vi->evtchn_upcall_pending, 1);
 	}
 
-	kvm_gpc_mark_dirty_in_slot(gpc);
-	read_unlock_irqrestore(&gpc->lock, flags);
+	kvm_gpc_release_dirty(gpc);
 
 	/* For the per-vCPU lapic vector, deliver it as MSI. */
 	if (v->arch.xen.upcall_vector)
@@ -697,8 +648,8 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 {
 	struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
-	unsigned long flags;
 	u8 rc = 0;
+	int r;
 
 	/*
 	 * If the global upcall vector (HVMIRQ_callback_vector) is set and
@@ -713,33 +664,23 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 	BUILD_BUG_ON(sizeof(rc) !=
 		     sizeof_field(struct compat_vcpu_info, evtchn_upcall_pending));
 
-	read_lock_irqsave(&gpc->lock, flags);
-	while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
-		read_unlock_irqrestore(&gpc->lock, flags);
-
-		/*
-		 * This function gets called from kvm_vcpu_block() after setting the
-		 * task to TASK_INTERRUPTIBLE, to see if it needs to wake immediately
-		 * from a HLT. So we really mustn't sleep. If the page ended up absent
-		 * at that point, just return 1 in order to trigger an immediate wake,
-		 * and we'll end up getting called again from a context where we *can*
-		 * fault in the page and wait for it.
-		 */
-		if (in_atomic() || !task_is_running(current))
-			return 1;
-
-		if (kvm_gpc_refresh(gpc, sizeof(struct vcpu_info))) {
-			/*
-			 * If this failed, userspace has screwed up the
-			 * vcpu_info mapping. No interrupts for you.
-			 */
-			return 0;
-		}
-		read_lock_irqsave(&gpc->lock, flags);
-	}
+	/*
+	 * This function gets called from kvm_vcpu_block() after setting the
+	 * task to TASK_INTERRUPTIBLE, to see if it needs to wake immediately
+	 * from a HLT. So we really mustn't sleep. If the page ended up absent
+	 * at that point, just return 1 in order to trigger an immediate wake,
+	 * and we'll end up getting called again from a context where we *can*
+	 * fault in the page and wait for it.
+	 *
+	 * If acquiring the cache fails completely, then userspace has screwed
+	 * up the vcpu_info mapping. No interrupts for you.
+	 */
+	r = __kvm_gpc_acquire(gpc, in_atomic() || !task_is_running(current));
+	if (r)
+		return r == -EWOULDBLOCK ? 1 : 0;
 
 	rc = ((struct vcpu_info *)gpc->khva)->evtchn_upcall_pending;
-	read_unlock_irqrestore(&gpc->lock, flags);
+	kvm_gpc_release_clean(gpc);
 	return rc;
 }
 

base-commit: 3d6cdcc8883b5726513d245eef0e91cabfc397f7
-- 

[*] https://lore.kernel.org/all/76c61e1cb86e04df892d74c10976597700fe4cb5.camel@infradead.org

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu()
  2026-04-01 15:40 [PATCH] KVM: x86/xen: Fix sleeping lock in hard IRQ context on PREEMPT_RT Sean Christopherson
@ 2026-04-29 22:25 ` shaikh.kamal
  2026-05-03  3:26   ` kernel test robot
  2026-05-03  3:26   ` kernel test robot
  0 siblings, 2 replies; 18+ messages in thread
From: shaikh.kamal @ 2026-04-29 22:25 UTC (permalink / raw)
  To: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, David Rientjes, Shakeel Butt,
	linux-mm, linux-kernel, kvm, linux-rt-devel
  Cc: pbonzini, skhan, me, syzbot+c3178b6b512446632bac, shaikh.kamal

When an mm undergoes OOM kill, the OOM reaper unmaps memory while
holding the mmap_lock. MMU notifier subscribers (notably KVM) need
to be informed so they can tear down their secondary mappings. The
current synchronous unregister path can deadlock on PREEMPT_RT
because synchronize_srcu() is called from contexts that cannot
safely sleep.

This patch implements the asynchronous cleanup design proposed by
Paolo Bonzini in v1 review: a new optional after_oom_unregister
callback in struct mmu_notifier_ops, invoked after the SRCU grace
period via call_srcu() so that no readers can still reference the
subscription when cleanup runs.

The flow is:

  1. The OOM reaper calls mmu_notifier_oom_enter() from
     __oom_reap_task_mm().
  2. mmu_notifier_oom_enter() walks the subscription list and, for
     each subscriber that provides after_oom_unregister, detaches
     the subscription from the active list and schedules a
     call_srcu() callback.
  3. The deferred callback invokes after_oom_unregister once the
     grace period has elapsed and all in-flight readers have
     finished.
  4. Subsystems waiting to free structures referenced by the
     callback can call the new mmu_notifier_barrier() helper, which
     wraps srcu_barrier() to wait for all outstanding callbacks
     scheduled this way.

after_oom_unregister is mutually exclusive with alloc_notifier
because allocated notifiers can have additional outstanding
references that the OOM path cannot safely drop.

KVM is updated to provide after_oom_unregister, which clears
mn_active_invalidate_count, and to detect via hlist_unhashed() in
kvm_destroy_vm() when its subscription was already detached by the
OOM path; in that case it calls mmu_notifier_barrier() and drops
the mm reference rather than calling mmu_notifier_unregister().

Reported-by: syzbot+c3178b6b512446632bac@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c3178b6b512446632bac
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/all/20260209161527.31978-1-shaikhkamal2012@gmail.com/

Signed-off-by: shaikh.kamal <shaikhkamal2012@gmail.com>
---
 include/linux/mmu_notifier.h |  10 +++
 mm/mmu_notifier.c            | 123 +++++++++++++++++++++++++++++++++++
 mm/oom_kill.c                |   3 +
 virt/kvm/kvm_main.c          |  27 +++++++-
 4 files changed, 162 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 07a2bbaf86e9..0ccd590f55d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -88,6 +88,14 @@ struct mmu_notifier_ops {
 	void (*release)(struct mmu_notifier *subscription,
 			struct mm_struct *mm);

+	/*
+	 * Any mmu notifier that defines this is automatically unregistered
+	 * when its mm is the subject of an OOM kill.  after_oom_unregister()
+	 * is invoked after all other outstanding callbacks have terminated.
+	 */
+	void (*after_oom_unregister)(struct mmu_notifier *subscription,
+				     struct mm_struct *mm);
+
 	/*
 	 * clear_flush_young is called after the VM is
 	 * test-and-clearing the young/accessed bitflag in the
@@ -375,6 +383,8 @@ mmu_interval_check_retry(struct mmu_interval_notifier *interval_sub,

 extern void __mmu_notifier_subscriptions_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
+void mmu_notifier_oom_enter(struct mm_struct *mm);
+extern void mmu_notifier_barrier(void);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long start,
 					  unsigned long end);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index a6cdf3674bdc..b8fa58fe6b7d 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -49,6 +49,37 @@ struct mmu_notifier_subscriptions {
 	struct hlist_head deferred_list;
 };

+/*
+ * Callback structure for asynchronous OOM cleanup.
+ * Used with call_srcu() to defer after_oom_unregister callbacks
+ * until after SRCU grace period completes.
+ */
+struct mmu_notifier_oom_callback {
+	struct rcu_head rcu;
+	struct mmu_notifier *subscription;
+	struct mm_struct *mm;
+};
+
+/*
+ * Callback function invoked after SRCU grace period.
+ * Safely calls after_oom_unregister once all readers have finished.
+ */
+static void mmu_notifier_oom_callback_fn(struct rcu_head *rcu)
+{
+	struct mmu_notifier_oom_callback *cb =
+		container_of(rcu, struct mmu_notifier_oom_callback, rcu);
+
+	/* Safe - all SRCU readers have finished */
+	cb->subscription->ops->after_oom_unregister(cb->subscription, cb->mm);
+
+	/* Release mm reference taken when callback was scheduled */
+	WARN_ON_ONCE(atomic_read(&cb->mm->mm_count) <= 0);
+	mmdrop(cb->mm);
+
+	/* Free callback structure */
+	kfree(cb);
+}
+
 /*
  * This is a collision-retry read-side/write-side 'lock', a lot like a
  * seqcount, however this allows multiple write-sides to hold it at
@@ -359,6 +390,85 @@ void __mmu_notifier_release(struct mm_struct *mm)
 		mn_hlist_release(subscriptions, mm);
 }

+void mmu_notifier_oom_enter(struct mm_struct *mm)
+{
+	struct mmu_notifier_subscriptions *subscriptions =
+						mm->notifier_subscriptions;
+	struct mmu_notifier *subscription;
+	struct hlist_node *tmp;
+	HLIST_HEAD(oom_list);
+	int id;
+
+	if (!subscriptions)
+		return;
+
+	id = srcu_read_lock(&srcu);
+
+	/*
+	 * Prevent further calls to the MMU notifier, except for
+	 * release and after_oom_unregister.
+	 */
+	spin_lock(&subscriptions->lock);
+	hlist_for_each_entry_safe(subscription, tmp,
+				  &subscriptions->list, hlist) {
+		if (!subscription->ops->after_oom_unregister)
+			continue;
+
+		/*
+		 * after_oom_unregister and alloc_notifier are incompatible,
+		 * because there could be other references to allocated
+		 * notifiers.
+		 */
+		if (WARN_ON(subscription->ops->alloc_notifier))
+			continue;
+
+		hlist_del_init_rcu(&subscription->hlist);
+		hlist_add_head(&subscription->hlist, &oom_list);
+	}
+	spin_unlock(&subscriptions->lock);
+	hlist_for_each_entry(subscription, &oom_list, hlist)
+		if (subscription->ops->release)
+			subscription->ops->release(subscription, mm);
+
+	srcu_read_unlock(&srcu, id);
+
+	if (hlist_empty(&oom_list))
+		return;
+
+	hlist_for_each_entry_safe(subscription, tmp,
+				  &oom_list, hlist) {
+		struct mmu_notifier_oom_callback *cb;
+		/*
+		 * Remove from stack-based oom_list and reset hlist to unhashed state.
+		 * This sets subscription->hlist.pprev = NULL, so future callers of
+		 * mmu_notifier_unregister() (e.g. kvm_destroy_vm) will see
+		 * hlist_unhashed() == true and take the safe path, avoiding
+		 * use-after-free on the stack-allocated oom_list head.
+		 */
+		hlist_del_init(&subscription->hlist);
+
+		/*
+		 * GFP_ATOMIC failure is exceedingly rare. We cannot sleep
+		 * here (would reintroduce the deadlock this patch fixes)
+		 * and cannot call after_oom_unregister synchronously
+		 * without first waiting for SRCU readers. The subscriber
+		 * will not receive after_oom_unregister but cleanup will
+		 * eventually happen via the unregister path.
+		 */
+		cb = kmalloc(sizeof(*cb), GFP_ATOMIC);
+		if (!cb)
+			continue;
+
+		cb->subscription = subscription;
+		cb->mm = mm;
+		mmgrab(mm);
+
+		/* Schedule callback - returns immediately */
+		call_srcu(&srcu, &cb->rcu, mmu_notifier_oom_callback_fn);
+	}
+
+}
+
 /*
  * If no young bitflag is supported by the hardware, ->clear_flush_young can
  * unmap the address and return 1 or 0 depending if the mapping previously
@@ -1096,3 +1206,16 @@ void mmu_notifier_synchronize(void)
 	synchronize_srcu(&srcu);
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_synchronize);
+
+/**
+ * mmu_notifier_barrier - Wait for all pending MMU notifier callbacks
+ *
+ * Waits for all call_srcu() callbacks scheduled by mmu_notifier_oom_enter()
+ * to complete. Used by subsystems during cleanup to prevent use-after-free
+ * when destroying structures accessed by the callbacks.
+ */
+void mmu_notifier_barrier(void)
+{
+	srcu_barrier(&srcu);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_barrier);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c6c95c169ee..029e041afc57 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -519,6 +519,9 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 	bool ret = true;
 	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);

+	/* Notify MMU notifiers about the OOM event */
+	mmu_notifier_oom_enter(mm);
+
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
 	 * is no longer stable. No barriers really needed because unmapping
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1bc1da66b4b0..a2df83d3b413 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -885,6 +885,24 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }

+static void kvm_mmu_notifier_after_oom_unregister(struct mmu_notifier *mn,
+					struct mm_struct *mm)
+{
+	struct kvm *kvm;
+
+	kvm = mmu_notifier_to_kvm(mn);
+
+	/*
+	 * At this point the unregister has completed and all other callbacks
+	 * have terminated. Clean up any unbalanced invalidation counts.
+	 */
+	WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
+	if (kvm->mn_active_invalidate_count)
+		kvm->mn_active_invalidate_count = 0;
+	else
+		WARN_ON(kvm->mmu_invalidate_in_progress);
+}
+
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
@@ -892,6 +910,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.clear_young		= kvm_mmu_notifier_clear_young,
 	.test_young		= kvm_mmu_notifier_test_young,
 	.release		= kvm_mmu_notifier_release,
+	.after_oom_unregister	= kvm_mmu_notifier_after_oom_unregister,
 };

 static int kvm_init_mmu_notifier(struct kvm *kvm)
@@ -1280,7 +1299,13 @@ static void kvm_destroy_vm(struct kvm *kvm)
 		kvm->buses[i] = NULL;
 	}
 	kvm_coalesced_mmio_free(kvm);
-	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+	if (hlist_unhashed(&kvm->mmu_notifier.hlist)) {
+		/* Subscription removed by OOM. Wait for async callback. */
+		mmu_notifier_barrier();
+		mmdrop(kvm->mm);
+	} else {
+		mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+	}
 	/*
 	 * At this point, pending calls to invalidate_range_start()
 	 * have completed but no more MMU notifiers will run, so
--
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu()
  2026-04-29 22:25 ` [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu() shaikh.kamal
@ 2026-05-03  3:26   ` kernel test robot
  2026-05-03  3:26   ` kernel test robot
  1 sibling, 0 replies; 18+ messages in thread
From: kernel test robot @ 2026-05-03  3:26 UTC (permalink / raw)
  To: shaikh.kamal, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, David Rientjes,
	Shakeel Butt, linux-mm, linux-kernel, kvm, linux-rt-devel
  Cc: llvm, oe-kbuild-all, pbonzini, skhan, me,
	syzbot+c3178b6b512446632bac, shaikh.kamal

Hi shaikh.kamal,

kernel test robot noticed the following build errors:

[auto build test ERROR on v7.0]
[cannot apply to akpm-mm/mm-everything kvm/queue kvm/next kvm/linux-next v7.1-rc1 linus/master next-20260430]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/shaikh-kamal/mm-mmu_notifier-Add-async-OOM-cleanup-via-call_srcu/20260430-202943
base:   v7.0
patch link:    https://lore.kernel.org/r/20260429222548.25475-1-shaikhkamal2012%40gmail.com
patch subject: [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu()
config: hexagon-allnoconfig (https://download.01.org/0day-ci/archive/20260503/202605031115.qmkkOLQc-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 5bac06718f502014fade905512f1d26d578a18f3)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260503/202605031115.qmkkOLQc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605031115.qmkkOLQc-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/oom_kill.c:523:2: error: call to undeclared function 'mmu_notifier_oom_enter'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     523 |         mmu_notifier_oom_enter(mm);
         |         ^
   mm/oom_kill.c:523:2: note: did you mean 'mmu_notifier_release'?
   include/linux/mmu_notifier.h:610:20: note: 'mmu_notifier_release' declared here
     610 | static inline void mmu_notifier_release(struct mm_struct *mm)
         |                    ^
   mm/oom_kill.c:511:28: warning: variable 'oom_reaper_th' set but not used [-Wunused-but-set-global]
     511 | static struct task_struct *oom_reaper_th;
         |                            ^
   1 warning and 1 error generated.


vim +/mmu_notifier_oom_enter +523 mm/oom_kill.c

   515	
   516	static bool __oom_reap_task_mm(struct mm_struct *mm)
   517	{
   518		struct vm_area_struct *vma;
   519		bool ret = true;
   520		MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);
   521	
   522		/* Notify MMU notifiers about the OOM event */
 > 523		mmu_notifier_oom_enter(mm);
   524	
   525		/*
   526		 * Tell all users of get_user/copy_from_user etc... that the content
   527		 * is no longer stable. No barriers really needed because unmapping
   528		 * should imply barriers already and the reader would hit a page fault
   529		 * if it stumbled over a reaped memory.
   530		 */
   531		mm_flags_set(MMF_UNSTABLE, mm);
   532	
   533		/*
   534		 * It might start racing with the dying task and compete for shared
   535		 * resources - e.g. page table lock contention has been observed.
   536		 * Reduce those races by reaping the oom victim from the other end
   537		 * of the address space.
   538		 */
   539		mas_for_each_rev(&mas, vma, 0) {
   540			if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
   541				continue;
   542	
   543			/*
   544			 * Only anonymous pages have a good chance to be dropped
   545			 * without additional steps which we cannot afford as we
   546			 * are OOM already.
   547			 *
   548			 * We do not even care about fs backed pages because all
   549			 * which are reclaimable have already been reclaimed and
   550			 * we do not want to block exit_mmap by keeping mm ref
   551			 * count elevated without a good reason.
   552			 */
   553			if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
   554				struct mmu_notifier_range range;
   555				struct mmu_gather tlb;
   556	
   557				mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0,
   558							mm, vma->vm_start,
   559							vma->vm_end);
   560				tlb_gather_mmu(&tlb, mm);
   561				if (mmu_notifier_invalidate_range_start_nonblock(&range)) {
   562					tlb_finish_mmu(&tlb);
   563					ret = false;
   564					continue;
   565				}
   566				unmap_page_range(&tlb, vma, range.start, range.end, NULL);
   567				mmu_notifier_invalidate_range_end(&range);
   568				tlb_finish_mmu(&tlb);
   569			}
   570		}
   571	
   572		return ret;
   573	}
   574	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu()
  2026-04-29 22:25 ` [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu() shaikh.kamal
  2026-05-03  3:26   ` kernel test robot
@ 2026-05-03  3:26   ` kernel test robot
  1 sibling, 0 replies; 18+ messages in thread
From: kernel test robot @ 2026-05-03  3:26 UTC (permalink / raw)
  To: shaikh.kamal, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, David Rientjes,
	Shakeel Butt, linux-mm, linux-kernel, kvm, linux-rt-devel
  Cc: oe-kbuild-all, pbonzini, skhan, me, syzbot+c3178b6b512446632bac,
	shaikh.kamal

Hi shaikh.kamal,

kernel test robot noticed the following build errors:

[auto build test ERROR on v7.0]
[cannot apply to akpm-mm/mm-everything kvm/queue kvm/next kvm/linux-next v7.1-rc1 linus/master next-20260430]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/shaikh-kamal/mm-mmu_notifier-Add-async-OOM-cleanup-via-call_srcu/20260430-202943
base:   v7.0
patch link:    https://lore.kernel.org/r/20260429222548.25475-1-shaikhkamal2012%40gmail.com
patch subject: [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu()
config: arc-allnoconfig (https://download.01.org/0day-ci/archive/20260503/202605031109.uxckW5L3-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260503/202605031109.uxckW5L3-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605031109.uxckW5L3-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/oom_kill.c: In function '__oom_reap_task_mm':
>> mm/oom_kill.c:523:9: error: implicit declaration of function 'mmu_notifier_oom_enter'; did you mean 'mmu_notifier_release'? [-Wimplicit-function-declaration]
     523 |         mmu_notifier_oom_enter(mm);
         |         ^~~~~~~~~~~~~~~~~~~~~~
         |         mmu_notifier_release


vim +523 mm/oom_kill.c

   515	
   516	static bool __oom_reap_task_mm(struct mm_struct *mm)
   517	{
   518		struct vm_area_struct *vma;
   519		bool ret = true;
   520		MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);
   521	
   522		/* Notify MMU notifiers about the OOM event */
 > 523		mmu_notifier_oom_enter(mm);
   524	
   525		/*
   526		 * Tell all users of get_user/copy_from_user etc... that the content
   527		 * is no longer stable. No barriers really needed because unmapping
   528		 * should imply barriers already and the reader would hit a page fault
   529		 * if it stumbled over a reaped memory.
   530		 */
   531		mm_flags_set(MMF_UNSTABLE, mm);
   532	
   533		/*
   534		 * It might start racing with the dying task and compete for shared
   535		 * resources - e.g. page table lock contention has been observed.
   536		 * Reduce those races by reaping the oom victim from the other end
   537		 * of the address space.
   538		 */
   539		mas_for_each_rev(&mas, vma, 0) {
   540			if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
   541				continue;
   542	
   543			/*
   544			 * Only anonymous pages have a good chance to be dropped
   545			 * without additional steps which we cannot afford as we
   546			 * are OOM already.
   547			 *
   548			 * We do not even care about fs backed pages because all
   549			 * which are reclaimable have already been reclaimed and
   550			 * we do not want to block exit_mmap by keeping mm ref
   551			 * count elevated without a good reason.
   552			 */
   553			if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
   554				struct mmu_notifier_range range;
   555				struct mmu_gather tlb;
   556	
   557				mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0,
   558							mm, vma->vm_start,
   559							vma->vm_end);
   560				tlb_gather_mmu(&tlb, mm);
   561				if (mmu_notifier_invalidate_range_start_nonblock(&range)) {
   562					tlb_finish_mmu(&tlb);
   563					ret = false;
   564					continue;
   565				}
   566				unmap_page_range(&tlb, vma, range.start, range.end, NULL);
   567				mmu_notifier_invalidate_range_end(&range);
   568				tlb_finish_mmu(&tlb);
   569			}
   570		}
   571	
   572		return ret;
   573	}
   574	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu()
@ 2026-04-30  4:48 shaikh.kamal
  0 siblings, 0 replies; 18+ messages in thread
From: shaikh.kamal @ 2026-04-30  4:48 UTC (permalink / raw)
  To: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, David Rientjes, Shakeel Butt,
	linux-mm, linux-kernel, kvm, linux-rt-devel
  Cc: pbonzini, skhan, me, shaikh.kamal, syzbot+c3178b6b512446632bac

When an mm undergoes OOM kill, the OOM reaper unmaps memory while
holding the mmap_lock. MMU notifier subscribers (notably KVM) need
to be informed so they can tear down their secondary mappings. The
current synchronous unregister path can deadlock on PREEMPT_RT
because synchronize_srcu() is called from contexts that cannot
safely sleep.

This patch implements the asynchronous cleanup design proposed by
Paolo Bonzini in v1 review: a new optional after_oom_unregister
callback in struct mmu_notifier_ops, invoked after the SRCU grace
period via call_srcu() so that no readers can still reference the
subscription when cleanup runs.

The flow is:

  1. The OOM reaper calls mmu_notifier_oom_enter() from
     __oom_reap_task_mm().
  2. mmu_notifier_oom_enter() walks the subscription list and, for
     each subscriber that provides after_oom_unregister, detaches
     the subscription from the active list and schedules a
     call_srcu() callback.
  3. The deferred callback invokes after_oom_unregister once the
     grace period has elapsed and all in-flight readers have
     finished.
  4. Subsystems waiting to free structures referenced by the
     callback can call the new mmu_notifier_barrier() helper, which
     wraps srcu_barrier() to wait for all outstanding callbacks
     scheduled this way.

after_oom_unregister is mutually exclusive with alloc_notifier
because allocated notifiers can have additional outstanding
references that the OOM path cannot safely drop.

KVM is updated to provide after_oom_unregister, which clears
mn_active_invalidate_count, and to detect via hlist_unhashed() in
kvm_destroy_vm() when its subscription was already detached by the
OOM path; in that case it calls mmu_notifier_barrier() and drops
the mm reference rather than calling mmu_notifier_unregister().

Reported-by: syzbot+c3178b6b512446632bac@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c3178b6b512446632bac
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/all/20260209161527.31978-1-shaikhkamal2012@gmail.com/

Signed-off-by: shaikh.kamal <shaikhkamal2012@gmail.com>
---
 include/linux/mmu_notifier.h |  10 +++
 mm/mmu_notifier.c            | 123 +++++++++++++++++++++++++++++++++++
 mm/oom_kill.c                |   3 +
 virt/kvm/kvm_main.c          |  27 +++++++-
 4 files changed, 162 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 07a2bbaf86e9..0ccd590f55d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -88,6 +88,14 @@ struct mmu_notifier_ops {
 	void (*release)(struct mmu_notifier *subscription,
 			struct mm_struct *mm);

+	/*
+	 * Any mmu notifier that defines this is automatically unregistered
+	 * when its mm is the subject of an OOM kill.  after_oom_unregister()
+	 * is invoked after all other outstanding callbacks have terminated.
+	 */
+	void (*after_oom_unregister)(struct mmu_notifier *subscription,
+				     struct mm_struct *mm);
+
 	/*
 	 * clear_flush_young is called after the VM is
 	 * test-and-clearing the young/accessed bitflag in the
@@ -375,6 +383,8 @@ mmu_interval_check_retry(struct mmu_interval_notifier *interval_sub,

 extern void __mmu_notifier_subscriptions_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
+void mmu_notifier_oom_enter(struct mm_struct *mm);
+extern void mmu_notifier_barrier(void);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long start,
 					  unsigned long end);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index a6cdf3674bdc..b8fa58fe6b7d 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -49,6 +49,37 @@ struct mmu_notifier_subscriptions {
 	struct hlist_head deferred_list;
 };

+/*
+ * Callback structure for asynchronous OOM cleanup.
+ * Used with call_srcu() to defer after_oom_unregister callbacks
+ * until after SRCU grace period completes.
+ */
+struct mmu_notifier_oom_callback {
+	struct rcu_head rcu;
+	struct mmu_notifier *subscription;
+	struct mm_struct *mm;
+};
+
+/*
+ * Callback function invoked after SRCU grace period.
+ * Safely calls after_oom_unregister once all readers have finished.
+ */
+static void mmu_notifier_oom_callback_fn(struct rcu_head *rcu)
+{
+	struct mmu_notifier_oom_callback *cb =
+		container_of(rcu, struct mmu_notifier_oom_callback, rcu);
+
+	/* Safe - all SRCU readers have finished */
+	cb->subscription->ops->after_oom_unregister(cb->subscription, cb->mm);
+
+	/* Release mm reference taken when callback was scheduled */
+	WARN_ON_ONCE(atomic_read(&cb->mm->mm_count) <= 0);
+	mmdrop(cb->mm);
+
+	/* Free callback structure */
+	kfree(cb);
+}
+
 /*
  * This is a collision-retry read-side/write-side 'lock', a lot like a
  * seqcount, however this allows multiple write-sides to hold it at
@@ -359,6 +390,85 @@ void __mmu_notifier_release(struct mm_struct *mm)
 		mn_hlist_release(subscriptions, mm);
 }

+void mmu_notifier_oom_enter(struct mm_struct *mm)
+{
+	struct mmu_notifier_subscriptions *subscriptions =
+						mm->notifier_subscriptions;
+	struct mmu_notifier *subscription;
+	struct hlist_node *tmp;
+	HLIST_HEAD(oom_list);
+	int id;
+
+	if (!subscriptions)
+		return;
+
+	id = srcu_read_lock(&srcu);
+
+	/*
+	 * Prevent further calls to the MMU notifier, except for
+	 * release and after_oom_unregister.
+	 */
+	spin_lock(&subscriptions->lock);
+	hlist_for_each_entry_safe(subscription, tmp,
+				  &subscriptions->list, hlist) {
+		if (!subscription->ops->after_oom_unregister)
+			continue;
+
+		/*
+		 * after_oom_unregister and alloc_notifier are incompatible,
+		 * because there could be other references to allocated
+		 * notifiers.
+		 */
+		if (WARN_ON(subscription->ops->alloc_notifier))
+			continue;
+
+		hlist_del_init_rcu(&subscription->hlist);
+		hlist_add_head(&subscription->hlist, &oom_list);
+	}
+	spin_unlock(&subscriptions->lock);
+	hlist_for_each_entry(subscription, &oom_list, hlist)
+		if (subscription->ops->release)
+			subscription->ops->release(subscription, mm);
+
+	srcu_read_unlock(&srcu, id);
+
+	if (hlist_empty(&oom_list))
+		return;
+
+	hlist_for_each_entry_safe(subscription, tmp,
+				  &oom_list, hlist) {
+		struct mmu_notifier_oom_callback *cb;
+		/*
+		 * Remove from stack-based oom_list and reset hlist to unhashed state.
+		 * This sets subscription->hlist.pprev = NULL, so future callers of
+		 * mmu_notifier_unregister() (e.g. kvm_destroy_vm) will see
+		 * hlist_unhashed() == true and take the safe path, avoiding
+		 * use-after-free on the stack-allocated oom_list head.
+		 */
+		hlist_del_init(&subscription->hlist);
+
+		/*
+		 * GFP_ATOMIC failure is exceedingly rare. We cannot sleep
+		 * here (would reintroduce the deadlock this patch fixes)
+		 * and cannot call after_oom_unregister synchronously
+		 * without first waiting for SRCU readers. The subscriber
+		 * will not receive after_oom_unregister but cleanup will
+		 * eventually happen via the unregister path.
+		 */
+		cb = kmalloc(sizeof(*cb), GFP_ATOMIC);
+		if (!cb)
+			continue;
+
+		cb->subscription = subscription;
+		cb->mm = mm;
+		mmgrab(mm);
+
+		/* Schedule callback - returns immediately */
+		call_srcu(&srcu, &cb->rcu, mmu_notifier_oom_callback_fn);
+	}
+
+}
+
 /*
  * If no young bitflag is supported by the hardware, ->clear_flush_young can
  * unmap the address and return 1 or 0 depending if the mapping previously
@@ -1096,3 +1206,16 @@ void mmu_notifier_synchronize(void)
 	synchronize_srcu(&srcu);
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_synchronize);
+
+/**
+ * mmu_notifier_barrier - Wait for all pending MMU notifier callbacks
+ *
+ * Waits for all call_srcu() callbacks scheduled by mmu_notifier_oom_enter()
+ * to complete. Used by subsystems during cleanup to prevent use-after-free
+ * when destroying structures accessed by the callbacks.
+ */
+void mmu_notifier_barrier(void)
+{
+	srcu_barrier(&srcu);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_barrier);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c6c95c169ee..029e041afc57 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -519,6 +519,9 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 	bool ret = true;
 	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);

+	/* Notify MMU notifiers about the OOM event */
+	mmu_notifier_oom_enter(mm);
+
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
 	 * is no longer stable. No barriers really needed because unmapping
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1bc1da66b4b0..a2df83d3b413 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -885,6 +885,24 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }

+static void kvm_mmu_notifier_after_oom_unregister(struct mmu_notifier *mn,
+					struct mm_struct *mm)
+{
+	struct kvm *kvm;
+
+	kvm = mmu_notifier_to_kvm(mn);
+
+	/*
+	 * At this point the unregister has completed and all other callbacks
+	 * have terminated. Clean up any unbalanced invalidation counts.
+	 */
+	WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
+	if (kvm->mn_active_invalidate_count)
+		kvm->mn_active_invalidate_count = 0;
+	else
+		WARN_ON(kvm->mmu_invalidate_in_progress);
+}
+
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
@@ -892,6 +910,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.clear_young		= kvm_mmu_notifier_clear_young,
 	.test_young		= kvm_mmu_notifier_test_young,
 	.release		= kvm_mmu_notifier_release,
+	.after_oom_unregister	= kvm_mmu_notifier_after_oom_unregister,
 };

 static int kvm_init_mmu_notifier(struct kvm *kvm)
@@ -1280,7 +1299,13 @@ static void kvm_destroy_vm(struct kvm *kvm)
 		kvm->buses[i] = NULL;
 	}
 	kvm_coalesced_mmio_free(kvm);
-	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+	if (hlist_unhashed(&kvm->mmu_notifier.hlist)) {
+		/* Subscription removed by OOM. Wait for async callback. */
+		mmu_notifier_barrier();
+		mmdrop(kvm->mm);
+	} else {
+		mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+	}
 	/*
 	 * At this point, pending calls to invalidate_range_start()
 	 * have completed but no more MMU notifiers will run, so
--
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-05-03  3:27 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09 16:15 [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations shaikh.kamal
2026-02-11 12:09 ` Sebastian Andrzej Siewior
2026-02-11 15:34   ` Sean Christopherson
2026-03-03 18:49     ` shaikh kamaluddin
2026-03-06 16:42       ` Sean Christopherson
2026-03-06 18:14       ` Paolo Bonzini
2026-03-12 19:24         ` shaikh kamaluddin
2026-03-14  7:47           ` Paolo Bonzini
2026-03-25  5:19             ` shaikh kamaluddin
2026-03-26 18:23               ` Paolo Bonzini
2026-03-28 14:50                 ` shaikh kamaluddin
2026-03-30 11:24                   ` Paolo Bonzini
2026-04-30 14:16                     ` [PATCH v2 0/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu() shaikh.kamal
2026-04-30 14:17                     ` [PATCH v2 1/1] " shaikh.kamal
  -- strict thread matches above, loose matches on Subject: below --
2026-04-01 15:40 [PATCH] KVM: x86/xen: Fix sleeping lock in hard IRQ context on PREEMPT_RT Sean Christopherson
2026-04-29 22:25 ` [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu() shaikh.kamal
2026-05-03  3:26   ` kernel test robot
2026-05-03  3:26   ` kernel test robot
2026-04-30  4:48 shaikh.kamal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox