From: "shaikh.kamal" <shaikhkamal2012@gmail.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
David Rientjes <rientjes@google.com>,
Shakeel Butt <shakeel.butt@linux.dev>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org, linux-rt-devel@lists.linux.dev
Cc: pbonzini@redhat.com, skhan@linuxfoundation.org,
me@brighamcampbell.com,
syzbot+c3178b6b512446632bac@syzkaller.appspotmail.com,
"shaikh.kamal" <shaikhkamal2012@gmail.com>
Subject: [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu()
Date: Thu, 30 Apr 2026 03:55:48 +0530 [thread overview]
Message-ID: <20260429222548.25475-1-shaikhkamal2012@gmail.com> (raw)
In-Reply-To: <ac08V4TaM2yh9SY1@google.com>
When an mm undergoes OOM kill, the OOM reaper unmaps memory while
holding the mmap_lock. MMU notifier subscribers (notably KVM) need
to be informed so they can tear down their secondary mappings. The
current synchronous unregister path can deadlock on PREEMPT_RT
because synchronize_srcu() is called from contexts that cannot
safely sleep.
This patch implements the asynchronous cleanup design proposed by
Paolo Bonzini in v1 review: a new optional after_oom_unregister
callback in struct mmu_notifier_ops, invoked after the SRCU grace
period via call_srcu() so that no readers can still reference the
subscription when cleanup runs.
The flow is:
1. The OOM reaper calls mmu_notifier_oom_enter() from
__oom_reap_task_mm().
2. mmu_notifier_oom_enter() walks the subscription list and, for
each subscriber that provides after_oom_unregister, detaches
the subscription from the active list and schedules a
call_srcu() callback.
3. The deferred callback invokes after_oom_unregister once the
grace period has elapsed and all in-flight readers have
finished.
4. Subsystems waiting to free structures referenced by the
callback can call the new mmu_notifier_barrier() helper, which
wraps srcu_barrier() to wait for all outstanding callbacks
scheduled this way.
after_oom_unregister is mutually exclusive with alloc_notifier
because allocated notifiers can have additional outstanding
references that the OOM path cannot safely drop.
KVM is updated to provide after_oom_unregister, which clears
mn_active_invalidate_count, and to detect via hlist_unhashed() in
kvm_destroy_vm() when its subscription was already detached by the
OOM path; in that case it calls mmu_notifier_barrier() and drops
the mm reference rather than calling mmu_notifier_unregister().
Reported-by: syzbot+c3178b6b512446632bac@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c3178b6b512446632bac
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/all/20260209161527.31978-1-shaikhkamal2012@gmail.com/
Signed-off-by: shaikh.kamal <shaikhkamal2012@gmail.com>
---
include/linux/mmu_notifier.h | 10 +++
mm/mmu_notifier.c | 123 +++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 3 +
virt/kvm/kvm_main.c | 27 +++++++-
4 files changed, 162 insertions(+), 1 deletion(-)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 07a2bbaf86e9..0ccd590f55d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -88,6 +88,14 @@ struct mmu_notifier_ops {
void (*release)(struct mmu_notifier *subscription,
struct mm_struct *mm);
+ /*
+ * Any mmu notifier that defines this is automatically unregistered
+ * when its mm is the subject of an OOM kill. after_oom_unregister()
+ * is invoked after all other outstanding callbacks have terminated.
+ */
+ void (*after_oom_unregister)(struct mmu_notifier *subscription,
+ struct mm_struct *mm);
+
/*
* clear_flush_young is called after the VM is
* test-and-clearing the young/accessed bitflag in the
@@ -375,6 +383,8 @@ mmu_interval_check_retry(struct mmu_interval_notifier *interval_sub,
extern void __mmu_notifier_subscriptions_destroy(struct mm_struct *mm);
extern void __mmu_notifier_release(struct mm_struct *mm);
+void mmu_notifier_oom_enter(struct mm_struct *mm);
+extern void mmu_notifier_barrier(void);
extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
unsigned long start,
unsigned long end);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index a6cdf3674bdc..b8fa58fe6b7d 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -49,6 +49,37 @@ struct mmu_notifier_subscriptions {
struct hlist_head deferred_list;
};
+/*
+ * Callback structure for asynchronous OOM cleanup.
+ * Used with call_srcu() to defer after_oom_unregister callbacks
+ * until after SRCU grace period completes.
+ */
+struct mmu_notifier_oom_callback {
+ struct rcu_head rcu;
+ struct mmu_notifier *subscription;
+ struct mm_struct *mm;
+};
+
+/*
+ * Callback function invoked after SRCU grace period.
+ * Safely calls after_oom_unregister once all readers have finished.
+ */
+static void mmu_notifier_oom_callback_fn(struct rcu_head *rcu)
+{
+ struct mmu_notifier_oom_callback *cb =
+ container_of(rcu, struct mmu_notifier_oom_callback, rcu);
+
+ /* Safe - all SRCU readers have finished */
+ cb->subscription->ops->after_oom_unregister(cb->subscription, cb->mm);
+
+ /* Release mm reference taken when callback was scheduled */
+ WARN_ON_ONCE(atomic_read(&cb->mm->mm_count) <= 0);
+ mmdrop(cb->mm);
+
+ /* Free callback structure */
+ kfree(cb);
+}
+
/*
* This is a collision-retry read-side/write-side 'lock', a lot like a
* seqcount, however this allows multiple write-sides to hold it at
@@ -359,6 +390,85 @@ void __mmu_notifier_release(struct mm_struct *mm)
mn_hlist_release(subscriptions, mm);
}
+void mmu_notifier_oom_enter(struct mm_struct *mm)
+{
+ struct mmu_notifier_subscriptions *subscriptions =
+ mm->notifier_subscriptions;
+ struct mmu_notifier *subscription;
+ struct hlist_node *tmp;
+ HLIST_HEAD(oom_list);
+ int id;
+
+ if (!subscriptions)
+ return;
+
+ id = srcu_read_lock(&srcu);
+
+ /*
+ * Prevent further calls to the MMU notifier, except for
+ * release and after_oom_unregister.
+ */
+ spin_lock(&subscriptions->lock);
+ hlist_for_each_entry_safe(subscription, tmp,
+ &subscriptions->list, hlist) {
+ if (!subscription->ops->after_oom_unregister)
+ continue;
+
+ /*
+ * after_oom_unregister and alloc_notifier are incompatible,
+ * because there could be other references to allocated
+ * notifiers.
+ */
+ if (WARN_ON(subscription->ops->alloc_notifier))
+ continue;
+
+ hlist_del_init_rcu(&subscription->hlist);
+ hlist_add_head(&subscription->hlist, &oom_list);
+ }
+ spin_unlock(&subscriptions->lock);
+ hlist_for_each_entry(subscription, &oom_list, hlist)
+ if (subscription->ops->release)
+ subscription->ops->release(subscription, mm);
+
+ srcu_read_unlock(&srcu, id);
+
+ if (hlist_empty(&oom_list))
+ return;
+
+ hlist_for_each_entry_safe(subscription, tmp,
+ &oom_list, hlist) {
+ struct mmu_notifier_oom_callback *cb;
+ /*
+ * Remove from stack-based oom_list and reset hlist to unhashed state.
+ * This sets subscription->hlist.pprev = NULL, so future callers of
+ * mmu_notifier_unregister() (e.g. kvm_destroy_vm) will see
+ * hlist_unhashed() == true and take the safe path, avoiding
+ * use-after-free on the stack-allocated oom_list head.
+ */
+ hlist_del_init(&subscription->hlist);
+
+ /*
+ * GFP_ATOMIC failure is exceedingly rare. We cannot sleep
+ * here (would reintroduce the deadlock this patch fixes)
+ * and cannot call after_oom_unregister synchronously
+ * without first waiting for SRCU readers. The subscriber
+ * will not receive after_oom_unregister but cleanup will
+ * eventually happen via the unregister path.
+ */
+ cb = kmalloc(sizeof(*cb), GFP_ATOMIC);
+ if (!cb)
+ continue;
+
+ cb->subscription = subscription;
+ cb->mm = mm;
+ mmgrab(mm);
+
+ /* Schedule callback - returns immediately */
+ call_srcu(&srcu, &cb->rcu, mmu_notifier_oom_callback_fn);
+ }
+
+}
+
/*
* If no young bitflag is supported by the hardware, ->clear_flush_young can
* unmap the address and return 1 or 0 depending if the mapping previously
@@ -1096,3 +1206,16 @@ void mmu_notifier_synchronize(void)
synchronize_srcu(&srcu);
}
EXPORT_SYMBOL_GPL(mmu_notifier_synchronize);
+
+/**
+ * mmu_notifier_barrier - Wait for all pending MMU notifier callbacks
+ *
+ * Waits for all call_srcu() callbacks scheduled by mmu_notifier_oom_enter()
+ * to complete. Used by subsystems during cleanup to prevent use-after-free
+ * when destroying structures accessed by the callbacks.
+ */
+void mmu_notifier_barrier(void)
+{
+ srcu_barrier(&srcu);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_barrier);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c6c95c169ee..029e041afc57 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -519,6 +519,9 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
bool ret = true;
MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);
+ /* Notify MMU notifiers about the OOM event */
+ mmu_notifier_oom_enter(mm);
+
/*
* Tell all users of get_user/copy_from_user etc... that the content
* is no longer stable. No barriers really needed because unmapping
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1bc1da66b4b0..a2df83d3b413 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -885,6 +885,24 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
srcu_read_unlock(&kvm->srcu, idx);
}
+static void kvm_mmu_notifier_after_oom_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm)
+{
+ struct kvm *kvm;
+
+ kvm = mmu_notifier_to_kvm(mn);
+
+ /*
+ * At this point the unregister has completed and all other callbacks
+ * have terminated. Clean up any unbalanced invalidation counts.
+ */
+ WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
+ if (kvm->mn_active_invalidate_count)
+ kvm->mn_active_invalidate_count = 0;
+ else
+ WARN_ON(kvm->mmu_invalidate_in_progress);
+}
+
static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
.invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
.invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
@@ -892,6 +910,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
.clear_young = kvm_mmu_notifier_clear_young,
.test_young = kvm_mmu_notifier_test_young,
.release = kvm_mmu_notifier_release,
+ .after_oom_unregister = kvm_mmu_notifier_after_oom_unregister,
};
static int kvm_init_mmu_notifier(struct kvm *kvm)
@@ -1280,7 +1299,13 @@ static void kvm_destroy_vm(struct kvm *kvm)
kvm->buses[i] = NULL;
}
kvm_coalesced_mmio_free(kvm);
- mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+ if (hlist_unhashed(&kvm->mmu_notifier.hlist)) {
+ /* Subscription removed by OOM. Wait for async callback. */
+ mmu_notifier_barrier();
+ mmdrop(kvm->mm);
+ } else {
+ mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+ }
/*
* At this point, pending calls to invalidate_range_start()
* have completed but no more MMU notifiers will run, so
--
2.43.0
next prev parent reply other threads:[~2026-04-29 22:26 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-29 13:15 [PATCH] KVM: x86/xen: Fix sleeping lock in hard IRQ context on PREEMPT_RT shaikh.kamal
2026-03-30 14:18 ` Steven Rostedt
2026-03-30 14:51 ` Woodhouse, David
2026-04-01 15:40 ` Sean Christopherson
2026-04-02 1:30 ` [PATCH v2 0/1] KVM: x86/xen: Fix PREEMPT_RT sleeping lock bug shaikh.kamal
2026-04-02 1:31 ` [PATCH v2 1/1] KVM: x86/xen: Use trylock for fast path event channel delivery shaikh.kamal
2026-04-02 6:36 ` Sebastian Andrzej Siewior
2026-04-02 22:40 ` Sean Christopherson
2026-04-02 6:42 ` [PATCH] KVM: x86/xen: Fix sleeping lock in hard IRQ context on PREEMPT_RT Sebastian Andrzej Siewior
2026-04-02 22:23 ` Sean Christopherson
2026-04-29 22:25 ` [PATCH v2 0/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu() shaikh.kamal
2026-04-29 22:25 ` shaikh.kamal [this message]
2026-05-03 3:26 ` [PATCH v2 1/1] " kernel test robot
2026-05-03 3:26 ` kernel test robot
-- strict thread matches above, loose matches on Subject: below --
2026-03-30 11:24 [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations Paolo Bonzini
2026-04-30 14:17 ` [PATCH v2 1/1] mm/mmu_notifier: Add async OOM cleanup via call_srcu() shaikh.kamal
2026-04-30 4:48 shaikh.kamal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260429222548.25475-1-shaikhkamal2012@gmail.com \
--to=shaikhkamal2012@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-rt-devel@lists.linux.dev \
--cc=lorenzo.stoakes@oracle.com \
--cc=me@brighamcampbell.com \
--cc=mhocko@suse.com \
--cc=pbonzini@redhat.com \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=skhan@linuxfoundation.org \
--cc=surenb@google.com \
--cc=syzbot+c3178b6b512446632bac@syzkaller.appspotmail.com \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox