From: Minchan Kim <minchan@kernel.org>
To: Suren Baghdasaryan <surenb@google.com>
Cc: akpm@linux-foundation.org, hca@linux.ibm.com,
linux-s390@vger.kernel.org, david@kernel.org, mhocko@suse.com,
brauner@kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, timmurray@google.com
Subject: Re: [PATCH v1 3/3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
Date: Mon, 27 Apr 2026 15:52:42 -0700 [thread overview]
Message-ID: <ae_oul_xgD5Kgnqi@google.com> (raw)
In-Reply-To: <CAJuCfpFK+qatN_vHArEKOYe+LRHQG0e3inJrzVdz-mABi-BSTg@mail.gmail.com>
On Mon, Apr 27, 2026 at 01:34:37PM -0700, Suren Baghdasaryan wrote:
> On Tue, Apr 21, 2026 at 4:03 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > Currently, process_mrelease() requires userspace to send a SIGKILL signal
> > prior to the call. This separation introduces a scheduling race window
> > where the victim task may receive the signal and enter the exit path
> > before the reaper can invoke process_mrelease().
> >
> > When the victim enters the exit path (do_exit -> exit_mm), it clears its
> > task->mm immediately. This causes process_mrelease() to fail with -ESRCH,
> > leaving the actual address space teardown (exit_mmap) to be deferred until
> > the mm's reference count drops to zero. In Android, arbitrary reference counts
> > (e.g., async I/O, reading /proc/<pid>/cmdline, or various other remote
> > VM accesses) frequently delay this teardown indefinitely, defeating the
> > purpose of expedited reclamation.
> >
> > This delay keeps memory pressure high, forcing the system to unnecessarily
> > kill additional innocent background apps before the memory from the first
> > victim is recovered.
> >
> > This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
> > an integrated auto-kill mode. When specified, process_mrelease() directly
> > injects a SIGKILL into the target task.
> >
> > To solve the race condition deterministically, we grab the mm reference
> > via mmget() and set the MMF_UNSTABLE flag *before* sending the SIGKILL.
> > Using mmget() instead of mmgrab() keeps mm_users > 0, preventing the
> > victim from calling exit_mmap() in its own exit path. This ensures that
> > the memory is reclaimed synchronously and deterministically by the reaper
> > in the context of process_mrelease(), avoiding delays caused by
> > non-deterministic scheduling of the victim task.
> >
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> > include/uapi/linux/mman.h | 4 +++
> > mm/oom_kill.c | 56 +++++++++++++++++++++++++++------------
> > 2 files changed, 43 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> > index e89d00528f2f..4266976b45ad 100644
> > --- a/include/uapi/linux/mman.h
> > +++ b/include/uapi/linux/mman.h
> > @@ -56,4 +56,8 @@ struct cachestat {
> > __u64 nr_recently_evicted;
> > };
> >
> > +/* Flags for process_mrelease */
> > +#define PROCESS_MRELEASE_REAP_KILL (1 << 0)
> > +#define PROCESS_MRELEASE_VALID_FLAGS (PROCESS_MRELEASE_REAP_KILL)
> > +
> > #endif /* _UAPI_LINUX_MMAN_H */
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5c6c95c169ee..730ba0d19b53 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -20,6 +20,7 @@
> >
> > #include <linux/oom.h>
> > #include <linux/mm.h>
> > +#include <uapi/linux/mman.h>
> > #include <linux/err.h>
> > #include <linux/gfp.h>
> > #include <linux/sched.h>
> > @@ -850,7 +851,7 @@ bool oom_killer_disable(signed long timeout)
> > return true;
> > }
> >
> > -static inline bool __task_will_free_mem(struct task_struct *task)
> > +static inline bool __task_will_free_mem(struct task_struct *task, bool ignore_exit)
> > {
> > struct signal_struct *sig = task->signal;
> >
> > @@ -862,6 +863,9 @@ static inline bool __task_will_free_mem(struct task_struct *task)
> > if (sig->core_state)
> > return false;
> >
> > + if (ignore_exit)
> > + return true;
> > +
> > if (sig->flags & SIGNAL_GROUP_EXIT)
> > return true;
> >
> > @@ -878,7 +882,7 @@ static inline bool __task_will_free_mem(struct task_struct *task)
> > * Caller has to make sure that task->mm is stable (hold task_lock or
> > * it operates on the current).
> > */
> > -static bool task_will_free_mem(struct task_struct *task)
> > +static bool task_will_free_mem(struct task_struct *task, bool ignore_exit)
> > {
> > struct mm_struct *mm = task->mm;
> > struct task_struct *p;
> > @@ -892,7 +896,7 @@ static bool task_will_free_mem(struct task_struct *task)
> > if (!mm)
> > return false;
> >
> > - if (!__task_will_free_mem(task))
> > + if (!__task_will_free_mem(task, ignore_exit))
> > return false;
> >
> > /*
> > @@ -916,7 +920,7 @@ static bool task_will_free_mem(struct task_struct *task)
> > continue;
> > if (same_thread_group(task, p))
> > continue;
> > - ret = __task_will_free_mem(p);
> > + ret = __task_will_free_mem(p, false);
> > if (!ret)
> > break;
> > }
> > @@ -1034,7 +1038,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > * so it can die quickly
> > */
> > task_lock(victim);
> > - if (task_will_free_mem(victim)) {
> > + if (task_will_free_mem(victim, false)) {
> > mark_oom_victim(victim);
> > queue_oom_reaper(victim);
> > task_unlock(victim);
> > @@ -1135,7 +1139,7 @@ bool out_of_memory(struct oom_control *oc)
> > * select it. The goal is to allow it to allocate so that it may
> > * quickly exit and free its memory.
> > */
> > - if (task_will_free_mem(current)) {
> > + if (task_will_free_mem(current, false)) {
> > mark_oom_victim(current);
> > queue_oom_reaper(current);
> > return true;
> > @@ -1217,8 +1221,9 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> > unsigned int f_flags;
> > bool reap = false;
> > long ret = 0;
> > + bool reap_kill;
> >
> > - if (flags)
> > + if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
> > return -EINVAL;
> >
> > task = pidfd_get_task(pidfd, &f_flags);
> > @@ -1236,19 +1241,33 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> > }
> >
> > mm = p->mm;
> > - mmgrab(mm);
> >
> > - if (task_will_free_mem(p))
> > - reap = true;
> > - else {
> > - /* Error only if the work has not been done already */
> > - if (!mm_flags_test(MMF_OOM_SKIP, mm))
> > + reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
> > + reap = task_will_free_mem(p, reap_kill);
> > + if (!reap) {
> > + if (reap_kill || !mm_flags_test(MMF_OOM_SKIP, mm))
> > ret = -EINVAL;
> > +
> > + task_unlock(p);
> > + goto put_task;
> > }
> > - task_unlock(p);
> >
> > - if (!reap)
> > - goto drop_mm;
> > + if (reap_kill) {
> > + /*
> > + * We use mmget() instead of mmgrab() to keep mm_users > 0,
> > + * preventing the victim from calling exit_mmap() in its
> > + * own exit path. This ensures that the memory is reclaimed
> > + * synchronously and deterministically by the reaper.
> > + */
> > + mmget(mm);
>
> I don't quite understand why you need to mmget() and prevent the
> victim from calling exit_mmap() here. As long as we successfully
> mmgrab'ed the mm, we can safely proceed with mmap locking and doing
> __oom_reap_task_mm(). Victim can execute exit_mmap() in parallel with
> __oom_reap_task_mm(). In fact that's even desirable! I remember a
> recent patch that used mas_for_each_rev() in __oom_reap_task_mm() to
> reclaim memory in reverse order so that exit_mmap() and
> __oom_reap_task_mm() can start from opposite ends of the address space
> and converge in the middle, working concurrently.
Good information. Will change it in next respin.
prev parent reply other threads:[~2026-04-27 22:52 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-21 23:02 [PATCH v1 0/3] mm: process_mrelease: expedite clean file folio reclaim and add auto-kill Minchan Kim
2026-04-21 23:02 ` [PATCH v1 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather Minchan Kim
2026-04-24 7:56 ` David Hildenbrand (Arm)
2026-04-24 21:24 ` Minchan Kim
2026-04-27 9:29 ` David Hildenbrand (Arm)
2026-04-27 22:04 ` Minchan Kim
2026-04-24 19:33 ` Matthew Wilcox
2026-04-24 21:56 ` Minchan Kim
2026-04-21 23:02 ` [PATCH v1 2/3] mm: process_mrelease: skip LRU movement for exclusive file folios Minchan Kim
2026-04-22 7:22 ` Baolin Wang
2026-04-23 23:38 ` Minchan Kim
2026-04-24 7:51 ` Michal Hocko
2026-04-24 7:57 ` David Hildenbrand (Arm)
2026-04-24 19:15 ` Minchan Kim
2026-04-27 7:16 ` Michal Hocko
2026-04-27 16:48 ` Suren Baghdasaryan
2026-04-27 17:15 ` Michal Hocko
2026-04-27 23:05 ` Minchan Kim
2026-04-28 6:56 ` Michal Hocko
2026-04-24 19:26 ` Minchan Kim
2026-04-21 23:02 ` [PATCH v1 3/3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
2026-04-24 7:57 ` Michal Hocko
2026-04-24 22:49 ` Minchan Kim
2026-04-27 7:02 ` Michal Hocko
2026-04-27 22:03 ` Minchan Kim
2026-04-28 7:01 ` Michal Hocko
2026-04-27 20:34 ` Suren Baghdasaryan
2026-04-27 22:52 ` Minchan Kim [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ae_oul_xgD5Kgnqi@google.com \
--to=minchan@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=david@kernel.org \
--cc=hca@linux.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
--cc=mhocko@suse.com \
--cc=surenb@google.com \
--cc=timmurray@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox