Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Minchan Kim <minchan@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Cc: Jann Horn <jannh@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>,
	akpm@linux-foundation.org, hca@linux.ibm.com,
	linux-s390@vger.kernel.org, david@kernel.org, mhocko@suse.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	surenb@google.com, timmurray@google.com
Subject: Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
Date: Fri, 15 May 2026 13:52:40 -0700	[thread overview]
Message-ID: <ageHmE1QIzK4R3nO@google.com> (raw)
In-Reply-To: <20260515-nachdenken-umbenannt-a90006a46e14@brauner>

On Fri, May 15, 2026 at 04:41:18PM +0200, Christian Brauner wrote:
> On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> > Currently, process_mrelease() requires userspace to send a SIGKILL signal
> > prior to invocation. This separation introduces a scheduling race window
> > where the victim task may receive the signal and enter the exit path
> > before the reaper can invoke process_mrelease().
> > 
> > When the victim enters the exit path (do_exit -> exit_mm), it clears its
> > task->mm immediately. This causes process_mrelease() to fail with -ESRCH,
> 
> To be quite frank about the patch as written below: I think this is a
> completely crazy api. And I really dislike it.
> 
> Right now we have clear and simple signal sending semantics for pidfds:
> 
> * thread-specific pidfd -> thread-directed signal (unless signal is thread-group scoped by default)
> * thread-group pidfd -> thread-group directed signal
> 
> with specific overrides for pidfd_send_signal():
> 
> * PIDFD_SIGNAL_THREAD           -> only signal thread
> * PIDFD_SIGNAL_THREAD_GROUP     -> signal thread-group
> * PIDFD_SIGNAL_PROCESS_GROUP    -> signal process-group
> 
> And now this patch aims to elevate process_mrelease() to a _signal
> sending function_. And the semantics are complete special sauce too.
> 
> You are effectively introducing a custom signal scope that is mm-based
> and then also plumbing it into a completely unrelated function that
> should have absolutely nothing to do with this.
> 
> This is such Zebroid API. I really would hate to see it land.
> 
> This came up because of the ptrace bug that was just recently discovered
> and that Linus fixed yesterday. This is another instance where I think
> the correct fix is to keep task->mm around until the process is reaped
> and then you can throw away all of the really ugly semantic in this
> patch afaict. I'd really like to see that merged as it would also clean
> up the ptrace code:
> 
> https://lore.kernel.org/all/20201016024019.1882062-1-jannh@google.com

Thank you for pointing this out. I completely agree.

The only reason for adding PROCESS_MRELEASE_REAP_KILL and its complex signaling
was to work around task->mm being cleared early in exit_mm(), which causes
process_mrelease() to fail with -ESRCH.

If Jann's patch to keep task->mm around is merged, this race naturally vanishes.
Userspace can reliably send SIGKILL and call the normal process_mrelease()
with the exit_mm

I would be thrilled to see Jann's patch merged. It allows us to drop this
complex REAP_KILL patch entirely while still fully resolving the expedited
reclaim issue.

> 
> > leaving the actual address space teardown (exit_mmap) to be deferred until
> > the mm's reference count drops to zero. In the field (e.g., Android),
> > arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> > remote VM accesses) frequently delay this teardown indefinitely,
> > defeating the purpose of expedited reclamation.
> > 
> > In Android's LMKD scenarios, this delay keeps memory pressure high, forcing
> > the system to unnecessarily kill additional innocent background apps before
> > the memory from the first victim is recovered.
> > 
> > This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
> > an integrated auto-kill mode. When specified, process_mrelease() directly
> > injects a SIGKILL into the target task after finding its mm.
> > 
> > To solve the race condition, we grab the mm reference via mmgrab() before
> > sending the SIGKILL. If the user passed PROCESS_MRELEASE_REAP_KILL, we assume
> > it will free its memory and proceed with reaping, making the logic as simple
> > as reap = reap_kill || task_will_free_mem(p).
> > 
> > To handle shared address spaces, we deliver SIGKILL to all processes sharing
> > the same address space using do_pidfd_send_signal_pidns(). This ensures the
> > target pid resides inside the caller's PID namespace hierarchy prior to
> > signal delivery. We iterate over all processes sharing the mm and deliver
> > SIGKILL to each. If delivering the signal to any of the sharing processes
> > fails, we return an error. Note that this approach may leave partial
> > side-effects if some processes are killed successfully before a failure occurs.
> > 
> > Cc: Christian Brauner <brauner@kernel.org>
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/signal.h    |  4 +++
> >  include/uapi/linux/mman.h |  4 +++
> >  kernel/signal.c           | 29 ++++++++++++++++++---
> >  mm/oom_kill.c             | 55 ++++++++++++++++++++++++++++++++++-----
> >  4 files changed, 81 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/signal.h b/include/linux/signal.h
> > index f19816832f05..bdbe6b3addec 100644
> > --- a/include/linux/signal.h
> > +++ b/include/linux/signal.h
> > @@ -276,6 +276,8 @@ static inline int valid_signal(unsigned long sig)
> >  
> >  struct timespec;
> >  struct pt_regs;
> > +struct mm_struct;
> > +struct pid;
> >  enum pid_type;
> >  
> >  extern int next_signal(struct sigpending *pending, sigset_t *mask);
> > @@ -283,6 +285,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
> >  				struct task_struct *p, enum pid_type type);
> >  extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
> >  			       struct task_struct *p, enum pid_type type);
> > +extern int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> > +				      siginfo_t __user *info, unsigned int flags);
> >  extern int send_signal_locked(int sig, struct kernel_siginfo *info,
> >  			      struct task_struct *p, enum pid_type type);
> >  extern int sigprocmask(int, sigset_t *, sigset_t *);
> > diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> > index e89d00528f2f..4266976b45ad 100644
> > --- a/include/uapi/linux/mman.h
> > +++ b/include/uapi/linux/mman.h
> > @@ -56,4 +56,8 @@ struct cachestat {
> >  	__u64 nr_recently_evicted;
> >  };
> >  
> > +/* Flags for process_mrelease */
> > +#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
> > +#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
> > +
> >  #endif /* _UAPI_LINUX_MMAN_H */
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index d65d0fe24bfb..b2dc08a9bdd3 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -4046,6 +4046,30 @@ static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
> >  	return kill_pid_info_type(sig, &kinfo, pid, type);
> >  }
> >  
> > +/**
> > + * do_pidfd_send_signal_pidns - Send a signal to a process via its struct pid
> > + *                              while validating PID namespace hierarchy.
> > + * @pid:   the struct pid of the target process
> > + * @sig:   signal to send
> > + * @type:  scope of the signal (e.g. PIDTYPE_TGID)
> > + * @info:  signal info payload
> > + * @flags: signaling flags
> > + *
> > + * Verify that the target pid resides inside the caller's PID namespace
> > + * hierarchy prior to signal delivery.
> > + *
> > + * Return: 0 on success, negative errno on failure.
> > + */
> > +int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> > +			       siginfo_t __user *info, unsigned int flags)
> > +{
> > +	/* Enforce PID namespace hierarchy boundary */
> > +	if (!access_pidfd_pidns(pid))
> > +		return -EINVAL;
> > +
> > +	return do_pidfd_send_signal(pid, sig, type, info, flags);
> > +}
> > +
> >  /**
> >   * sys_pidfd_send_signal - Signal a process through a pidfd
> >   * @pidfd:  file descriptor of the process
> > @@ -4094,16 +4118,13 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >  		if (IS_ERR(pid))
> >  			return PTR_ERR(pid);
> >  
> > -		if (!access_pidfd_pidns(pid))
> > -			return -EINVAL;
> > -
> >  		/* Infer scope from the type of pidfd. */
> >  		if (fd_file(f)->f_flags & PIDFD_THREAD)
> >  			type = PIDTYPE_PID;
> >  		else
> >  			type = PIDTYPE_TGID;
> >  
> > -		return do_pidfd_send_signal(pid, sig, type, info, flags);
> > +		return do_pidfd_send_signal_pidns(pid, sig, type, info, flags);
> >  	}
> >  	}
> >  
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5c6c95c169ee..253aa80770f2 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -20,6 +20,7 @@
> >  
> >  #include <linux/oom.h>
> >  #include <linux/mm.h>
> > +#include <uapi/linux/mman.h>
> >  #include <linux/err.h>
> >  #include <linux/gfp.h>
> >  #include <linux/sched.h>
> > @@ -925,6 +926,39 @@ static bool task_will_free_mem(struct task_struct *task)
> >  	return ret;
> >  }
> >  
> > +/*
> > + * kill_all_shared_mm - Deliver SIGKILL to all processes sharing the given address space.
> > + * @victim: the targeted OOM process group leader
> > + * @mm:     the virtual memory space being reaped
> > + *
> > + * Traverse all threads globally and signal any user processes sharing the identical
> > + * mm footprints, ensuring no concurrent users pin the memory. Skips the system
> > + * global init and kernel worker threads.
> > + */
> > +static int kill_all_shared_mm(struct task_struct *victim, struct mm_struct *mm)
> > +{
> > +	struct task_struct *p;
> > +	bool failed = false;
> > +
> > +	rcu_read_lock();
> > +	for_each_process(p) {
> > +		if (!process_shares_mm(p, mm))
> > +			continue;
> > +		if (is_global_init(p)) {
> 
> You can't signal init in any shape or form any way. Why bother reporting
> failure at all.
> 
> > +			failed = true;
> > +			continue;
> > +		}
> > +		if (unlikely(p->flags & PF_KTHREAD))
> > +			continue;
> > +
> > +		if (do_pidfd_send_signal_pidns(task_pid(p), SIGKILL, PIDTYPE_TGID, NULL, 0))
> > +			failed = true;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return failed ? -EBUSY : 0;
> 
> Why are you returning EBUSY? This makes no sense imho.
> 
> > +}
> > +
> >  static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  {
> >  	struct task_struct *p;
> > @@ -1217,9 +1251,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >  	unsigned int f_flags;
> >  	bool reap = false;
> >  	long ret = 0;
> > +	bool reap_kill;
> >  
> > -	if (flags)
> > +	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
> >  		return -EINVAL;
> > +	reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
> >  
> >  	task = pidfd_get_task(pidfd, &f_flags);
> >  	if (IS_ERR(task))
> > @@ -1236,19 +1272,24 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >  	}
> >  
> >  	mm = p->mm;
> > -	mmgrab(mm);
> >  
> > -	if (task_will_free_mem(p))
> > -		reap = true;
> > -	else {
> > +	reap = reap_kill || task_will_free_mem(p);
> > +	if (!reap) {
> >  		/* Error only if the work has not been done already */
> >  		if (!mm_flags_test(MMF_OOM_SKIP, mm))
> >  			ret = -EINVAL;
> > +		task_unlock(p);
> > +		goto put_task;
> >  	}
> > +
> > +	mmgrab(mm);
> >  	task_unlock(p);
> >  
> > -	if (!reap)
> > -		goto drop_mm;
> > +	if (reap_kill) {
> > +		ret = kill_all_shared_mm(task, mm);
> > +		if (ret)
> > +			goto drop_mm;
> > +	}
> >  
> >  	if (mmap_read_lock_killable(mm)) {
> >  		ret = -EINTR;
> > -- 
> > 2.54.0.563.g4f69b47b94-goog
> > 
> >

     prev parent reply	other threads:[~2026-05-15 20:52 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-11 21:42 [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
2026-05-15 14:41 ` Christian Brauner
2026-05-15 15:49   ` Oleg Nesterov
2026-05-15 20:15   ` Oleg Nesterov
2026-05-15 22:33     ` Minchan Kim
2026-05-15 23:45       ` Linus Torvalds
2026-05-16  0:00         ` Linus Torvalds
2026-05-16  5:47         ` Minchan Kim
2026-05-15 20:42   ` Suren Baghdasaryan
2026-05-15 20:52   ` Minchan Kim [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ageHmE1QIzK4R3nO@google.com \
    --to=minchan@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=david@kernel.org \
    --cc=hca@linux.ibm.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=oleg@redhat.com \
    --cc=surenb@google.com \
    --cc=timmurray@google.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.