[PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
@ 2026-05-11 21:42 Minchan Kim
  2026-05-15 14:41 ` Christian Brauner
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2026-05-11 21:42 UTC (permalink / raw)
  To: akpm
  Cc: hca, linux-s390, david, mhocko, brauner, linux-mm, linux-kernel,
	surenb, timmurray, Minchan Kim

Currently, process_mrelease() requires userspace to send a SIGKILL signal
prior to invocation. This separation introduces a scheduling race window
where the victim task may receive the signal and enter the exit path
before the reaper can invoke process_mrelease().

When the victim enters the exit path (do_exit -> exit_mm), it clears its
task->mm immediately. This causes process_mrelease() to fail with -ESRCH,
leaving the actual address space teardown (exit_mmap) to be deferred until
the mm's reference count drops to zero. In the field (e.g., Android),
arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
remote VM accesses) frequently delay this teardown indefinitely,
defeating the purpose of expedited reclamation.

In Android's LMKD scenarios, this delay keeps memory pressure high, forcing
the system to unnecessarily kill additional innocent background apps before
the memory from the first victim is recovered.

This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
an integrated auto-kill mode. When specified, process_mrelease() directly
injects a SIGKILL into the target task after finding its mm.

To solve the race condition, we grab the mm reference via mmgrab() before
sending the SIGKILL. If the user passed PROCESS_MRELEASE_REAP_KILL, we assume
it will free its memory and proceed with reaping, making the logic as simple
as reap = reap_kill || task_will_free_mem(p).

To handle shared address spaces, we deliver SIGKILL to all processes sharing
the same address space using do_pidfd_send_signal_pidns(). This ensures the
target pid resides inside the caller's PID namespace hierarchy prior to
signal delivery. We iterate over all processes sharing the mm and deliver
SIGKILL to each. If delivering the signal to any of the sharing processes
fails, we return an error. Note that this approach may leave partial
side-effects if some processes are killed successfully before a failure occurs.

Cc: Christian Brauner <brauner@kernel.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/signal.h    |  4 +++
 include/uapi/linux/mman.h |  4 +++
 kernel/signal.c           | 29 ++++++++++++++++++---
 mm/oom_kill.c             | 55 ++++++++++++++++++++++++++++++++++-----
 4 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index f19816832f05..bdbe6b3addec 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -276,6 +276,8 @@ static inline int valid_signal(unsigned long sig)
 
 struct timespec;
 struct pt_regs;
+struct mm_struct;
+struct pid;
 enum pid_type;
 
 extern int next_signal(struct sigpending *pending, sigset_t *mask);
@@ -283,6 +285,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
 				struct task_struct *p, enum pid_type type);
 extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
 			       struct task_struct *p, enum pid_type type);
+extern int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
+				      siginfo_t __user *info, unsigned int flags);
 extern int send_signal_locked(int sig, struct kernel_siginfo *info,
 			      struct task_struct *p, enum pid_type type);
 extern int sigprocmask(int, sigset_t *, sigset_t *);
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index e89d00528f2f..4266976b45ad 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -56,4 +56,8 @@ struct cachestat {
 	__u64 nr_recently_evicted;
 };
 
+/* Flags for process_mrelease */
+#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
+#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/kernel/signal.c b/kernel/signal.c
index d65d0fe24bfb..b2dc08a9bdd3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4046,6 +4046,30 @@ static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
 	return kill_pid_info_type(sig, &kinfo, pid, type);
 }
 
+/**
+ * do_pidfd_send_signal_pidns - Send a signal to a process via its struct pid
+ *                              while validating PID namespace hierarchy.
+ * @pid:   the struct pid of the target process
+ * @sig:   signal to send
+ * @type:  scope of the signal (e.g. PIDTYPE_TGID)
+ * @info:  signal info payload
+ * @flags: signaling flags
+ *
+ * Verify that the target pid resides inside the caller's PID namespace
+ * hierarchy prior to signal delivery.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
+			       siginfo_t __user *info, unsigned int flags)
+{
+	/* Enforce PID namespace hierarchy boundary */
+	if (!access_pidfd_pidns(pid))
+		return -EINVAL;
+
+	return do_pidfd_send_signal(pid, sig, type, info, flags);
+}
+
 /**
  * sys_pidfd_send_signal - Signal a process through a pidfd
  * @pidfd:  file descriptor of the process
@@ -4094,16 +4118,13 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 		if (IS_ERR(pid))
 			return PTR_ERR(pid);
 
-		if (!access_pidfd_pidns(pid))
-			return -EINVAL;
-
 		/* Infer scope from the type of pidfd. */
 		if (fd_file(f)->f_flags & PIDFD_THREAD)
 			type = PIDTYPE_PID;
 		else
 			type = PIDTYPE_TGID;
 
-		return do_pidfd_send_signal(pid, sig, type, info, flags);
+		return do_pidfd_send_signal_pidns(pid, sig, type, info, flags);
 	}
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c6c95c169ee..253aa80770f2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -20,6 +20,7 @@
 
 #include <linux/oom.h>
 #include <linux/mm.h>
+#include <uapi/linux/mman.h>
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/sched.h>
@@ -925,6 +926,39 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
+/*
+ * kill_all_shared_mm - Deliver SIGKILL to all processes sharing the given address space.
+ * @victim: the targeted OOM process group leader
+ * @mm:     the virtual memory space being reaped
+ *
+ * Traverse all threads globally and signal any user processes sharing the identical
+ * mm footprints, ensuring no concurrent users pin the memory. Skips the system
+ * global init and kernel worker threads.
+ */
+static int kill_all_shared_mm(struct task_struct *victim, struct mm_struct *mm)
+{
+	struct task_struct *p;
+	bool failed = false;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (is_global_init(p)) {
+			failed = true;
+			continue;
+		}
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
+
+		if (do_pidfd_send_signal_pidns(task_pid(p), SIGKILL, PIDTYPE_TGID, NULL, 0))
+			failed = true;
+	}
+	rcu_read_unlock();
+
+	return failed ? -EBUSY : 0;
+}
+
 static void __oom_kill_process(struct task_struct *victim, const char *message)
 {
 	struct task_struct *p;
@@ -1217,9 +1251,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	unsigned int f_flags;
 	bool reap = false;
 	long ret = 0;
+	bool reap_kill;
 
-	if (flags)
+	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
 		return -EINVAL;
+	reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
 
 	task = pidfd_get_task(pidfd, &f_flags);
 	if (IS_ERR(task))
@@ -1236,19 +1272,24 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	}
 
 	mm = p->mm;
-	mmgrab(mm);
 
-	if (task_will_free_mem(p))
-		reap = true;
-	else {
+	reap = reap_kill || task_will_free_mem(p);
+	if (!reap) {
 		/* Error only if the work has not been done already */
 		if (!mm_flags_test(MMF_OOM_SKIP, mm))
 			ret = -EINVAL;
+		task_unlock(p);
+		goto put_task;
 	}
+
+	mmgrab(mm);
 	task_unlock(p);
 
-	if (!reap)
-		goto drop_mm;
+	if (reap_kill) {
+		ret = kill_all_shared_mm(task, mm);
+		if (ret)
+			goto drop_mm;
+	}
 
 	if (mmap_read_lock_killable(mm)) {
 		ret = -EINTR;
-- 
2.54.0.563.g4f69b47b94-goog



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-11 21:42 [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
@ 2026-05-15 14:41 ` Christian Brauner
  2026-05-15 15:49   ` Oleg Nesterov
                     ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Christian Brauner @ 2026-05-15 14:41 UTC (permalink / raw)
  To: Minchan Kim, Jann Horn, Linus Torvalds, Oleg Nesterov
  Cc: akpm, hca, linux-s390, david, mhocko, linux-mm, linux-kernel,
	surenb, timmurray

On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> Currently, process_mrelease() requires userspace to send a SIGKILL signal
> prior to invocation. This separation introduces a scheduling race window
> where the victim task may receive the signal and enter the exit path
> before the reaper can invoke process_mrelease().
> 
> When the victim enters the exit path (do_exit -> exit_mm), it clears its
> task->mm immediately. This causes process_mrelease() to fail with -ESRCH,

To be quite frank about the patch as written below: I think this is a
completely crazy api. And I really dislike it.

Right now we have clear and simple signal sending semantics for pidfds:

* thread-specific pidfd -> thread-directed signal (unless signal is thread-group scoped by default)
* thread-group pidfd -> thread-group directed signal

with specific overrides for pidfd_send_signal():

* PIDFD_SIGNAL_THREAD           -> only signal thread
* PIDFD_SIGNAL_THREAD_GROUP     -> signal thread-group
* PIDFD_SIGNAL_PROCESS_GROUP    -> signal process-group

And now this patch aims to elevate process_mrelease() to a _signal
sending function_. And the semantics are complete special sauce too.

You are effectively introducing a custom signal scope that is mm-based
and then also plumbing it into a completely unrelated function that
should have absolutely nothing to do with this.

This is such Zebroid API. I really would hate to see it land.

This came up because of the ptrace bug that was just recently discovered
and that Linus fixed yesterday. This is another instance where I think
the correct fix is to keep task->mm around until the process is reaped
and then you can throw away all of the really ugly semantic in this
patch afaict. I'd really like to see that merged as it would also clean
up the ptrace code:

https://lore.kernel.org/all/20201016024019.1882062-1-jannh@google.com

> leaving the actual address space teardown (exit_mmap) to be deferred until
> the mm's reference count drops to zero. In the field (e.g., Android),
> arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> remote VM accesses) frequently delay this teardown indefinitely,
> defeating the purpose of expedited reclamation.
> 
> In Android's LMKD scenarios, this delay keeps memory pressure high, forcing
> the system to unnecessarily kill additional innocent background apps before
> the memory from the first victim is recovered.
> 
> This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
> an integrated auto-kill mode. When specified, process_mrelease() directly
> injects a SIGKILL into the target task after finding its mm.
> 
> To solve the race condition, we grab the mm reference via mmgrab() before
> sending the SIGKILL. If the user passed PROCESS_MRELEASE_REAP_KILL, we assume
> it will free its memory and proceed with reaping, making the logic as simple
> as reap = reap_kill || task_will_free_mem(p).
> 
> To handle shared address spaces, we deliver SIGKILL to all processes sharing
> the same address space using do_pidfd_send_signal_pidns(). This ensures the
> target pid resides inside the caller's PID namespace hierarchy prior to
> signal delivery. We iterate over all processes sharing the mm and deliver
> SIGKILL to each. If delivering the signal to any of the sharing processes
> fails, we return an error. Note that this approach may leave partial
> side-effects if some processes are killed successfully before a failure occurs.
> 
> Cc: Christian Brauner <brauner@kernel.org>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/signal.h    |  4 +++
>  include/uapi/linux/mman.h |  4 +++
>  kernel/signal.c           | 29 ++++++++++++++++++---
>  mm/oom_kill.c             | 55 ++++++++++++++++++++++++++++++++++-----
>  4 files changed, 81 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index f19816832f05..bdbe6b3addec 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -276,6 +276,8 @@ static inline int valid_signal(unsigned long sig)
>  
>  struct timespec;
>  struct pt_regs;
> +struct mm_struct;
> +struct pid;
>  enum pid_type;
>  
>  extern int next_signal(struct sigpending *pending, sigset_t *mask);
> @@ -283,6 +285,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
>  				struct task_struct *p, enum pid_type type);
>  extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
>  			       struct task_struct *p, enum pid_type type);
> +extern int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> +				      siginfo_t __user *info, unsigned int flags);
>  extern int send_signal_locked(int sig, struct kernel_siginfo *info,
>  			      struct task_struct *p, enum pid_type type);
>  extern int sigprocmask(int, sigset_t *, sigset_t *);
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index e89d00528f2f..4266976b45ad 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -56,4 +56,8 @@ struct cachestat {
>  	__u64 nr_recently_evicted;
>  };
>  
> +/* Flags for process_mrelease */
> +#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
> +#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
> +
>  #endif /* _UAPI_LINUX_MMAN_H */
> diff --git a/kernel/signal.c b/kernel/signal.c
> index d65d0fe24bfb..b2dc08a9bdd3 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -4046,6 +4046,30 @@ static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
>  	return kill_pid_info_type(sig, &kinfo, pid, type);
>  }
>  
> +/**
> + * do_pidfd_send_signal_pidns - Send a signal to a process via its struct pid
> + *                              while validating PID namespace hierarchy.
> + * @pid:   the struct pid of the target process
> + * @sig:   signal to send
> + * @type:  scope of the signal (e.g. PIDTYPE_TGID)
> + * @info:  signal info payload
> + * @flags: signaling flags
> + *
> + * Verify that the target pid resides inside the caller's PID namespace
> + * hierarchy prior to signal delivery.
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> +			       siginfo_t __user *info, unsigned int flags)
> +{
> +	/* Enforce PID namespace hierarchy boundary */
> +	if (!access_pidfd_pidns(pid))
> +		return -EINVAL;
> +
> +	return do_pidfd_send_signal(pid, sig, type, info, flags);
> +}
> +
>  /**
>   * sys_pidfd_send_signal - Signal a process through a pidfd
>   * @pidfd:  file descriptor of the process
> @@ -4094,16 +4118,13 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
>  		if (IS_ERR(pid))
>  			return PTR_ERR(pid);
>  
> -		if (!access_pidfd_pidns(pid))
> -			return -EINVAL;
> -
>  		/* Infer scope from the type of pidfd. */
>  		if (fd_file(f)->f_flags & PIDFD_THREAD)
>  			type = PIDTYPE_PID;
>  		else
>  			type = PIDTYPE_TGID;
>  
> -		return do_pidfd_send_signal(pid, sig, type, info, flags);
> +		return do_pidfd_send_signal_pidns(pid, sig, type, info, flags);
>  	}
>  	}
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5c6c95c169ee..253aa80770f2 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -20,6 +20,7 @@
>  
>  #include <linux/oom.h>
>  #include <linux/mm.h>
> +#include <uapi/linux/mman.h>
>  #include <linux/err.h>
>  #include <linux/gfp.h>
>  #include <linux/sched.h>
> @@ -925,6 +926,39 @@ static bool task_will_free_mem(struct task_struct *task)
>  	return ret;
>  }
>  
> +/*
> + * kill_all_shared_mm - Deliver SIGKILL to all processes sharing the given address space.
> + * @victim: the targeted OOM process group leader
> + * @mm:     the virtual memory space being reaped
> + *
> + * Traverse all threads globally and signal any user processes sharing the identical
> + * mm footprints, ensuring no concurrent users pin the memory. Skips the system
> + * global init and kernel worker threads.
> + */
> +static int kill_all_shared_mm(struct task_struct *victim, struct mm_struct *mm)
> +{
> +	struct task_struct *p;
> +	bool failed = false;
> +
> +	rcu_read_lock();
> +	for_each_process(p) {
> +		if (!process_shares_mm(p, mm))
> +			continue;
> +		if (is_global_init(p)) {

You can't signal init in any shape or form any way. Why bother reporting
failure at all.

> +			failed = true;
> +			continue;
> +		}
> +		if (unlikely(p->flags & PF_KTHREAD))
> +			continue;
> +
> +		if (do_pidfd_send_signal_pidns(task_pid(p), SIGKILL, PIDTYPE_TGID, NULL, 0))
> +			failed = true;
> +	}
> +	rcu_read_unlock();
> +
> +	return failed ? -EBUSY : 0;

Why are you returning EBUSY? This makes no sense imho.

> +}
> +
>  static void __oom_kill_process(struct task_struct *victim, const char *message)
>  {
>  	struct task_struct *p;
> @@ -1217,9 +1251,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	unsigned int f_flags;
>  	bool reap = false;
>  	long ret = 0;
> +	bool reap_kill;
>  
> -	if (flags)
> +	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
>  		return -EINVAL;
> +	reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
>  
>  	task = pidfd_get_task(pidfd, &f_flags);
>  	if (IS_ERR(task))
> @@ -1236,19 +1272,24 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	}
>  
>  	mm = p->mm;
> -	mmgrab(mm);
>  
> -	if (task_will_free_mem(p))
> -		reap = true;
> -	else {
> +	reap = reap_kill || task_will_free_mem(p);
> +	if (!reap) {
>  		/* Error only if the work has not been done already */
>  		if (!mm_flags_test(MMF_OOM_SKIP, mm))
>  			ret = -EINVAL;
> +		task_unlock(p);
> +		goto put_task;
>  	}
> +
> +	mmgrab(mm);
>  	task_unlock(p);
>  
> -	if (!reap)
> -		goto drop_mm;
> +	if (reap_kill) {
> +		ret = kill_all_shared_mm(task, mm);
> +		if (ret)
> +			goto drop_mm;
> +	}
>  
>  	if (mmap_read_lock_killable(mm)) {
>  		ret = -EINTR;
> -- 
> 2.54.0.563.g4f69b47b94-goog
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 14:41 ` Christian Brauner
@ 2026-05-15 15:49   ` Oleg Nesterov
  2026-05-15 20:15   ` Oleg Nesterov
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Oleg Nesterov @ 2026-05-15 15:49 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, Jann Horn, Linus Torvalds, akpm, hca, linux-s390,
	david, mhocko, linux-mm, linux-kernel, surenb, timmurray

On 05/15, Christian Brauner wrote:
>
> Right now we have clear and simple signal sending semantics for pidfds:
>
> * thread-specific pidfd -> thread-directed signal (unless signal is thread-group scoped by default)
> * thread-group pidfd -> thread-group directed signal

I don't even understand why this patch uses pidfd to send SIGKILL...

> > +	reap = reap_kill || task_will_free_mem(p);
> > +	if (!reap) {
> >  		/* Error only if the work has not been done already */
> >  		if (!mm_flags_test(MMF_OOM_SKIP, mm))
> >  			ret = -EINVAL;
> > +		task_unlock(p);
> > +		goto put_task;
> >  	}
> > +
> > +	mmgrab(mm);
> >  	task_unlock(p);
> >
> > -	if (!reap)
> > -		goto drop_mm;
> > +	if (reap_kill) {
> > +		ret = kill_all_shared_mm(task, mm);

If task_will_free_mem() above succeeded, we do not need to call
kill_all_shared_mm() ?

Oleg.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 14:41 ` Christian Brauner
  2026-05-15 15:49   ` Oleg Nesterov
@ 2026-05-15 20:15   ` Oleg Nesterov
  2026-05-15 22:33     ` Minchan Kim
  2026-05-15 20:42   ` Suren Baghdasaryan
  2026-05-15 20:52   ` Minchan Kim
  3 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2026-05-15 20:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, Jann Horn, Linus Torvalds, akpm, hca, linux-s390,
	david, mhocko, linux-mm, linux-kernel, surenb, timmurray

In fact I don't even understand the motivation...

On 05/15, Christian Brauner wrote:
>
> On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> > leaving the actual address space teardown (exit_mmap) to be deferred until
> > the mm's reference count drops to zero. In the field (e.g., Android),
> > arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> > remote VM accesses) frequently delay this teardown indefinitely,

Sure, get_task_cmdline() can delay mmput(). But indefinitely ?

Perhaps the changelog could be more clear? I don't see how any remote VM access
can pin mm->mm_users "indefinitely". Even if, say, a lot of threads read
/proc/<pid>/cmdline in an endless loop in parallel...

I must have missed something.

Oleg.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 14:41 ` Christian Brauner
  2026-05-15 15:49   ` Oleg Nesterov
  2026-05-15 20:15   ` Oleg Nesterov
@ 2026-05-15 20:42   ` Suren Baghdasaryan
  2026-05-15 20:52   ` Minchan Kim
  3 siblings, 0 replies; 12+ messages in thread
From: Suren Baghdasaryan @ 2026-05-15 20:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, Jann Horn, Linus Torvalds, Oleg Nesterov, akpm, hca,
	linux-s390, david, mhocko, linux-mm, linux-kernel, timmurray

On Fri, May 15, 2026 at 7:41 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> > Currently, process_mrelease() requires userspace to send a SIGKILL signal
> > prior to invocation. This separation introduces a scheduling race window
> > where the victim task may receive the signal and enter the exit path
> > before the reaper can invoke process_mrelease().
> >
> > When the victim enters the exit path (do_exit -> exit_mm), it clears its
> > task->mm immediately. This causes process_mrelease() to fail with -ESRCH,
>
> To be quite frank about the patch as written below: I think this is a
> completely crazy api. And I really dislike it.
>
> Right now we have clear and simple signal sending semantics for pidfds:
>
> * thread-specific pidfd -> thread-directed signal (unless signal is thread-group scoped by default)
> * thread-group pidfd -> thread-group directed signal
>
> with specific overrides for pidfd_send_signal():
>
> * PIDFD_SIGNAL_THREAD           -> only signal thread
> * PIDFD_SIGNAL_THREAD_GROUP     -> signal thread-group
> * PIDFD_SIGNAL_PROCESS_GROUP    -> signal process-group
>
> And now this patch aims to elevate process_mrelease() to a _signal
> sending function_. And the semantics are complete special sauce too.
>
> You are effectively introducing a custom signal scope that is mm-based
> and then also plumbing it into a completely unrelated function that
> should have absolutely nothing to do with this.
>
> This is such Zebroid API. I really would hate to see it land.
>
> This came up because of the ptrace bug that was just recently discovered
> and that Linus fixed yesterday. This is another instance where I think
> the correct fix is to keep task->mm around until the process is reaped
> and then you can throw away all of the really ugly semantic in this
> patch afaict. I'd really like to see that merged as it would also clean
> up the ptrace code:
>
> https://lore.kernel.org/all/20201016024019.1882062-1-jannh@google.com

Yeah, I think this patchset would help process_mrelease() find the mm
to reap even after the process has passed its exit_mm(). We just need
to add:

if (!mm)
        mm = task->exit_mm;

Inside process_mrelease(). I had a similar idea of stashing mm and the
approach seems sane to me. Keeping UAPI unchanged is another upside
IMHO.

>
> > leaving the actual address space teardown (exit_mmap) to be deferred until
> > the mm's reference count drops to zero. In the field (e.g., Android),
> > arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> > remote VM accesses) frequently delay this teardown indefinitely,
> > defeating the purpose of expedited reclamation.
> >
> > In Android's LMKD scenarios, this delay keeps memory pressure high, forcing
> > the system to unnecessarily kill additional innocent background apps before
> > the memory from the first victim is recovered.
> >
> > This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
> > an integrated auto-kill mode. When specified, process_mrelease() directly
> > injects a SIGKILL into the target task after finding its mm.
> >
> > To solve the race condition, we grab the mm reference via mmgrab() before
> > sending the SIGKILL. If the user passed PROCESS_MRELEASE_REAP_KILL, we assume
> > it will free its memory and proceed with reaping, making the logic as simple
> > as reap = reap_kill || task_will_free_mem(p).
> >
> > To handle shared address spaces, we deliver SIGKILL to all processes sharing
> > the same address space using do_pidfd_send_signal_pidns(). This ensures the
> > target pid resides inside the caller's PID namespace hierarchy prior to
> > signal delivery. We iterate over all processes sharing the mm and deliver
> > SIGKILL to each. If delivering the signal to any of the sharing processes
> > fails, we return an error. Note that this approach may leave partial
> > side-effects if some processes are killed successfully before a failure occurs.
> >
> > Cc: Christian Brauner <brauner@kernel.org>
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/signal.h    |  4 +++
> >  include/uapi/linux/mman.h |  4 +++
> >  kernel/signal.c           | 29 ++++++++++++++++++---
> >  mm/oom_kill.c             | 55 ++++++++++++++++++++++++++++++++++-----
> >  4 files changed, 81 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/signal.h b/include/linux/signal.h
> > index f19816832f05..bdbe6b3addec 100644
> > --- a/include/linux/signal.h
> > +++ b/include/linux/signal.h
> > @@ -276,6 +276,8 @@ static inline int valid_signal(unsigned long sig)
> >
> >  struct timespec;
> >  struct pt_regs;
> > +struct mm_struct;
> > +struct pid;
> >  enum pid_type;
> >
> >  extern int next_signal(struct sigpending *pending, sigset_t *mask);
> > @@ -283,6 +285,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
> >                               struct task_struct *p, enum pid_type type);
> >  extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
> >                              struct task_struct *p, enum pid_type type);
> > +extern int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> > +                                   siginfo_t __user *info, unsigned int flags);
> >  extern int send_signal_locked(int sig, struct kernel_siginfo *info,
> >                             struct task_struct *p, enum pid_type type);
> >  extern int sigprocmask(int, sigset_t *, sigset_t *);
> > diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> > index e89d00528f2f..4266976b45ad 100644
> > --- a/include/uapi/linux/mman.h
> > +++ b/include/uapi/linux/mman.h
> > @@ -56,4 +56,8 @@ struct cachestat {
> >       __u64 nr_recently_evicted;
> >  };
> >
> > +/* Flags for process_mrelease */
> > +#define PROCESS_MRELEASE_REAP_KILL   (1 << 0)
> > +#define PROCESS_MRELEASE_VALID_FLAGS (PROCESS_MRELEASE_REAP_KILL)
> > +
> >  #endif /* _UAPI_LINUX_MMAN_H */
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index d65d0fe24bfb..b2dc08a9bdd3 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -4046,6 +4046,30 @@ static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
> >       return kill_pid_info_type(sig, &kinfo, pid, type);
> >  }
> >
> > +/**
> > + * do_pidfd_send_signal_pidns - Send a signal to a process via its struct pid
> > + *                              while validating PID namespace hierarchy.
> > + * @pid:   the struct pid of the target process
> > + * @sig:   signal to send
> > + * @type:  scope of the signal (e.g. PIDTYPE_TGID)
> > + * @info:  signal info payload
> > + * @flags: signaling flags
> > + *
> > + * Verify that the target pid resides inside the caller's PID namespace
> > + * hierarchy prior to signal delivery.
> > + *
> > + * Return: 0 on success, negative errno on failure.
> > + */
> > +int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> > +                            siginfo_t __user *info, unsigned int flags)
> > +{
> > +     /* Enforce PID namespace hierarchy boundary */
> > +     if (!access_pidfd_pidns(pid))
> > +             return -EINVAL;
> > +
> > +     return do_pidfd_send_signal(pid, sig, type, info, flags);
> > +}
> > +
> >  /**
> >   * sys_pidfd_send_signal - Signal a process through a pidfd
> >   * @pidfd:  file descriptor of the process
> > @@ -4094,16 +4118,13 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >               if (IS_ERR(pid))
> >                       return PTR_ERR(pid);
> >
> > -             if (!access_pidfd_pidns(pid))
> > -                     return -EINVAL;
> > -
> >               /* Infer scope from the type of pidfd. */
> >               if (fd_file(f)->f_flags & PIDFD_THREAD)
> >                       type = PIDTYPE_PID;
> >               else
> >                       type = PIDTYPE_TGID;
> >
> > -             return do_pidfd_send_signal(pid, sig, type, info, flags);
> > +             return do_pidfd_send_signal_pidns(pid, sig, type, info, flags);
> >       }
> >       }
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5c6c95c169ee..253aa80770f2 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -20,6 +20,7 @@
> >
> >  #include <linux/oom.h>
> >  #include <linux/mm.h>
> > +#include <uapi/linux/mman.h>
> >  #include <linux/err.h>
> >  #include <linux/gfp.h>
> >  #include <linux/sched.h>
> > @@ -925,6 +926,39 @@ static bool task_will_free_mem(struct task_struct *task)
> >       return ret;
> >  }
> >
> > +/*
> > + * kill_all_shared_mm - Deliver SIGKILL to all processes sharing the given address space.
> > + * @victim: the targeted OOM process group leader
> > + * @mm:     the virtual memory space being reaped
> > + *
> > + * Traverse all threads globally and signal any user processes sharing the identical
> > + * mm footprints, ensuring no concurrent users pin the memory. Skips the system
> > + * global init and kernel worker threads.
> > + */
> > +static int kill_all_shared_mm(struct task_struct *victim, struct mm_struct *mm)
> > +{
> > +     struct task_struct *p;
> > +     bool failed = false;
> > +
> > +     rcu_read_lock();
> > +     for_each_process(p) {
> > +             if (!process_shares_mm(p, mm))
> > +                     continue;
> > +             if (is_global_init(p)) {
>
> You can't signal init in any shape or form any way. Why bother reporting
> failure at all.
>
> > +                     failed = true;
> > +                     continue;
> > +             }
> > +             if (unlikely(p->flags & PF_KTHREAD))
> > +                     continue;
> > +
> > +             if (do_pidfd_send_signal_pidns(task_pid(p), SIGKILL, PIDTYPE_TGID, NULL, 0))
> > +                     failed = true;
> > +     }
> > +     rcu_read_unlock();
> > +
> > +     return failed ? -EBUSY : 0;
>
> Why are you returning EBUSY? This makes no sense imho.
>
> > +}
> > +
> >  static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  {
> >       struct task_struct *p;
> > @@ -1217,9 +1251,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >       unsigned int f_flags;
> >       bool reap = false;
> >       long ret = 0;
> > +     bool reap_kill;
> >
> > -     if (flags)
> > +     if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
> >               return -EINVAL;
> > +     reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
> >
> >       task = pidfd_get_task(pidfd, &f_flags);
> >       if (IS_ERR(task))
> > @@ -1236,19 +1272,24 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >       }
> >
> >       mm = p->mm;
> > -     mmgrab(mm);
> >
> > -     if (task_will_free_mem(p))
> > -             reap = true;
> > -     else {
> > +     reap = reap_kill || task_will_free_mem(p);
> > +     if (!reap) {
> >               /* Error only if the work has not been done already */
> >               if (!mm_flags_test(MMF_OOM_SKIP, mm))
> >                       ret = -EINVAL;
> > +             task_unlock(p);
> > +             goto put_task;
> >       }
> > +
> > +     mmgrab(mm);
> >       task_unlock(p);
> >
> > -     if (!reap)
> > -             goto drop_mm;
> > +     if (reap_kill) {
> > +             ret = kill_all_shared_mm(task, mm);
> > +             if (ret)
> > +                     goto drop_mm;
> > +     }
> >
> >       if (mmap_read_lock_killable(mm)) {
> >               ret = -EINTR;
> > --
> > 2.54.0.563.g4f69b47b94-goog
> >
> >


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 14:41 ` Christian Brauner
                     ` (2 preceding siblings ...)
  2026-05-15 20:42   ` Suren Baghdasaryan
@ 2026-05-15 20:52   ` Minchan Kim
  3 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2026-05-15 20:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Linus Torvalds, Oleg Nesterov, akpm, hca, linux-s390,
	david, mhocko, linux-mm, linux-kernel, surenb, timmurray

On Fri, May 15, 2026 at 04:41:18PM +0200, Christian Brauner wrote:
> On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> > Currently, process_mrelease() requires userspace to send a SIGKILL signal
> > prior to invocation. This separation introduces a scheduling race window
> > where the victim task may receive the signal and enter the exit path
> > before the reaper can invoke process_mrelease().
> > 
> > When the victim enters the exit path (do_exit -> exit_mm), it clears its
> > task->mm immediately. This causes process_mrelease() to fail with -ESRCH,
> 
> To be quite frank about the patch as written below: I think this is a
> completely crazy api. And I really dislike it.
> 
> Right now we have clear and simple signal sending semantics for pidfds:
> 
> * thread-specific pidfd -> thread-directed signal (unless signal is thread-group scoped by default)
> * thread-group pidfd -> thread-group directed signal
> 
> with specific overrides for pidfd_send_signal():
> 
> * PIDFD_SIGNAL_THREAD           -> only signal thread
> * PIDFD_SIGNAL_THREAD_GROUP     -> signal thread-group
> * PIDFD_SIGNAL_PROCESS_GROUP    -> signal process-group
> 
> And now this patch aims to elevate process_mrelease() to a _signal
> sending function_. And the semantics are complete special sauce too.
> 
> You are effectively introducing a custom signal scope that is mm-based
> and then also plumbing it into a completely unrelated function that
> should have absolutely nothing to do with this.
> 
> This is such Zebroid API. I really would hate to see it land.
> 
> This came up because of the ptrace bug that was just recently discovered
> and that Linus fixed yesterday. This is another instance where I think
> the correct fix is to keep task->mm around until the process is reaped
> and then you can throw away all of the really ugly semantic in this
> patch afaict. I'd really like to see that merged as it would also clean
> up the ptrace code:
> 
> https://lore.kernel.org/all/20201016024019.1882062-1-jannh@google.com

Thank you for pointing this out. I completely agree.

The only reason for adding PROCESS_MRELEASE_REAP_KILL and its complex signaling
was to work around task->mm being cleared early in exit_mm(), which causes
process_mrelease() to fail with -ESRCH.

If Jann's patch to keep task->mm around is merged, this race naturally vanishes.
Userspace can reliably send SIGKILL and call the normal process_mrelease()
with the exit_mm

I would be thrilled to see Jann's patch merged. It allows us to drop this
complex REAP_KILL patch entirely while still fully resolving the expedited
reclaim issue.

> 
> > leaving the actual address space teardown (exit_mmap) to be deferred until
> > the mm's reference count drops to zero. In the field (e.g., Android),
> > arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> > remote VM accesses) frequently delay this teardown indefinitely,
> > defeating the purpose of expedited reclamation.
> > 
> > In Android's LMKD scenarios, this delay keeps memory pressure high, forcing
> > the system to unnecessarily kill additional innocent background apps before
> > the memory from the first victim is recovered.
> > 
> > This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
> > an integrated auto-kill mode. When specified, process_mrelease() directly
> > injects a SIGKILL into the target task after finding its mm.
> > 
> > To solve the race condition, we grab the mm reference via mmgrab() before
> > sending the SIGKILL. If the user passed PROCESS_MRELEASE_REAP_KILL, we assume
> > it will free its memory and proceed with reaping, making the logic as simple
> > as reap = reap_kill || task_will_free_mem(p).
> > 
> > To handle shared address spaces, we deliver SIGKILL to all processes sharing
> > the same address space using do_pidfd_send_signal_pidns(). This ensures the
> > target pid resides inside the caller's PID namespace hierarchy prior to
> > signal delivery. We iterate over all processes sharing the mm and deliver
> > SIGKILL to each. If delivering the signal to any of the sharing processes
> > fails, we return an error. Note that this approach may leave partial
> > side-effects if some processes are killed successfully before a failure occurs.
> > 
> > Cc: Christian Brauner <brauner@kernel.org>
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/signal.h    |  4 +++
> >  include/uapi/linux/mman.h |  4 +++
> >  kernel/signal.c           | 29 ++++++++++++++++++---
> >  mm/oom_kill.c             | 55 ++++++++++++++++++++++++++++++++++-----
> >  4 files changed, 81 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/signal.h b/include/linux/signal.h
> > index f19816832f05..bdbe6b3addec 100644
> > --- a/include/linux/signal.h
> > +++ b/include/linux/signal.h
> > @@ -276,6 +276,8 @@ static inline int valid_signal(unsigned long sig)
> >  
> >  struct timespec;
> >  struct pt_regs;
> > +struct mm_struct;
> > +struct pid;
> >  enum pid_type;
> >  
> >  extern int next_signal(struct sigpending *pending, sigset_t *mask);
> > @@ -283,6 +285,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
> >  				struct task_struct *p, enum pid_type type);
> >  extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
> >  			       struct task_struct *p, enum pid_type type);
> > +extern int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> > +				      siginfo_t __user *info, unsigned int flags);
> >  extern int send_signal_locked(int sig, struct kernel_siginfo *info,
> >  			      struct task_struct *p, enum pid_type type);
> >  extern int sigprocmask(int, sigset_t *, sigset_t *);
> > diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> > index e89d00528f2f..4266976b45ad 100644
> > --- a/include/uapi/linux/mman.h
> > +++ b/include/uapi/linux/mman.h
> > @@ -56,4 +56,8 @@ struct cachestat {
> >  	__u64 nr_recently_evicted;
> >  };
> >  
> > +/* Flags for process_mrelease */
> > +#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
> > +#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
> > +
> >  #endif /* _UAPI_LINUX_MMAN_H */
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index d65d0fe24bfb..b2dc08a9bdd3 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -4046,6 +4046,30 @@ static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
> >  	return kill_pid_info_type(sig, &kinfo, pid, type);
> >  }
> >  
> > +/**
> > + * do_pidfd_send_signal_pidns - Send a signal to a process via its struct pid
> > + *                              while validating PID namespace hierarchy.
> > + * @pid:   the struct pid of the target process
> > + * @sig:   signal to send
> > + * @type:  scope of the signal (e.g. PIDTYPE_TGID)
> > + * @info:  signal info payload
> > + * @flags: signaling flags
> > + *
> > + * Verify that the target pid resides inside the caller's PID namespace
> > + * hierarchy prior to signal delivery.
> > + *
> > + * Return: 0 on success, negative errno on failure.
> > + */
> > +int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> > +			       siginfo_t __user *info, unsigned int flags)
> > +{
> > +	/* Enforce PID namespace hierarchy boundary */
> > +	if (!access_pidfd_pidns(pid))
> > +		return -EINVAL;
> > +
> > +	return do_pidfd_send_signal(pid, sig, type, info, flags);
> > +}
> > +
> >  /**
> >   * sys_pidfd_send_signal - Signal a process through a pidfd
> >   * @pidfd:  file descriptor of the process
> > @@ -4094,16 +4118,13 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >  		if (IS_ERR(pid))
> >  			return PTR_ERR(pid);
> >  
> > -		if (!access_pidfd_pidns(pid))
> > -			return -EINVAL;
> > -
> >  		/* Infer scope from the type of pidfd. */
> >  		if (fd_file(f)->f_flags & PIDFD_THREAD)
> >  			type = PIDTYPE_PID;
> >  		else
> >  			type = PIDTYPE_TGID;
> >  
> > -		return do_pidfd_send_signal(pid, sig, type, info, flags);
> > +		return do_pidfd_send_signal_pidns(pid, sig, type, info, flags);
> >  	}
> >  	}
> >  
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5c6c95c169ee..253aa80770f2 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -20,6 +20,7 @@
> >  
> >  #include <linux/oom.h>
> >  #include <linux/mm.h>
> > +#include <uapi/linux/mman.h>
> >  #include <linux/err.h>
> >  #include <linux/gfp.h>
> >  #include <linux/sched.h>
> > @@ -925,6 +926,39 @@ static bool task_will_free_mem(struct task_struct *task)
> >  	return ret;
> >  }
> >  
> > +/*
> > + * kill_all_shared_mm - Deliver SIGKILL to all processes sharing the given address space.
> > + * @victim: the targeted OOM process group leader
> > + * @mm:     the virtual memory space being reaped
> > + *
> > + * Traverse all threads globally and signal any user processes sharing the identical
> > + * mm footprints, ensuring no concurrent users pin the memory. Skips the system
> > + * global init and kernel worker threads.
> > + */
> > +static int kill_all_shared_mm(struct task_struct *victim, struct mm_struct *mm)
> > +{
> > +	struct task_struct *p;
> > +	bool failed = false;
> > +
> > +	rcu_read_lock();
> > +	for_each_process(p) {
> > +		if (!process_shares_mm(p, mm))
> > +			continue;
> > +		if (is_global_init(p)) {
> 
> You can't signal init in any shape or form any way. Why bother reporting
> failure at all.
> 
> > +			failed = true;
> > +			continue;
> > +		}
> > +		if (unlikely(p->flags & PF_KTHREAD))
> > +			continue;
> > +
> > +		if (do_pidfd_send_signal_pidns(task_pid(p), SIGKILL, PIDTYPE_TGID, NULL, 0))
> > +			failed = true;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return failed ? -EBUSY : 0;
> 
> Why are you returning EBUSY? This makes no sense imho.
> 
> > +}
> > +
> >  static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  {
> >  	struct task_struct *p;
> > @@ -1217,9 +1251,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >  	unsigned int f_flags;
> >  	bool reap = false;
> >  	long ret = 0;
> > +	bool reap_kill;
> >  
> > -	if (flags)
> > +	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
> >  		return -EINVAL;
> > +	reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
> >  
> >  	task = pidfd_get_task(pidfd, &f_flags);
> >  	if (IS_ERR(task))
> > @@ -1236,19 +1272,24 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >  	}
> >  
> >  	mm = p->mm;
> > -	mmgrab(mm);
> >  
> > -	if (task_will_free_mem(p))
> > -		reap = true;
> > -	else {
> > +	reap = reap_kill || task_will_free_mem(p);
> > +	if (!reap) {
> >  		/* Error only if the work has not been done already */
> >  		if (!mm_flags_test(MMF_OOM_SKIP, mm))
> >  			ret = -EINVAL;
> > +		task_unlock(p);
> > +		goto put_task;
> >  	}
> > +
> > +	mmgrab(mm);
> >  	task_unlock(p);
> >  
> > -	if (!reap)
> > -		goto drop_mm;
> > +	if (reap_kill) {
> > +		ret = kill_all_shared_mm(task, mm);
> > +		if (ret)
> > +			goto drop_mm;
> > +	}
> >  
> >  	if (mmap_read_lock_killable(mm)) {
> >  		ret = -EINTR;
> > -- 
> > 2.54.0.563.g4f69b47b94-goog
> > 
> > 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 20:15   ` Oleg Nesterov
@ 2026-05-15 22:33     ` Minchan Kim
  2026-05-15 23:45       ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2026-05-15 22:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Jann Horn, Linus Torvalds, akpm, hca,
	linux-s390, david, mhocko, linux-mm, linux-kernel, surenb,
	timmurray

On Fri, May 15, 2026 at 10:15:53PM +0200, Oleg Nesterov wrote:
> In fact I don't even understand the motivation...
> 
> On 05/15, Christian Brauner wrote:
> >
> > On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> > > leaving the actual address space teardown (exit_mmap) to be deferred until
> > > the mm's reference count drops to zero. In the field (e.g., Android),
> > > arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> > > remote VM accesses) frequently delay this teardown indefinitely,
> 
> Sure, get_task_cmdline() can delay mmput(). But indefinitely ?
> 
> Perhaps the changelog could be more clear? I don't see how any remote VM access
> can pin mm->mm_users "indefinitely". Even if, say, a lot of threads read
> /proc/<pid>/cmdline in an endless loop in parallel...
> 
> I must have missed something.

Thank you for the review and questions. You are entirely right that under normal
uncongested conditions, a /proc reader drops mmput() quickly.

However, on any heavily loaded system under severe memory/CPU pressure, this delay
can be long enough to cause cascading issues. Here is exactly how this occurs
and why it acts as an indefinite delay from an emergency reclaim perspective.

When memory pressure is critical, a userspace OOM killer terminates a large
victim process. Simultaneously, another process (such as a monitoring tool) is
reading /proc/<pid>/smaps or cmdline. Because the system is heavily loaded, the
reader thread on CPU C can get preempted or blocked while holding mmget().

When the dying victim executes exit_mm(), mm_users drops from 2 to 1. Thus,
exit_mmap() does not run. For hundreds of milliseconds or seconds, the memory
remains fully trapped. The userspace OOM policy sees that memory is still
critically low and unnecessarily kills additional innocent processes.

Here is the exact timing chart illustrating the existing problem and why
process_mrelease() fails in this scenario:

CPU A (Userspace OOM Killer)      CPU B (Victim Task)      CPU C (/proc Reader)
----------------------------      -------------------      --------------------
						       open(/proc/pid/smaps)
						       get_task_mm()
						       [mm_users++ => 2]
						       (Preempted/Stalled)
							          |
1. Sends SIGKILL                                                  |
2. Victim receives SIGKILL                                        |
			      do_exit()                           |
				exit_mm()                         |
				  task->mm = NULL                 |
				  mmput() [mm_users => 1]         |
				  (Memory NOT freed!)             |
							          |
3. Calls process_mrelease()                                       |
							          |
find_lock_task_mm() sees task->mm == NULL                         |
Returns -ESRCH. Reaping fails!                                    |
(Memory remains trapped until CPU C finally finishes!) <==========/

I hope thisclarifies the motivation and mechanics behind this issue.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 22:33     ` Minchan Kim
@ 2026-05-15 23:45       ` Linus Torvalds
  2026-05-16  0:00         ` Linus Torvalds
  2026-05-16  5:47         ` Minchan Kim
  0 siblings, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2026-05-15 23:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Oleg Nesterov, Christian Brauner, Jann Horn, akpm, hca,
	linux-s390, david, mhocko, linux-mm, linux-kernel, surenb,
	timmurray

On Fri, 15 May 2026 at 15:33, Minchan Kim <minchan@kernel.org> wrote:
>
> I hope thisclarifies the motivation and mechanics behind this issue.

This still seems very hacky and a complete special case for one odd situation.

This all sounds like it's just because smap is a pig.

And yes, smap *is* a pig, but it should be trivial to just fix smap
for this case - fix the problem spot, don't add new horrid logic
elsewhere.

I really think the fix is to fix smap instead.

And I think smap is doing odd things. For example, it does

   pid_smaps_open -> do_maps_open -> proc_mem_open

and that proc_mem_open() takes that long-term ref to the mm. And then
does various memory allocations - and copying data to user space -
under that long-term ref, which is presumably what causes all the
latency issues.

But it doesn't actually seem to *need* a long-term ref to the mm. The
seqfile interface is designed so that it should all be chunkable, and
the locks and refs should be done at m_start/m_end time.

And the smap / maps m_start and m_end functions already *almost* seem
to do that. They literally look up the task again with

        priv->task = get_proc_task(priv->inode);

etc, but then they do that odd

        lock_ctx = &priv->lock_ctx;
        mm = lock_ctx->mm;
        if (!mm || !mmget_not_zero(mm)) {
                put_task_struct(priv->task);
                priv->task = NULL;
                return NULL;
        }

dance (where lock_ctx->mm is literally that long-term ref we hold).

And I don't see why they need to do any of this. I think it's all a
historical accident.

Because I think it could look up the mm from the task pointer every
time, without holding that long-term ref from proc_mem_open() at all.

IOW, at open time, we could save off the "this is the mm I opened",
but *without* any refcount, and then just verify that "yes, the task
mm still matches". No long-term refs anywhere.

Yes, yes, we'd need some sequence counter for when the mm changes due
to execve, but *that* should be absolutely trivial.

And wouldn't that make all of this go away entirely? And probably
clean up the code at the same time?

I think the only reason "proc_mem_open()" does what it does was that
it was simple. Not because it was a good idea.

                   Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 23:45       ` Linus Torvalds
@ 2026-05-16  0:00         ` Linus Torvalds
  2026-05-16  5:47         ` Minchan Kim
  1 sibling, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2026-05-16  0:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Oleg Nesterov, Christian Brauner, Jann Horn, akpm, hca,
	linux-s390, david, mhocko, linux-mm, linux-kernel, surenb,
	timmurray

On Fri, 15 May 2026 at 16:45, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yes, yes, we'd need some sequence counter for when the mm changes due
> to execve, but *that* should be absolutely trivial.

I guess we already  have it in the form of "self_exec_id". So the
tuple <mm,self_exec_id> should basically guarantee it's the same mm as
it was at open time.

Except it's not updated at quite the right time. If it was done inside
the task lock when task->mm was actually set in exec_mmap(), it would
work as-is.

This isn't exactly what that 'self_exec_id' was designed for, but it
*is* a sequence number for the mm of a task. So if a 'mm' pointer
value gets reused, you can tell that it's not the same mm as it used
to be by looking at the exec_id.

And I think the only reason for it being done outside the task lock is
that the current use is all synchronous to the task itself, so locking
or location simply didn't matter.

But there's some mqueue signal handling thing that looks like it's
actually violating that situation and would race with execve().

            Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 23:45       ` Linus Torvalds
  2026-05-16  0:00         ` Linus Torvalds
@ 2026-05-16  5:47         ` Minchan Kim
  2026-05-16 16:31           ` Linus Torvalds
  1 sibling, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2026-05-16  5:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Christian Brauner, Jann Horn, akpm, hca,
	linux-s390, david, mhocko, linux-mm, linux-kernel, surenb,
	timmurray

On Fri, May 15, 2026 at 04:45:38PM -0700, Linus Torvalds wrote:
> On Fri, 15 May 2026 at 15:33, Minchan Kim <minchan@kernel.org> wrote:
> >
> > I hope thisclarifies the motivation and mechanics behind this issue.
> 
> This still seems very hacky and a complete special case for one odd situation.

Hi Linus,

Thank you for looking into this issue and sharing your thoughts.

Regarding proc_mem_open(), it actually operates very close to what you suggested.
It acquires a reference to the mm_struct itself via mmgrab() but immediately
unpins the address space memory via mmput(). Thus, no long-term mm_users
reference is held across the open file descriptor.

The latency issue occurs during seqfile iteration (m_start/m_stop) in
smaps/maps, or during get_cmdline() and ptrace_access_vm(), where the reader
temporarily acquires mm_users via mmget_not_zero() or get_task_mm().

Because these monitoring processes or background readers often run with lower
priority on heavily loaded systems under memory pressure, their temporary
mmget() reference can easily be delayed or preempted for extended periods.

When the victim process exits, its memory teardown (exit_mmap) is blocked by
this stalled reference. Because any kernel interface acquiring mmget() can
potentially cause this delay under heavy load, the issue appears more general
than just a single file interface.

This is why I felt allowing process_mrelease() to directly acquire the dying
address space could be a clean and robust way to resolve this expedited reclaim
issue, without modifying individual callers across the kernel. I would highly
appreciate your thoughts on this perspective.

> 
> This all sounds like it's just because smap is a pig.
> 
> And yes, smap *is* a pig, but it should be trivial to just fix smap
> for this case - fix the problem spot, don't add new horrid logic
> elsewhere.
> 
> I really think the fix is to fix smap instead.
> 
> And I think smap is doing odd things. For example, it does
> 
>    pid_smaps_open -> do_maps_open -> proc_mem_open
> 
> and that proc_mem_open() takes that long-term ref to the mm. And then
> does various memory allocations - and copying data to user space -
> under that long-term ref, which is presumably what causes all the
> latency issues.
> 
> But it doesn't actually seem to *need* a long-term ref to the mm. The
> seqfile interface is designed so that it should all be chunkable, and
> the locks and refs should be done at m_start/m_end time.
> 
> And the smap / maps m_start and m_end functions already *almost* seem
> to do that. They literally look up the task again with
> 
>         priv->task = get_proc_task(priv->inode);
> 
> etc, but then they do that odd
> 
>         lock_ctx = &priv->lock_ctx;
>         mm = lock_ctx->mm;
>         if (!mm || !mmget_not_zero(mm)) {
>                 put_task_struct(priv->task);
>                 priv->task = NULL;
>                 return NULL;
>         }
> 
> dance (where lock_ctx->mm is literally that long-term ref we hold).
> 
> And I don't see why they need to do any of this. I think it's all a
> historical accident.
> 
> Because I think it could look up the mm from the task pointer every
> time, without holding that long-term ref from proc_mem_open() at all.
> 
> IOW, at open time, we could save off the "this is the mm I opened",
> but *without* any refcount, and then just verify that "yes, the task
> mm still matches". No long-term refs anywhere.
> 
> Yes, yes, we'd need some sequence counter for when the mm changes due
> to execve, but *that* should be absolutely trivial.
> 
> And wouldn't that make all of this go away entirely? And probably
> clean up the code at the same time?
> 
> I think the only reason "proc_mem_open()" does what it does was that
> it was simple. Not because it was a good idea.
> 
>                    Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-16  5:47         ` Minchan Kim
@ 2026-05-16 16:31           ` Linus Torvalds
  2026-05-19 20:53             ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2026-05-16 16:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Oleg Nesterov, Christian Brauner, Jann Horn, akpm, hca,
	linux-s390, david, mhocko, linux-mm, linux-kernel, surenb,
	timmurray

On Fri, 15 May 2026 at 22:47, Minchan Kim <minchan@kernel.org> wrote:
>
> Regarding proc_mem_open(), it actually operates very close to what you suggested.
> It acquires a reference to the mm_struct itself via mmgrab() but immediately
> unpins the address space memory via mmput(). Thus, no long-term mm_users
> reference is held across the open file descriptor.

Ahh, and we've actually done that since 2012. How time flies..

> The latency issue occurs during seqfile iteration (m_start/m_stop) in
> smaps/maps, or during get_cmdline() and ptrace_access_vm(), where the reader
> temporarily acquires mm_users via mmget_not_zero() or get_task_mm().

Ok, so it's that much smaller region.

How about a completely different approach then - make exit_mmap() just
take the mmap_write_lock() properly, and allow walking the vma's
without ever grabbing mm_users at all?

IOW, just a mm_count ref would be sufficient to hold the mm_struct
around, and then the read-lock protects against exit_mm() actually
tearing the list down when the last "real" user goes away.

[ exit_mm() is currently a bit odd - it does take the mmap_write lock,
but it *starts* with the read-lock.

   I'm not sure why it does that - it used to do the write lock over
the whole sequence, but that was changed in commit bf3980c85212 ("mm:
drop oom code from exit_mmap").

  Sure, read-lock allows more concurrency, but that would seem to be a
complete non-issue for exit_mmap(), and switching locking seems to
just complicate things.

  But that's a separate issue that I just happened to notice while
looking at this ]

I may be missing something else again.

Also, I do really hate the smap code. People have optimized it because
it's so piggy, but that code is still just silly. The "rollup" case in
particular knows how bad it is, and does that whole "unlock and relock
under contention" because it knows it's a horrible latency pig.

Oh well. But it really feels like we *could* do this all without mm_users. No?

                  Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-16 16:31           ` Linus Torvalds
@ 2026-05-19 20:53             ` Minchan Kim
  0 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2026-05-19 20:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Christian Brauner, Jann Horn, akpm, hca,
	linux-s390, david, mhocko, linux-mm, linux-kernel, surenb,
	timmurray

On Sat, May 16, 2026 at 09:31:04AM -0700, Linus Torvalds wrote:
> On Fri, 15 May 2026 at 22:47, Minchan Kim <minchan@kernel.org> wrote:
> >
> > Regarding proc_mem_open(), it actually operates very close to what you suggested.
> > It acquires a reference to the mm_struct itself via mmgrab() but immediately
> > unpins the address space memory via mmput(). Thus, no long-term mm_users
> > reference is held across the open file descriptor.
> 
> Ahh, and we've actually done that since 2012. How time flies..
> 
> > The latency issue occurs during seqfile iteration (m_start/m_stop) in
> > smaps/maps, or during get_cmdline() and ptrace_access_vm(), where the reader
> > temporarily acquires mm_users via mmget_not_zero() or get_task_mm().
> 
> Ok, so it's that much smaller region.
> 
> How about a completely different approach then - make exit_mmap() just
> take the mmap_write_lock() properly, and allow walking the vma's
> without ever grabbing mm_users at all?
> 
> IOW, just a mm_count ref would be sufficient to hold the mm_struct
> around, and then the read-lock protects against exit_mm() actually
> tearing the list down when the last "real" user goes away.
> 
> [ exit_mm() is currently a bit odd - it does take the mmap_write lock,
> but it *starts* with the read-lock.
> 
>    I'm not sure why it does that - it used to do the write lock over
> the whole sequence, but that was changed in commit bf3980c85212 ("mm:
> drop oom code from exit_mmap").
> 
>   Sure, read-lock allows more concurrency, but that would seem to be a
> complete non-issue for exit_mmap(), and switching locking seems to
> just complicate things.
> 
>   But that's a separate issue that I just happened to notice while
> looking at this ]
> 
> I may be missing something else again.

Hi Linus,

Sorry for the slow response.
Thank you for the incredibly detailed feedback and the suggestions.

Your proposal to avoid mm_users and synchronize via mmap_lock is an elegant
conceptual cleanup. However, from the perspective of userspace OOM recovery,
we hit two critical roadblocks that this alone cannot resolve:

First, the -ESRCH race remains unsolved.
Even if we don't grab mm_users, the victim process still clears its task->mm
to NULL early in exit_mm(). Here is the timing mismatch:

CPU A (Userspace OOM Killer)      CPU B (Victim Task)
----------------------------      -------------------
1. Sends SIGKILL
2. Victim receives SIGKILL
                                  do_exit()
                                    exit_mm()
                                      task->mm = NULL  <==== (Stops pinning mm)
                                      mmput()
3. Calls process_mrelease()
   (Looks up task->mm)
   (Sees NULL)
   Returns -ESRCH! <======================================== (Reaping fails!)

Without Jann's patch to preserve the mm pointer via task->exit_mm, the
userspace killer won't even have a chance to attempt reaping.

Second, the latency bottleneck transfers from mmput() to mmap_lock.
If a low-priority procfs reader is preempted or stalled while holding the
mmap_read_lock, the exiting process calling exit_mmap() will block indefinitely
when trying to acquire the mmap_write_lock.

Crucially, if this lock contention occurs, process_mrelease() itself would
also block on the same mmap_lock while trying to reap the memory, defeating the
synchronous and expedited nature of the API.

[An Alternative Proposal: Combining Kill and Reap via pidfd_send_signal()]

Taking a step back, I believe the fundamental issue stems from separating
the asynchronous "Kill" and synchronous "Reap" operations into two distinct
system calls. Because userspace cannot predict when the victim will execute
exit_mm(), the timing mismatch is practically unavoidable so the reaping
doesn't work in the end.

Since Christian understandably dislikes combining signaling semantics into
process_mrelease(), perhaps we could solve this from the signal side.

What if we introduce a new flag for pidfd_send_signal(), such as
PIDFD_SIGNAL_PROCESS_GROUP_EXPEDITE?

When invoked with this flag and SIGKILL, pidfd_send_signal() would deliver the
fatal signal and immediately trigger the oom_reaper's VM zapping on the target
mm within the same synchronous syscall context (where task->mm is guaranteed to
be valid and easily locked).

This would completely eliminate the -ESRCH race by making the kill-and-reap
operation atomic from userspace's perspective, while keeping each syscall
focused strictly on its primary responsibility (signaling vs. reclaiming)

Honestly, if we adopt this atomic interface, it might actually make the
separate process_mrelease() syscall obsolete. I am not entirely sure about
the historical reasons why they were split into two distinct APIs
in the first place, but merging them into a single pidfd-based atomic
operation seems much cleaner.

I would highly appreciate everyone's thoughts on this perspective and
alternative direction.

> 
> Also, I do really hate the smap code. People have optimized it because
> it's so piggy, but that code is still just silly. The "rollup" case in
> particular knows how bad it is, and does that whole "unlock and relock
> under contention" because it knows it's a horrible latency pig.

And yes, I completely agree with your frustration on the smaps code—it is
indeed a massive latency pig. In fact, userspace tools have increasingly moved
away from smaps and even PSS (Proportional Set Size) altogether because they
are simply too slow to be usable in production.

> 
> Oh well. But it really feels like we *could* do this all without mm_users. No?
> 
>                   Linus
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-19 20:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-11 21:42 [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
2026-05-15 14:41 ` Christian Brauner
2026-05-15 15:49   ` Oleg Nesterov
2026-05-15 20:15   ` Oleg Nesterov
2026-05-15 22:33     ` Minchan Kim
2026-05-15 23:45       ` Linus Torvalds
2026-05-16  0:00         ` Linus Torvalds
2026-05-16  5:47         ` Minchan Kim
2026-05-16 16:31           ` Linus Torvalds
2026-05-19 20:53             ` Minchan Kim
2026-05-15 20:42   ` Suren Baghdasaryan
2026-05-15 20:52   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox