[PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
@ 2026-05-11 21:42 Minchan Kim
  2026-05-15 14:41 ` Christian Brauner
  0 siblings, 1 reply; 9+ messages in thread
From: Minchan Kim @ 2026-05-11 21:42 UTC (permalink / raw)
  To: akpm
  Cc: hca, linux-s390, david, mhocko, brauner, linux-mm, linux-kernel,
	surenb, timmurray, Minchan Kim

Currently, process_mrelease() requires userspace to send a SIGKILL signal
prior to invocation. This separation introduces a scheduling race window
where the victim task may receive the signal and enter the exit path
before the reaper can invoke process_mrelease().

When the victim enters the exit path (do_exit -> exit_mm), it clears its
task->mm immediately. This causes process_mrelease() to fail with -ESRCH,
leaving the actual address space teardown (exit_mmap) to be deferred until
the mm's reference count drops to zero. In the field (e.g., Android),
arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
remote VM accesses) frequently delay this teardown indefinitely,
defeating the purpose of expedited reclamation.

In Android's LMKD scenarios, this delay keeps memory pressure high, forcing
the system to unnecessarily kill additional innocent background apps before
the memory from the first victim is recovered.

This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
an integrated auto-kill mode. When specified, process_mrelease() directly
injects a SIGKILL into the target task after finding its mm.

To solve the race condition, we grab the mm reference via mmgrab() before
sending the SIGKILL. If the user passed PROCESS_MRELEASE_REAP_KILL, we assume
it will free its memory and proceed with reaping, making the logic as simple
as reap = reap_kill || task_will_free_mem(p).

To handle shared address spaces, we deliver SIGKILL to all processes sharing
the same address space using do_pidfd_send_signal_pidns(). This ensures the
target pid resides inside the caller's PID namespace hierarchy prior to
signal delivery. We iterate over all processes sharing the mm and deliver
SIGKILL to each. If delivering the signal to any of the sharing processes
fails, we return an error. Note that this approach may leave partial
side-effects if some processes are killed successfully before a failure occurs.

Cc: Christian Brauner <brauner@kernel.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/signal.h    |  4 +++
 include/uapi/linux/mman.h |  4 +++
 kernel/signal.c           | 29 ++++++++++++++++++---
 mm/oom_kill.c             | 55 ++++++++++++++++++++++++++++++++++-----
 4 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index f19816832f05..bdbe6b3addec 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -276,6 +276,8 @@ static inline int valid_signal(unsigned long sig)
 
 struct timespec;
 struct pt_regs;
+struct mm_struct;
+struct pid;
 enum pid_type;
 
 extern int next_signal(struct sigpending *pending, sigset_t *mask);
@@ -283,6 +285,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
 				struct task_struct *p, enum pid_type type);
 extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
 			       struct task_struct *p, enum pid_type type);
+extern int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
+				      siginfo_t __user *info, unsigned int flags);
 extern int send_signal_locked(int sig, struct kernel_siginfo *info,
 			      struct task_struct *p, enum pid_type type);
 extern int sigprocmask(int, sigset_t *, sigset_t *);
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index e89d00528f2f..4266976b45ad 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -56,4 +56,8 @@ struct cachestat {
 	__u64 nr_recently_evicted;
 };
 
+/* Flags for process_mrelease */
+#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
+#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/kernel/signal.c b/kernel/signal.c
index d65d0fe24bfb..b2dc08a9bdd3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4046,6 +4046,30 @@ static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
 	return kill_pid_info_type(sig, &kinfo, pid, type);
 }
 
+/**
+ * do_pidfd_send_signal_pidns - Send a signal to a process via its struct pid
+ *                              while validating PID namespace hierarchy.
+ * @pid:   the struct pid of the target process
+ * @sig:   signal to send
+ * @type:  scope of the signal (e.g. PIDTYPE_TGID)
+ * @info:  signal info payload
+ * @flags: signaling flags
+ *
+ * Verify that the target pid resides inside the caller's PID namespace
+ * hierarchy prior to signal delivery.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
+			       siginfo_t __user *info, unsigned int flags)
+{
+	/* Enforce PID namespace hierarchy boundary */
+	if (!access_pidfd_pidns(pid))
+		return -EINVAL;
+
+	return do_pidfd_send_signal(pid, sig, type, info, flags);
+}
+
 /**
  * sys_pidfd_send_signal - Signal a process through a pidfd
  * @pidfd:  file descriptor of the process
@@ -4094,16 +4118,13 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 		if (IS_ERR(pid))
 			return PTR_ERR(pid);
 
-		if (!access_pidfd_pidns(pid))
-			return -EINVAL;
-
 		/* Infer scope from the type of pidfd. */
 		if (fd_file(f)->f_flags & PIDFD_THREAD)
 			type = PIDTYPE_PID;
 		else
 			type = PIDTYPE_TGID;
 
-		return do_pidfd_send_signal(pid, sig, type, info, flags);
+		return do_pidfd_send_signal_pidns(pid, sig, type, info, flags);
 	}
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c6c95c169ee..253aa80770f2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -20,6 +20,7 @@
 
 #include <linux/oom.h>
 #include <linux/mm.h>
+#include <uapi/linux/mman.h>
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/sched.h>
@@ -925,6 +926,39 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
+/*
+ * kill_all_shared_mm - Deliver SIGKILL to all processes sharing the given address space.
+ * @victim: the targeted OOM process group leader
+ * @mm:     the virtual memory space being reaped
+ *
+ * Traverse all threads globally and signal any user processes sharing the identical
+ * mm footprints, ensuring no concurrent users pin the memory. Skips the system
+ * global init and kernel worker threads.
+ */
+static int kill_all_shared_mm(struct task_struct *victim, struct mm_struct *mm)
+{
+	struct task_struct *p;
+	bool failed = false;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (is_global_init(p)) {
+			failed = true;
+			continue;
+		}
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
+
+		if (do_pidfd_send_signal_pidns(task_pid(p), SIGKILL, PIDTYPE_TGID, NULL, 0))
+			failed = true;
+	}
+	rcu_read_unlock();
+
+	return failed ? -EBUSY : 0;
+}
+
 static void __oom_kill_process(struct task_struct *victim, const char *message)
 {
 	struct task_struct *p;
@@ -1217,9 +1251,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	unsigned int f_flags;
 	bool reap = false;
 	long ret = 0;
+	bool reap_kill;
 
-	if (flags)
+	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
 		return -EINVAL;
+	reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
 
 	task = pidfd_get_task(pidfd, &f_flags);
 	if (IS_ERR(task))
@@ -1236,19 +1272,24 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	}
 
 	mm = p->mm;
-	mmgrab(mm);
 
-	if (task_will_free_mem(p))
-		reap = true;
-	else {
+	reap = reap_kill || task_will_free_mem(p);
+	if (!reap) {
 		/* Error only if the work has not been done already */
 		if (!mm_flags_test(MMF_OOM_SKIP, mm))
 			ret = -EINVAL;
+		task_unlock(p);
+		goto put_task;
 	}
+
+	mmgrab(mm);
 	task_unlock(p);
 
-	if (!reap)
-		goto drop_mm;
+	if (reap_kill) {
+		ret = kill_all_shared_mm(task, mm);
+		if (ret)
+			goto drop_mm;
+	}
 
 	if (mmap_read_lock_killable(mm)) {
 		ret = -EINTR;
-- 
2.54.0.563.g4f69b47b94-goog



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-11 21:42 [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
@ 2026-05-15 14:41 ` Christian Brauner
  2026-05-15 15:49   ` Oleg Nesterov
                     ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Christian Brauner @ 2026-05-15 14:41 UTC (permalink / raw)
  To: Minchan Kim, Jann Horn, Linus Torvalds, Oleg Nesterov
  Cc: akpm, hca, linux-s390, david, mhocko, linux-mm, linux-kernel,
	surenb, timmurray

On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> Currently, process_mrelease() requires userspace to send a SIGKILL signal
> prior to invocation. This separation introduces a scheduling race window
> where the victim task may receive the signal and enter the exit path
> before the reaper can invoke process_mrelease().
> 
> When the victim enters the exit path (do_exit -> exit_mm), it clears its
> task->mm immediately. This causes process_mrelease() to fail with -ESRCH,

To be quite frank about the patch as written below: I think this is a
completely crazy api. And I really dislike it.

Right now we have clear and simple signal sending semantics for pidfds:

* thread-specific pidfd -> thread-directed signal (unless signal is thread-group scoped by default)
* thread-group pidfd -> thread-group directed signal

with specific overrides for pidfd_send_signal():

* PIDFD_SIGNAL_THREAD           -> only signal thread
* PIDFD_SIGNAL_THREAD_GROUP     -> signal thread-group
* PIDFD_SIGNAL_PROCESS_GROUP    -> signal process-group

And now this patch aims to elevate process_mrelease() to a _signal
sending function_. And the semantics are complete special sauce too.

You are effectively introducing a custom signal scope that is mm-based
and then also plumbing it into a completely unrelated function that
should have absolutely nothing to do with this.

This is such Zebroid API. I really would hate to see it land.

This came up because of the ptrace bug that was just recently discovered
and that Linus fixed yesterday. This is another instance where I think
the correct fix is to keep task->mm around until the process is reaped
and then you can throw away all of the really ugly semantic in this
patch afaict. I'd really like to see that merged as it would also clean
up the ptrace code:

https://lore.kernel.org/all/20201016024019.1882062-1-jannh@google.com

> leaving the actual address space teardown (exit_mmap) to be deferred until
> the mm's reference count drops to zero. In the field (e.g., Android),
> arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> remote VM accesses) frequently delay this teardown indefinitely,
> defeating the purpose of expedited reclamation.
> 
> In Android's LMKD scenarios, this delay keeps memory pressure high, forcing
> the system to unnecessarily kill additional innocent background apps before
> the memory from the first victim is recovered.
> 
> This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
> an integrated auto-kill mode. When specified, process_mrelease() directly
> injects a SIGKILL into the target task after finding its mm.
> 
> To solve the race condition, we grab the mm reference via mmgrab() before
> sending the SIGKILL. If the user passed PROCESS_MRELEASE_REAP_KILL, we assume
> it will free its memory and proceed with reaping, making the logic as simple
> as reap = reap_kill || task_will_free_mem(p).
> 
> To handle shared address spaces, we deliver SIGKILL to all processes sharing
> the same address space using do_pidfd_send_signal_pidns(). This ensures the
> target pid resides inside the caller's PID namespace hierarchy prior to
> signal delivery. We iterate over all processes sharing the mm and deliver
> SIGKILL to each. If delivering the signal to any of the sharing processes
> fails, we return an error. Note that this approach may leave partial
> side-effects if some processes are killed successfully before a failure occurs.
> 
> Cc: Christian Brauner <brauner@kernel.org>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/signal.h    |  4 +++
>  include/uapi/linux/mman.h |  4 +++
>  kernel/signal.c           | 29 ++++++++++++++++++---
>  mm/oom_kill.c             | 55 ++++++++++++++++++++++++++++++++++-----
>  4 files changed, 81 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index f19816832f05..bdbe6b3addec 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -276,6 +276,8 @@ static inline int valid_signal(unsigned long sig)
>  
>  struct timespec;
>  struct pt_regs;
> +struct mm_struct;
> +struct pid;
>  enum pid_type;
>  
>  extern int next_signal(struct sigpending *pending, sigset_t *mask);
> @@ -283,6 +285,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
>  				struct task_struct *p, enum pid_type type);
>  extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
>  			       struct task_struct *p, enum pid_type type);
> +extern int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> +				      siginfo_t __user *info, unsigned int flags);
>  extern int send_signal_locked(int sig, struct kernel_siginfo *info,
>  			      struct task_struct *p, enum pid_type type);
>  extern int sigprocmask(int, sigset_t *, sigset_t *);
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index e89d00528f2f..4266976b45ad 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -56,4 +56,8 @@ struct cachestat {
>  	__u64 nr_recently_evicted;
>  };
>  
> +/* Flags for process_mrelease */
> +#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
> +#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
> +
>  #endif /* _UAPI_LINUX_MMAN_H */
> diff --git a/kernel/signal.c b/kernel/signal.c
> index d65d0fe24bfb..b2dc08a9bdd3 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -4046,6 +4046,30 @@ static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
>  	return kill_pid_info_type(sig, &kinfo, pid, type);
>  }
>  
> +/**
> + * do_pidfd_send_signal_pidns - Send a signal to a process via its struct pid
> + *                              while validating PID namespace hierarchy.
> + * @pid:   the struct pid of the target process
> + * @sig:   signal to send
> + * @type:  scope of the signal (e.g. PIDTYPE_TGID)
> + * @info:  signal info payload
> + * @flags: signaling flags
> + *
> + * Verify that the target pid resides inside the caller's PID namespace
> + * hierarchy prior to signal delivery.
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> +			       siginfo_t __user *info, unsigned int flags)
> +{
> +	/* Enforce PID namespace hierarchy boundary */
> +	if (!access_pidfd_pidns(pid))
> +		return -EINVAL;
> +
> +	return do_pidfd_send_signal(pid, sig, type, info, flags);
> +}
> +
>  /**
>   * sys_pidfd_send_signal - Signal a process through a pidfd
>   * @pidfd:  file descriptor of the process
> @@ -4094,16 +4118,13 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
>  		if (IS_ERR(pid))
>  			return PTR_ERR(pid);
>  
> -		if (!access_pidfd_pidns(pid))
> -			return -EINVAL;
> -
>  		/* Infer scope from the type of pidfd. */
>  		if (fd_file(f)->f_flags & PIDFD_THREAD)
>  			type = PIDTYPE_PID;
>  		else
>  			type = PIDTYPE_TGID;
>  
> -		return do_pidfd_send_signal(pid, sig, type, info, flags);
> +		return do_pidfd_send_signal_pidns(pid, sig, type, info, flags);
>  	}
>  	}
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5c6c95c169ee..253aa80770f2 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -20,6 +20,7 @@
>  
>  #include <linux/oom.h>
>  #include <linux/mm.h>
> +#include <uapi/linux/mman.h>
>  #include <linux/err.h>
>  #include <linux/gfp.h>
>  #include <linux/sched.h>
> @@ -925,6 +926,39 @@ static bool task_will_free_mem(struct task_struct *task)
>  	return ret;
>  }
>  
> +/*
> + * kill_all_shared_mm - Deliver SIGKILL to all processes sharing the given address space.
> + * @victim: the targeted OOM process group leader
> + * @mm:     the virtual memory space being reaped
> + *
> + * Traverse all threads globally and signal any user processes sharing the identical
> + * mm footprints, ensuring no concurrent users pin the memory. Skips the system
> + * global init and kernel worker threads.
> + */
> +static int kill_all_shared_mm(struct task_struct *victim, struct mm_struct *mm)
> +{
> +	struct task_struct *p;
> +	bool failed = false;
> +
> +	rcu_read_lock();
> +	for_each_process(p) {
> +		if (!process_shares_mm(p, mm))
> +			continue;
> +		if (is_global_init(p)) {

You can't signal init in any shape or form any way. Why bother reporting
failure at all.

> +			failed = true;
> +			continue;
> +		}
> +		if (unlikely(p->flags & PF_KTHREAD))
> +			continue;
> +
> +		if (do_pidfd_send_signal_pidns(task_pid(p), SIGKILL, PIDTYPE_TGID, NULL, 0))
> +			failed = true;
> +	}
> +	rcu_read_unlock();
> +
> +	return failed ? -EBUSY : 0;

Why are you returning EBUSY? This makes no sense imho.

> +}
> +
>  static void __oom_kill_process(struct task_struct *victim, const char *message)
>  {
>  	struct task_struct *p;
> @@ -1217,9 +1251,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	unsigned int f_flags;
>  	bool reap = false;
>  	long ret = 0;
> +	bool reap_kill;
>  
> -	if (flags)
> +	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
>  		return -EINVAL;
> +	reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
>  
>  	task = pidfd_get_task(pidfd, &f_flags);
>  	if (IS_ERR(task))
> @@ -1236,19 +1272,24 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	}
>  
>  	mm = p->mm;
> -	mmgrab(mm);
>  
> -	if (task_will_free_mem(p))
> -		reap = true;
> -	else {
> +	reap = reap_kill || task_will_free_mem(p);
> +	if (!reap) {
>  		/* Error only if the work has not been done already */
>  		if (!mm_flags_test(MMF_OOM_SKIP, mm))
>  			ret = -EINVAL;
> +		task_unlock(p);
> +		goto put_task;
>  	}
> +
> +	mmgrab(mm);
>  	task_unlock(p);
>  
> -	if (!reap)
> -		goto drop_mm;
> +	if (reap_kill) {
> +		ret = kill_all_shared_mm(task, mm);
> +		if (ret)
> +			goto drop_mm;
> +	}
>  
>  	if (mmap_read_lock_killable(mm)) {
>  		ret = -EINTR;
> -- 
> 2.54.0.563.g4f69b47b94-goog
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 14:41 ` Christian Brauner
@ 2026-05-15 15:49   ` Oleg Nesterov
  2026-05-15 20:15   ` Oleg Nesterov
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2026-05-15 15:49 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, Jann Horn, Linus Torvalds, akpm, hca, linux-s390,
	david, mhocko, linux-mm, linux-kernel, surenb, timmurray

On 05/15, Christian Brauner wrote:
>
> Right now we have clear and simple signal sending semantics for pidfds:
>
> * thread-specific pidfd -> thread-directed signal (unless signal is thread-group scoped by default)
> * thread-group pidfd -> thread-group directed signal

I don't even understand why this patch uses pidfd to send SIGKILL...

> > +	reap = reap_kill || task_will_free_mem(p);
> > +	if (!reap) {
> >  		/* Error only if the work has not been done already */
> >  		if (!mm_flags_test(MMF_OOM_SKIP, mm))
> >  			ret = -EINVAL;
> > +		task_unlock(p);
> > +		goto put_task;
> >  	}
> > +
> > +	mmgrab(mm);
> >  	task_unlock(p);
> >
> > -	if (!reap)
> > -		goto drop_mm;
> > +	if (reap_kill) {
> > +		ret = kill_all_shared_mm(task, mm);

If task_will_free_mem() above succeeded, we do not need to call
kill_all_shared_mm() ?

Oleg.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 14:41 ` Christian Brauner
  2026-05-15 15:49   ` Oleg Nesterov
@ 2026-05-15 20:15   ` Oleg Nesterov
  2026-05-15 22:33     ` Minchan Kim
  2026-05-15 20:42   ` Suren Baghdasaryan
  2026-05-15 20:52   ` Minchan Kim
  3 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2026-05-15 20:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, Jann Horn, Linus Torvalds, akpm, hca, linux-s390,
	david, mhocko, linux-mm, linux-kernel, surenb, timmurray

In fact I don't even understand the motivation...

On 05/15, Christian Brauner wrote:
>
> On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> > leaving the actual address space teardown (exit_mmap) to be deferred until
> > the mm's reference count drops to zero. In the field (e.g., Android),
> > arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> > remote VM accesses) frequently delay this teardown indefinitely,

Sure, get_task_cmdline() can delay mmput(). But indefinitely ?

Perhaps the changelog could be more clear? I don't see how any remote VM access
can pin mm->mm_users "indefinitely". Even if, say, a lot of threads read
/proc/<pid>/cmdline in an endless loop in parallel...

I must have missed something.

Oleg.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 14:41 ` Christian Brauner
  2026-05-15 15:49   ` Oleg Nesterov
  2026-05-15 20:15   ` Oleg Nesterov
@ 2026-05-15 20:42   ` Suren Baghdasaryan
  2026-05-15 20:52   ` Minchan Kim
  3 siblings, 0 replies; 9+ messages in thread
From: Suren Baghdasaryan @ 2026-05-15 20:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, Jann Horn, Linus Torvalds, Oleg Nesterov, akpm, hca,
	linux-s390, david, mhocko, linux-mm, linux-kernel, timmurray

On Fri, May 15, 2026 at 7:41 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> > Currently, process_mrelease() requires userspace to send a SIGKILL signal
> > prior to invocation. This separation introduces a scheduling race window
> > where the victim task may receive the signal and enter the exit path
> > before the reaper can invoke process_mrelease().
> >
> > When the victim enters the exit path (do_exit -> exit_mm), it clears its
> > task->mm immediately. This causes process_mrelease() to fail with -ESRCH,
>
> To be quite frank about the patch as written below: I think this is a
> completely crazy api. And I really dislike it.
>
> Right now we have clear and simple signal sending semantics for pidfds:
>
> * thread-specific pidfd -> thread-directed signal (unless signal is thread-group scoped by default)
> * thread-group pidfd -> thread-group directed signal
>
> with specific overrides for pidfd_send_signal():
>
> * PIDFD_SIGNAL_THREAD           -> only signal thread
> * PIDFD_SIGNAL_THREAD_GROUP     -> signal thread-group
> * PIDFD_SIGNAL_PROCESS_GROUP    -> signal process-group
>
> And now this patch aims to elevate process_mrelease() to a _signal
> sending function_. And the semantics are complete special sauce too.
>
> You are effectively introducing a custom signal scope that is mm-based
> and then also plumbing it into a completely unrelated function that
> should have absolutely nothing to do with this.
>
> This is such Zebroid API. I really would hate to see it land.
>
> This came up because of the ptrace bug that was just recently discovered
> and that Linus fixed yesterday. This is another instance where I think
> the correct fix is to keep task->mm around until the process is reaped
> and then you can throw away all of the really ugly semantic in this
> patch afaict. I'd really like to see that merged as it would also clean
> up the ptrace code:
>
> https://lore.kernel.org/all/20201016024019.1882062-1-jannh@google.com

Yeah, I think this patchset would help process_mrelease() find the mm
to reap even after the process has passed its exit_mm(). We just need
to add:

if (!mm)
        mm = task->exit_mm;

Inside process_mrelease(). I had a similar idea of stashing mm and the
approach seems sane to me. Keeping UAPI unchanged is another upside
IMHO.

>
> > leaving the actual address space teardown (exit_mmap) to be deferred until
> > the mm's reference count drops to zero. In the field (e.g., Android),
> > arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> > remote VM accesses) frequently delay this teardown indefinitely,
> > defeating the purpose of expedited reclamation.
> >
> > In Android's LMKD scenarios, this delay keeps memory pressure high, forcing
> > the system to unnecessarily kill additional innocent background apps before
> > the memory from the first victim is recovered.
> >
> > This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
> > an integrated auto-kill mode. When specified, process_mrelease() directly
> > injects a SIGKILL into the target task after finding its mm.
> >
> > To solve the race condition, we grab the mm reference via mmgrab() before
> > sending the SIGKILL. If the user passed PROCESS_MRELEASE_REAP_KILL, we assume
> > it will free its memory and proceed with reaping, making the logic as simple
> > as reap = reap_kill || task_will_free_mem(p).
> >
> > To handle shared address spaces, we deliver SIGKILL to all processes sharing
> > the same address space using do_pidfd_send_signal_pidns(). This ensures the
> > target pid resides inside the caller's PID namespace hierarchy prior to
> > signal delivery. We iterate over all processes sharing the mm and deliver
> > SIGKILL to each. If delivering the signal to any of the sharing processes
> > fails, we return an error. Note that this approach may leave partial
> > side-effects if some processes are killed successfully before a failure occurs.
> >
> > Cc: Christian Brauner <brauner@kernel.org>
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/signal.h    |  4 +++
> >  include/uapi/linux/mman.h |  4 +++
> >  kernel/signal.c           | 29 ++++++++++++++++++---
> >  mm/oom_kill.c             | 55 ++++++++++++++++++++++++++++++++++-----
> >  4 files changed, 81 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/signal.h b/include/linux/signal.h
> > index f19816832f05..bdbe6b3addec 100644
> > --- a/include/linux/signal.h
> > +++ b/include/linux/signal.h
> > @@ -276,6 +276,8 @@ static inline int valid_signal(unsigned long sig)
> >
> >  struct timespec;
> >  struct pt_regs;
> > +struct mm_struct;
> > +struct pid;
> >  enum pid_type;
> >
> >  extern int next_signal(struct sigpending *pending, sigset_t *mask);
> > @@ -283,6 +285,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
> >                               struct task_struct *p, enum pid_type type);
> >  extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
> >                              struct task_struct *p, enum pid_type type);
> > +extern int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> > +                                   siginfo_t __user *info, unsigned int flags);
> >  extern int send_signal_locked(int sig, struct kernel_siginfo *info,
> >                             struct task_struct *p, enum pid_type type);
> >  extern int sigprocmask(int, sigset_t *, sigset_t *);
> > diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> > index e89d00528f2f..4266976b45ad 100644
> > --- a/include/uapi/linux/mman.h
> > +++ b/include/uapi/linux/mman.h
> > @@ -56,4 +56,8 @@ struct cachestat {
> >       __u64 nr_recently_evicted;
> >  };
> >
> > +/* Flags for process_mrelease */
> > +#define PROCESS_MRELEASE_REAP_KILL   (1 << 0)
> > +#define PROCESS_MRELEASE_VALID_FLAGS (PROCESS_MRELEASE_REAP_KILL)
> > +
> >  #endif /* _UAPI_LINUX_MMAN_H */
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index d65d0fe24bfb..b2dc08a9bdd3 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -4046,6 +4046,30 @@ static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
> >       return kill_pid_info_type(sig, &kinfo, pid, type);
> >  }
> >
> > +/**
> > + * do_pidfd_send_signal_pidns - Send a signal to a process via its struct pid
> > + *                              while validating PID namespace hierarchy.
> > + * @pid:   the struct pid of the target process
> > + * @sig:   signal to send
> > + * @type:  scope of the signal (e.g. PIDTYPE_TGID)
> > + * @info:  signal info payload
> > + * @flags: signaling flags
> > + *
> > + * Verify that the target pid resides inside the caller's PID namespace
> > + * hierarchy prior to signal delivery.
> > + *
> > + * Return: 0 on success, negative errno on failure.
> > + */
> > +int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> > +                            siginfo_t __user *info, unsigned int flags)
> > +{
> > +     /* Enforce PID namespace hierarchy boundary */
> > +     if (!access_pidfd_pidns(pid))
> > +             return -EINVAL;
> > +
> > +     return do_pidfd_send_signal(pid, sig, type, info, flags);
> > +}
> > +
> >  /**
> >   * sys_pidfd_send_signal - Signal a process through a pidfd
> >   * @pidfd:  file descriptor of the process
> > @@ -4094,16 +4118,13 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >               if (IS_ERR(pid))
> >                       return PTR_ERR(pid);
> >
> > -             if (!access_pidfd_pidns(pid))
> > -                     return -EINVAL;
> > -
> >               /* Infer scope from the type of pidfd. */
> >               if (fd_file(f)->f_flags & PIDFD_THREAD)
> >                       type = PIDTYPE_PID;
> >               else
> >                       type = PIDTYPE_TGID;
> >
> > -             return do_pidfd_send_signal(pid, sig, type, info, flags);
> > +             return do_pidfd_send_signal_pidns(pid, sig, type, info, flags);
> >       }
> >       }
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5c6c95c169ee..253aa80770f2 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -20,6 +20,7 @@
> >
> >  #include <linux/oom.h>
> >  #include <linux/mm.h>
> > +#include <uapi/linux/mman.h>
> >  #include <linux/err.h>
> >  #include <linux/gfp.h>
> >  #include <linux/sched.h>
> > @@ -925,6 +926,39 @@ static bool task_will_free_mem(struct task_struct *task)
> >       return ret;
> >  }
> >
> > +/*
> > + * kill_all_shared_mm - Deliver SIGKILL to all processes sharing the given address space.
> > + * @victim: the targeted OOM process group leader
> > + * @mm:     the virtual memory space being reaped
> > + *
> > + * Traverse all threads globally and signal any user processes sharing the identical
> > + * mm footprints, ensuring no concurrent users pin the memory. Skips the system
> > + * global init and kernel worker threads.
> > + */
> > +static int kill_all_shared_mm(struct task_struct *victim, struct mm_struct *mm)
> > +{
> > +     struct task_struct *p;
> > +     bool failed = false;
> > +
> > +     rcu_read_lock();
> > +     for_each_process(p) {
> > +             if (!process_shares_mm(p, mm))
> > +                     continue;
> > +             if (is_global_init(p)) {
>
> You can't signal init in any shape or form any way. Why bother reporting
> failure at all.
>
> > +                     failed = true;
> > +                     continue;
> > +             }
> > +             if (unlikely(p->flags & PF_KTHREAD))
> > +                     continue;
> > +
> > +             if (do_pidfd_send_signal_pidns(task_pid(p), SIGKILL, PIDTYPE_TGID, NULL, 0))
> > +                     failed = true;
> > +     }
> > +     rcu_read_unlock();
> > +
> > +     return failed ? -EBUSY : 0;
>
> Why are you returning EBUSY? This makes no sense imho.
>
> > +}
> > +
> >  static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  {
> >       struct task_struct *p;
> > @@ -1217,9 +1251,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >       unsigned int f_flags;
> >       bool reap = false;
> >       long ret = 0;
> > +     bool reap_kill;
> >
> > -     if (flags)
> > +     if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
> >               return -EINVAL;
> > +     reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
> >
> >       task = pidfd_get_task(pidfd, &f_flags);
> >       if (IS_ERR(task))
> > @@ -1236,19 +1272,24 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >       }
> >
> >       mm = p->mm;
> > -     mmgrab(mm);
> >
> > -     if (task_will_free_mem(p))
> > -             reap = true;
> > -     else {
> > +     reap = reap_kill || task_will_free_mem(p);
> > +     if (!reap) {
> >               /* Error only if the work has not been done already */
> >               if (!mm_flags_test(MMF_OOM_SKIP, mm))
> >                       ret = -EINVAL;
> > +             task_unlock(p);
> > +             goto put_task;
> >       }
> > +
> > +     mmgrab(mm);
> >       task_unlock(p);
> >
> > -     if (!reap)
> > -             goto drop_mm;
> > +     if (reap_kill) {
> > +             ret = kill_all_shared_mm(task, mm);
> > +             if (ret)
> > +                     goto drop_mm;
> > +     }
> >
> >       if (mmap_read_lock_killable(mm)) {
> >               ret = -EINTR;
> > --
> > 2.54.0.563.g4f69b47b94-goog
> >
> >


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 14:41 ` Christian Brauner
                     ` (2 preceding siblings ...)
  2026-05-15 20:42   ` Suren Baghdasaryan
@ 2026-05-15 20:52   ` Minchan Kim
  3 siblings, 0 replies; 9+ messages in thread
From: Minchan Kim @ 2026-05-15 20:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Linus Torvalds, Oleg Nesterov, akpm, hca, linux-s390,
	david, mhocko, linux-mm, linux-kernel, surenb, timmurray

On Fri, May 15, 2026 at 04:41:18PM +0200, Christian Brauner wrote:
> On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> > Currently, process_mrelease() requires userspace to send a SIGKILL signal
> > prior to invocation. This separation introduces a scheduling race window
> > where the victim task may receive the signal and enter the exit path
> > before the reaper can invoke process_mrelease().
> > 
> > When the victim enters the exit path (do_exit -> exit_mm), it clears its
> > task->mm immediately. This causes process_mrelease() to fail with -ESRCH,
> 
> To be quite frank about the patch as written below: I think this is a
> completely crazy api. And I really dislike it.
> 
> Right now we have clear and simple signal sending semantics for pidfds:
> 
> * thread-specific pidfd -> thread-directed signal (unless signal is thread-group scoped by default)
> * thread-group pidfd -> thread-group directed signal
> 
> with specific overrides for pidfd_send_signal():
> 
> * PIDFD_SIGNAL_THREAD           -> only signal thread
> * PIDFD_SIGNAL_THREAD_GROUP     -> signal thread-group
> * PIDFD_SIGNAL_PROCESS_GROUP    -> signal process-group
> 
> And now this patch aims to elevate process_mrelease() to a _signal
> sending function_. And the semantics are complete special sauce too.
> 
> You are effectively introducing a custom signal scope that is mm-based
> and then also plumbing it into a completely unrelated function that
> should have absolutely nothing to do with this.
> 
> This is such Zebroid API. I really would hate to see it land.
> 
> This came up because of the ptrace bug that was just recently discovered
> and that Linus fixed yesterday. This is another instance where I think
> the correct fix is to keep task->mm around until the process is reaped
> and then you can throw away all of the really ugly semantic in this
> patch afaict. I'd really like to see that merged as it would also clean
> up the ptrace code:
> 
> https://lore.kernel.org/all/20201016024019.1882062-1-jannh@google.com

Thank you for pointing this out. I completely agree.

The only reason for adding PROCESS_MRELEASE_REAP_KILL and its complex signaling
was to work around task->mm being cleared early in exit_mm(), which causes
process_mrelease() to fail with -ESRCH.

If Jann's patch to keep task->mm around is merged, this race naturally vanishes.
Userspace can reliably send SIGKILL and call the normal process_mrelease()
with the exit_mm

I would be thrilled to see Jann's patch merged. It allows us to drop this
complex REAP_KILL patch entirely while still fully resolving the expedited
reclaim issue.

> 
> > leaving the actual address space teardown (exit_mmap) to be deferred until
> > the mm's reference count drops to zero. In the field (e.g., Android),
> > arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> > remote VM accesses) frequently delay this teardown indefinitely,
> > defeating the purpose of expedited reclamation.
> > 
> > In Android's LMKD scenarios, this delay keeps memory pressure high, forcing
> > the system to unnecessarily kill additional innocent background apps before
> > the memory from the first victim is recovered.
> > 
> > This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
> > an integrated auto-kill mode. When specified, process_mrelease() directly
> > injects a SIGKILL into the target task after finding its mm.
> > 
> > To solve the race condition, we grab the mm reference via mmgrab() before
> > sending the SIGKILL. If the user passed PROCESS_MRELEASE_REAP_KILL, we assume
> > it will free its memory and proceed with reaping, making the logic as simple
> > as reap = reap_kill || task_will_free_mem(p).
> > 
> > To handle shared address spaces, we deliver SIGKILL to all processes sharing
> > the same address space using do_pidfd_send_signal_pidns(). This ensures the
> > target pid resides inside the caller's PID namespace hierarchy prior to
> > signal delivery. We iterate over all processes sharing the mm and deliver
> > SIGKILL to each. If delivering the signal to any of the sharing processes
> > fails, we return an error. Note that this approach may leave partial
> > side-effects if some processes are killed successfully before a failure occurs.
> > 
> > Cc: Christian Brauner <brauner@kernel.org>
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/signal.h    |  4 +++
> >  include/uapi/linux/mman.h |  4 +++
> >  kernel/signal.c           | 29 ++++++++++++++++++---
> >  mm/oom_kill.c             | 55 ++++++++++++++++++++++++++++++++++-----
> >  4 files changed, 81 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/signal.h b/include/linux/signal.h
> > index f19816832f05..bdbe6b3addec 100644
> > --- a/include/linux/signal.h
> > +++ b/include/linux/signal.h
> > @@ -276,6 +276,8 @@ static inline int valid_signal(unsigned long sig)
> >  
> >  struct timespec;
> >  struct pt_regs;
> > +struct mm_struct;
> > +struct pid;
> >  enum pid_type;
> >  
> >  extern int next_signal(struct sigpending *pending, sigset_t *mask);
> > @@ -283,6 +285,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
> >  				struct task_struct *p, enum pid_type type);
> >  extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
> >  			       struct task_struct *p, enum pid_type type);
> > +extern int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> > +				      siginfo_t __user *info, unsigned int flags);
> >  extern int send_signal_locked(int sig, struct kernel_siginfo *info,
> >  			      struct task_struct *p, enum pid_type type);
> >  extern int sigprocmask(int, sigset_t *, sigset_t *);
> > diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> > index e89d00528f2f..4266976b45ad 100644
> > --- a/include/uapi/linux/mman.h
> > +++ b/include/uapi/linux/mman.h
> > @@ -56,4 +56,8 @@ struct cachestat {
> >  	__u64 nr_recently_evicted;
> >  };
> >  
> > +/* Flags for process_mrelease */
> > +#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
> > +#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
> > +
> >  #endif /* _UAPI_LINUX_MMAN_H */
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index d65d0fe24bfb..b2dc08a9bdd3 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -4046,6 +4046,30 @@ static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
> >  	return kill_pid_info_type(sig, &kinfo, pid, type);
> >  }
> >  
> > +/**
> > + * do_pidfd_send_signal_pidns - Send a signal to a process via its struct pid
> > + *                              while validating PID namespace hierarchy.
> > + * @pid:   the struct pid of the target process
> > + * @sig:   signal to send
> > + * @type:  scope of the signal (e.g. PIDTYPE_TGID)
> > + * @info:  signal info payload
> > + * @flags: signaling flags
> > + *
> > + * Verify that the target pid resides inside the caller's PID namespace
> > + * hierarchy prior to signal delivery.
> > + *
> > + * Return: 0 on success, negative errno on failure.
> > + */
> > +int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> > +			       siginfo_t __user *info, unsigned int flags)
> > +{
> > +	/* Enforce PID namespace hierarchy boundary */
> > +	if (!access_pidfd_pidns(pid))
> > +		return -EINVAL;
> > +
> > +	return do_pidfd_send_signal(pid, sig, type, info, flags);
> > +}
> > +
> >  /**
> >   * sys_pidfd_send_signal - Signal a process through a pidfd
> >   * @pidfd:  file descriptor of the process
> > @@ -4094,16 +4118,13 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >  		if (IS_ERR(pid))
> >  			return PTR_ERR(pid);
> >  
> > -		if (!access_pidfd_pidns(pid))
> > -			return -EINVAL;
> > -
> >  		/* Infer scope from the type of pidfd. */
> >  		if (fd_file(f)->f_flags & PIDFD_THREAD)
> >  			type = PIDTYPE_PID;
> >  		else
> >  			type = PIDTYPE_TGID;
> >  
> > -		return do_pidfd_send_signal(pid, sig, type, info, flags);
> > +		return do_pidfd_send_signal_pidns(pid, sig, type, info, flags);
> >  	}
> >  	}
> >  
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5c6c95c169ee..253aa80770f2 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -20,6 +20,7 @@
> >  
> >  #include <linux/oom.h>
> >  #include <linux/mm.h>
> > +#include <uapi/linux/mman.h>
> >  #include <linux/err.h>
> >  #include <linux/gfp.h>
> >  #include <linux/sched.h>
> > @@ -925,6 +926,39 @@ static bool task_will_free_mem(struct task_struct *task)
> >  	return ret;
> >  }
> >  
> > +/*
> > + * kill_all_shared_mm - Deliver SIGKILL to all processes sharing the given address space.
> > + * @victim: the targeted OOM process group leader
> > + * @mm:     the virtual memory space being reaped
> > + *
> > + * Traverse all threads globally and signal any user processes sharing the identical
> > + * mm footprints, ensuring no concurrent users pin the memory. Skips the system
> > + * global init and kernel worker threads.
> > + */
> > +static int kill_all_shared_mm(struct task_struct *victim, struct mm_struct *mm)
> > +{
> > +	struct task_struct *p;
> > +	bool failed = false;
> > +
> > +	rcu_read_lock();
> > +	for_each_process(p) {
> > +		if (!process_shares_mm(p, mm))
> > +			continue;
> > +		if (is_global_init(p)) {
> 
> You can't signal init in any shape or form any way. Why bother reporting
> failure at all.
> 
> > +			failed = true;
> > +			continue;
> > +		}
> > +		if (unlikely(p->flags & PF_KTHREAD))
> > +			continue;
> > +
> > +		if (do_pidfd_send_signal_pidns(task_pid(p), SIGKILL, PIDTYPE_TGID, NULL, 0))
> > +			failed = true;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return failed ? -EBUSY : 0;
> 
> Why are you returning EBUSY? This makes no sense imho.
> 
> > +}
> > +
> >  static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  {
> >  	struct task_struct *p;
> > @@ -1217,9 +1251,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >  	unsigned int f_flags;
> >  	bool reap = false;
> >  	long ret = 0;
> > +	bool reap_kill;
> >  
> > -	if (flags)
> > +	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
> >  		return -EINVAL;
> > +	reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
> >  
> >  	task = pidfd_get_task(pidfd, &f_flags);
> >  	if (IS_ERR(task))
> > @@ -1236,19 +1272,24 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >  	}
> >  
> >  	mm = p->mm;
> > -	mmgrab(mm);
> >  
> > -	if (task_will_free_mem(p))
> > -		reap = true;
> > -	else {
> > +	reap = reap_kill || task_will_free_mem(p);
> > +	if (!reap) {
> >  		/* Error only if the work has not been done already */
> >  		if (!mm_flags_test(MMF_OOM_SKIP, mm))
> >  			ret = -EINVAL;
> > +		task_unlock(p);
> > +		goto put_task;
> >  	}
> > +
> > +	mmgrab(mm);
> >  	task_unlock(p);
> >  
> > -	if (!reap)
> > -		goto drop_mm;
> > +	if (reap_kill) {
> > +		ret = kill_all_shared_mm(task, mm);
> > +		if (ret)
> > +			goto drop_mm;
> > +	}
> >  
> >  	if (mmap_read_lock_killable(mm)) {
> >  		ret = -EINTR;
> > -- 
> > 2.54.0.563.g4f69b47b94-goog
> > 
> > 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 20:15   ` Oleg Nesterov
@ 2026-05-15 22:33     ` Minchan Kim
  2026-05-15 23:45       ` Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Minchan Kim @ 2026-05-15 22:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Jann Horn, Linus Torvalds, akpm, hca,
	linux-s390, david, mhocko, linux-mm, linux-kernel, surenb,
	timmurray

On Fri, May 15, 2026 at 10:15:53PM +0200, Oleg Nesterov wrote:
> In fact I don't even understand the motivation...
> 
> On 05/15, Christian Brauner wrote:
> >
> > On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> > > leaving the actual address space teardown (exit_mmap) to be deferred until
> > > the mm's reference count drops to zero. In the field (e.g., Android),
> > > arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> > > remote VM accesses) frequently delay this teardown indefinitely,
> 
> Sure, get_task_cmdline() can delay mmput(). But indefinitely ?
> 
> Perhaps the changelog could be more clear? I don't see how any remote VM access
> can pin mm->mm_users "indefinitely". Even if, say, a lot of threads read
> /proc/<pid>/cmdline in an endless loop in parallel...
> 
> I must have missed something.

Thank you for the review and questions. You are entirely right that under normal
uncongested conditions, a /proc reader drops mmput() quickly.

However, on any heavily loaded system under severe memory/CPU pressure, this delay
can be long enough to cause cascading issues. Here is exactly how this occurs
and why it acts as an indefinite delay from an emergency reclaim perspective.

When memory pressure is critical, a userspace OOM killer terminates a large
victim process. Simultaneously, another process (such as a monitoring tool) is
reading /proc/<pid>/smaps or cmdline. Because the system is heavily loaded, the
reader thread on CPU C can get preempted or blocked while holding mmget().

When the dying victim executes exit_mm(), mm_users drops from 2 to 1. Thus,
exit_mmap() does not run. For hundreds of milliseconds or seconds, the memory
remains fully trapped. The userspace OOM policy sees that memory is still
critically low and unnecessarily kills additional innocent processes.

Here is the exact timing chart illustrating the existing problem and why
process_mrelease() fails in this scenario:

CPU A (Userspace OOM Killer)      CPU B (Victim Task)      CPU C (/proc Reader)
----------------------------      -------------------      --------------------
						       open(/proc/pid/smaps)
						       get_task_mm()
						       [mm_users++ => 2]
						       (Preempted/Stalled)
							          |
1. Sends SIGKILL                                                  |
2. Victim receives SIGKILL                                        |
			      do_exit()                           |
				exit_mm()                         |
				  task->mm = NULL                 |
				  mmput() [mm_users => 1]         |
				  (Memory NOT freed!)             |
							          |
3. Calls process_mrelease()                                       |
							          |
find_lock_task_mm() sees task->mm == NULL                         |
Returns -ESRCH. Reaping fails!                                    |
(Memory remains trapped until CPU C finally finishes!) <==========/

I hope thisclarifies the motivation and mechanics behind this issue.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 22:33     ` Minchan Kim
@ 2026-05-15 23:45       ` Linus Torvalds
  2026-05-16  0:00         ` Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2026-05-15 23:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Oleg Nesterov, Christian Brauner, Jann Horn, akpm, hca,
	linux-s390, david, mhocko, linux-mm, linux-kernel, surenb,
	timmurray

On Fri, 15 May 2026 at 15:33, Minchan Kim <minchan@kernel.org> wrote:
>
> I hope thisclarifies the motivation and mechanics behind this issue.

This still seems very hacky and a complete special case for one odd situation.

This all sounds like it's just because smap is a pig.

And yes, smap *is* a pig, but it should be trivial to just fix smap
for this case - fix the problem spot, don't add new horrid logic
elsewhere.

I really think the fix is to fix smap instead.

And I think smap is doing odd things. For example, it does

   pid_smaps_open -> do_maps_open -> proc_mem_open

and that proc_mem_open() takes that long-term ref to the mm. And then
does various memory allocations - and copying data to user space -
under that long-term ref, which is presumably what causes all the
latency issues.

But it doesn't actually seem to *need* a long-term ref to the mm. The
seqfile interface is designed so that it should all be chunkable, and
the locks and refs should be done at m_start/m_end time.

And the smap / maps m_start and m_end functions already *almost* seem
to do that. They literally look up the task again with

        priv->task = get_proc_task(priv->inode);

etc, but then they do that odd

        lock_ctx = &priv->lock_ctx;
        mm = lock_ctx->mm;
        if (!mm || !mmget_not_zero(mm)) {
                put_task_struct(priv->task);
                priv->task = NULL;
                return NULL;
        }

dance (where lock_ctx->mm is literally that long-term ref we hold).

And I don't see why they need to do any of this. I think it's all a
historical accident.

Because I think it could look up the mm from the task pointer every
time, without holding that long-term ref from proc_mem_open() at all.

IOW, at open time, we could save off the "this is the mm I opened",
but *without* any refcount, and then just verify that "yes, the task
mm still matches". No long-term refs anywhere.

Yes, yes, we'd need some sequence counter for when the mm changes due
to execve, but *that* should be absolutely trivial.

And wouldn't that make all of this go away entirely? And probably
clean up the code at the same time?

I think the only reason "proc_mem_open()" does what it does was that
it was simple. Not because it was a good idea.

                   Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-15 23:45       ` Linus Torvalds
@ 2026-05-16  0:00         ` Linus Torvalds
  0 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2026-05-16  0:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Oleg Nesterov, Christian Brauner, Jann Horn, akpm, hca,
	linux-s390, david, mhocko, linux-mm, linux-kernel, surenb,
	timmurray

On Fri, 15 May 2026 at 16:45, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yes, yes, we'd need some sequence counter for when the mm changes due
> to execve, but *that* should be absolutely trivial.

I guess we already  have it in the form of "self_exec_id". So the
tuple <mm,self_exec_id> should basically guarantee it's the same mm as
it was at open time.

Except it's not updated at quite the right time. If it was done inside
the task lock when task->mm was actually set in exec_mmap(), it would
work as-is.

This isn't exactly what that 'self_exec_id' was designed for, but it
*is* a sequence number for the mm of a task. So if a 'mm' pointer
value gets reused, you can tell that it's not the same mm as it used
to be by looking at the exec_id.

And I think the only reason for it being done outside the task lock is
that the current use is all synchronous to the task itself, so locking
or location simply didn't matter.

But there's some mqueue signal handling thing that looks like it's
actually violating that situation and would race with execve().

            Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-16  0:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-11 21:42 [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
2026-05-15 14:41 ` Christian Brauner
2026-05-15 15:49   ` Oleg Nesterov
2026-05-15 20:15   ` Oleg Nesterov
2026-05-15 22:33     ` Minchan Kim
2026-05-15 23:45       ` Linus Torvalds
2026-05-16  0:00         ` Linus Torvalds
2026-05-15 20:42   ` Suren Baghdasaryan
2026-05-15 20:52   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox