[PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP

public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
@ 2026-04-29 21:13 Minchan Kim
  2026-04-30  9:55 ` Michal Hocko
  2026-04-30 14:43 ` Andrew Morton
  0 siblings, 2 replies; 12+ messages in thread
From: Minchan Kim @ 2026-04-29 21:13 UTC (permalink / raw)
  To: akpm
  Cc: hca, linux-s390, david, mhocko, brauner, linux-mm, linux-kernel,
	surenb, timmurray, Minchan Kim

Currently, process_mrelease() requires userspace to send a SIGKILL signal
prior to invocation. This separation introduces a scheduling race window
where the victim task may receive the signal and enter the exit path
before the reaper can invoke process_mrelease().

When the victim enters the exit path (do_exit -> exit_mm), it clears its
task->mm immediately. This causes process_mrelease() to fail with -ESRCH,
leaving the actual address space teardown (exit_mmap) to be deferred until
the mm's reference count drops to zero. In the field (e.g., Android),
arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
remote VM accesses) frequently delay this teardown indefinitely,
defeating the purpose of expedited reclamation.

In Android's LMKD scenarios, this delay keeps memory pressure high, forcing
the system to unnecessarily kill additional innocent background apps before
the memory from the first victim is recovered.

This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
an integrated auto-kill mode. When specified, process_mrelease() directly
injects a SIGKILL into the target task after finding its mm.

To solve the race condition, we grab the mm reference via mmgrab() before
sending the SIGKILL. If the user passed PROCESS_MRELEASE_REAP_KILL, we assume
it will free its memory and proceed with reaping, making the logic as simple
as reap = reap_kill || task_will_free_mem(p).

To handle shared address spaces safely in the auto-kill mode, we bail out
immediately if the mm is marked with MMF_MULTIPROCESS when
PROCESS_MRELEASE_REAP_KILL is specified. This protects existing users of
process_mrelease() from behavior changes while preventing unsafe reaping of
shared memory.

This policy differs from the global OOM killer, which kills all processes
sharing the same mm to guarantee memory reclamation at all costs (preventing
system hangs). However, process_mrelease() is invoked by userspace policy.
If it fails due to sharing, userspace can simply adapt and select another
victim process (such as another background app in Android case) to release
memory. We do not need to force success or affect processes that were not
targeted.

Fundamentally, this allows process_mrelease() to trigger targeted memory
reclaim (via oom_reaper infrastructure) quickly, even if the victim is
not yet in the exit path, while reusing existing race handling between
reaper and exit_mmap.

Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/uapi/linux/mman.h |  4 ++++
 mm/oom_kill.c             | 27 ++++++++++++++++++++-------
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index e89d00528f2f..4266976b45ad 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -56,4 +56,8 @@ struct cachestat {
 	__u64 nr_recently_evicted;
 };

+/* Flags for process_mrelease */
+#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
+#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c6c95c169ee..efa6541b1c47 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -20,6 +20,7 @@

 #include <linux/oom.h>
 #include <linux/mm.h>
+#include <uapi/linux/mman.h>
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/sched.h>
@@ -1217,9 +1218,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	unsigned int f_flags;
 	bool reap = false;
 	long ret = 0;
+	bool reap_kill;

-	if (flags)
+	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
 		return -EINVAL;
+	reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);

 	task = pidfd_get_task(pidfd, &f_flags);
 	if (IS_ERR(task))
@@ -1236,19 +1239,29 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	}

 	mm = p->mm;
-	mmgrab(mm);
+	if (reap_kill && mm_flags_test(MMF_MULTIPROCESS, mm)) {
+		ret = -EINVAL;
+		task_unlock(p);
+		goto put_task;
+	}

-	if (task_will_free_mem(p))
-		reap = true;
-	else {
+	reap = reap_kill || task_will_free_mem(p);
+	if (!reap) {
 		/* Error only if the work has not been done already */
 		if (!mm_flags_test(MMF_OOM_SKIP, mm))
 			ret = -EINVAL;
+		task_unlock(p);
+		goto put_task;
 	}
+
+	mmgrab(mm);
 	task_unlock(p);

-	if (!reap)
-		goto drop_mm;
+	if (reap_kill) {
+		ret = kill_pid(task_tgid(task), SIGKILL, 0);
+		if (ret)
+			goto drop_mm;
+	}

 	if (mmap_read_lock_killable(mm)) {
 		ret = -EINTR;
-- 
2.54.0.545.g6539524ca2-goog

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-04-29 21:13 [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
@ 2026-04-30  9:55 ` Michal Hocko
  2026-05-01 21:17   ` Minchan Kim
  2026-04-30 14:43 ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2026-04-30  9:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, hca, linux-s390, david, brauner, linux-mm, linux-kernel,
	surenb, timmurray

On Wed 29-04-26 14:13:59, Minchan Kim wrote:
> This policy differs from the global OOM killer, which kills all processes
> sharing the same mm to guarantee memory reclamation at all costs (preventing
> system hangs).

Incorrect, we do the same for memcg OOM killer as well. This is not
about preventing system hands. But rather to 

> However, process_mrelease() is invoked by userspace policy.
> If it fails due to sharing, userspace can simply adapt and select another
> victim process (such as another background app in Android case) to release
> memory. We do not need to force success or affect processes that were not
> targeted.

This is a wrong justification for the proposed semantic. You seem to be
assuming this is just fine rather than this would be problematic for
reasons a), b) and c). If there are no strong reasons _against_
following the global policy then we should stick with it. There are very
good reasons why we are doing that on the global level.

If for no other reasons then the proposed semantic severly criples the
shared MM case. You are left with a racy kill and call process_mrelease
approach. You certainly do not want to allow a simple way for tasks to
evade your LMK, do you? So just choose something else is a very bad
approach.

So unless you are aware of a specific reason(s) where collective kill is a
clearly an incorrect behavior then I believe the proper way is to kill
all processes sharing the mm (unless you are crossing any security
boundary when doing that).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-04-30  9:55 ` Michal Hocko
@ 2026-05-01 21:17   ` Minchan Kim
  2026-05-04  7:51     ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2026-05-01 21:17 UTC (permalink / raw)
  To: Michal Hocko, Christian Brauner
  Cc: akpm, hca, linux-s390, david, brauner, linux-mm, linux-kernel,
	surenb, timmurray

On Thu, Apr 30, 2026 at 11:55:54AM +0200, Michal Hocko wrote:
> On Wed 29-04-26 14:13:59, Minchan Kim wrote:
> > This policy differs from the global OOM killer, which kills all processes
> > sharing the same mm to guarantee memory reclamation at all costs (preventing
> > system hangs).
> 
> Incorrect, we do the same for memcg OOM killer as well. This is not
> about preventing system hands. But rather to 
> 
> > However, process_mrelease() is invoked by userspace policy.
> > If it fails due to sharing, userspace can simply adapt and select another
> > victim process (such as another background app in Android case) to release
> > memory. We do not need to force success or affect processes that were not
> > targeted.
> 
> This is a wrong justification for the proposed semantic. You seem to be
> assuming this is just fine rather than this would be problematic for
> reasons a), b) and c). If there are no strong reasons _against_
> following the global policy then we should stick with it. There are very
> good reasons why we are doing that on the global level.
> 
> If for no other reasons then the proposed semantic severly criples the
> shared MM case. You are left with a racy kill and call process_mrelease
> approach. You certainly do not want to allow a simple way for tasks to
> evade your LMK, do you? So just choose something else is a very bad
> approach.
> 
> So unless you are aware of a specific reason(s) where collective kill is a
> clearly an incorrect behavior then I believe the proper way is to kill
> all processes sharing the mm (unless you are crossing any security
> boundary when doing that).

I agree that in the case of a global or memcg OOM, the kernel deals with an
emergency, system-wide crisis where killing all sibling processes sharing
the same mm is an absolute necessity for system survival, bypassing
user-space privilege screening.

However, process_mrelease() is an explicit user-space initiated system call,
and I am still hesitant to place that same raw, destructive policy blindly
at the UAPI syscall level even though I don't know of any known security
issues right now.

If we really want to go that way for the collective kill, at least, we should
evaluate signal authorization (kill permission) against *every single*
sibling process beforehand instead of only the target task of
process_mrelease. Do you agree?

Also, I wonder what the signal/process maintainer thinks about this approach.
Christian Brauner <brauner@kernel.org>?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-01 21:17   ` Minchan Kim
@ 2026-05-04  7:51     ` Michal Hocko
  2026-05-05  5:04       ` Minchan Kim
  2026-05-05  9:30       ` Christian Brauner
  0 siblings, 2 replies; 12+ messages in thread
From: Michal Hocko @ 2026-05-04  7:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Christian Brauner, akpm, hca, linux-s390, david, linux-mm,
	linux-kernel, surenb, timmurray

On Fri 01-05-26 14:17:50, Minchan Kim wrote:
> On Thu, Apr 30, 2026 at 11:55:54AM +0200, Michal Hocko wrote:
> > On Wed 29-04-26 14:13:59, Minchan Kim wrote:
> > > This policy differs from the global OOM killer, which kills all processes
> > > sharing the same mm to guarantee memory reclamation at all costs (preventing
> > > system hangs).
> > 
> > Incorrect, we do the same for memcg OOM killer as well. This is not
> > about preventing system hands. But rather to 
> > 
> > > However, process_mrelease() is invoked by userspace policy.
> > > If it fails due to sharing, userspace can simply adapt and select another
> > > victim process (such as another background app in Android case) to release
> > > memory. We do not need to force success or affect processes that were not
> > > targeted.
> > 
> > This is a wrong justification for the proposed semantic. You seem to be
> > assuming this is just fine rather than this would be problematic for
> > reasons a), b) and c). If there are no strong reasons _against_
> > following the global policy then we should stick with it. There are very
> > good reasons why we are doing that on the global level.
> > 
> > If for no other reasons then the proposed semantic severly criples the
> > shared MM case. You are left with a racy kill and call process_mrelease
> > approach. You certainly do not want to allow a simple way for tasks to
> > evade your LMK, do you? So just choose something else is a very bad
> > approach.
> > 
> > So unless you are aware of a specific reason(s) where collective kill is a
> > clearly an incorrect behavior then I believe the proper way is to kill
> > all processes sharing the mm (unless you are crossing any security
> > boundary when doing that).
> 
> I agree that in the case of a global or memcg OOM, the kernel deals with an
> emergency, system-wide crisis where killing all sibling processes sharing
> the same mm is an absolute necessity for system survival, bypassing
> user-space privilege screening.

You are misinterpreting or missing my point. I am not suggesting to
cross privilege boundaries. The syscall should fail if the mm is shared
with tasks the caller cannot kill (same as it does now).

> However, process_mrelease() is an explicit user-space initiated system call,
> and I am still hesitant to place that same raw, destructive policy blindly
> at the UAPI syscall level even though I don't know of any known security
> issues right now.

This is very wrong argument to introduce a potentially crippled syscall
semantic.
 
> If we really want to go that way for the collective kill, at least, we should
> evaluate signal authorization (kill permission) against *every single*
> sibling process beforehand instead of only the target task of
> process_mrelease. Do you agree?

This is what I've proposed already.

> Also, I wonder what the signal/process maintainer thinks about this approach.
> Christian Brauner <brauner@kernel.org>?

Yes, this makes sense. There might be a very good reason why we might
not want to introduce a way to kill cross thread groups when they share
mm from userspace. I do not see any as long as you keep the proper
permissions for all affected tasks. Maybe we cannot do that sanely now.
But these reasons have to be properly documented. You whole argument
that this is different from in-kernel oom killing is just not valid.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-04  7:51     ` Michal Hocko
@ 2026-05-05  5:04       ` Minchan Kim
  2026-05-05  9:30       ` Christian Brauner
  1 sibling, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2026-05-05  5:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christian Brauner, akpm, hca, linux-s390, david, linux-mm,
	linux-kernel, surenb, timmurray

On Mon, May 04, 2026 at 09:51:35AM +0200, Michal Hocko wrote:
> On Fri 01-05-26 14:17:50, Minchan Kim wrote:
> > On Thu, Apr 30, 2026 at 11:55:54AM +0200, Michal Hocko wrote:
> > > On Wed 29-04-26 14:13:59, Minchan Kim wrote:
> > > > This policy differs from the global OOM killer, which kills all processes
> > > > sharing the same mm to guarantee memory reclamation at all costs (preventing
> > > > system hangs).
> > > 
> > > Incorrect, we do the same for memcg OOM killer as well. This is not
> > > about preventing system hands. But rather to 
> > > 
> > > > However, process_mrelease() is invoked by userspace policy.
> > > > If it fails due to sharing, userspace can simply adapt and select another
> > > > victim process (such as another background app in Android case) to release
> > > > memory. We do not need to force success or affect processes that were not
> > > > targeted.
> > > 
> > > This is a wrong justification for the proposed semantic. You seem to be
> > > assuming this is just fine rather than this would be problematic for
> > > reasons a), b) and c). If there are no strong reasons _against_
> > > following the global policy then we should stick with it. There are very
> > > good reasons why we are doing that on the global level.
> > > 
> > > If for no other reasons then the proposed semantic severly criples the
> > > shared MM case. You are left with a racy kill and call process_mrelease
> > > approach. You certainly do not want to allow a simple way for tasks to
> > > evade your LMK, do you? So just choose something else is a very bad
> > > approach.
> > > 
> > > So unless you are aware of a specific reason(s) where collective kill is a
> > > clearly an incorrect behavior then I believe the proper way is to kill
> > > all processes sharing the mm (unless you are crossing any security
> > > boundary when doing that).
> > 
> > I agree that in the case of a global or memcg OOM, the kernel deals with an
> > emergency, system-wide crisis where killing all sibling processes sharing
> > the same mm is an absolute necessity for system survival, bypassing
> > user-space privilege screening.
> 
> You are misinterpreting or missing my point. I am not suggesting to
> cross privilege boundaries. The syscall should fail if the mm is shared
> with tasks the caller cannot kill (same as it does now).
> 
> > However, process_mrelease() is an explicit user-space initiated system call,
> > and I am still hesitant to place that same raw, destructive policy blindly
> > at the UAPI syscall level even though I don't know of any known security
> > issues right now.
> 
> This is very wrong argument to introduce a potentially crippled syscall
> semantic.
>  
> > If we really want to go that way for the collective kill, at least, we should
> > evaluate signal authorization (kill permission) against *every single*
> > sibling process beforehand instead of only the target task of
> > process_mrelease. Do you agree?
> 
> This is what I've proposed already.

Sounds good.

One thing to note is that this approach is still not perfect, as some sibling
processes sharing the mm might not be killed due to different UID or SELinux
policies while others are. In such cases, the actual memory reaping via
process_mrelease() will still fail anyway.

If we are okay with this limitation - meaning it acts as a best-effort
approach where we might end up killing some processes without successfully
releasing the memory — then I can proceed with this design.

> 
> > Also, I wonder what the signal/process maintainer thinks about this approach.
> > Christian Brauner <brauner@kernel.org>?
> 
> Yes, this makes sense. There might be a very good reason why we might
> not want to introduce a way to kill cross thread groups when they share
> mm from userspace. I do not see any as long as you keep the proper
> permissions for all affected tasks. Maybe we cannot do that sanely now.
> But these reasons have to be properly documented. You whole argument
> that this is different from in-kernel oom killing is just not valid.

Okay, let's wait for any valid reasons or concerns they might raise.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-04  7:51     ` Michal Hocko
  2026-05-05  5:04       ` Minchan Kim
@ 2026-05-05  9:30       ` Christian Brauner
  2026-05-05 16:03         ` Michal Hocko
  1 sibling, 1 reply; 12+ messages in thread
From: Christian Brauner @ 2026-05-05  9:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, akpm, hca, linux-s390, david, linux-mm, linux-kernel,
	surenb, timmurray

On Mon, May 04, 2026 at 09:51:35AM +0200, Michal Hocko wrote:
> On Fri 01-05-26 14:17:50, Minchan Kim wrote:
> > On Thu, Apr 30, 2026 at 11:55:54AM +0200, Michal Hocko wrote:
> > > On Wed 29-04-26 14:13:59, Minchan Kim wrote:
> > > > This policy differs from the global OOM killer, which kills all processes
> > > > sharing the same mm to guarantee memory reclamation at all costs (preventing
> > > > system hangs).
> > > 
> > > Incorrect, we do the same for memcg OOM killer as well. This is not
> > > about preventing system hands. But rather to 
> > > 
> > > > However, process_mrelease() is invoked by userspace policy.
> > > > If it fails due to sharing, userspace can simply adapt and select another
> > > > victim process (such as another background app in Android case) to release
> > > > memory. We do not need to force success or affect processes that were not
> > > > targeted.
> > > 
> > > This is a wrong justification for the proposed semantic. You seem to be
> > > assuming this is just fine rather than this would be problematic for
> > > reasons a), b) and c). If there are no strong reasons _against_
> > > following the global policy then we should stick with it. There are very
> > > good reasons why we are doing that on the global level.
> > > 
> > > If for no other reasons then the proposed semantic severly criples the
> > > shared MM case. You are left with a racy kill and call process_mrelease
> > > approach. You certainly do not want to allow a simple way for tasks to
> > > evade your LMK, do you? So just choose something else is a very bad
> > > approach.
> > > 
> > > So unless you are aware of a specific reason(s) where collective kill is a
> > > clearly an incorrect behavior then I believe the proper way is to kill
> > > all processes sharing the mm (unless you are crossing any security
> > > boundary when doing that).
> > 
> > I agree that in the case of a global or memcg OOM, the kernel deals with an
> > emergency, system-wide crisis where killing all sibling processes sharing
> > the same mm is an absolute necessity for system survival, bypassing
> > user-space privilege screening.
> 
> You are misinterpreting or missing my point. I am not suggesting to
> cross privilege boundaries. The syscall should fail if the mm is shared
> with tasks the caller cannot kill (same as it does now).
> 
> > However, process_mrelease() is an explicit user-space initiated system call,
> > and I am still hesitant to place that same raw, destructive policy blindly
> > at the UAPI syscall level even though I don't know of any known security
> > issues right now.
> 
> This is very wrong argument to introduce a potentially crippled syscall
> semantic.
>  
> > If we really want to go that way for the collective kill, at least, we should
> > evaluate signal authorization (kill permission) against *every single*
> > sibling process beforehand instead of only the target task of
> > process_mrelease. Do you agree?
> 
> This is what I've proposed already.
> 
> > Also, I wonder what the signal/process maintainer thinks about this approach.
> > Christian Brauner <brauner@kernel.org>?
> 
> Yes, this makes sense. There might be a very good reason why we might
> not want to introduce a way to kill cross thread groups when they share
> mm from userspace. I do not see any as long as you keep the proper
> permissions for all affected tasks. Maybe we cannot do that sanely now.
> But these reasons have to be properly documented. You whole argument
> that this is different from in-kernel oom killing is just not valid.

IIUC, then the OOM kill if invoked from the kernel just takes down
without permission checking what it wants to take down. That makes a lot
of sense and is mostly safe - after all it is the kernel that initiates
the kill.

However, when userspace initiates the kill we need at least the
semantics you proposed, Michal. You can only kill processes that you
have the necessary privileges over otherwise you end up allowing to
SIGKILL setuid binaries over which you hold no privileged possibly
generating information leaks or worse.

The other thing to keep in mind is that currently pidfds explicitly do
not to allow to signal taks that are outside of their pid namespace
hierarchy - see pidfd_send_signal()'s permission checking. I don't want
to break these semantics - it's just very bad api design if signaling
suddenly behaves differently and pidfd suddenly convey the ability to
do a very wide signal scope.

The other thing is that pidfds are handles that can be sent around using
SCM_RIGHTS which means they could be forwarded to a container or another
privileged user that then initiates kill semantics.

The other thing is that the type of pidfd selects the scope of the
signaling operation:

* If the pidfd was created via PIDFD_THREAD then the scope of the signal
  is by default the individual thread - unless the signal itself is
  thread-group oriented ofc.

* If the pidfd was created wihout PIDFD_THREAD then the scope of the
  signal is by default the thread-group.

* pidfd_send_signal() provides explicitly scope overrides:

  (1) PIDFD_SIGNAL_THREAD
  (2) PIDFD_SIGNAL_THREAD_GROUP
  (3) PIDFD_SIGNAL_PROCESS_GROUP

  The flags should be mostly self-explanatory.

  So I really dislike the idea of now letting the pidfd passed to
  process_mrelease() to have an implicit scope suddenly. The problem is
  that this is very opaque to userspace and introduces another way to
  signal a group of processes.

IOW, I still dislike the fact that process_mrelease() is suddenly turned
into a signal sending syscall and I really dislike the fact that it
implies a "kill everything with that mm and cross other thread-groups".

I wonder if you couldn't just add PIDFD_SIGNAL_MM_GROUP or something to
pidfd_send_signal() instead.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-05  9:30       ` Christian Brauner
@ 2026-05-05 16:03         ` Michal Hocko
  2026-05-05 17:59           ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2026-05-05 16:03 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, akpm, hca, linux-s390, david, linux-mm, linux-kernel,
	surenb, timmurray

On Tue 05-05-26 11:30:22, Christian Brauner wrote:
> IIUC, then the OOM kill if invoked from the kernel just takes down
> without permission checking what it wants to take down. That makes a lot
> of sense and is mostly safe - after all it is the kernel that initiates
> the kill.
> 
> However, when userspace initiates the kill we need at least the
> semantics you proposed, Michal. You can only kill processes that you
> have the necessary privileges over otherwise you end up allowing to
> SIGKILL setuid binaries over which you hold no privileged possibly
> generating information leaks or worse.

Agreed!

> The other thing to keep in mind is that currently pidfds explicitly do
> not to allow to signal taks that are outside of their pid namespace
> hierarchy - see pidfd_send_signal()'s permission checking. I don't want
> to break these semantics - it's just very bad api design if signaling
> suddenly behaves differently and pidfd suddenly convey the ability to
> do a very wide signal scope.

Agreed!

> The other thing is that pidfds are handles that can be sent around using
> SCM_RIGHTS which means they could be forwarded to a container or another
> privileged user that then initiates kill semantics.
> 
> The other thing is that the type of pidfd selects the scope of the
> signaling operation:
> 
> * If the pidfd was created via PIDFD_THREAD then the scope of the signal
>   is by default the individual thread - unless the signal itself is
>   thread-group oriented ofc.
> 
> * If the pidfd was created wihout PIDFD_THREAD then the scope of the
>   signal is by default the thread-group.
> 
> * pidfd_send_signal() provides explicitly scope overrides:
> 
>   (1) PIDFD_SIGNAL_THREAD
>   (2) PIDFD_SIGNAL_THREAD_GROUP
>   (3) PIDFD_SIGNAL_PROCESS_GROUP
> 
>   The flags should be mostly self-explanatory.
> 
>   So I really dislike the idea of now letting the pidfd passed to
>   process_mrelease() to have an implicit scope suddenly. The problem is
>   that this is very opaque to userspace and introduces another way to
>   signal a group of processes.

I do see your point. Unfortunately the whole concept of mm shared
across thread (signal) groups is not fitting well into the overall
model. For the most usecases this is not a big problem. But oom handlers
do care. If you do not kill all owners of the mm you are not releasing
any memory.

> IOW, I still dislike the fact that process_mrelease() is suddenly turned
> into a signal sending syscall and I really dislike the fact that it
> implies a "kill everything with that mm and cross other thread-groups".
> 
> I wonder if you couldn't just add PIDFD_SIGNAL_MM_GROUP or something to
> pidfd_send_signal() instead.

That would be a clean interface for sure. The thing we are struggling
here is not just the killing side of things but also grabbing the mm
before it disappears which is the primary reason why process_mrelease is
turning into signal sending syscall (which you seem to be not in favor
of).

So I can see these options on the table
1) keep process_mrelease as is and live with the race. This sucks
because it makes userspace low memory (oom) killers harder to predict.
2) we add the proposed option to kill&release into process_mrelease that
is not aware of shared mm case. This sucks because it creates an easy
way to evade from the said oom killer
3) same as 2 but add PIDFD_SIGNAL_MM_GROUP that would do the right thing
on the signal handling side. You seem to like the idea from the
pidfd_send_signal POV but I am not sure you are OK with that being
implanted into process_mrelease.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-05-05 16:03         ` Michal Hocko
@ 2026-05-05 17:59           ` Minchan Kim
  0 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2026-05-05 17:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christian Brauner, akpm, hca, linux-s390, david, linux-mm,
	linux-kernel, surenb, timmurray

On Tue, May 05, 2026 at 06:03:03PM +0200, Michal Hocko wrote:
> On Tue 05-05-26 11:30:22, Christian Brauner wrote:
> > IIUC, then the OOM kill if invoked from the kernel just takes down
> > without permission checking what it wants to take down. That makes a lot
> > of sense and is mostly safe - after all it is the kernel that initiates
> > the kill.
> > 
> > However, when userspace initiates the kill we need at least the
> > semantics you proposed, Michal. You can only kill processes that you
> > have the necessary privileges over otherwise you end up allowing to
> > SIGKILL setuid binaries over which you hold no privileged possibly
> > generating information leaks or worse.
> 
> Agreed!
> 
> > The other thing to keep in mind is that currently pidfds explicitly do
> > not to allow to signal taks that are outside of their pid namespace
> > hierarchy - see pidfd_send_signal()'s permission checking. I don't want
> > to break these semantics - it's just very bad api design if signaling
> > suddenly behaves differently and pidfd suddenly convey the ability to
> > do a very wide signal scope.
> 
> Agreed!
> 
> > The other thing is that pidfds are handles that can be sent around using
> > SCM_RIGHTS which means they could be forwarded to a container or another
> > privileged user that then initiates kill semantics.
> > 
> > The other thing is that the type of pidfd selects the scope of the
> > signaling operation:
> > 
> > * If the pidfd was created via PIDFD_THREAD then the scope of the signal
> >   is by default the individual thread - unless the signal itself is
> >   thread-group oriented ofc.
> > 
> > * If the pidfd was created wihout PIDFD_THREAD then the scope of the
> >   signal is by default the thread-group.
> > 
> > * pidfd_send_signal() provides explicitly scope overrides:
> > 
> >   (1) PIDFD_SIGNAL_THREAD
> >   (2) PIDFD_SIGNAL_THREAD_GROUP
> >   (3) PIDFD_SIGNAL_PROCESS_GROUP
> > 
> >   The flags should be mostly self-explanatory.
> > 
> >   So I really dislike the idea of now letting the pidfd passed to
> >   process_mrelease() to have an implicit scope suddenly. The problem is
> >   that this is very opaque to userspace and introduces another way to
> >   signal a group of processes.
> 
> I do see your point. Unfortunately the whole concept of mm shared
> across thread (signal) groups is not fitting well into the overall
> model. For the most usecases this is not a big problem. But oom handlers
> do care. If you do not kill all owners of the mm you are not releasing
> any memory.
> 
> > IOW, I still dislike the fact that process_mrelease() is suddenly turned
> > into a signal sending syscall and I really dislike the fact that it
> > implies a "kill everything with that mm and cross other thread-groups".
> > 
> > I wonder if you couldn't just add PIDFD_SIGNAL_MM_GROUP or something to
> > pidfd_send_signal() instead.
> 
> That would be a clean interface for sure. The thing we are struggling
> here is not just the killing side of things but also grabbing the mm
> before it disappears which is the primary reason why process_mrelease is
> turning into signal sending syscall (which you seem to be not in favor
> of).
> 
> So I can see these options on the table
> 1) keep process_mrelease as is and live with the race. This sucks
> because it makes userspace low memory (oom) killers harder to predict.
> 2) we add the proposed option to kill&release into process_mrelease that
> is not aware of shared mm case. This sucks because it creates an easy
> way to evade from the said oom killer
> 3) same as 2 but add PIDFD_SIGNAL_MM_GROUP that would do the right thing
> on the signal handling side. You seem to like the idea from the
> pidfd_send_signal POV but I am not sure you are OK with that being
> implanted into process_mrelease.

For 3, maybe something likle this?
(Just to show the concept for further discussion)

---
 include/linux/signal.h    |  4 +++
 include/uapi/linux/mman.h |  4 +++
 kernel/signal.c           | 29 ++++++++++++++++++---
 mm/oom_kill.c             | 55 ++++++++++++++++++++++++++++++++++-----
 4 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index f19816832f05..bdbe6b3addec 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -276,6 +276,8 @@ static inline int valid_signal(unsigned long sig)
 
 struct timespec;
 struct pt_regs;
+struct mm_struct;
+struct pid;
 enum pid_type;
 
 extern int next_signal(struct sigpending *pending, sigset_t *mask);
@@ -283,6 +285,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
 				struct task_struct *p, enum pid_type type);
 extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
 			       struct task_struct *p, enum pid_type type);
+extern int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
+				      siginfo_t __user *info, unsigned int flags);
 extern int send_signal_locked(int sig, struct kernel_siginfo *info,
 			      struct task_struct *p, enum pid_type type);
 extern int sigprocmask(int, sigset_t *, sigset_t *);
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index e89d00528f2f..4266976b45ad 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -56,4 +56,8 @@ struct cachestat {
 	__u64 nr_recently_evicted;
 };
 
+/* Flags for process_mrelease */
+#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
+#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/kernel/signal.c b/kernel/signal.c
index d65d0fe24bfb..b2dc08a9bdd3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4046,6 +4046,30 @@ static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
 	return kill_pid_info_type(sig, &kinfo, pid, type);
 }
 
+/**
+ * do_pidfd_send_signal_pidns - Send a signal to a process via its struct pid
+ *                              while validating PID namespace hierarchy.
+ * @pid:   the struct pid of the target process
+ * @sig:   signal to send
+ * @type:  scope of the signal (e.g. PIDTYPE_TGID)
+ * @info:  signal info payload
+ * @flags: signaling flags
+ *
+ * Verify that the target pid resides inside the caller's PID namespace
+ * hierarchy prior to signal delivery.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
+			       siginfo_t __user *info, unsigned int flags)
+{
+	/* Enforce PID namespace hierarchy boundary */
+	if (!access_pidfd_pidns(pid))
+		return -EINVAL;
+
+	return do_pidfd_send_signal(pid, sig, type, info, flags);
+}
+
 /**
  * sys_pidfd_send_signal - Signal a process through a pidfd
  * @pidfd:  file descriptor of the process
@@ -4094,16 +4118,13 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 		if (IS_ERR(pid))
 			return PTR_ERR(pid);
 
-		if (!access_pidfd_pidns(pid))
-			return -EINVAL;
-
 		/* Infer scope from the type of pidfd. */
 		if (fd_file(f)->f_flags & PIDFD_THREAD)
 			type = PIDTYPE_PID;
 		else
 			type = PIDTYPE_TGID;
 
-		return do_pidfd_send_signal(pid, sig, type, info, flags);
+		return do_pidfd_send_signal_pidns(pid, sig, type, info, flags);
 	}
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c6c95c169ee..253aa80770f2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -20,6 +20,7 @@
 
 #include <linux/oom.h>
 #include <linux/mm.h>
+#include <uapi/linux/mman.h>
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/sched.h>
@@ -925,6 +926,39 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
+/*
+ * kill_all_shared_mm - Deliver SIGKILL to all processes sharing the given address space.
+ * @victim: the targeted OOM process group leader
+ * @mm:     the virtual memory space being reaped
+ *
+ * Traverse all threads globally and signal any user processes sharing the identical
+ * mm footprints, ensuring no concurrent users pin the memory. Skips the system
+ * global init and kernel worker threads.
+ */
+static int kill_all_shared_mm(struct task_struct *victim, struct mm_struct *mm)
+{
+	struct task_struct *p;
+	bool failed = false;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (is_global_init(p)) {
+			failed = true;
+			continue;
+		}
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
+
+		if (do_pidfd_send_signal_pidns(task_pid(p), SIGKILL, PIDTYPE_TGID, NULL, 0))
+			failed = true;
+	}
+	rcu_read_unlock();
+
+	return failed ? -EBUSY : 0;
+}
+
 static void __oom_kill_process(struct task_struct *victim, const char *message)
 {
 	struct task_struct *p;
@@ -1217,9 +1251,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	unsigned int f_flags;
 	bool reap = false;
 	long ret = 0;
+	bool reap_kill;
 
-	if (flags)
+	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
 		return -EINVAL;
+	reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
 
 	task = pidfd_get_task(pidfd, &f_flags);
 	if (IS_ERR(task))
@@ -1236,19 +1272,24 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	}
 
 	mm = p->mm;
-	mmgrab(mm);
 
-	if (task_will_free_mem(p))
-		reap = true;
-	else {
+	reap = reap_kill || task_will_free_mem(p);
+	if (!reap) {
 		/* Error only if the work has not been done already */
 		if (!mm_flags_test(MMF_OOM_SKIP, mm))
 			ret = -EINVAL;
+		task_unlock(p);
+		goto put_task;
 	}
+
+	mmgrab(mm);
 	task_unlock(p);
 
-	if (!reap)
-		goto drop_mm;
+	if (reap_kill) {
+		ret = kill_all_shared_mm(task, mm);
+		if (ret)
+			goto drop_mm;
+	}
 
 	if (mmap_read_lock_killable(mm)) {
 		ret = -EINTR;
-- 
2.54.0.545.g6539524ca2-goog




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-04-29 21:13 [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
  2026-04-30  9:55 ` Michal Hocko
@ 2026-04-30 14:43 ` Andrew Morton
  2026-04-30 15:32   ` Michal Hocko
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2026-04-30 14:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: hca, linux-s390, david, mhocko, brauner, linux-mm, linux-kernel,
	surenb, timmurray

On Wed, 29 Apr 2026 14:13:59 -0700 Minchan Kim <minchan@kernel.org> wrote:

> Currently, process_mrelease() requires userspace to send a SIGKILL signal
> prior to invocation. This separation introduces a scheduling race window
> where the victim task may receive the signal and enter the exit path
> before the reaper can invoke process_mrelease().

Does process_mrelease() have a manpage?  My googling was a fail.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-04-30 14:43 ` Andrew Morton
@ 2026-04-30 15:32   ` Michal Hocko
  2026-04-30 16:34     ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2026-04-30 15:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, hca, linux-s390, david, brauner, linux-mm,
	linux-kernel, surenb, timmurray

On Thu 30-04-26 07:43:05, Andrew Morton wrote:
> On Wed, 29 Apr 2026 14:13:59 -0700 Minchan Kim <minchan@kernel.org> wrote:
> 
> > Currently, process_mrelease() requires userspace to send a SIGKILL signal
> > prior to invocation. This separation introduces a scheduling race window
> > where the victim task may receive the signal and enter the exit path
> > before the reaper can invoke process_mrelease().
> 
> Does process_mrelease() have a manpage?  My googling was a fail.

It does. Very well hidden in 884a7e5964e06
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-04-30 15:32   ` Michal Hocko
@ 2026-04-30 16:34     ` Andrew Morton
  2026-04-30 17:24       ` Suren Baghdasaryan
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2026-04-30 16:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, hca, linux-s390, david, brauner, linux-mm,
	linux-kernel, surenb, timmurray

On Thu, 30 Apr 2026 17:32:40 +0200 Michal Hocko <mhocko@suse.com> wrote:

> On Thu 30-04-26 07:43:05, Andrew Morton wrote:
> > On Wed, 29 Apr 2026 14:13:59 -0700 Minchan Kim <minchan@kernel.org> wrote:
> > 
> > > Currently, process_mrelease() requires userspace to send a SIGKILL signal
> > > prior to invocation. This separation introduces a scheduling race window
> > > where the victim task may receive the signal and enter the exit path
> > > before the reaper can invoke process_mrelease().
> > 
> > Does process_mrelease() have a manpage?  My googling was a fail.
> 
> It does. Very well hidden in 884a7e5964e06


Well, that didn't appear to make it into the manpages project and it
doesn't describe the expected usage: need to kill the process first. 
But I guess all the needed info is in
tools/testing/selftests/mm/mrelease_test.c.

https://lwn.net/Articles/864184/ is useful.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
  2026-04-30 16:34     ` Andrew Morton
@ 2026-04-30 17:24       ` Suren Baghdasaryan
  0 siblings, 0 replies; 12+ messages in thread
From: Suren Baghdasaryan @ 2026-04-30 17:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Minchan Kim, hca, linux-s390, david, brauner,
	linux-mm, linux-kernel, timmurray

On Thu, Apr 30, 2026 at 9:34 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 30 Apr 2026 17:32:40 +0200 Michal Hocko <mhocko@suse.com> wrote:
>
> > On Thu 30-04-26 07:43:05, Andrew Morton wrote:
> > > On Wed, 29 Apr 2026 14:13:59 -0700 Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > > Currently, process_mrelease() requires userspace to send a SIGKILL signal
> > > > prior to invocation. This separation introduces a scheduling race window
> > > > where the victim task may receive the signal and enter the exit path
> > > > before the reaper can invoke process_mrelease().
> > >
> > > Does process_mrelease() have a manpage?  My googling was a fail.
> >
> > It does. Very well hidden in 884a7e5964e06
>
>
> Well, that didn't appear to make it into the manpages project and it
> doesn't describe the expected usage: need to kill the process first.
> But I guess all the needed info is in
> tools/testing/selftests/mm/mrelease_test.c.
>
> https://lwn.net/Articles/864184/ is useful.

I'll try to carve out some time to post a proper manpage for it.
Thanks for pointing this out!

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-05 17:59 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29 21:13 [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
2026-04-30  9:55 ` Michal Hocko
2026-05-01 21:17   ` Minchan Kim
2026-05-04  7:51     ` Michal Hocko
2026-05-05  5:04       ` Minchan Kim
2026-05-05  9:30       ` Christian Brauner
2026-05-05 16:03         ` Michal Hocko
2026-05-05 17:59           ` Minchan Kim
2026-04-30 14:43 ` Andrew Morton
2026-04-30 15:32   ` Michal Hocko
2026-04-30 16:34     ` Andrew Morton
2026-04-30 17:24       ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox