public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Minchan Kim <minchan@kernel.org>
To: Michal Hocko <mhocko@suse.com>
Cc: Christian Brauner <brauner@kernel.org>,
	akpm@linux-foundation.org, hca@linux.ibm.com,
	linux-s390@vger.kernel.org, david@kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, surenb@google.com,
	timmurray@google.com
Subject: Re: [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
Date: Tue, 5 May 2026 10:59:43 -0700	[thread overview]
Message-ID: <afowD31YsGVdVUBP@google.com> (raw)
In-Reply-To: <afoUt3te1k2TNao-@tiehlicka>

On Tue, May 05, 2026 at 06:03:03PM +0200, Michal Hocko wrote:
> On Tue 05-05-26 11:30:22, Christian Brauner wrote:
> > IIUC, then the OOM kill if invoked from the kernel just takes down
> > without permission checking what it wants to take down. That makes a lot
> > of sense and is mostly safe - after all it is the kernel that initiates
> > the kill.
> > 
> > However, when userspace initiates the kill we need at least the
> > semantics you proposed, Michal. You can only kill processes that you
> > have the necessary privileges over otherwise you end up allowing to
> > SIGKILL setuid binaries over which you hold no privileged possibly
> > generating information leaks or worse.
> 
> Agreed!
> 
> > The other thing to keep in mind is that currently pidfds explicitly do
> > not to allow to signal taks that are outside of their pid namespace
> > hierarchy - see pidfd_send_signal()'s permission checking. I don't want
> > to break these semantics - it's just very bad api design if signaling
> > suddenly behaves differently and pidfd suddenly convey the ability to
> > do a very wide signal scope.
> 
> Agreed!
> 
> > The other thing is that pidfds are handles that can be sent around using
> > SCM_RIGHTS which means they could be forwarded to a container or another
> > privileged user that then initiates kill semantics.
> > 
> > The other thing is that the type of pidfd selects the scope of the
> > signaling operation:
> > 
> > * If the pidfd was created via PIDFD_THREAD then the scope of the signal
> >   is by default the individual thread - unless the signal itself is
> >   thread-group oriented ofc.
> > 
> > * If the pidfd was created wihout PIDFD_THREAD then the scope of the
> >   signal is by default the thread-group.
> > 
> > * pidfd_send_signal() provides explicitly scope overrides:
> > 
> >   (1) PIDFD_SIGNAL_THREAD
> >   (2) PIDFD_SIGNAL_THREAD_GROUP
> >   (3) PIDFD_SIGNAL_PROCESS_GROUP
> > 
> >   The flags should be mostly self-explanatory.
> > 
> >   So I really dislike the idea of now letting the pidfd passed to
> >   process_mrelease() to have an implicit scope suddenly. The problem is
> >   that this is very opaque to userspace and introduces another way to
> >   signal a group of processes.
> 
> I do see your point. Unfortunately the whole concept of mm shared
> across thread (signal) groups is not fitting well into the overall
> model. For the most usecases this is not a big problem. But oom handlers
> do care. If you do not kill all owners of the mm you are not releasing
> any memory.
> 
> > IOW, I still dislike the fact that process_mrelease() is suddenly turned
> > into a signal sending syscall and I really dislike the fact that it
> > implies a "kill everything with that mm and cross other thread-groups".
> > 
> > I wonder if you couldn't just add PIDFD_SIGNAL_MM_GROUP or something to
> > pidfd_send_signal() instead.
> 
> That would be a clean interface for sure. The thing we are struggling
> here is not just the killing side of things but also grabbing the mm
> before it disappears which is the primary reason why process_mrelease is
> turning into signal sending syscall (which you seem to be not in favor
> of).
> 
> So I can see these options on the table
> 1) keep process_mrelease as is and live with the race. This sucks
> because it makes userspace low memory (oom) killers harder to predict.
> 2) we add the proposed option to kill&release into process_mrelease that
> is not aware of shared mm case. This sucks because it creates an easy
> way to evade from the said oom killer
> 3) same as 2 but add PIDFD_SIGNAL_MM_GROUP that would do the right thing
> on the signal handling side. You seem to like the idea from the
> pidfd_send_signal POV but I am not sure you are OK with that being
> implanted into process_mrelease.

For 3, maybe something likle this?
(Just to show the concept for further discussion)

---
 include/linux/signal.h    |  4 +++
 include/uapi/linux/mman.h |  4 +++
 kernel/signal.c           | 29 ++++++++++++++++++---
 mm/oom_kill.c             | 55 ++++++++++++++++++++++++++++++++++-----
 4 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index f19816832f05..bdbe6b3addec 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -276,6 +276,8 @@ static inline int valid_signal(unsigned long sig)
 
 struct timespec;
 struct pt_regs;
+struct mm_struct;
+struct pid;
 enum pid_type;
 
 extern int next_signal(struct sigpending *pending, sigset_t *mask);
@@ -283,6 +285,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
 				struct task_struct *p, enum pid_type type);
 extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
 			       struct task_struct *p, enum pid_type type);
+extern int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
+				      siginfo_t __user *info, unsigned int flags);
 extern int send_signal_locked(int sig, struct kernel_siginfo *info,
 			      struct task_struct *p, enum pid_type type);
 extern int sigprocmask(int, sigset_t *, sigset_t *);
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index e89d00528f2f..4266976b45ad 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -56,4 +56,8 @@ struct cachestat {
 	__u64 nr_recently_evicted;
 };
 
+/* Flags for process_mrelease */
+#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
+#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/kernel/signal.c b/kernel/signal.c
index d65d0fe24bfb..b2dc08a9bdd3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4046,6 +4046,30 @@ static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
 	return kill_pid_info_type(sig, &kinfo, pid, type);
 }
 
+/**
+ * do_pidfd_send_signal_pidns - Send a signal to a process via its struct pid
+ *                              while validating PID namespace hierarchy.
+ * @pid:   the struct pid of the target process
+ * @sig:   signal to send
+ * @type:  scope of the signal (e.g. PIDTYPE_TGID)
+ * @info:  signal info payload
+ * @flags: signaling flags
+ *
+ * Verify that the target pid resides inside the caller's PID namespace
+ * hierarchy prior to signal delivery.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
+			       siginfo_t __user *info, unsigned int flags)
+{
+	/* Enforce PID namespace hierarchy boundary */
+	if (!access_pidfd_pidns(pid))
+		return -EINVAL;
+
+	return do_pidfd_send_signal(pid, sig, type, info, flags);
+}
+
 /**
  * sys_pidfd_send_signal - Signal a process through a pidfd
  * @pidfd:  file descriptor of the process
@@ -4094,16 +4118,13 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 		if (IS_ERR(pid))
 			return PTR_ERR(pid);
 
-		if (!access_pidfd_pidns(pid))
-			return -EINVAL;
-
 		/* Infer scope from the type of pidfd. */
 		if (fd_file(f)->f_flags & PIDFD_THREAD)
 			type = PIDTYPE_PID;
 		else
 			type = PIDTYPE_TGID;
 
-		return do_pidfd_send_signal(pid, sig, type, info, flags);
+		return do_pidfd_send_signal_pidns(pid, sig, type, info, flags);
 	}
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c6c95c169ee..253aa80770f2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -20,6 +20,7 @@
 
 #include <linux/oom.h>
 #include <linux/mm.h>
+#include <uapi/linux/mman.h>
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/sched.h>
@@ -925,6 +926,39 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
+/*
+ * kill_all_shared_mm - Deliver SIGKILL to all processes sharing the given address space.
+ * @victim: the targeted OOM process group leader
+ * @mm:     the virtual memory space being reaped
+ *
+ * Traverse all threads globally and signal any user processes sharing the identical
+ * mm footprints, ensuring no concurrent users pin the memory. Skips the system
+ * global init and kernel worker threads.
+ */
+static int kill_all_shared_mm(struct task_struct *victim, struct mm_struct *mm)
+{
+	struct task_struct *p;
+	bool failed = false;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (is_global_init(p)) {
+			failed = true;
+			continue;
+		}
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
+
+		if (do_pidfd_send_signal_pidns(task_pid(p), SIGKILL, PIDTYPE_TGID, NULL, 0))
+			failed = true;
+	}
+	rcu_read_unlock();
+
+	return failed ? -EBUSY : 0;
+}
+
 static void __oom_kill_process(struct task_struct *victim, const char *message)
 {
 	struct task_struct *p;
@@ -1217,9 +1251,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	unsigned int f_flags;
 	bool reap = false;
 	long ret = 0;
+	bool reap_kill;
 
-	if (flags)
+	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
 		return -EINVAL;
+	reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
 
 	task = pidfd_get_task(pidfd, &f_flags);
 	if (IS_ERR(task))
@@ -1236,19 +1272,24 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	}
 
 	mm = p->mm;
-	mmgrab(mm);
 
-	if (task_will_free_mem(p))
-		reap = true;
-	else {
+	reap = reap_kill || task_will_free_mem(p);
+	if (!reap) {
 		/* Error only if the work has not been done already */
 		if (!mm_flags_test(MMF_OOM_SKIP, mm))
 			ret = -EINVAL;
+		task_unlock(p);
+		goto put_task;
 	}
+
+	mmgrab(mm);
 	task_unlock(p);
 
-	if (!reap)
-		goto drop_mm;
+	if (reap_kill) {
+		ret = kill_all_shared_mm(task, mm);
+		if (ret)
+			goto drop_mm;
+	}
 
 	if (mmap_read_lock_killable(mm)) {
 		ret = -EINTR;
-- 
2.54.0.545.g6539524ca2-goog




  reply	other threads:[~2026-05-05 17:59 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-29 21:13 [PATCH v2] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
2026-04-30  9:55 ` Michal Hocko
2026-05-01 21:17   ` Minchan Kim
2026-05-04  7:51     ` Michal Hocko
2026-05-05  5:04       ` Minchan Kim
2026-05-05  9:30       ` Christian Brauner
2026-05-05 16:03         ` Michal Hocko
2026-05-05 17:59           ` Minchan Kim [this message]
2026-04-30 14:43 ` Andrew Morton
2026-04-30 15:32   ` Michal Hocko
2026-04-30 16:34     ` Andrew Morton
2026-04-30 17:24       ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afowD31YsGVdVUBP@google.com \
    --to=minchan@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=david@kernel.org \
    --cc=hca@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=timmurray@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox