From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 846AA3806BE;
	Fri, 15 May 2026 20:52:42 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778878362; cv=none; b=o4D0owc9ECY+vtMZxuyCrJaGMRHzCYzqWDY6na/Q7IgQnwlGGZeSA/GV30LLAaa0dXfxgO4pd9Mr6Fe5Z4s0QB1IaNEhVEEIoeh/fAaRLi9AUgW+lWw5IVR91FGog9SEW+F8sY/BwAUUU9xkI3yccvMmQz5ygIh4bMitCS/SzjE=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778878362; c=relaxed/simple;
	bh=k2L1X5wmfaJTfYccSe1oeRjdS8+g4LrLFB3drvRjVAw=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=jez6ah4bBCiL0NsDOJ4VhzMCBLQfSNAvodEyV1DIvDHzUWv5VRfkcwW5VinQ9vwUJCL9mWJ+gTy6h8EvYKy6ULNCySD2mGXmpgMKKo7Okn4A7jTr/+IV8jtRQZeXOUujZh9TF9KTG221ag7O9AN/lUZCi/v4PIkYbDapD6ZOcko=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=SpmVlAMQ; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="SpmVlAMQ"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id DC9FEC2BCB0;
	Fri, 15 May 2026 20:52:41 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1778878362;
	bh=k2L1X5wmfaJTfYccSe1oeRjdS8+g4LrLFB3drvRjVAw=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=SpmVlAMQsenNFryEZGvkmUsYvajcCxB+UXSgXy7PomyUJAmaxY2qvcQpNuQGNBuo+
	 rKvvWnkzc0qtUC6cZFTH6wULPH0+xpBt8UUUeN4Z8Tf/E46x1ZuN9cNz1yBMlJowTQ
	 yia5uzJvauumibe0UxMyy3VwfVcONTffh4sTBTe03ZW/2smt7kmEM6mdHIkRq+TCaS
	 bbTi459OKk41infK3HC50SQmQPqpkXTW9GShLNPxJ9p+u0SaetHekAT4daUwNI86qK
	 BkyrEar0LDJp2i7SoQ8IUE0XIAi4rWYiK6Chrk9rAj7xenkVhHUwMdN5nY5SRfWoWT
	 O38nq5eEfYWcw==
Date: Fri, 15 May 2026 13:52:40 -0700
From: Minchan Kim <minchan@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Cc: Jann Horn <jannh@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>, akpm@linux-foundation.org,
	hca@linux.ibm.com, linux-s390@vger.kernel.org, david@kernel.org,
	mhocko@suse.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	surenb@google.com, timmurray@google.com
Subject: Re: [PATCH v3] mm: process_mrelease: introduce
 PROCESS_MRELEASE_REAP_KILL flag
Message-ID: <ageHmE1QIzK4R3nO@google.com>
References: <20260511214226.937793-1-minchan@kernel.org>
 <20260515-nachdenken-umbenannt-a90006a46e14@brauner>
Precedence: bulk
X-Mailing-List: linux-s390@vger.kernel.org
List-Id: <linux-s390.vger.kernel.org>
List-Subscribe: <mailto:linux-s390+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-s390+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20260515-nachdenken-umbenannt-a90006a46e14@brauner>

On Fri, May 15, 2026 at 04:41:18PM +0200, Christian Brauner wrote:
> On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> > Currently, process_mrelease() requires userspace to send a SIGKILL signal
> > prior to invocation. This separation introduces a scheduling race window
> > where the victim task may receive the signal and enter the exit path
> > before the reaper can invoke process_mrelease().
> > 
> > When the victim enters the exit path (do_exit -> exit_mm), it clears its
> > task->mm immediately. This causes process_mrelease() to fail with -ESRCH,
> 
> To be quite frank about the patch as written below: I think this is a
> completely crazy api. And I really dislike it.
> 
> Right now we have clear and simple signal sending semantics for pidfds:
> 
> * thread-specific pidfd -> thread-directed signal (unless signal is thread-group scoped by default)
> * thread-group pidfd -> thread-group directed signal
> 
> with specific overrides for pidfd_send_signal():
> 
> * PIDFD_SIGNAL_THREAD           -> only signal thread
> * PIDFD_SIGNAL_THREAD_GROUP     -> signal thread-group
> * PIDFD_SIGNAL_PROCESS_GROUP    -> signal process-group
> 
> And now this patch aims to elevate process_mrelease() to a _signal
> sending function_. And the semantics are complete special sauce too.
> 
> You are effectively introducing a custom signal scope that is mm-based
> and then also plumbing it into a completely unrelated function that
> should have absolutely nothing to do with this.
> 
> This is such Zebroid API. I really would hate to see it land.
> 
> This came up because of the ptrace bug that was just recently discovered
> and that Linus fixed yesterday. This is another instance where I think
> the correct fix is to keep task->mm around until the process is reaped
> and then you can throw away all of the really ugly semantic in this
> patch afaict. I'd really like to see that merged as it would also clean
> up the ptrace code:
> 
> https://lore.kernel.org/all/20201016024019.1882062-1-jannh@google.com

Thank you for pointing this out. I completely agree.

The only reason for adding PROCESS_MRELEASE_REAP_KILL and its complex signaling
was to work around task->mm being cleared early in exit_mm(), which causes
process_mrelease() to fail with -ESRCH.

If Jann's patch to keep task->mm around is merged, this race naturally vanishes.
Userspace can reliably send SIGKILL and call the normal process_mrelease()
with the exit_mm

I would be thrilled to see Jann's patch merged. It allows us to drop this
complex REAP_KILL patch entirely while still fully resolving the expedited
reclaim issue.

> 
> > leaving the actual address space teardown (exit_mmap) to be deferred until
> > the mm's reference count drops to zero. In the field (e.g., Android),
> > arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> > remote VM accesses) frequently delay this teardown indefinitely,
> > defeating the purpose of expedited reclamation.
> > 
> > In Android's LMKD scenarios, this delay keeps memory pressure high, forcing
> > the system to unnecessarily kill additional innocent background apps before
> > the memory from the first victim is recovered.
> > 
> > This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
> > an integrated auto-kill mode. When specified, process_mrelease() directly
> > injects a SIGKILL into the target task after finding its mm.
> > 
> > To solve the race condition, we grab the mm reference via mmgrab() before
> > sending the SIGKILL. If the user passed PROCESS_MRELEASE_REAP_KILL, we assume
> > it will free its memory and proceed with reaping, making the logic as simple
> > as reap = reap_kill || task_will_free_mem(p).
> > 
> > To handle shared address spaces, we deliver SIGKILL to all processes sharing
> > the same address space using do_pidfd_send_signal_pidns(). This ensures the
> > target pid resides inside the caller's PID namespace hierarchy prior to
> > signal delivery. We iterate over all processes sharing the mm and deliver
> > SIGKILL to each. If delivering the signal to any of the sharing processes
> > fails, we return an error. Note that this approach may leave partial
> > side-effects if some processes are killed successfully before a failure occurs.
> > 
> > Cc: Christian Brauner <brauner@kernel.org>
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/signal.h    |  4 +++
> >  include/uapi/linux/mman.h |  4 +++
> >  kernel/signal.c           | 29 ++++++++++++++++++---
> >  mm/oom_kill.c             | 55 ++++++++++++++++++++++++++++++++++-----
> >  4 files changed, 81 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/signal.h b/include/linux/signal.h
> > index f19816832f05..bdbe6b3addec 100644
> > --- a/include/linux/signal.h
> > +++ b/include/linux/signal.h
> > @@ -276,6 +276,8 @@ static inline int valid_signal(unsigned long sig)
> >  
> >  struct timespec;
> >  struct pt_regs;
> > +struct mm_struct;
> > +struct pid;
> >  enum pid_type;
> >  
> >  extern int next_signal(struct sigpending *pending, sigset_t *mask);
> > @@ -283,6 +285,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
> >  				struct task_struct *p, enum pid_type type);
> >  extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
> >  			       struct task_struct *p, enum pid_type type);
> > +extern int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> > +				      siginfo_t __user *info, unsigned int flags);
> >  extern int send_signal_locked(int sig, struct kernel_siginfo *info,
> >  			      struct task_struct *p, enum pid_type type);
> >  extern int sigprocmask(int, sigset_t *, sigset_t *);
> > diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> > index e89d00528f2f..4266976b45ad 100644
> > --- a/include/uapi/linux/mman.h
> > +++ b/include/uapi/linux/mman.h
> > @@ -56,4 +56,8 @@ struct cachestat {
> >  	__u64 nr_recently_evicted;
> >  };
> >  
> > +/* Flags for process_mrelease */
> > +#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
> > +#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
> > +
> >  #endif /* _UAPI_LINUX_MMAN_H */
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index d65d0fe24bfb..b2dc08a9bdd3 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -4046,6 +4046,30 @@ static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
> >  	return kill_pid_info_type(sig, &kinfo, pid, type);
> >  }
> >  
> > +/**
> > + * do_pidfd_send_signal_pidns - Send a signal to a process via its struct pid
> > + *                              while validating PID namespace hierarchy.
> > + * @pid:   the struct pid of the target process
> > + * @sig:   signal to send
> > + * @type:  scope of the signal (e.g. PIDTYPE_TGID)
> > + * @info:  signal info payload
> > + * @flags: signaling flags
> > + *
> > + * Verify that the target pid resides inside the caller's PID namespace
> > + * hierarchy prior to signal delivery.
> > + *
> > + * Return: 0 on success, negative errno on failure.
> > + */
> > +int do_pidfd_send_signal_pidns(struct pid *pid, int sig, enum pid_type type,
> > +			       siginfo_t __user *info, unsigned int flags)
> > +{
> > +	/* Enforce PID namespace hierarchy boundary */
> > +	if (!access_pidfd_pidns(pid))
> > +		return -EINVAL;
> > +
> > +	return do_pidfd_send_signal(pid, sig, type, info, flags);
> > +}
> > +
> >  /**
> >   * sys_pidfd_send_signal - Signal a process through a pidfd
> >   * @pidfd:  file descriptor of the process
> > @@ -4094,16 +4118,13 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >  		if (IS_ERR(pid))
> >  			return PTR_ERR(pid);
> >  
> > -		if (!access_pidfd_pidns(pid))
> > -			return -EINVAL;
> > -
> >  		/* Infer scope from the type of pidfd. */
> >  		if (fd_file(f)->f_flags & PIDFD_THREAD)
> >  			type = PIDTYPE_PID;
> >  		else
> >  			type = PIDTYPE_TGID;
> >  
> > -		return do_pidfd_send_signal(pid, sig, type, info, flags);
> > +		return do_pidfd_send_signal_pidns(pid, sig, type, info, flags);
> >  	}
> >  	}
> >  
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5c6c95c169ee..253aa80770f2 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -20,6 +20,7 @@
> >  
> >  #include <linux/oom.h>
> >  #include <linux/mm.h>
> > +#include <uapi/linux/mman.h>
> >  #include <linux/err.h>
> >  #include <linux/gfp.h>
> >  #include <linux/sched.h>
> > @@ -925,6 +926,39 @@ static bool task_will_free_mem(struct task_struct *task)
> >  	return ret;
> >  }
> >  
> > +/*
> > + * kill_all_shared_mm - Deliver SIGKILL to all processes sharing the given address space.
> > + * @victim: the targeted OOM process group leader
> > + * @mm:     the virtual memory space being reaped
> > + *
> > + * Traverse all threads globally and signal any user processes sharing the identical
> > + * mm footprints, ensuring no concurrent users pin the memory. Skips the system
> > + * global init and kernel worker threads.
> > + */
> > +static int kill_all_shared_mm(struct task_struct *victim, struct mm_struct *mm)
> > +{
> > +	struct task_struct *p;
> > +	bool failed = false;
> > +
> > +	rcu_read_lock();
> > +	for_each_process(p) {
> > +		if (!process_shares_mm(p, mm))
> > +			continue;
> > +		if (is_global_init(p)) {
> 
> You can't signal init in any shape or form any way. Why bother reporting
> failure at all.
> 
> > +			failed = true;
> > +			continue;
> > +		}
> > +		if (unlikely(p->flags & PF_KTHREAD))
> > +			continue;
> > +
> > +		if (do_pidfd_send_signal_pidns(task_pid(p), SIGKILL, PIDTYPE_TGID, NULL, 0))
> > +			failed = true;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return failed ? -EBUSY : 0;
> 
> Why are you returning EBUSY? This makes no sense imho.
> 
> > +}
> > +
> >  static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  {
> >  	struct task_struct *p;
> > @@ -1217,9 +1251,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >  	unsigned int f_flags;
> >  	bool reap = false;
> >  	long ret = 0;
> > +	bool reap_kill;
> >  
> > -	if (flags)
> > +	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
> >  		return -EINVAL;
> > +	reap_kill = !!(flags & PROCESS_MRELEASE_REAP_KILL);
> >  
> >  	task = pidfd_get_task(pidfd, &f_flags);
> >  	if (IS_ERR(task))
> > @@ -1236,19 +1272,24 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >  	}
> >  
> >  	mm = p->mm;
> > -	mmgrab(mm);
> >  
> > -	if (task_will_free_mem(p))
> > -		reap = true;
> > -	else {
> > +	reap = reap_kill || task_will_free_mem(p);
> > +	if (!reap) {
> >  		/* Error only if the work has not been done already */
> >  		if (!mm_flags_test(MMF_OOM_SKIP, mm))
> >  			ret = -EINVAL;
> > +		task_unlock(p);
> > +		goto put_task;
> >  	}
> > +
> > +	mmgrab(mm);
> >  	task_unlock(p);
> >  
> > -	if (!reap)
> > -		goto drop_mm;
> > +	if (reap_kill) {
> > +		ret = kill_all_shared_mm(task, mm);
> > +		if (ret)
> > +			goto drop_mm;
> > +	}
> >  
> >  	if (mmap_read_lock_killable(mm)) {
> >  		ret = -EINTR;
> > -- 
> > 2.54.0.563.g4f69b47b94-goog
> > 
> >