From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 49CA0311959 for ; Fri, 17 Apr 2026 06:30:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776407412; cv=none; b=Wq8nNt7aTpb322XZijegyVx/O4J3pixIF8maQyeJDYiW6bCIBL2IxuWj2uRXtn3rQA74NXfiuRYw3qgYJHvanofrKw8oFoqZUX7seKalkdS1cvYCf7OF52MHZUoTvfKx4JtRsequIMWnliC+6XYQI/LWWppNZOcblgFRi95p4Gw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776407412; c=relaxed/simple; bh=gDuRI9gS/dVQ6Jf3ZoMT0FRKjSgGqPUbWj0jrgE778o=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=TQvkig+oziI+uylAM3fCTW9BJ+uXvOVx+aYzIE+b2EZ4RSU2qs2T/hBRsxTnAir+rr8UfymAsPK2Jj1ZmPKJl2bYXIbLPKBR+nb5yx2Fpniceys+NM2cvpCFti+91WSXCYq7XljgXtTJas0obyOR/SuvPzd7knF0YcAnqR5nMxA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=HWOzSIyY; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="HWOzSIyY" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 87124C19425; Fri, 17 Apr 2026 06:30:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776407411; bh=gDuRI9gS/dVQ6Jf3ZoMT0FRKjSgGqPUbWj0jrgE778o=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=HWOzSIyYOi0b/xR3tIGZaNAg6s4efvGhxD6MfRFrd4qOb3N2L4DQTENKFsl8C77Tr w/UC1xpxI1upPul5yW75k8A4bijhSUMD7vPl/CFr++fbrI8yyb9i3D7gwz7V3A377m P8UkOqnBh02Uz+4jZvFejO1q7iUNvRm+4Oy6ORXrGoegrTbxRKT7Czn3W6S8MtUT/i Q9X/NZ2CqGxNXq+amxGDD8TS3m1IacNHOKsFXCBg3NKtAcoKSBG9rgnx0WzD2rOTxq 09wpKlvU/VaMrLL4k9L7qtfGOc/aUP2FyvVvH1PlG5eQ1W8DyVnv4oZBUhCvolnrlk KZGAnSfsBYkag== Date: Thu, 16 Apr 2026 23:30:09 -0700 From: Minchan Kim To: Christian Brauner Cc: akpm@linux-foundation.org, david@kernel.org, mhocko@suse.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, surenb@google.com, timmurray@google.com Subject: Re: [RFC 3/3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Message-ID: References: <20260413223948.556351-1-minchan@kernel.org> <20260413223948.556351-4-minchan@kernel.org> <20260416-planktont-abwinken-b9499483b939@brauner> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260416-planktont-abwinken-b9499483b939@brauner> On Thu, Apr 16, 2026 at 11:13:35AM +0200, Christian Brauner wrote: > On Mon, Apr 13, 2026 at 03:39:48PM -0700, Minchan Kim wrote: > > Currently, process_mrelease() requires userspace to send a SIGKILL signal > > prior to invocation. This separation introduces a race window where the > > victim task may receive the signal and enter the exit path before the > > reaper can invoke process_mrelease(). > > > > In this case, the victim task frees its memory via the standard, unoptimized > > exit path, bypassing the expedited clean file folio reclamation optimization > > introduced in the previous patch (which relies on the MMF_UNSTABLE flag). > > > > This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support > > an integrated auto-kill mode. When specified, process_mrelease() directly > > injects a SIGKILL into the target task. > > > > Crucially, this patch utilizes a dedicated signal code (KILL_MRELEASE) > > during signal injection, belonging to a new SIGKILL si_codes section. > > This special code ensures that the kernel's signal delivery path reliably > > intercepts the request and marks the target address space as unstable > > (MMF_UNSTABLE). This mechanism guarantees that the MMF_UNSTABLE flag is set > > before either the victim task or the reaper proceeds, ensuring that the > > expedited reclamation optimization is utilized regardless of scheduling > > order. > > > > Signed-off-by: Minchan Kim > > --- > > include/uapi/asm-generic/siginfo.h | 6 ++++++ > > include/uapi/linux/mman.h | 4 ++++ > > kernel/signal.c | 4 ++++ > > mm/oom_kill.c | 20 +++++++++++++++++++- > > 4 files changed, 33 insertions(+), 1 deletion(-) > > > > diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h > > index 5a1ca43b5fc6..0f59b791dab4 100644 > > --- a/include/uapi/asm-generic/siginfo.h > > +++ b/include/uapi/asm-generic/siginfo.h > > @@ -252,6 +252,12 @@ typedef struct siginfo { > > #define BUS_MCEERR_AO 5 > > #define NSIGBUS 5 > > > > +/* > > + * SIGKILL si_codes > > + */ > > +#define KILL_MRELEASE 1 /* sent by process_mrelease */ > > +#define NSIGKILL 1 > > + > > /* > > * SIGTRAP si_codes > > */ > > diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h > > index e89d00528f2f..4266976b45ad 100644 > > --- a/include/uapi/linux/mman.h > > +++ b/include/uapi/linux/mman.h > > @@ -56,4 +56,8 @@ struct cachestat { > > __u64 nr_recently_evicted; > > }; > > > > +/* Flags for process_mrelease */ > > +#define PROCESS_MRELEASE_REAP_KILL (1 << 0) > > +#define PROCESS_MRELEASE_VALID_FLAGS (PROCESS_MRELEASE_REAP_KILL) > > + > > #endif /* _UAPI_LINUX_MMAN_H */ > > diff --git a/kernel/signal.c b/kernel/signal.c > > index d65d0fe24bfb..c21b2176dc5e 100644 > > --- a/kernel/signal.c > > +++ b/kernel/signal.c > > @@ -1134,6 +1134,10 @@ static int __send_signal_locked(int sig, struct kernel_siginfo *info, > > > > out_set: > > signalfd_notify(t, sig); > > + > > + if (sig == SIGKILL && !is_si_special(info) && > > + info->si_code == KILL_MRELEASE && t->mm) > > + mm_flags_set(MMF_UNSTABLE, t->mm); > > sigaddset(&pending->signal, sig); > > > > /* Let multiprocess signals appear after on-going forks */ > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 5c6c95c169ee..0b5da5208707 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -20,6 +20,8 @@ > > > > #include > > #include > > +#include > > +#include > > #include > > #include > > #include > > @@ -1218,13 +1220,29 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) > > bool reap = false; > > long ret = 0; > > > > - if (flags) > > + if (flags & ~PROCESS_MRELEASE_VALID_FLAGS) > > return -EINVAL; > > > > task = pidfd_get_task(pidfd, &f_flags); > > if (IS_ERR(task)) > > return PTR_ERR(task); > > > > + if (flags & PROCESS_MRELEASE_REAP_KILL) { > > + struct kernel_siginfo info; > > + > > + if (!capable(CAP_KILL)) { > > Why? Just call a function that uses check_kill_permission() before > firing the signal? What's the rational for doing it this way? Thanks for pointing that out. I wasn't aware of check_kill_permission(). I took a look at it, and it seems check_kill_permission() handles permissions primarily for signals sent from userspace. Since we are injecting the signal from the kernel side using a positive si_code (KILL_MRELEASE), check_kill_permission() would just return 0 and skip the permission checks entirely. I am open to better ideas if there is a more standard way to handle permission checks for kernel-injected signals. > > Tbh, I really hate that process_mrelease() now has a kill side effect > with non-standard permission handling as well. > > Seems like bad api design. Why can't you just raise the MMF_UNSTABLE bit > before the SIGKILL as that's the problem you're trying to solve. The problem is that process_mrelease() strictly requires the target process to already have a pending fatal signal or be in the exit path before it allows any operation. Therefore, we cannot invoke process_mrelease() to just set the MMF_UNSTABLE flag *before* the SIGKILL is sent. If I send the SIGKILL first to satisfy the process_mrelease() requirement, we immediately run into the scheduling race condition where the victim can enter the exit path before the reaper can set the flag. This circular dependency is exactly why I had to integrate the kill operation into process_mrelease() to make it atomic. > > > + ret = -EPERM; > > + goto put_task; > > + } > > + clear_siginfo(&info); > > + info.si_signo = SIGKILL; > > + info.si_code = KILL_MRELEASE; > > + info.si_pid = task_tgid_vnr(current); > > + info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); > > This should not be open-coded like this. Good point. Maybe, I can reuse prepare_kill_siginfo. > > > + > > + do_send_sig_info(SIGKILL, &info, task, PIDTYPE_TGID); > > + } > > + > > /* > > * Make sure to choose a thread which still has a reference to mm > > * during the group exit > > -- > > 2.54.0.rc0.605.g598a273b03-goog > >