Generic Linux architectural discussions

Generic Linux architectural discussions
 help / color / mirror / Atom feed

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Al Viro @ 2026-06-25 16:55 UTC (permalink / raw)
  To: Xin Zhao
  Cc: brauner, alex.aring, allen.lkml, arnd, chuck.lever, david,
	ebiederm, j.granados, jack, jlayton, keescook, linux-arch,
	linux-fsdevel, linux-kernel, linux-mm, ljs, mcgrof, mjguzik,
	pfalcato, rppt
In-Reply-To: <20260625085018.989584-1-jackzxcui1989@163.com>

On Thu, Jun 25, 2026 at 04:50:18PM +0800, Xin Zhao wrote:
> > [Severity: Medium]
> > Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
> > of file->f_flags risks losing concurrent fcntl() updates since it doesn't
> > hold file->f_lock.
> > 
> > Also, if a file has duplicated file descriptors (e.g., via dup()), will
> > clearing O_TMPCLOS here prematurely skip the closure of the remaining
> > descriptors? When encountering the duplicated descriptor later, the flag
> > will already be cleared, leaving the shared file actively referenced.
> 
> Currently, this flag will only be used by the logic we added, so I believe
> there won't be any issues.

What makes you (or whatever LLM you happen to use) think that file is referenced
only by descriptor table of the coredumping process?  Or that only one coredumping
process exists at any time, for that matter - and each might hold references to
the same struct file.

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: Andy Lutomirski @ 2026-06-25 15:51 UTC (permalink / raw)
  To: John Ericson
  Cc: Al Viro, Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <a75a9b82-a15b-4893-8f92-62b62664ea83@app.fastmail.com>

On Wed, Jun 24, 2026 at 8:41 PM John Ericson <mail@johnericson.me> wrote:
>
> Ah, I started replying to your first email, but this is better, this
> gets to the heart of the matter. Please don't mind me responding to your
> two questions in reverse.
>
> On Wed, Jun 24, 2026, at 9:10 PM, Al Viro wrote:
> > What's the fundamental difference between CWD and any open descriptor
> > for a directory?  Why does it make sense to ban the former, but allow
> > the equivalents done via the latter?
>
> Yes! These two notions are very close --- but that's the *problem*, not
> a reason to not care about the existence of the CWD and root FS. I want
> to get rid of CWD in my processes not because it is fundamentally
> different (it isn't), but because it is superfluous.
>
> If one is capability-minded like me, it's a bad mistake that we ever had
> this "working directory" notion to begin with, and yet another example
> of the folks at Bell Labs sticking something in the kernel that was
> really only needed by the shell, and that could have just been done in
> userland.
>
> The current working directory, roughly, is *just* some global state
> holding a directory file descriptor. But I don't want that global state.
> If I am writing my userland program (that is not a shell), I would not
> create the global variable. I do not appreciate the fact that the kernel
> foists that state upon me whether I like it or not.
>
> Now obviously we cannot have a giant breaking change removing the notion
> of a current working directory altogether. But we can allow individual
> processes which don't want it to opt out, and that is what nulling out
> these fields (and updating the path resolution code to cope with that)
> allows.
>
> There is no loss of expressive power doing this, because one can (and
> should!) just use the `*at` and file descriptors. But there is, however,
> the imposition of discipline. The programmer (or coding agent) is
> encouraged to do everything with file descriptors rather than path
> concatenations etc., because they need to use `*at` anyways, and then
> voilà, without browbeating anyone in security seminars or code review, a
> bunch of TOCTOU issues disappear simply because doing the right thing is
> now the path of least resistance.
>
> > Please, start with explaining what, in your opinion, a mount namespace
> > _is_, and where does "mount X is attached at path P relative to mount
> > Y" belong.
>
> Let's take a pathological example:
>
> - Process A has `/foo` bind-mounted at `/bar/foo`
>
> - Process B has `/bar` without that bind mount, and `/foo` mounted at
>   `/baz/foo`, as is possible because it is in a different mount
>   namespace.
>
> If A opens `/bar/foo`, and sends it over (via socket) to B, and then B
> does `openat(recv_fd, "..")`, B will get `/bar`, not `/baz`. This is
> because `..` is resolved according to the mount referenced in the open
> file. (This is, by the way, very good! Directory file descriptors would
> be perilous to use if this were not the case!)
>
> The moral of the story is that "mount X is attached at path P relative
> to mount Y" is information accessed in the mounts themselves (maybe via
> their containing mount namespace, per the `mnt_ns` field, or maybe not,
> I am not sure, but it is immaterial). In contrast, the mount namespace
> of the *opening* task (`current->nsproxy->mnt_ns`, and current is B)
> doesn't matter at all for this purpose.

It's sort of a combination -- read the data structures :)  Other than
the propagation part, they're really not that bad.

In any event, I think this discussion is sort of immaterial to the
proposed API change.  No one is about to remove the concept of a mount
namespace.  But maybe it makes sense to have a way to have a task that
doesn't actually belong to a mount namespace.  A mount namespace is
certainly going to exist.

There will definitely be subtleties.  For example, what happens if a
task with "no mount namespace" tries to do OPEN_TREE_CLONE?  In some
logical sense it ought to work but it ought to be impossible to
actually mount the resulting tree anywhere, but this risks running
afoul of all kinds of checks.  Maybe you get a whole new mount
namespace (that does not become your current mnt_ns) if you
OPEN_TREE_CLONE?

This stuff is complex and it probably makes more sense to keep changes simple.

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Xin Zhao @ 2026-06-25 15:45 UTC (permalink / raw)
  To: ljs
  Cc: akpm, alex.aring, allen.lkml, arnd, brauner, chuck.lever, corbet,
	david, ebiederm, j.granados, jack, jackzxcui1989, jlayton,
	juri.lelli, keescook, liam, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, mcgrof, mingo, mjguzik, peterz, pfalcato,
	vincent.guittot, viro
In-Reply-To: <aj0cUrwdXYKIicC-@lucifer>

On Thu, 25 Jun 2026 13:48:10 +0100 Lorenzo Stoakes <ljs@kernel.org> wrote:

> +cc missing maintainers, lists.
> 
> NAK.
> 
> This is un-upstreamable for numerous reasons.
> 
> The stuff you're doing in mm is broken, wrong and invasive and you've not
> even bothered to cc- mm people. I'm annoyed by this.
> 
> You're also doing incredibly silly mistakes at v4 of something that should have
> been an RFC.
> 
> You don't seem to understand the concept of patch _series_ (break it up into
> smaller patches!!!) and you haven't bothered cc'ing maintainers whose subsystems
> you're radically alterting.
> 
> I'm annoyed as you have a history where you were told not to add insane hacks
> before ([0], my reply at [1]).
> 
> [0]:https://lore.kernel.org/all/20260116042817.3790405-1-jackzxcui1989@163.com/
> [1]:https://lore.kernel.org/all/14110b70-19e7-474d-b0dd-ba80e8bed9b0@lucifer.local/
> 
> Was I wasting my time there? Am I wasting my time responding now?
> 
> And how hard is it to run a simple perl script?
> 
> Let me run it for you for _just_ the maintainers:

I probably shouldn't reply to this email to waste more of your time, but I
can't help but respond because your comments have been very beneficial to
me, and I enjoy the process.

The v4 version has changed too much compared to the v3 version. I should
have re-executed the "get maintainer" script, but I mistakenly copied the
previous email list and sent it out. I sincerely apologize for that.

There are quite a few issues now, and I haven't come up with a good
overall solution. I actually want to resolve the problems we encountered
in our project with minimal kernel modifications, but I can't think of a
good way to do it. It seems that the v4 version has turned out to be a
complete disaster of a patch, and I sincerely hope that my example won't
be used as a counterexample in the future. Thank you for that.

Suddenly, I have some thoughts about this issue, but I even question
whether I should have these ideas. Let me sit down and sort things out
properly. I hope the v5 version won't be a disaster.

Thanks
Xin Zhao

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Lorenzo Stoakes @ 2026-06-25 12:48 UTC (permalink / raw)
  To: Xin Zhao
  Cc: brauner, mjguzik, pfalcato, ebiederm, viro, jack, jlayton,
	chuck.lever, alex.aring, arnd, keescook, mcgrof, j.granados,
	allen.lkml, linux-fsdevel, linux-kernel, linux-arch,
	Jonathan Corbet, Andrew Morton, David Hildenbrand, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Liam R. Howlett,
	linux-doc, linux-mm
In-Reply-To: <20260624145552.70143-1-jackzxcui1989@163.com>

+cc missing maintainers, lists.

NAK.

This is un-upstreamable for numerous reasons.

The stuff you're doing in mm is broken, wrong and invasive and you've not
even bothered to cc- mm people. I'm annoyed by this.

You're also doing incredibly silly mistakes at v4 of something that should have
been an RFC.

You don't seem to understand the concept of patch _series_ (break it up into
smaller patches!!!) and you haven't bothered cc'ing maintainers whose subsystems
you're radically alterting.

I'm annoyed as you have a history where you were told not to add insane hacks
before ([0], my reply at [1]).

[0]:https://lore.kernel.org/all/20260116042817.3790405-1-jackzxcui1989@163.com/
[1]:https://lore.kernel.org/all/14110b70-19e7-474d-b0dd-ba80e8bed9b0@lucifer.local/

Was I wasting my time there? Am I wasting my time responding now?

And how hard is it to run a simple perl script?

Let me run it for you for _just_ the maintainers:

$ scripts/get_maintainer.pl --nogit --nogit-fallback --nor your_patch.patch
Jonathan Corbet <corbet@lwn.net> (maintainer:DOCUMENTATION)
Alexander Viro <viro@zeniv.linux.org.uk> (maintainer:FILESYSTEMS (VFS and infrastructure))
Christian Brauner <brauner@kernel.org> (maintainer:FILESYSTEMS (VFS and infrastructure))
Andrew Morton <akpm@linux-foundation.org> (maintainer:MEMORY MANAGEMENT - CORE)
David Hildenbrand <david@kernel.org> (maintainer:MEMORY MANAGEMENT - CORE)
Arnd Bergmann <arnd@arndb.de> (maintainer:GENERIC INCLUDE/ASM HEADER FILES)
Ingo Molnar <mingo@redhat.com> (maintainer:SCHEDULER)
Peter Zijlstra <peterz@infradead.org> (maintainer:SCHEDULER)
Juri Lelli <juri.lelli@redhat.com> (maintainer:SCHEDULER)
Vincent Guittot <vincent.guittot@linaro.org> (maintainer:SCHEDULER)
Kees Cook <kees@kernel.org> (maintainer:EXEC & BINFMT API, ELF)
"Liam R. Howlett" <liam@infradead.org> (maintainer:MEMORY MAPPING)
Lorenzo Stoakes <ljs@kernel.org> (maintainer:MEMORY MAPPING)
linux-doc@vger.kernel.org (open list:DOCUMENTATION)
linux-kernel@vger.kernel.org (open list)
linux-fsdevel@vger.kernel.org (open list:PROC FILESYSTEM)
linux-mm@kvack.org (open list:MEMORY MANAGEMENT - CORE)
linux-arch@vger.kernel.org (open list:GENERIC INCLUDE/ASM HEADER FILES)
EXEC & BINFMT API, ELF status: Supported

You're missing the majority of these. That's _not OK_.

On Wed, Jun 24, 2026 at 10:55:52PM +0800, Xin Zhao wrote:
> A coredump typically takes some time to complete. If we happen to hold a
> write lock with flock just before triggering the coredump, that write lock
> will not be released during the entire coredump process. As a result,
> other processes attempting to acquire the same write lock may experience
> significant delays. Another typical scenario is that shared memory, such
> as dma-buf, remains occupied and is not released for a long time due to
> core dumps.
>
> To address this, add /proc/<pid>/coredump_pre_exit node so that people can

This is a horrible idea.

> specify which resources they want to release before dumping core. This
> patch implements the early release of two types of resources: flock files
> and file-backed shared memory. Default settings are NOT pre-exit anything.

What, people set this ahead of time? For a dynamic thing like files?

>
> A temporary bit, O_TMPCLOS, is added to mark vma->vm_file->f_flags during
> the execution of the newly introduced exit_mmap_mapped_shared() function.
> In this way, the subsequent exit_files_pre_exit() function does not need
> to find the corresponding vma through the file to check for the VM_SHARED
> attribute, thereby reducing the traversal cost.

This sentence doesn't even make sense?

And also !VM_SHARED means !vma->vm_file so your code would NULL deref if you
didn't check that. But !VM_SHARED VMAs can absolutely be file-backed...

>
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
> ---
>
> Change in v4:
> - Christian pointed out that the coredump process will traverse file
>   descriptors (fd), so certain fds should not be closed by default.
>   Rework the whole feature, add /proc/<pid>/coredump_pre_exit for user
>   pre-exit resources selection, default is NOT pre-exit anything.
> - Mateusz suggested that walking the fd table and release the file-lock is
>   reasonable. No longer release all the fd(s). Based on user config, only
>   the flock fd(s) and the fd(s) correspondent to file-backed shared memory
>   will be released at most.
>
> Change in v3:
> - Add comment and commit-log to explain why do the MMF_DUMP_MAPPED_SHARED
>   mm_flags_test() check, note that memory mapped files keep their own
>   separate references to the files. The case to work around is that early
>   unlocking a flock on a file allows other processes to lock and modify
>   the mapped data protected by the flock,
>   as suggested by Pedro Falcato.
> - Link to v3: https://lore.kernel.org/all/20260619122419.3954581-1-jackzxcui1989@163.com/
>
> Change in v2:
> - Get rid of the implement of adding new fcntl API, the issue does not
>   worth inflicting the cost on everyone,
>   as suggested by Al Viro.
> - Call exit_files() in coredump_wait(),
>   as suggested by Eric W. Biederman.
>   Add MMF_DUMP_MAPPED_SHARED mm_flags_test() check to filter cases that
>   need to dump file-backed shared memory.
> - Link to v2: https://lore.kernel.org/lkml/20260618150301.3226517-1-jackzxcui1989@163.com/
>
> v1:
> - Link to v1: https://lore.kernel.org/all/20260618030700.2511668-1-jackzxcui1989@163.com/
> ---
>  .../admin-guide/kernel-parameters.txt         |  5 ++
>  Documentation/filesystems/proc.rst            | 58 +++++++++-----
>  fs/coredump.c                                 | 23 ++++++
>  fs/file.c                                     | 46 +++++++++++
>  fs/proc/base.c                                | 78 +++++++++++++++++++
>  include/linux/mm.h                            |  1 +

No.

>  include/linux/mm_types.h                      |  9 +++

No.

>  include/linux/sched/task.h                    |  1 +
>  include/uapi/asm-generic/fcntl.h              |  4 +
>  kernel/fork.c                                 | 12 +++
>  mm/mmap.c                                     | 21 +++++

No.

>  11 files changed, 238 insertions(+), 20 deletions(-)

This is a completely insane diffstat for a single patch. Ridiculous.

AND YOU HAVEN'T ADDED A SINGLE TEST.

>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f575d4508..bc6d3859f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1024,6 +1024,11 @@ Kernel parameters
>  			/proc/<pid>/coredump_filter.
>  			See also Documentation/filesystems/proc.rst.
>
> +	coredump_pre_exit=
> +			[KNL] Change the default value for
> +			/proc/<pid>/coredump_pre_exit.
> +			See also Documentation/filesystems/proc.rst.
> +
>  	coresight_cpu_debug.enable
>  			[ARM,ARM64]
>  			Format: <bool>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index db6167bef..6a637d31d 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -39,16 +39,17 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
>    3.2	/proc/<pid>/oom_score - Display current oom-killer score
>    3.3	/proc/<pid>/io - Display the IO accounting fields
>    3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
> -  3.5	/proc/<pid>/mountinfo - Information about mounts
> -  3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
> -  3.7   /proc/<pid>/task/<tid>/children - Information about task children
> -  3.8   /proc/<pid>/fdinfo/<fd> - Information about opened file
> -  3.9   /proc/<pid>/map_files - Information about memory mapped files
> -  3.10  /proc/<pid>/timerslack_ns - Task timerslack value
> -  3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> -  3.12	/proc/<pid>/arch_status - Task architecture specific information
> -  3.13  /proc/<pid>/fd - List of symlinks to open files
> -  3.14  /proc/<pid>/ksm_stat - Information about the process's ksm status.
> +  3.5  /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
> +  3.6	/proc/<pid>/mountinfo - Information about mounts
> +  3.7	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
> +  3.8   /proc/<pid>/task/<tid>/children - Information about task children
> +  3.9   /proc/<pid>/fdinfo/<fd> - Information about opened file
> +  3.10   /proc/<pid>/map_files - Information about memory mapped files
> +  3.11  /proc/<pid>/timerslack_ns - Task timerslack value
> +  3.12	/proc/<pid>/patch_state - Livepatch patch operation state
> +  3.13	/proc/<pid>/arch_status - Task architecture specific information
> +  3.14  /proc/<pid>/fd - List of symlinks to open files
> +  3.15  /proc/<pid>/ksm_stat - Information about the process's ksm status.
>
>    4	Configuring procfs
>    4.1	Mount options
> @@ -1961,7 +1962,24 @@ For example::
>    $ echo 0x7 > /proc/self/coredump_filter
>    $ ./some_program
>
> -3.5	/proc/<pid>/mountinfo - Information about mounts
> +3.5 /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
> +---------------------------------------------------------------
> +A coredump typically takes some time to complete. If we happen to hold a write
> +lock with flock just before triggering the coredump, that write lock will not
> +be released during the entire coredump process. As a result, other processes
> +attempting to acquire the same write lock may experience significant delays.
> +Another typical scenario is that shared memory, such as dma-buf, remains
> +occupied and is not released for a long time due to core dumps.
> +
> +/proc/<pid>/coredump_pre_exit allows you to pre-exit some resources before
> +dumping core.
> +
> +The following two types are supported:
> +
> +  - (bit 0) flock files
> +  - (bit 1) file-backed shared memory
> +
> +3.6	/proc/<pid>/mountinfo - Information about mounts
>  --------------------------------------------------------
>
>  This file contains lines of the form::
> @@ -2001,7 +2019,7 @@ For more information on mount propagation see:
>    Documentation/filesystems/sharedsubtree.rst
>
>
> -3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
> +3.7	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
>  --------------------------------------------------------
>  These files provide a method to access a task's comm value. It also allows for
>  a task to set its own or one of its thread siblings comm value. The comm value
> @@ -2010,7 +2028,7 @@ then the kernel's TASK_COMM_LEN (currently 16 chars, including the NUL
>  terminator) will result in a truncated comm value.
>
>
> -3.7	/proc/<pid>/task/<tid>/children - Information about task children
> +3.8	/proc/<pid>/task/<tid>/children - Information about task children
>  -------------------------------------------------------------------------
>  This file provides a fast way to retrieve first level children pids
>  of a task pointed by <pid>/<tid> pair. The format is a space separated
> @@ -2027,7 +2045,7 @@ pids, so one needs to either stop or freeze processes being inspected
>  if precise results are needed.
>
>
> -3.8	/proc/<pid>/fdinfo/<fd> - Information about opened file
> +3.9	/proc/<pid>/fdinfo/<fd> - Information about opened file
>  ---------------------------------------------------------------
>  This file provides information associated with an opened file. The regular
>  files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'.
> @@ -2198,7 +2216,7 @@ VFIO Device files
>  where 'vfio-device-syspath' is the sysfs path corresponding to the VFIO device
>  file.
>
> -3.9	/proc/<pid>/map_files - Information about memory mapped files
> +3.10	/proc/<pid>/map_files - Information about memory mapped files
>  ---------------------------------------------------------------------
>  This directory contains symbolic links which represent memory mapped files
>  the process is maintaining.  Example output::
> @@ -2220,7 +2238,7 @@ time one can open(2) mappings from the listings of two processes and
>  comparing their inode numbers to figure out which anonymous memory areas
>  are actually shared.
>
> -3.10	/proc/<pid>/timerslack_ns - Task timerslack value
> +3.11	/proc/<pid>/timerslack_ns - Task timerslack value
>  ---------------------------------------------------------
>  This file provides the value of the task's timerslack value in nanoseconds.
>  This value specifies an amount of time that normal timers may be deferred
> @@ -2236,7 +2254,7 @@ Valid values are from 0 - ULLONG_MAX
>  An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level
>  permissions on the task specified to change its timerslack_ns value.
>
> -3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> +3.12	/proc/<pid>/patch_state - Livepatch patch operation state
>  -----------------------------------------------------------------
>  When CONFIG_LIVEPATCH is enabled, this file displays the value of the
>  patch state for the task.
> @@ -2253,7 +2271,7 @@ patched.  If the patch is being enabled, then the task has already been
>  patched.  If the patch is being disabled, then the task hasn't been
>  unpatched yet.
>
> -3.12 /proc/<pid>/arch_status - task architecture specific status
> +3.13 /proc/<pid>/arch_status - task architecture specific status
>  -------------------------------------------------------------------
>  When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
>  architecture specific status of the task.
> @@ -2298,7 +2316,7 @@ AVX512_elapsed_ms
>    the task is unlikely an AVX512 user, but depends on the workload and the
>    scheduling scenario, it also could be a false negative mentioned above.
>
> -3.13 /proc/<pid>/fd - List of symlinks to open files
> +3.14 /proc/<pid>/fd - List of symlinks to open files
>  -------------------------------------------------------
>  This directory contains symbolic links which represent open files
>  the process is maintaining.  Example output::
> @@ -2313,7 +2331,7 @@ The number of open files for the process is stored in 'size' member
>  of stat() output for /proc/<pid>/fd for fast access.
>  -------------------------------------------------------
>
> -3.14 /proc/<pid>/ksm_stat - Information about the process's ksm status
> +3.15 /proc/<pid>/ksm_stat - Information about the process's ksm status
>  ----------------------------------------------------------------------
>  When CONFIG_KSM is enabled, each process has this file which displays
>  the information of ksm merging status.
> diff --git a/fs/coredump.c b/fs/coredump.c
> index bb6fdb1f4..e08a8a6c4 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -521,6 +521,27 @@ static int zap_threads(struct task_struct *tsk,
>  	return nr;
>  }
>
> +static void coredump_pre_exit(void)
> +{
> +	struct task_struct *tsk = current;
> +	unsigned long flags = __mm_flags_get_dumpable(tsk->mm);
> +
> +	if (!likely(flags & MMF_DUMP_PRE_EXIT_MASK))
> +		return;
> +
> +	/*
> +	 * Set O_TMPCLOS of file f_flags if file needs to be closed.
> +	 */
> +	if (test_bit(MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED, &flags) &&
> +	    !test_bit(MMF_DUMP_MAPPED_SHARED, &flags))
> +		exit_mmap_mapped_shared(tsk->mm);

What the hell are you doing?

This is not where we unmap VMAs?

This is likely broken in subtle ways.

> +
> +	/*
> +	 * Check O_TMPCLOS of file f_flags to close file and clear it.
> +	 */
> +	exit_files_pre_exit(tsk, mm_flags_test(MMF_DUMP_PRE_EXIT_FLOCK, tsk->mm));
> +}
> +
>  static int coredump_wait(int exit_code, struct core_state *core_state)
>  {
>  	struct task_struct *tsk = current;
> @@ -1100,6 +1121,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
>  		return;
>  	}
>
> +	coredump_pre_exit();
> +
>  	switch (cn->core_type) {
>  	case COREDUMP_FILE:
>  		if (!coredump_file(cn, cprm, binfmt))
> diff --git a/fs/file.c b/fs/file.c
> index 2c81c0b16..a58ffffcc 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -23,6 +23,7 @@
>  #include <linux/file_ref.h>
>  #include <net/sock.h>
>  #include <linux/init_task.h>
> +#include <linux/filelock.h>
>
>  #include "internal.h"
>
> @@ -527,6 +528,51 @@ void exit_files(struct task_struct *tsk)
>  	}
>  }
>
> +void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
> +{
> +	struct files_struct *files = tsk->files;
> +	struct fdtable *fdt;
> +	struct file *file;
> +	unsigned int i, j = 0;
> +
> +	if (!files)
> +		return;
> +
> +	fdt = rcu_dereference_raw(files->fdt);
> +	for (;;) {
> +		unsigned long set;
> +
> +		i = j * BITS_PER_LONG;
> +		if (i >= fdt->max_fds)
> +			break;
> +		set = fdt->open_fds[j++];
> +		while (set) {
> +			if (!(set & 1))
> +				goto next_fd;
> +			file = fdt->fd[i];
> +			if (!file)
> +				goto next_fd;
> +			if (file->f_flags & O_TMPCLOS) {
> +				file->f_flags &= ~O_TMPCLOS;
> +				goto close_fd;
> +			}
> +			if (!checkflock)
> +				goto next_fd;
> +			if (!vfs_inode_has_locks(file_inode(file)))
> +				goto next_fd;
> +
> +close_fd:
> +			fdt->fd[i] = NULL;
> +			filp_close(file, files);
> +			cond_resched();
> +
> +next_fd:
> +			i++;
> +			set >>= 1;
> +		}
> +	}

This code hurts my eyes.

> +}
> +
>  struct files_struct init_files = {
>  	.count		= ATOMIC_INIT(1),
>  	.fdt		= &init_files.fdtab,
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d9acfa89c..99b5f219f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3026,6 +3026,83 @@ static const struct file_operations proc_coredump_filter_operations = {
>  	.write		= proc_coredump_filter_write,
>  	.llseek		= generic_file_llseek,
>  };
> +

No comment, obviously.

> +static ssize_t proc_coredump_pre_exit_read(struct file *file, char __user *buf,
> +					   size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = get_proc_task(file_inode(file));
> +	struct mm_struct *mm;
> +	char buffer[PROC_NUMBUF];
> +	size_t len;
> +	int ret;
> +
> +	if (!task)
> +		return -ESRCH;
> +
> +	ret = 0;
> +	mm = get_task_mm(task);
> +	if (mm) {
> +		unsigned long flags = __mm_flags_get_dumpable(mm);
> +
> +		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
> +			       ((flags & MMF_DUMP_PRE_EXIT_MASK) >>
> +				MMF_DUMP_PRE_EXIT_SHIFT));
> +		mmput(mm);
> +		ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
> +	}
> +
> +	put_task_struct(task);
> +
> +	return ret;
> +}
> +

Yeah who needs a comment...

> +static ssize_t proc_coredump_pre_exit_write(struct file *file,
> +					    const char __user *buf,
> +					    size_t count,
> +					    loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	unsigned int val;
> +	int ret;
> +	int i;
> +	unsigned long mask;
> +
> +	ret = kstrtouint_from_user(buf, count, 0, &val);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = -ESRCH;
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		goto out_no_task;
> +
> +	mm = get_task_mm(task);
> +	if (!mm)
> +		goto out_no_mm;
> +	ret = 0;
> +
> +	for (i = 0, mask = 1; i < MMF_DUMP_PRE_EXIT_BITS; i++, mask <<= 1) {

What?

> +		if (val & mask)
> +			mm_flags_set(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +		else
> +			mm_flags_clear(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +	}
> +
> +	mmput(mm);
> + out_no_mm:
> +	put_task_struct(task);
> + out_no_task:
> +	if (ret < 0)
> +		return ret;
> +	return count;
> +}
> +
> +static const struct file_operations proc_coredump_pre_exit_operations = {
> +	.read		= proc_coredump_pre_exit_read,
> +	.write		= proc_coredump_pre_exit_write,
> +	.llseek		= generic_file_llseek,
> +};
>  #endif
>
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
> @@ -3391,6 +3468,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  #ifdef CONFIG_ELF_CORE
>  	REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
> +	REG("coredump_pre_exit", S_IRUGO|S_IWUSR, proc_coredump_pre_exit_operations),
>  #endif
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
>  	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index af23453e9..dfd4717c7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4066,6 +4066,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
>  extern int __vm_enough_memory(const struct mm_struct *mm, long pages, int cap_sys_admin);
>  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
>  extern void exit_mmap(struct mm_struct *);
> +extern void exit_mmap_mapped_shared(struct mm_struct *mm);

You don't use extern.

>  bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
>  				 unsigned long addr, bool write);
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c7db35be6..0555aaf50 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1963,6 +1963,15 @@ enum {
>  	(BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
>  	 BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
>
> +/* coredump pre-exit bits */
> +#define MMF_DUMP_PRE_EXIT_FLOCK	11
> +#define MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED 12

Err do we have space for this?

You really want to add 2 more bits to mm_struct flags for this insanity?

> +
> +#define MMF_DUMP_PRE_EXIT_SHIFT	(MMF_DUMPABLE_BITS + MMF_DUMP_FILTER_BITS)
> +#define MMF_DUMP_PRE_EXIT_BITS	2
> +#define MMF_DUMP_PRE_EXIT_MASK	\
> +	(((1 << MMF_DUMP_PRE_EXIT_BITS) - 1) << MMF_DUMP_PRE_EXIT_SHIFT)

So are these dumpable bits or not? Why are you not just incrementing
MMF_DUMPABLE_BITS?

> +
>  #ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
>  # define MMF_DUMP_MASK_DEFAULT_ELF	BIT(MMF_DUMP_ELF_HEADERS)
>  #else
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 41ed884cf..b4becbf6c 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -93,6 +93,7 @@ static inline void exit_thread(struct task_struct *tsk)
>  extern __noreturn void do_group_exit(int);
>
>  extern void exit_files(struct task_struct *);
> +extern void exit_files_pre_exit(struct task_struct *, bool);
>  extern void exit_itimers(struct task_struct *);
>
>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285..360604d65 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -95,6 +95,10 @@
>  #define O_NDELAY	O_NONBLOCK
>  #endif
>
> +#ifndef O_TMPCLOS
> +#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
> +#endif
> +
>  #define F_DUPFD		0	/* dup */
>  #define F_GETFD		1	/* get close_on_exec */
>  #define F_SETFD		2	/* set/clear close_on_exec */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a679b2448..84f1ee7f3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>
>  __setup("coredump_filter=", coredump_filter_setup);
>
> +static unsigned long default_dump_pre_exit;
> +
> +static int __init coredump_pre_exit_setup(char *s)
> +{
> +	default_dump_pre_exit =
> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> +		MMF_DUMP_PRE_EXIT_MASK;
> +	return 1;
> +}
> +
> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> +
>  #include <linux/init_task.h>
>
>  static void mm_init_aio(struct mm_struct *mm)
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5754d1c36..b955c47c0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1326,6 +1326,27 @@ void exit_mmap(struct mm_struct *mm)
>  	vm_unacct_memory(nr_accounted);
>  }
>
> +void exit_mmap_mapped_shared(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vma;
> +	VMA_ITERATOR(vmi, mm, 0);
> +
> +	mmap_write_lock(mm);
> +	lru_add_drain();

Why?

> +
> +	for_each_vma(vmi, vma) {

Literally every single VMA? Including the gate VMA too?

No VMA locks... so that's already broken.

> +		if (vma->vm_flags & VM_HUGETLB)
> +			continue;

That's not how you test for hugetlb.

> +		if (!(vma->vm_flags & VM_SHARED) || !file_inode(vma->vm_file)->i_nlink)

This isn't how we work with flags any more.

> +			continue;
> +		vma->vm_file->f_flags |= O_TMPCLOS;


Not sure directly manipulating file flags like this is valid in any way, shape,
or form.

> +		do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);

This is utterly broken, the outer loop will be invalidated by you removing
these, do_munmap() has its own iterator...

And this is just madly inefficient. Why wouldn't you just loop over the VMAs to
alter flags then unmap the whole range?

But this is also introducing a completely separate, duplicative, version of
exit_mmap().

You're not doing any of what that function does. You're just very inefficiently
unmapping everything?

> +		cond_resched();

Of course!

> +	}
> +
> +	mmap_write_unlock(mm);

And VMAs can be mapped again now?

> +}
> +
>  /*
>   * Return true if the calling process may expand its vm space by the passed
>   * number of pages
> --
> 2.34.1
>

I'm not sure if this idea can be made upstreamble in any way. But this patch or
anything that looks like it or fundamentally alters mm is just not acceptable,
sorry.

Lorenzo

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: David Hildenbrand (Arm) @ 2026-06-25 11:43 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Christian Brauner, Mike Rapoport, Lorenzo Stoakes, mjguzik,
	ebiederm, viro, jack, jlayton, chuck.lever, alex.aring, arnd,
	keescook, mcgrof, j.granados, allen.lkml, linux-fsdevel,
	linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <aj0Mr3e9yt0kU-Qj@pedro-suse>

On 6/25/26 13:18, Pedro Falcato wrote:
> On Thu, Jun 25, 2026 at 12:57:02PM +0200, David Hildenbrand (Arm) wrote:
>>>
>>> This makes no sense. I think you really need to sit down and think about
>>> a design for this that doesn't introduce state machinery for boot, mm,
>>> and the VFS in one shot to solve a fringe problem...
>>
>> Staring at exit_mmap_mapped_shared(), ... this looks rather hacky ("let's fake
>> munmap and set some magical flags").
>>
>> We're essentially saying "we don't want (pretty much) anything that's MAP_SHARED
>> in the coredump". And for some reason someone should configure that, that's a
>> rather weird toggle tbh.
>>
>> And the granularity ("file-backed shared memory") is completely odd.
>>
>>
>> Aren't there other ways we could optimize this internally?
>>
>> Like, if we know that a process is dead and cannot run anymore, downgrade writes
>> to reads (and make sure we block GUP write attempts accordingly), or would that
>> also not be sufficient?
>>
>>
>> Another thought:
>>
>> fs/coredump.c calls get_dump_page().
>>
>> get_dump_page() will not fault in any memory. So if a page is not in the page
>> tables at the time of the dump, it will not get included in the coredump. Which
>> means, that whether most non-anonymous memory will be included in a coredump is
>> already like playing the lottery.
>>
>> This is true for MAP_SHARED file mappings and MAP_PRIVATE file mappings without
>> private modifications.
>>
>> Which makes me wonder: How much is tooling relying on file-backed pages to end
>> up in a coredump?
> 
> FWIW this mechanism already exists, see /proc/self/coredump_filter. The
> default is bits 0, 1, 4 and 5 (see core(5)), which maps back to no file pages
> being dumped to a core dump, apart from ELF headers (these help the debugger
> trace back the mapped binary to the debug info using the buildid).
> 
> So the answer to this question is "approximately none" :)
> 

Ah, thanks! vma_dump_size() honors this, and I am sure through some magical
routing the information stored in m->dump_size will end up not dumping these pages.

Staring at elf_core_dump(), this "unmap some stuff" part is really, really
nasty, as it effectively removes the VMAs->segments from the dump. (unless I am
missing something important)

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Pedro Falcato @ 2026-06-25 11:18 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Christian Brauner, Mike Rapoport, Lorenzo Stoakes, mjguzik,
	ebiederm, viro, jack, jlayton, chuck.lever, alex.aring, arnd,
	keescook, mcgrof, j.granados, allen.lkml, linux-fsdevel,
	linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <9105c433-44a7-4e8f-bacb-def93d11a7f2@kernel.org>

On Thu, Jun 25, 2026 at 12:57:02PM +0200, David Hildenbrand (Arm) wrote:
> >> +
> >>  #define F_DUPFD		0	/* dup */
> >>  #define F_GETFD		1	/* get close_on_exec */
> >>  #define F_SETFD		2	/* set/clear close_on_exec */
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index a679b2448234..84f1ee7f32cf 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
> >>  
> >>  __setup("coredump_filter=", coredump_filter_setup);
> >>  
> >> +static unsigned long default_dump_pre_exit;
> >> +
> >> +static int __init coredump_pre_exit_setup(char *s)
> >> +{
> >> +	default_dump_pre_exit =
> >> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> >> +		MMF_DUMP_PRE_EXIT_MASK;
> >> +	return 1;
> >> +}
> >> +
> >> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> > 
> > This makes no sense. I think you really need to sit down and think about
> > a design for this that doesn't introduce state machinery for boot, mm,
> > and the VFS in one shot to solve a fringe problem...
> 
> Staring at exit_mmap_mapped_shared(), ... this looks rather hacky ("let's fake
> munmap and set some magical flags").
> 
> We're essentially saying "we don't want (pretty much) anything that's MAP_SHARED
> in the coredump". And for some reason someone should configure that, that's a
> rather weird toggle tbh.
> 
> And the granularity ("file-backed shared memory") is completely odd.
> 
> 
> Aren't there other ways we could optimize this internally?
> 
> Like, if we know that a process is dead and cannot run anymore, downgrade writes
> to reads (and make sure we block GUP write attempts accordingly), or would that
> also not be sufficient?
> 
> 
> Another thought:
> 
> fs/coredump.c calls get_dump_page().
> 
> get_dump_page() will not fault in any memory. So if a page is not in the page
> tables at the time of the dump, it will not get included in the coredump. Which
> means, that whether most non-anonymous memory will be included in a coredump is
> already like playing the lottery.
> 
> This is true for MAP_SHARED file mappings and MAP_PRIVATE file mappings without
> private modifications.
> 
> Which makes me wonder: How much is tooling relying on file-backed pages to end
> up in a coredump?

FWIW this mechanism already exists, see /proc/self/coredump_filter. The
default is bits 0, 1, 4 and 5 (see core(5)), which maps back to no file pages
being dumped to a core dump, apart from ELF headers (these help the debugger
trace back the mapped binary to the debug info using the buildid).

So the answer to this question is "approximately none" :)

-- 
Pedro

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: David Hildenbrand (Arm) @ 2026-06-25 10:57 UTC (permalink / raw)
  To: Christian Brauner, Mike Rapoport, Lorenzo Stoakes, mjguzik,
	pfalcato, ebiederm, viro, jack, jlayton, chuck.lever, alex.aring,
	arnd, keescook, mcgrof, j.granados, allen.lkml
  Cc: linux-fsdevel, linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <20260625-wappnen-drohbrief-wermutstropfen-c53538f01547@brauner>

>> +
>>  #define F_DUPFD		0	/* dup */
>>  #define F_GETFD		1	/* get close_on_exec */
>>  #define F_SETFD		2	/* set/clear close_on_exec */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index a679b2448234..84f1ee7f32cf 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>>  
>>  __setup("coredump_filter=", coredump_filter_setup);
>>  
>> +static unsigned long default_dump_pre_exit;
>> +
>> +static int __init coredump_pre_exit_setup(char *s)
>> +{
>> +	default_dump_pre_exit =
>> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
>> +		MMF_DUMP_PRE_EXIT_MASK;
>> +	return 1;
>> +}
>> +
>> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> 
> This makes no sense. I think you really need to sit down and think about
> a design for this that doesn't introduce state machinery for boot, mm,
> and the VFS in one shot to solve a fringe problem...

Staring at exit_mmap_mapped_shared(), ... this looks rather hacky ("let's fake
munmap and set some magical flags").

We're essentially saying "we don't want (pretty much) anything that's MAP_SHARED
in the coredump". And for some reason someone should configure that, that's a
rather weird toggle tbh.

And the granularity ("file-backed shared memory") is completely odd.

Aren't there other ways we could optimize this internally?

Like, if we know that a process is dead and cannot run anymore, downgrade writes
to reads (and make sure we block GUP write attempts accordingly), or would that
also not be sufficient?

Another thought:

fs/coredump.c calls get_dump_page().

get_dump_page() will not fault in any memory. So if a page is not in the page
tables at the time of the dump, it will not get included in the coredump. Which
means, that whether most non-anonymous memory will be included in a coredump is
already like playing the lottery.

This is true for MAP_SHARED file mappings and MAP_PRIVATE file mappings without
private modifications.

Which makes me wonder: How much is tooling relying on file-backed pages to end
up in a coredump?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Xin Zhao @ 2026-06-25  8:50 UTC (permalink / raw)
  To: brauner
  Cc: alex.aring, allen.lkml, arnd, chuck.lever, david, ebiederm,
	j.granados, jack, jackzxcui1989, jlayton, keescook, linux-arch,
	linux-fsdevel, linux-kernel, linux-mm, ljs, mcgrof, mjguzik,
	pfalcato, rppt, viro
In-Reply-To: <20260625-wappnen-drohbrief-wermutstropfen-c53538f01547@brauner>

On Thu, 25 Jun 2026 09:28:08 +0200 Christian Brauner <brauner@kernel.org> wrote:

> > +	coredump_pre_exit=
> > +			[KNL] Change the default value for
> > +			/proc/<pid>/coredump_pre_exit.
> > +			See also Documentation/filesystems/proc.rst.
> 
> Nah, we're not doing a separate file for this. That makes no sense
> whatsoever. I've already explained this in the first mail. There are
> effectively three modes:
> 
> (1) dump to a file
> (2) spawn super-privileged usermode helper process connect coredumping
>     process and said helper via pipe
> (3) coredumping process connects to AF_UNIX socket
> 
> Parameterize (1) and (2) via a command line arguments. I strongly
> suspect you're using some AI tooling so it should be able to figure out
> how this was done in the past.
> 
> (3) can be extended by just introducing a new flag value for struct
>     coredump_req. That is also illustrated by previous work.
> 
> We're not spreading procfs files. It's terrible api design especially
> for security sensitive changes.

The coredump socket approach is easier to implement because it allows for
interaction between the server and client, enabling the customization of
protocols. However, for the coredump file method, I can only think of
defining "r" and "R" through core_pattern to release flock and file-backed
shared data in advance. I'm unsure if this is feasible, as it changes the
original definition of core_pattern.

Regarding the coredump pipe, there is also a lack of a mechanism for the
pipe program to notify the coredump process, so it might still require
adding "r" and "R" at the end of core_pattern to indicate this, allowing
the coredump process to handle the early release on its own. I'm not sure
if my understanding is correct.

Even if the coredump pipe program obtains the file pointer from the process
that generated the coredump, it cannot reduce the reference count of the
file (which I understand is a very bad attempt). Since it cannot decrease
the reference count of the file, the early release must still be performed
by the task that generated the coredump. Given this situation, it seems
that we indeed need to use core_pattern for marking. I've thought for a
long time about more suitable solutions, but I haven't come up with any.

> > +#ifndef O_TMPCLOS
> > +#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
> > +#endif
> 
> Sorry, not going to happen. This doesn't not justify the addition of a
> new uapi value at all.

OK, if I use it at last, I will not put it in user header file.

> > +
> > +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> 
> This makes no sense. I think you really need to sit down and think about
> a design for this that doesn't introduce state machinery for boot, mm,
> and the VFS in one shot to solve a fringe problem...

I'll get rid of the attempt to add a new boot-up argument for this feature.

> [Severity: High]
> Does modifying the VMA maple tree via do_munmap() during the for_each_vma()
> iteration invalidate the outer iterator? The loop traverses the maple tree
> using the iterator vmi. However, do_munmap() creates its own internal
> VMA_ITERATOR and removes the VMA from the tree. Because the outer vmi
> iterator is not updated to reflect these structural changes, its cached
> state becomes stale, which can lead to a use-after-free when vma_next()
> is subsequently called.
> 
> via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com

When executing this traversal logic, we have already acquired a lock, and
the process has been frozen. The traversal logic goes from start to finish.
Are you sure that this approach could still have issues?

> [Severity: High]
> Is it safe to iterate the file descriptor table without holding
> rcu_read_lock()? Because coredump_pre_exit() is called before zap_threads()
> kills other threads, concurrent threads can still trigger expand_files(),
> which replaces the fdt and frees the old one after an RCU grace period.

Since the process has already been frozen, shouldn't we not need to consider
such concurrency issues?

> [Severity: Medium]
> Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
> of file->f_flags risks losing concurrent fcntl() updates since it doesn't
> hold file->f_lock.
> 
> Also, if a file has duplicated file descriptors (e.g., via dup()), will
> clearing O_TMPCLOS here prematurely skip the closure of the remaining
> descriptors? When encountering the duplicated descriptor later, the flag
> will already be cleared, leaving the shared file actively referenced.

> [Severity: Medium]
> Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
> of file->f_flags risks losing concurrent fcntl() updates since it doesn't
> hold file->f_lock.
> 
> Also, if a file has duplicated file descriptors (e.g., via dup()), will
> clearing O_TMPCLOS here prematurely skip the closure of the remaining
> descriptors? When encountering the duplicated descriptor later, the flag
> will already be cleared, leaving the shared file actively referenced.

Currently, this flag will only be used by the logic we added, so I believe
there won't be any issues.

Thanks
Xin Zhao

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Christian Brauner @ 2026-06-25  7:28 UTC (permalink / raw)
  To: David Hildenbrand, Mike Rapoport, Lorenzo Stoakes, brauner,
	mjguzik, pfalcato, ebiederm, viro, jack, jlayton, chuck.lever,
	alex.aring, arnd, keescook, mcgrof, j.granados, allen.lkml
  Cc: linux-fsdevel, linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <20260624145552.70143-1-jackzxcui1989@163.com>

> A coredump typically takes some time to complete. If we happen to hold a
> write lock with flock just before triggering the coredump, that write lock
> will not be released during the entire coredump process. As a result,
> other processes attempting to acquire the same write lock may experience
> significant delays. Another typical scenario is that shared memory, such
> as dma-buf, remains occupied and is not released for a long time due to
> core dumps.
> 
> To address this, add /proc/<pid>/coredump_pre_exit node so that people can
> specify which resources they want to release before dumping core. This
> patch implements the early release of two types of resources: flock files
> and file-backed shared memory. Default settings are NOT pre-exit anything.
> 
> A temporary bit, O_TMPCLOS, is added to mark vma->vm_file->f_flags during
> the execution of the newly introduced exit_mmap_mapped_shared() function.
> In this way, the subsequent exit_files_pre_exit() function does not need
> to find the corresponding vma through the file to check for the VM_SHARED
> attribute, thereby reducing the traversal cost.
> 
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f575d450861e..bc6d3859f874 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1024,6 +1024,11 @@ Kernel parameters
>  			/proc/<pid>/coredump_filter.
>  			See also Documentation/filesystems/proc.rst.
>  
> +	coredump_pre_exit=
> +			[KNL] Change the default value for
> +			/proc/<pid>/coredump_pre_exit.
> +			See also Documentation/filesystems/proc.rst.

Nah, we're not doing a separate file for this. That makes no sense
whatsoever. I've already explained this in the first mail. There are
effectively three modes:

(1) dump to a file
(2) spawn super-privileged usermode helper process connect coredumping
    process and said helper via pipe
(3) coredumping process connects to AF_UNIX socket

Parameterize (1) and (2) via a command line arguments. I strongly
suspect you're using some AI tooling so it should be able to figure out
how this was done in the past.

(3) can be extended by just introducing a new flag value for struct
    coredump_req. That is also illustrated by previous work.

We're not spreading procfs files. It's terrible api design especially
for security sensitive changes.

> +static void coredump_pre_exit(void)
> +{
> +	struct task_struct *tsk = current;
> +	unsigned long flags = __mm_flags_get_dumpable(tsk->mm);
> +
> +	if (!likely(flags & MMF_DUMP_PRE_EXIT_MASK))
> +		return;
> +
> +	/*
> +	 * Set O_TMPCLOS of file f_flags if file needs to be closed.
> +	 */
> +	if (test_bit(MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED, &flags) &&
> +	    !test_bit(MMF_DUMP_MAPPED_SHARED, &flags))
> +		exit_mmap_mapped_shared(tsk->mm);
> +
> +	/*
> +	 * Check O_TMPCLOS of file f_flags to close file and clear it.
> +	 */
> +	exit_files_pre_exit(tsk, mm_flags_test(MMF_DUMP_PRE_EXIT_FLOCK, tsk->mm));
> +}
> +
>  static int coredump_wait(int exit_code, struct core_state *core_state)
>  {
>  	struct task_struct *tsk = current;
> @@ -1100,6 +1121,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
>  		return;
>  	}
>  
> +	coredump_pre_exit();
> +
>  	switch (cn->core_type) {
>  	case COREDUMP_FILE:
>  		if (!coredump_file(cn, cprm, binfmt))
> diff --git a/fs/file.c b/fs/file.c
> index 2c81c0b162d0..a58ffffcc31d 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -23,6 +23,7 @@
>  #include <linux/file_ref.h>
>  #include <net/sock.h>
>  #include <linux/init_task.h>
> +#include <linux/filelock.h>
>  
>  #include "internal.h"
>  
> @@ -527,6 +528,51 @@ void exit_files(struct task_struct *tsk)
>  	}
>  }
>  
> +void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
> +{
> +	struct files_struct *files = tsk->files;
> +	struct fdtable *fdt;
> +	struct file *file;
> +	unsigned int i, j = 0;
> +
> +	if (!files)
> +		return;
> +
> +	fdt = rcu_dereference_raw(files->fdt);
> +	for (;;) {
> +		unsigned long set;
> +
> +		i = j * BITS_PER_LONG;
> +		if (i >= fdt->max_fds)
> +			break;
> +		set = fdt->open_fds[j++];
> +		while (set) {
> +			if (!(set & 1))
> +				goto next_fd;
> +			file = fdt->fd[i];
> +			if (!file)
> +				goto next_fd;
> +			if (file->f_flags & O_TMPCLOS) {
> +				file->f_flags &= ~O_TMPCLOS;
> +				goto close_fd;
> +			}
> +			if (!checkflock)
> +				goto next_fd;
> +			if (!vfs_inode_has_locks(file_inode(file)))
> +				goto next_fd;
> +
> +close_fd:
> +			fdt->fd[i] = NULL;
> +			filp_close(file, files);
> +			cond_resched();
> +
> +next_fd:
> +			i++;
> +			set >>= 1;
> +		}
> +	}
> +}
> +
>  struct files_struct init_files = {
>  	.count		= ATOMIC_INIT(1),
>  	.fdt		= &init_files.fdtab,
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d9acfa89c894..99b5f219f7fa 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3026,6 +3026,83 @@ static const struct file_operations proc_coredump_filter_operations = {
>  	.write		= proc_coredump_filter_write,
>  	.llseek		= generic_file_llseek,
>  };
> +
> +static ssize_t proc_coredump_pre_exit_read(struct file *file, char __user *buf,
> +					   size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = get_proc_task(file_inode(file));
> +	struct mm_struct *mm;
> +	char buffer[PROC_NUMBUF];
> +	size_t len;
> +	int ret;
> +
> +	if (!task)
> +		return -ESRCH;
> +
> +	ret = 0;
> +	mm = get_task_mm(task);
> +	if (mm) {
> +		unsigned long flags = __mm_flags_get_dumpable(mm);
> +
> +		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
> +			       ((flags & MMF_DUMP_PRE_EXIT_MASK) >>
> +				MMF_DUMP_PRE_EXIT_SHIFT));
> +		mmput(mm);
> +		ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
> +	}
> +
> +	put_task_struct(task);
> +
> +	return ret;
> +}
> +
> +static ssize_t proc_coredump_pre_exit_write(struct file *file,
> +					    const char __user *buf,
> +					    size_t count,
> +					    loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	unsigned int val;
> +	int ret;
> +	int i;
> +	unsigned long mask;
> +
> +	ret = kstrtouint_from_user(buf, count, 0, &val);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = -ESRCH;
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		goto out_no_task;
> +
> +	mm = get_task_mm(task);
> +	if (!mm)
> +		goto out_no_mm;
> +	ret = 0;
> +
> +	for (i = 0, mask = 1; i < MMF_DUMP_PRE_EXIT_BITS; i++, mask <<= 1) {
> +		if (val & mask)
> +			mm_flags_set(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +		else
> +			mm_flags_clear(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +	}
> +
> +	mmput(mm);
> + out_no_mm:
> +	put_task_struct(task);
> + out_no_task:
> +	if (ret < 0)
> +		return ret;
> +	return count;
> +}
> +
> +static const struct file_operations proc_coredump_pre_exit_operations = {
> +	.read		= proc_coredump_pre_exit_read,
> +	.write		= proc_coredump_pre_exit_write,
> +	.llseek		= generic_file_llseek,
> +};
>  #endif
>  
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
> @@ -3391,6 +3468,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  #ifdef CONFIG_ELF_CORE
>  	REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
> +	REG("coredump_pre_exit", S_IRUGO|S_IWUSR, proc_coredump_pre_exit_operations),
>  #endif
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
>  	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index af23453e9dbd..dfd4717c7e3e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4066,6 +4066,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
>  extern int __vm_enough_memory(const struct mm_struct *mm, long pages, int cap_sys_admin);
>  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
>  extern void exit_mmap(struct mm_struct *);
> +extern void exit_mmap_mapped_shared(struct mm_struct *mm);
>  bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
>  				 unsigned long addr, bool write);
>  
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c7db35be6a30..0555aaf50001 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1963,6 +1963,15 @@ enum {
>  	(BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
>  	 BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
>  
> +/* coredump pre-exit bits */
> +#define MMF_DUMP_PRE_EXIT_FLOCK	11
> +#define MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED 12
> +
> +#define MMF_DUMP_PRE_EXIT_SHIFT	(MMF_DUMPABLE_BITS + MMF_DUMP_FILTER_BITS)
> +#define MMF_DUMP_PRE_EXIT_BITS	2
> +#define MMF_DUMP_PRE_EXIT_MASK	\
> +	(((1 << MMF_DUMP_PRE_EXIT_BITS) - 1) << MMF_DUMP_PRE_EXIT_SHIFT)
> +
>  #ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
>  # define MMF_DUMP_MASK_DEFAULT_ELF	BIT(MMF_DUMP_ELF_HEADERS)
>  #else
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 41ed884cffc9..b4becbf6c0eb 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -93,6 +93,7 @@ static inline void exit_thread(struct task_struct *tsk)
>  extern __noreturn void do_group_exit(int);
>  
>  extern void exit_files(struct task_struct *);
> +extern void exit_files_pre_exit(struct task_struct *, bool);
>  extern void exit_itimers(struct task_struct *);
>  
>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285643..360604d653b4 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -95,6 +95,10 @@
>  #define O_NDELAY	O_NONBLOCK
>  #endif
>  
> +#ifndef O_TMPCLOS
> +#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
> +#endif

Sorry, not going to happen. This doesn't not justify the addition of a
new uapi value at all.

I'm also including various Sashkio comments:

sashiko.dev <sashiko@sashiko.dev>:

[Severity: Medium]
Is it safe to expose an internal, temporary kernel flag in a UAPI header?
Userspace applications could intentionally or accidentally pass O_TMPCLOS to
open(), which might permanently pollute the userspace ABI and trigger
unexpected behavior during a coredump.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com

> +
>  #define F_DUPFD		0	/* dup */
>  #define F_GETFD		1	/* get close_on_exec */
>  #define F_SETFD		2	/* set/clear close_on_exec */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a679b2448234..84f1ee7f32cf 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>  
>  __setup("coredump_filter=", coredump_filter_setup);
>  
> +static unsigned long default_dump_pre_exit;
> +
> +static int __init coredump_pre_exit_setup(char *s)
> +{
> +	default_dump_pre_exit =
> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> +		MMF_DUMP_PRE_EXIT_MASK;
> +	return 1;
> +}
> +
> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);

This makes no sense. I think you really need to sit down and think about
a design for this that doesn't introduce state machinery for boot, mm,
and the VFS in one shot to solve a fringe problem...





sashiko.dev <sashiko@sashiko.dev>:

[Severity: High]
Does modifying the VMA maple tree via do_munmap() during the for_each_vma()
iteration invalidate the outer iterator? The loop traverses the maple tree
using the iterator vmi. However, do_munmap() creates its own internal
VMA_ITERATOR and removes the VMA from the tree. Because the outer vmi
iterator is not updated to reflect these structural changes, its cached
state becomes stale, which can lead to a use-after-free when vma_next()
is subsequently called.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com


sashiko.dev <sashiko@sashiko.dev>:

[Severity: High]
Is it safe to iterate the file descriptor table without holding
rcu_read_lock()? Because coredump_pre_exit() is called before zap_threads()
kills other threads, concurrent threads can still trigger expand_files(),
which replaces the fdt and frees the old one after an RCU grace period.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com


sashiko.dev <sashiko@sashiko.dev>:

[Severity: Medium]
Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
of file->f_flags risks losing concurrent fcntl() updates since it doesn't
hold file->f_lock.

Also, if a file has duplicated file descriptors (e.g., via dup()), will
clearing O_TMPCLOS here prematurely skip the closure of the remaining
descriptors? When encountering the duplicated descriptor later, the flag
will already be cleared, leaving the shared file actively referenced.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com

-- 
Christian Brauner <brauner@kernel.org>

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: John Ericson @ 2026-06-25  3:41 UTC (permalink / raw)
  To: Al Viro
  Cc: Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <20260625011023.GM2636677@ZenIV>

Ah, I started replying to your first email, but this is better, this
gets to the heart of the matter. Please don't mind me responding to your
two questions in reverse.

On Wed, Jun 24, 2026, at 9:10 PM, Al Viro wrote:
> What's the fundamental difference between CWD and any open descriptor
> for a directory?  Why does it make sense to ban the former, but allow
> the equivalents done via the latter?

Yes! These two notions are very close --- but that's the *problem*, not
a reason to not care about the existence of the CWD and root FS. I want
to get rid of CWD in my processes not because it is fundamentally
different (it isn't), but because it is superfluous.

If one is capability-minded like me, it's a bad mistake that we ever had
this "working directory" notion to begin with, and yet another example
of the folks at Bell Labs sticking something in the kernel that was
really only needed by the shell, and that could have just been done in
userland.

The current working directory, roughly, is *just* some global state
holding a directory file descriptor. But I don't want that global state.
If I am writing my userland program (that is not a shell), I would not
create the global variable. I do not appreciate the fact that the kernel
foists that state upon me whether I like it or not.

Now obviously we cannot have a giant breaking change removing the notion
of a current working directory altogether. But we can allow individual
processes which don't want it to opt out, and that is what nulling out
these fields (and updating the path resolution code to cope with that)
allows.

There is no loss of expressive power doing this, because one can (and
should!) just use the `*at` and file descriptors. But there is, however,
the imposition of discipline. The programmer (or coding agent) is
encouraged to do everything with file descriptors rather than path
concatenations etc., because they need to use `*at` anyways, and then
voilà, without browbeating anyone in security seminars or code review, a
bunch of TOCTOU issues disappear simply because doing the right thing is
now the path of least resistance.

> Please, start with explaining what, in your opinion, a mount namespace
> _is_, and where does "mount X is attached at path P relative to mount
> Y" belong.

Let's take a pathological example:

- Process A has `/foo` bind-mounted at `/bar/foo`

- Process B has `/bar` without that bind mount, and `/foo` mounted at
  `/baz/foo`, as is possible because it is in a different mount
  namespace.

If A opens `/bar/foo`, and sends it over (via socket) to B, and then B
does `openat(recv_fd, "..")`, B will get `/bar`, not `/baz`. This is
because `..` is resolved according to the mount referenced in the open
file. (This is, by the way, very good! Directory file descriptors would
be perilous to use if this were not the case!)

The moral of the story is that "mount X is attached at path P relative
to mount Y" is information accessed in the mounts themselves (maybe via
their containing mount namespace, per the `mnt_ns` field, or maybe not,
I am not sure, but it is immaterial). In contrast, the mount namespace
of the *opening* task (`current->nsproxy->mnt_ns`, and current is B)
doesn't matter at all for this purpose.

I am not on a crusade against `struct mnt_namespace` in general; I am
just trying to null out `(struct nsproxy)::mnt_ns` in particular. (This
is just as I am not on a crusade against `struct path`, just `root` and
`pwd` of `struct fs_struct`.)

These days, `current->nsproxy->mnt_ns` is, to me, first and foremost,
there for the legacy mount API. Again, just like our CWD example above,
this is mostly just global state.

The new mount API drastically [^1] reduces the need for it, since it
allows referring to mounts explicitly via file descriptors. That's OK!
The argument is the same as the above --- I am *not* trying to limit
what can be done if one has all the right files open with the right
perms. I am just trying to limit what works out of the box --- to reduce
the default set of privileges, *especially* where the resources involved
are implicit and/or stateful.

[^1]: It doesn't *quite* eliminate the need for `nsproxy->mnt_ns`
    entirely, since (as I understand it, from reading the `move_mount`
    man page) it is still used for some authorization checks, since
    `O_PATH` file descriptors do not grant privileges other than mere
    discoverability. But that's a problem that could be solved later
    with an `O_MOUNT` option analogous to `O_RDONLY` or `O_WRONLY`. In
    the meantime, I am perfectly happy if my processes with null mount
    namespaces get `move_mount` permission errors.

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Xin Zhao @ 2026-06-25  2:51 UTC (permalink / raw)
  To: viro
  Cc: alex.aring, allen.lkml, arnd, brauner, chuck.lever, ebiederm,
	j.granados, jack, jackzxcui1989, jlayton, keescook, linux-arch,
	linux-fsdevel, linux-kernel, mcgrof, mjguzik, pfalcato
In-Reply-To: <20260624162844.GK2636677@ZenIV>

On Wed, 24 Jun 2026 17:28:44 +0100 Al Viro <viro@zeniv.linux.org.uk> wrote:

> > +			if (file->f_flags & O_TMPCLOS) {
> > +				file->f_flags &= ~O_TMPCLOS;
> > +				goto close_fd;
> > +			}
> 
> *blink*
> 
> 	How could that possibly make sense?  Many descriptors
> may refer to the same file; what's more, many descriptor tables
> may contain such descriptors, so... just what is that code
> trying to do?

This is yet another serious mistake. Perhaps my test scenarios were not
complex enough, or I was overly confident in removing the logic that
cleared the O_TMPCLOS flag and performed debug printing only when the
reference count dropped to zero during that single close operation,
without conducting further tests.

In v5, I plan to avoid clearing the O_TMPCLOS flag to handle the situation
where multiple file descriptors map to a single file. Of course, there are
some cases where the lifecycle of this file may extend beyond the process
exit, but AFICT such situations either cannot last long or do not involve
memory in the case where i_nlink != 0. Therefore, keeping this flag seems
unlikely to cause any issues.

Since this flag is no longer used temporarily (it will never be cleared),
I would like to rename it to O_PRECLOS.

Thanks
Xin Zhao

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: Al Viro @ 2026-06-25  1:10 UTC (permalink / raw)
  To: John Ericson
  Cc: Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <103524f8-1658-41df-88e9-cf49c628a721@app.fastmail.com>

On Wed, Jun 24, 2026 at 07:53:53PM -0400, John Ericson wrote:
> I wanted to discuss a bit about each type of namespace to indicate that
> this is a concept I think works across the board --- it wouldn't be such
> a good solution for the process spawning API if it was only applicable
> to some but not all namespace types. But the truth is that I have
> thought about the FS cases the most, as I think you have picked up on.
> 
> If there is interest in landing
> 
>   1. null CWD
>   2. null root fs
>   3. null mount namespace
> 
> in isolation, and then returning to the other namespaces to iron out
> their details, that would be fantastic. It would be much nicer for me to
> get some momentum that way, without having to design everything all at
> once first before getting to implement anything.

Please, start with explaining what, in your opinion, a mount namespace _is_,
and where does "mount X is attached at path P relative to mount Y" belong.

What's the fundamental difference between CWD and any open descriptor for
a directory?  Why does it make sense to ban the former, but allow the
equivalents done via the latter?

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: John Ericson @ 2026-06-24 23:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Li Chen, Cong Wang, Christian Brauner, linux-arch, LKML,
	linux-fsdevel, linux-api, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jan Kara, Jonathan Corbet, Shuah Khan, Al Viro, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <CALCETrU3bgUxp0k1y-U-uL0-fW2016Gmsyu9O_=830czEUGMcQ@mail.gmail.com>

On Wed, Jun 24, 2026, at 7:20 PM, Andy Lutomirski wrote:
> I think I like this, but some comments:

Thanks, that's really nice to hear!

While arguably this is just the culmination of a direction Linux has
been going in for a while, it could also be seen as a very "out there"
idea. That at least one person likes the rough sound of things makes me
feel a lot better!

> On Wed, Jun 24, 2026 at 4:06 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Jun 24, 2026 at 3:52 PM John Ericson <mail@johnericson.me> wrote:
>
> > >   - null current working directory: relative paths with traditional,
> > >     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
> >
> > It's perfectly valid to cd to a directory that does not belong to
> > one's namespace.  We have fchdir.  What's wrong with letting it
> > continue working?
> >
> > Regardless of that, the current directory either needs to be a
> > directory or to be nothing at all, and if we support the latter, we
> > need to figure out what /proc will show.
>
> Thinking about this more: I think that handling CWD might actually be
> a prerequisite for the series and has little to do with namespaces.
> Maybe try adding, as a standalone feature, the ability to have a null
> CWD.  Define semantics and see what the implementation looks like.
>
> Then, if you add null namespaces, you could optionally make
> transitioning to a null namespace set a null CWD.  Or those features
> could be orthogonal.

Hehe, I had the same thought after working on the filesystem patches,
along with the analogous thought for the root filesystem. It had been so
long since I had done a `chroot` without also doing a mount namespace
`unshare` --- despite the former being much older --- that I had
forgotten this separation of concerns.

My apologies for forgetting to include this insight in the original
email.

> Maybe the way to go is to implement the ones that have clearer
> semantics and to defer the others.

I would much prefer this, actually.

I wanted to discuss a bit about each type of namespace to indicate that
this is a concept I think works across the board --- it wouldn't be such
a good solution for the process spawning API if it was only applicable
to some but not all namespace types. But the truth is that I have
thought about the FS cases the most, as I think you have picked up on.

If there is interest in landing

  1. null CWD
  2. null root fs
  3. null mount namespace

in isolation, and then returning to the other namespaces to iron out
their details, that would be fantastic. It would be much nicer for me to
get some momentum that way, without having to design everything all at
once first before getting to implement anything.

John

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: Andy Lutomirski @ 2026-06-24 23:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: John Ericson, Li Chen, Cong Wang, Christian Brauner, linux-arch,
	linux-kernel, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan,
	Alexander Viro, Kees Cook, Sergei Zimmerman, Farid Zakaria
In-Reply-To: <CALCETrWhXNetw-BsAaoyT31suMmjYLdMh9MAuLB2Lvk2ac-31g@mail.gmail.com>

On Wed, Jun 24, 2026 at 4:06 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Jun 24, 2026 at 3:52 PM John Ericson <mail@johnericson.me> wrote:

> >   - null current working directory: relative paths with traditional,
> >     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
>
> It's perfectly valid to cd to a directory that does not belong to
> one's namespace.  We have fchdir.  What's wrong with letting it
> continue working?
>
> Regardless of that, the current directory either needs to be a
> directory or to be nothing at all, and if we support the latter, we
> need to figure out what /proc will show.

Thinking about this more: I think that handling CWD might actually be
a prerequisite for the series and has little to do with namespaces.
Maybe try adding, as a standalone feature, the ability to have a null
CWD.  Define semantics and see what the implementation looks like.

Then, if you add null namespaces, you could optionally make
transitioning to a null namespace set a null CWD.  Or those features
could be orthogonal.

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: Al Viro @ 2026-06-24 23:12 UTC (permalink / raw)
  To: John Ericson
  Cc: Li Chen, Cong Wang, Christian Brauner, linux-arch, linux-kernel,
	linux-fsdevel, linux-api, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <a49ce818-f38d-41b0-bbf7-80b8aad998b1@app.fastmail.com>

On Wed, Jun 24, 2026 at 06:51:47PM -0400, John Ericson wrote:

> #### Null mount namespace
> 
> - requires:
> 
>   - null root file system: absolute paths don't work.
> 
>   - null current working directory: relative paths with traditional,
>     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
> 
> - All operations relating to the "ambient" mount tree don't work.
> 
> - `*at` operations with a file descriptor do work.

Huh?  The last bit looks contradicts the previous one - if you have
an opened directory in a mount from some namespace, those `*at` operations
with that descriptor *will* be seeing the mount tree of that namespace,
whatever the hell is "ambient" supposed to mean.  Either that, or you
will be exposing whatever's overmounted in that mount, which is a huge
can of worms.

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: Andy Lutomirski @ 2026-06-24 23:06 UTC (permalink / raw)
  To: John Ericson
  Cc: Li Chen, Cong Wang, Christian Brauner, linux-arch, linux-kernel,
	linux-fsdevel, linux-api, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan,
	Alexander Viro, Kees Cook, Sergei Zimmerman, Farid Zakaria
In-Reply-To: <a49ce818-f38d-41b0-bbf7-80b8aad998b1@app.fastmail.com>

On Wed, Jun 24, 2026 at 3:52 PM John Ericson <mail@johnericson.me> wrote:
>
> Hello, I am hoping to discuss an idea I've had for a while, that I am
> calling "null namespaces" that has become more relevant with some recent
> other discussions. First I'll discuss null namespaces in general terms,
> and then I'll link those recent discussions and relate null namespaces
> to them.
>
> ### Null namespaces
>
> The essence of null namespaces is trying to give processes as little
> ambient authority as possible, so they are lighter weight and allowed to
> do even less than fully unshared processes today.
>
> Namespaces as they exist today are frequently described as an isolation
> mechanism, but I think this is the conflation of two different things.
> *Removing* a new process from its parent's namespaces unquestionably is
> increasing isolation --- no disagreement there. But putting the process
> in new namespaces is something else; I would call it supporting
> "delusions of grandeur" of that process. For example, namespaces allow a
> process to do mounts, have `CAP_SYS_ADMIN`, create network interfaces,
> look up other processes by PID, etc.
>
> Conceptually, to remove a process from one ambient authority scope (the
> very name "namespaces" indicates they are about ambient authority)
> should not require putting it in some ambient authority scope. Just
> because, for example, the process cannot see one mount tree, doesn't
> mean it needs to see another.

I think I like this, but some comments:

>
> Here's what I am thinking would happen concretely:
>
> First, the simpler cases:
>
> #### Null mount namespace
>
> - requires:
>
>   - null root file system: absolute paths don't work.
>
>   - null current working directory: relative paths with traditional,
>     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.

It's perfectly valid to cd to a directory that does not belong to
one's namespace.  We have fchdir.  What's wrong with letting it
continue working?

Regardless of that, the current directory either needs to be a
directory or to be nothing at all, and if we support the latter, we
need to figure out what /proc will show.

> #### Null user namespace

A user namespace is kind of about how *non-current* uids and gids work
for the process and how it perceives its own uid and gid and not so
much about what uid and gid it has when accessing outside resources.
So...

>
> - Process has no user or group ids

What does that mean?  What does ps show?



Maybe the way to go is to implement the ones that have clearer
semantics and to defer the others.

^ permalink raw reply

* [RFC] Null Namespaces
From: John Ericson @ 2026-06-24 22:51 UTC (permalink / raw)
  To: Li Chen, Cong Wang, Christian Brauner, linux-arch
  Cc: linux-kernel, linux-fsdevel, linux-api, Arnd Bergmann,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
	Shuah Khan, Alexander Viro, Kees Cook, Sergei Zimmerman,
	Farid Zakaria

Hello, I am hoping to discuss an idea I've had for a while, that I am
calling "null namespaces" that has become more relevant with some recent
other discussions. First I'll discuss null namespaces in general terms,
and then I'll link those recent discussions and relate null namespaces
to them.

### Null namespaces

The essence of null namespaces is trying to give processes as little
ambient authority as possible, so they are lighter weight and allowed to
do even less than fully unshared processes today.

Namespaces as they exist today are frequently described as an isolation
mechanism, but I think this is the conflation of two different things.
*Removing* a new process from its parent's namespaces unquestionably is
increasing isolation --- no disagreement there. But putting the process
in new namespaces is something else; I would call it supporting
"delusions of grandeur" of that process. For example, namespaces allow a
process to do mounts, have `CAP_SYS_ADMIN`, create network interfaces,
look up other processes by PID, etc.

Conceptually, to remove a process from one ambient authority scope (the
very name "namespaces" indicates they are about ambient authority)
should not require putting it in some ambient authority scope. Just
because, for example, the process cannot see one mount tree, doesn't
mean it needs to see another.

Here's what I am thinking would happen concretely:

First, the simpler cases:

#### Null mount namespace

- requires:

  - null root file system: absolute paths don't work.

  - null current working directory: relative paths with traditional,
    non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.

- All operations relating to the "ambient" mount tree don't work.

- `*at` operations with a file descriptor do work.

- The new fd-based mount APIs with detached mounts do work, modulo
  the calling process having enough permissions (as usual).

#### Null network namespace

- No network interfaces

- No abstract Unix sockets

#### Null IPC namespace

- cannot create or look up either type of message queue

#### Null UTS namespace

- no hostname or domainname: `uname`, `gethostname`/`sethostname`, and the
  related `/proc/sys/kernel` sysctls all fail.

#### Null user namespace

- Process has no user or group ids

- All uid/gid-based authorization lookups return "denied"

- -1 / "nobody" IDs for operations we don't want to fail (like `fstat`)
  can be used.

Note how in each of these, the notion of there "existing" a "single"
null namespace or not is degenerate --- every process with a null
namespace field is as isolated from one another (in terms of the axis
that namespace regulates) as they are from processes that are in other
namespaces. It is truly a minimal permission level, and (as we shall
see) cheap too, because it is just a null pointer in `task_struct`.

Then for the nested ones --- PID and cgroup --- we cannot have quite a
null namespace in the same sense, because it is an important property
that these namespaces are hierarchical up to the root namespaces.
Instead of having a disjoint null namespace, we need a null namespace
with a parent.

#### Null PID namespace

- cannot look up other processes by PID

- current process ID lookup fails

- current process's parent process ID lookup fails

- current process still assigned IDs in parent PID namespaces, per usual

#### Null cgroup namespace

- Process still can have resources restricted according to parent cgroup

- Process unaware of cgroup hierarchy though --- blind to who/how it is
  constrained

In these cases, we cannot just implement with a null pointer, because we
still need a valid parent namespace. However, we shouldn't need any info
*but* the parent namespace. A pair of a pointer and a bool indicating
null namespace with parent namespace or actual namespace membership,
with some sort of helper to get the parent namespace in either case
(since the actual namespace has its parent), should implement this.

Finally there is the time namespace. Conceptually a null time namespace
is simple enough --- you cannot look up the time! --- but the
implementation is a bit more complex to get right because of the vDSO
for certain timing operations.

### General Motivation

Why am I so interested in this stuff?

Firstly it is because I have always been interested in a more strictly
object-capability-based userland, and projects like
Capsicum/CloudABI/WASI. I think going all in on file descriptors is
generally the direction that Linux has been going in, and it creates a
genuinely better programming model than the traditional Unix one with
all its ambient authority, and the TOCTOU and other issues that attend
it.

Today's container idioms and the "delusions of grandeur" that namespaces
provide are great for retrofitting existing software to run in a more
isolated environment. But I don't want that to be the ceiling of our
ambitions. Especially in this age of LLM refactoring, it is very easy to
get both new and existing software to abide by the more limited set of
allowed operations that null-namespace processes allow. And the
modifications that that entails (more `openat`, more socket activation,
etc.) make that software (in my view) simply *better* --- I would want
it to work that way with or without these constraints forcing the issue.

Secondly, and more concretely/imminently as a Nix developer, I am very
interested in the performance and overhead of process isolation. It is
very much my ambition to move Nix into the Bazel/Buck space of ever more
numerous and fine-grained atomic build steps (i.e. small compilation
units, not "packages"), but to do this *without* sacrificing Nix's
strong sandboxing guarantees that make our build plans so self-contained
and thus the ease of onboarding new Nix users.

I think this "null namespace" sandboxing will likely be simpler and more
performant than creating and destroying a bunch of regular namespaces
for each compilation unit. And while it will no doubt take some compiler
/ other tool patching to fix up any assumptions that get in the way of
running processes with so few permissions, I am happy to take a stab at
that too. Nix is, after all, for "tool-assisted yak shaves" as one put
it --- patching GCC / Clang / whatever and then rebuilding the world is
something we are quite good at.

Lastly, I'll add that the traditional way people have thought about
things like Capsicum/CloudABI is custom personalities/seccomp rules, but
IMO trying to tackle the massive UAPI surface area so shallowly is ugly
and unmaintainable. Nulling out namespace fields in `task_struct`,
conversely, attacks the problem at its core, much more elegantly, and
makes it easy to handle both current *and future* syscalls in a
minimally invasive and maintainable manner.

### Null namespaces and process spawning

Why bring this up now?

Recently [1], Li Chen took a stab at the venerable old goal of making a
better process spawning UAPI than fork/clone + exec. I am quite excited
to see this happen, as it generally dovetails very nicely with the
object capability goals I have above. (E.g. making it performant and
idiomatic to opt-in, rather than opt-out of sharing file descriptors
with a child process is very good for a world where all
resource/privilege sharing is done with file descriptors.)

One problem with clone that didn't yet come up is that its defaults are
not good from a security perspective: sharing by default, and unsharing
as the opt in means that one must remember and take active measures to
ensure that child processes get *less* privileges. This is very bad ---
secure practices mean that the "lazy programmer" and the "smallest
program" must always err on the side of giving the child process *less*
privileges. This is the only way economics and the "principle of least
privilege" will work together, rather than against each other (and
economics is quite likely to win when they are working against each
other).

The reason that clone *doesn't* work that way is, of course,
performance: it would be wasteful to unshare and create new namespaces
when they are just going to be thrown away because the user wants to
share after all.

Null namespaces I think elegantly work around this performance/security
trade-off, while also avoiding the need for gazillion-parameter syscalls
like clone. This is because, as the most secure option, and a cheap
option, they are the rightful default for a new process creation API.

1. When an "embryonic" (under construction, not yet ready to be
   scheduled) task is first created, it should have all null namespaces.

2. Separate syscalls (`io_uring` exists for batching, we don't need to
   reinvent an ad-hoc batch solution) can exist for setting the
   namespaces on the process, where either "sharing" (use parent process
   namespace) or "unsharing" (use fresh namespace, usually derived from
   the parent process namespace but perhaps derived from a different
   one) are choices that can be opted into instead of the null namespace
   default.

3. After all state is initialized (arguments, environment variables,
   file descriptors, namespaces, etc.), the process can be "birthed",
   and submitted as ready to be scheduled.

This design is very natural to me, but its full naturality is *only*
available with the null namespace option. Otherwise we are stuck in a
place of no good defaults, and the "builder pattern" seems more awkward.

Also in [2], I bring up a design for unix sockets without the file
system or the "abstract" socket namespace, and how I want to avoid both
in order to firmly rule out TOCTOU and other ambient authority issues. I
think those arguments stand on their own, but the possibility of a null
network namespace sharpens the issue: it forces the `O_PATH` FD stuff I
discuss to be the only viable option.

### Implementation

I've "LLM'd" out some draft patches [3] for this. I'm not submitting
them because I still need to review and test them, and I don't want
(currently, pre those steps) low-quality slop to tarnish this proposal.
What this initial exploration did, however, confirm for me is that these
changes should be quite lightweight to implement. (Also, what I propose
is slightly different from my implementation draft in a few cases where
I think the design I proposed here is better than my draft
implementation.)

If the discussion here starts moving towards consensus, I'll clean up
and rework those patches along the lines of the consensus. Ideally I
would submit them one at a time, I figure, since the implementations for
different namespaces are necessarily changes to different subsystems.

Cheers!

John

[1]: https://lore.kernel.org/all/20260528095235.2491226-1-me@linux.beauty/

[2]: https://lore.kernel.org/all/455281ec-3ee1-4f27-989b-c239f0690d8b@app.fastmail.com/

[3]: https://github.com/Ericson2314/linux/commits/null-namespace

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Al Viro @ 2026-06-24 16:28 UTC (permalink / raw)
  To: Xin Zhao
  Cc: brauner, mjguzik, pfalcato, ebiederm, jack, jlayton, chuck.lever,
	alex.aring, arnd, keescook, mcgrof, j.granados, allen.lkml,
	linux-fsdevel, linux-kernel, linux-arch
In-Reply-To: <20260624145552.70143-1-jackzxcui1989@163.com>

On Wed, Jun 24, 2026 at 10:55:52PM +0800, Xin Zhao wrote:
> +void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
> +{
> +	struct files_struct *files = tsk->files;
> +	struct fdtable *fdt;
> +	struct file *file;
> +	unsigned int i, j = 0;
> +
> +	if (!files)
> +		return;
> +
> +	fdt = rcu_dereference_raw(files->fdt);
> +	for (;;) {
> +		unsigned long set;
> +
> +		i = j * BITS_PER_LONG;
> +		if (i >= fdt->max_fds)
> +			break;
> +		set = fdt->open_fds[j++];
> +		while (set) {
> +			if (!(set & 1))
> +				goto next_fd;
> +			file = fdt->fd[i];
> +			if (!file)
> +				goto next_fd;
> +			if (file->f_flags & O_TMPCLOS) {
> +				file->f_flags &= ~O_TMPCLOS;
> +				goto close_fd;
> +			}

*blink*

	How could that possibly make sense?  Many descriptors
may refer to the same file; what's more, many descriptor tables
may contain such descriptors, so... just what is that code
trying to do?

^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Sebastian Andrzej Siewior @ 2026-06-24 15:24 UTC (permalink / raw)
  To: Petr Mladek
  Cc: K Prateek Nayak, linux-arch, linux-kernel, sched-ext, netdev,
	David S . Miller, Andrea Righi, Andrew Morton, Arnd Bergmann,
	Ben Segall, Breno Leitao, Changwoo Min, David Vernet,
	Dietmar Eggemann, Eric Dumazet, Ingo Molnar, Jakub Kicinski,
	John Ogness, Juri Lelli, Paolo Abeni, Peter Zijlstra,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <ajugq8VAciqtMx9F@pathway.suse.cz>

On 2026-06-24 11:17:31 [+0200], Petr Mladek wrote:
> For Linus, it was a no-go, definitely.
…
> I would vote for adding the WARN_*DEFERRED() into the scheduler code
> at least until majority of console drivers are converted to nbcon API.

I see four nbcon serial console drivers (+netconsole, + drm_log). We
have at least four times that many console drivers. What is the
majority from your point of view? The 8250 should cover all of x86.

> Best Regards,
> Petr

Sebastian

^ permalink raw reply

* [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Xin Zhao @ 2026-06-24 14:55 UTC (permalink / raw)
  To: brauner, mjguzik, pfalcato, ebiederm, viro, jack, jlayton,
	chuck.lever, alex.aring, arnd, keescook, mcgrof, j.granados,
	allen.lkml
  Cc: linux-fsdevel, linux-kernel, linux-arch, Xin Zhao

A coredump typically takes some time to complete. If we happen to hold a
write lock with flock just before triggering the coredump, that write lock
will not be released during the entire coredump process. As a result,
other processes attempting to acquire the same write lock may experience
significant delays. Another typical scenario is that shared memory, such
as dma-buf, remains occupied and is not released for a long time due to
core dumps.

To address this, add /proc/<pid>/coredump_pre_exit node so that people can
specify which resources they want to release before dumping core. This
patch implements the early release of two types of resources: flock files
and file-backed shared memory. Default settings are NOT pre-exit anything.

A temporary bit, O_TMPCLOS, is added to mark vma->vm_file->f_flags during
the execution of the newly introduced exit_mmap_mapped_shared() function.
In this way, the subsequent exit_files_pre_exit() function does not need
to find the corresponding vma through the file to check for the VM_SHARED
attribute, thereby reducing the traversal cost.

Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
---

Change in v4:
- Christian pointed out that the coredump process will traverse file
  descriptors (fd), so certain fds should not be closed by default.
  Rework the whole feature, add /proc/<pid>/coredump_pre_exit for user
  pre-exit resources selection, default is NOT pre-exit anything.
- Mateusz suggested that walking the fd table and release the file-lock is
  reasonable. No longer release all the fd(s). Based on user config, only
  the flock fd(s) and the fd(s) correspondent to file-backed shared memory
  will be released at most.

Change in v3:
- Add comment and commit-log to explain why do the MMF_DUMP_MAPPED_SHARED
  mm_flags_test() check, note that memory mapped files keep their own
  separate references to the files. The case to work around is that early
  unlocking a flock on a file allows other processes to lock and modify
  the mapped data protected by the flock,
  as suggested by Pedro Falcato.
- Link to v3: https://lore.kernel.org/all/20260619122419.3954581-1-jackzxcui1989@163.com/

Change in v2:
- Get rid of the implement of adding new fcntl API, the issue does not
  worth inflicting the cost on everyone,
  as suggested by Al Viro.
- Call exit_files() in coredump_wait(),
  as suggested by Eric W. Biederman.
  Add MMF_DUMP_MAPPED_SHARED mm_flags_test() check to filter cases that
  need to dump file-backed shared memory.
- Link to v2: https://lore.kernel.org/lkml/20260618150301.3226517-1-jackzxcui1989@163.com/

v1:
- Link to v1: https://lore.kernel.org/all/20260618030700.2511668-1-jackzxcui1989@163.com/
---
 .../admin-guide/kernel-parameters.txt         |  5 ++
 Documentation/filesystems/proc.rst            | 58 +++++++++-----
 fs/coredump.c                                 | 23 ++++++
 fs/file.c                                     | 46 +++++++++++
 fs/proc/base.c                                | 78 +++++++++++++++++++
 include/linux/mm.h                            |  1 +
 include/linux/mm_types.h                      |  9 +++
 include/linux/sched/task.h                    |  1 +
 include/uapi/asm-generic/fcntl.h              |  4 +
 kernel/fork.c                                 | 12 +++
 mm/mmap.c                                     | 21 +++++
 11 files changed, 238 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f575d4508..bc6d3859f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1024,6 +1024,11 @@ Kernel parameters
 			/proc/<pid>/coredump_filter.
 			See also Documentation/filesystems/proc.rst.
 
+	coredump_pre_exit=
+			[KNL] Change the default value for
+			/proc/<pid>/coredump_pre_exit.
+			See also Documentation/filesystems/proc.rst.
+
 	coresight_cpu_debug.enable
 			[ARM,ARM64]
 			Format: <bool>
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index db6167bef..6a637d31d 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -39,16 +39,17 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
   3.2	/proc/<pid>/oom_score - Display current oom-killer score
   3.3	/proc/<pid>/io - Display the IO accounting fields
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
-  3.5	/proc/<pid>/mountinfo - Information about mounts
-  3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
-  3.7   /proc/<pid>/task/<tid>/children - Information about task children
-  3.8   /proc/<pid>/fdinfo/<fd> - Information about opened file
-  3.9   /proc/<pid>/map_files - Information about memory mapped files
-  3.10  /proc/<pid>/timerslack_ns - Task timerslack value
-  3.11	/proc/<pid>/patch_state - Livepatch patch operation state
-  3.12	/proc/<pid>/arch_status - Task architecture specific information
-  3.13  /proc/<pid>/fd - List of symlinks to open files
-  3.14  /proc/<pid>/ksm_stat - Information about the process's ksm status.
+  3.5  /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
+  3.6	/proc/<pid>/mountinfo - Information about mounts
+  3.7	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
+  3.8   /proc/<pid>/task/<tid>/children - Information about task children
+  3.9   /proc/<pid>/fdinfo/<fd> - Information about opened file
+  3.10   /proc/<pid>/map_files - Information about memory mapped files
+  3.11  /proc/<pid>/timerslack_ns - Task timerslack value
+  3.12	/proc/<pid>/patch_state - Livepatch patch operation state
+  3.13	/proc/<pid>/arch_status - Task architecture specific information
+  3.14  /proc/<pid>/fd - List of symlinks to open files
+  3.15  /proc/<pid>/ksm_stat - Information about the process's ksm status.
 
   4	Configuring procfs
   4.1	Mount options
@@ -1961,7 +1962,24 @@ For example::
   $ echo 0x7 > /proc/self/coredump_filter
   $ ./some_program
 
-3.5	/proc/<pid>/mountinfo - Information about mounts
+3.5 /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
+---------------------------------------------------------------
+A coredump typically takes some time to complete. If we happen to hold a write
+lock with flock just before triggering the coredump, that write lock will not
+be released during the entire coredump process. As a result, other processes
+attempting to acquire the same write lock may experience significant delays.
+Another typical scenario is that shared memory, such as dma-buf, remains
+occupied and is not released for a long time due to core dumps.
+
+/proc/<pid>/coredump_pre_exit allows you to pre-exit some resources before
+dumping core.
+
+The following two types are supported:
+
+  - (bit 0) flock files
+  - (bit 1) file-backed shared memory
+
+3.6	/proc/<pid>/mountinfo - Information about mounts
 --------------------------------------------------------
 
 This file contains lines of the form::
@@ -2001,7 +2019,7 @@ For more information on mount propagation see:
   Documentation/filesystems/sharedsubtree.rst
 
 
-3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
+3.7	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
 --------------------------------------------------------
 These files provide a method to access a task's comm value. It also allows for
 a task to set its own or one of its thread siblings comm value. The comm value
@@ -2010,7 +2028,7 @@ then the kernel's TASK_COMM_LEN (currently 16 chars, including the NUL
 terminator) will result in a truncated comm value.
 
 
-3.7	/proc/<pid>/task/<tid>/children - Information about task children
+3.8	/proc/<pid>/task/<tid>/children - Information about task children
 -------------------------------------------------------------------------
 This file provides a fast way to retrieve first level children pids
 of a task pointed by <pid>/<tid> pair. The format is a space separated
@@ -2027,7 +2045,7 @@ pids, so one needs to either stop or freeze processes being inspected
 if precise results are needed.
 
 
-3.8	/proc/<pid>/fdinfo/<fd> - Information about opened file
+3.9	/proc/<pid>/fdinfo/<fd> - Information about opened file
 ---------------------------------------------------------------
 This file provides information associated with an opened file. The regular
 files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'.
@@ -2198,7 +2216,7 @@ VFIO Device files
 where 'vfio-device-syspath' is the sysfs path corresponding to the VFIO device
 file.
 
-3.9	/proc/<pid>/map_files - Information about memory mapped files
+3.10	/proc/<pid>/map_files - Information about memory mapped files
 ---------------------------------------------------------------------
 This directory contains symbolic links which represent memory mapped files
 the process is maintaining.  Example output::
@@ -2220,7 +2238,7 @@ time one can open(2) mappings from the listings of two processes and
 comparing their inode numbers to figure out which anonymous memory areas
 are actually shared.
 
-3.10	/proc/<pid>/timerslack_ns - Task timerslack value
+3.11	/proc/<pid>/timerslack_ns - Task timerslack value
 ---------------------------------------------------------
 This file provides the value of the task's timerslack value in nanoseconds.
 This value specifies an amount of time that normal timers may be deferred
@@ -2236,7 +2254,7 @@ Valid values are from 0 - ULLONG_MAX
 An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level
 permissions on the task specified to change its timerslack_ns value.
 
-3.11	/proc/<pid>/patch_state - Livepatch patch operation state
+3.12	/proc/<pid>/patch_state - Livepatch patch operation state
 -----------------------------------------------------------------
 When CONFIG_LIVEPATCH is enabled, this file displays the value of the
 patch state for the task.
@@ -2253,7 +2271,7 @@ patched.  If the patch is being enabled, then the task has already been
 patched.  If the patch is being disabled, then the task hasn't been
 unpatched yet.
 
-3.12 /proc/<pid>/arch_status - task architecture specific status
+3.13 /proc/<pid>/arch_status - task architecture specific status
 -------------------------------------------------------------------
 When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
 architecture specific status of the task.
@@ -2298,7 +2316,7 @@ AVX512_elapsed_ms
   the task is unlikely an AVX512 user, but depends on the workload and the
   scheduling scenario, it also could be a false negative mentioned above.
 
-3.13 /proc/<pid>/fd - List of symlinks to open files
+3.14 /proc/<pid>/fd - List of symlinks to open files
 -------------------------------------------------------
 This directory contains symbolic links which represent open files
 the process is maintaining.  Example output::
@@ -2313,7 +2331,7 @@ The number of open files for the process is stored in 'size' member
 of stat() output for /proc/<pid>/fd for fast access.
 -------------------------------------------------------
 
-3.14 /proc/<pid>/ksm_stat - Information about the process's ksm status
+3.15 /proc/<pid>/ksm_stat - Information about the process's ksm status
 ----------------------------------------------------------------------
 When CONFIG_KSM is enabled, each process has this file which displays
 the information of ksm merging status.
diff --git a/fs/coredump.c b/fs/coredump.c
index bb6fdb1f4..e08a8a6c4 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -521,6 +521,27 @@ static int zap_threads(struct task_struct *tsk,
 	return nr;
 }
 
+static void coredump_pre_exit(void)
+{
+	struct task_struct *tsk = current;
+	unsigned long flags = __mm_flags_get_dumpable(tsk->mm);
+
+	if (!likely(flags & MMF_DUMP_PRE_EXIT_MASK))
+		return;
+
+	/*
+	 * Set O_TMPCLOS of file f_flags if file needs to be closed.
+	 */
+	if (test_bit(MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED, &flags) &&
+	    !test_bit(MMF_DUMP_MAPPED_SHARED, &flags))
+		exit_mmap_mapped_shared(tsk->mm);
+
+	/*
+	 * Check O_TMPCLOS of file f_flags to close file and clear it.
+	 */
+	exit_files_pre_exit(tsk, mm_flags_test(MMF_DUMP_PRE_EXIT_FLOCK, tsk->mm));
+}
+
 static int coredump_wait(int exit_code, struct core_state *core_state)
 {
 	struct task_struct *tsk = current;
@@ -1100,6 +1121,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
 		return;
 	}
 
+	coredump_pre_exit();
+
 	switch (cn->core_type) {
 	case COREDUMP_FILE:
 		if (!coredump_file(cn, cprm, binfmt))
diff --git a/fs/file.c b/fs/file.c
index 2c81c0b16..a58ffffcc 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -23,6 +23,7 @@
 #include <linux/file_ref.h>
 #include <net/sock.h>
 #include <linux/init_task.h>
+#include <linux/filelock.h>
 
 #include "internal.h"
 
@@ -527,6 +528,51 @@ void exit_files(struct task_struct *tsk)
 	}
 }
 
+void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
+{
+	struct files_struct *files = tsk->files;
+	struct fdtable *fdt;
+	struct file *file;
+	unsigned int i, j = 0;
+
+	if (!files)
+		return;
+
+	fdt = rcu_dereference_raw(files->fdt);
+	for (;;) {
+		unsigned long set;
+
+		i = j * BITS_PER_LONG;
+		if (i >= fdt->max_fds)
+			break;
+		set = fdt->open_fds[j++];
+		while (set) {
+			if (!(set & 1))
+				goto next_fd;
+			file = fdt->fd[i];
+			if (!file)
+				goto next_fd;
+			if (file->f_flags & O_TMPCLOS) {
+				file->f_flags &= ~O_TMPCLOS;
+				goto close_fd;
+			}
+			if (!checkflock)
+				goto next_fd;
+			if (!vfs_inode_has_locks(file_inode(file)))
+				goto next_fd;
+
+close_fd:
+			fdt->fd[i] = NULL;
+			filp_close(file, files);
+			cond_resched();
+
+next_fd:
+			i++;
+			set >>= 1;
+		}
+	}
+}
+
 struct files_struct init_files = {
 	.count		= ATOMIC_INIT(1),
 	.fdt		= &init_files.fdtab,
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d9acfa89c..99b5f219f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3026,6 +3026,83 @@ static const struct file_operations proc_coredump_filter_operations = {
 	.write		= proc_coredump_filter_write,
 	.llseek		= generic_file_llseek,
 };
+
+static ssize_t proc_coredump_pre_exit_read(struct file *file, char __user *buf,
+					   size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file_inode(file));
+	struct mm_struct *mm;
+	char buffer[PROC_NUMBUF];
+	size_t len;
+	int ret;
+
+	if (!task)
+		return -ESRCH;
+
+	ret = 0;
+	mm = get_task_mm(task);
+	if (mm) {
+		unsigned long flags = __mm_flags_get_dumpable(mm);
+
+		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
+			       ((flags & MMF_DUMP_PRE_EXIT_MASK) >>
+				MMF_DUMP_PRE_EXIT_SHIFT));
+		mmput(mm);
+		ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
+	}
+
+	put_task_struct(task);
+
+	return ret;
+}
+
+static ssize_t proc_coredump_pre_exit_write(struct file *file,
+					    const char __user *buf,
+					    size_t count,
+					    loff_t *ppos)
+{
+	struct task_struct *task;
+	struct mm_struct *mm;
+	unsigned int val;
+	int ret;
+	int i;
+	unsigned long mask;
+
+	ret = kstrtouint_from_user(buf, count, 0, &val);
+	if (ret < 0)
+		return ret;
+
+	ret = -ESRCH;
+	task = get_proc_task(file_inode(file));
+	if (!task)
+		goto out_no_task;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out_no_mm;
+	ret = 0;
+
+	for (i = 0, mask = 1; i < MMF_DUMP_PRE_EXIT_BITS; i++, mask <<= 1) {
+		if (val & mask)
+			mm_flags_set(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
+		else
+			mm_flags_clear(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
+	}
+
+	mmput(mm);
+ out_no_mm:
+	put_task_struct(task);
+ out_no_task:
+	if (ret < 0)
+		return ret;
+	return count;
+}
+
+static const struct file_operations proc_coredump_pre_exit_operations = {
+	.read		= proc_coredump_pre_exit_read,
+	.write		= proc_coredump_pre_exit_write,
+	.llseek		= generic_file_llseek,
+};
 #endif
 
 #ifdef CONFIG_TASK_IO_ACCOUNTING
@@ -3391,6 +3468,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 #ifdef CONFIG_ELF_CORE
 	REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
+	REG("coredump_pre_exit", S_IRUGO|S_IWUSR, proc_coredump_pre_exit_operations),
 #endif
 #ifdef CONFIG_TASK_IO_ACCOUNTING
 	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
diff --git a/include/linux/mm.h b/include/linux/mm.h
index af23453e9..dfd4717c7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4066,6 +4066,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
 extern int __vm_enough_memory(const struct mm_struct *mm, long pages, int cap_sys_admin);
 extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
 extern void exit_mmap(struct mm_struct *);
+extern void exit_mmap_mapped_shared(struct mm_struct *mm);
 bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
 				 unsigned long addr, bool write);
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c7db35be6..0555aaf50 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1963,6 +1963,15 @@ enum {
 	(BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
 	 BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
 
+/* coredump pre-exit bits */
+#define MMF_DUMP_PRE_EXIT_FLOCK	11
+#define MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED 12
+
+#define MMF_DUMP_PRE_EXIT_SHIFT	(MMF_DUMPABLE_BITS + MMF_DUMP_FILTER_BITS)
+#define MMF_DUMP_PRE_EXIT_BITS	2
+#define MMF_DUMP_PRE_EXIT_MASK	\
+	(((1 << MMF_DUMP_PRE_EXIT_BITS) - 1) << MMF_DUMP_PRE_EXIT_SHIFT)
+
 #ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
 # define MMF_DUMP_MASK_DEFAULT_ELF	BIT(MMF_DUMP_ELF_HEADERS)
 #else
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 41ed884cf..b4becbf6c 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -93,6 +93,7 @@ static inline void exit_thread(struct task_struct *tsk)
 extern __noreturn void do_group_exit(int);
 
 extern void exit_files(struct task_struct *);
+extern void exit_files_pre_exit(struct task_struct *, bool);
 extern void exit_itimers(struct task_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 613475285..360604d65 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -95,6 +95,10 @@
 #define O_NDELAY	O_NONBLOCK
 #endif
 
+#ifndef O_TMPCLOS
+#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
+#endif
+
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
 #define F_SETFD		2	/* set/clear close_on_exec */
diff --git a/kernel/fork.c b/kernel/fork.c
index a679b2448..84f1ee7f3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
 
 __setup("coredump_filter=", coredump_filter_setup);
 
+static unsigned long default_dump_pre_exit;
+
+static int __init coredump_pre_exit_setup(char *s)
+{
+	default_dump_pre_exit =
+		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
+		MMF_DUMP_PRE_EXIT_MASK;
+	return 1;
+}
+
+__setup("coredump_pre_exit=", coredump_pre_exit_setup);
+
 #include <linux/init_task.h>
 
 static void mm_init_aio(struct mm_struct *mm)
diff --git a/mm/mmap.c b/mm/mmap.c
index 5754d1c36..b955c47c0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1326,6 +1326,27 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 }
 
+void exit_mmap_mapped_shared(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, 0);
+
+	mmap_write_lock(mm);
+	lru_add_drain();
+
+	for_each_vma(vmi, vma) {
+		if (vma->vm_flags & VM_HUGETLB)
+			continue;
+		if (!(vma->vm_flags & VM_SHARED) || !file_inode(vma->vm_file)->i_nlink)
+			continue;
+		vma->vm_file->f_flags |= O_TMPCLOS;
+		do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
+		cond_resched();
+	}
+
+	mmap_write_unlock(mm);
+}
+
 /*
  * Return true if the calling process may expand its vm space by the passed
  * number of pages
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Sebastian Andrzej Siewior @ 2026-06-24 11:03 UTC (permalink / raw)
  To: Breno Leitao
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Andrew Morton, Arnd Bergmann, Ben Segall,
	Changwoo Min, David Vernet, Dietmar Eggemann, Eric Dumazet,
	Ingo Molnar, Jakub Kicinski, John Ogness, Juri Lelli,
	K Prateek Nayak, Paolo Abeni, Peter Zijlstra, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <ajuWnKsQR0Z825Wn@gmail.com>

On 2026-06-24 01:37:53 [-0700], Breno Leitao wrote:
Hi Breno,

> Have you considered an approach similar to printk_deferred_enter(),
> where you mark the code region that needs deferral and all WARN() calls
> within that region are automatically deferred?

Doing this at rq-lock site is not something the scheduler department
takes. It increases/ bloats the code sides more than what we have now.

Not everything is in __sched section so we can't check for this from
within printk. So this turd was the only idea I had.

> The current proposal requires changing individual WARN() call sites,
> but whether they need deferral might depend on the calling context. This
> means you'd need to convert many call sites and ensure all nested
> warnings are also converted to the deferred variant.

I hope for the forced-threaded-legacy the default but this camp has not
a lot members. It would increase the pressure to provide nbcon so it
could be a good thing.

To accept this series and make it more bullet-proof we could do
s/WARN_ON\>/WARN_ON_DEFERRED/ for all sched/ and require it regardless
if the rq-lock is held. So you wouldn't have to audit it each and every
time. Due to that preempt-disable thingy it can be used in preemptible
sections without breaking anything.

> 
> Thanks,
> --breno

Sebastian

^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Sebastian Andrzej Siewior @ 2026-06-24 10:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Andrew Morton, Arnd Bergmann, Ben Segall,
	Breno Leitao, Changwoo Min, David Vernet, Dietmar Eggemann,
	Eric Dumazet, Ingo Molnar, Jakub Kicinski, John Ogness,
	Juri Lelli, K Prateek Nayak, Paolo Abeni, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260624093117.GY48970@noisy.programming.kicks-ass.net>

On 2026-06-24 11:31:17 [+0200], Peter Zijlstra wrote:
> On Tue, Jun 23, 2026 at 04:26:49PM +0200, Sebastian Andrzej Siewior wrote:
> 
> > +#ifndef WARN_ON_DEFERRED
> > +#define WARN_ON_DEFERRED(condition) ({					\
> > +	int __ret_warn_on = !!(condition);				\
> > +	if (unlikely(__ret_warn_on)) {					\
> > +		guard(preempt)();					\
> > +		printk_deferred_enter()					\
> > +		__WARN();						\
> > +		printk_deferred_exit()					\
> > +	}								\
> > +	unlikely(__ret_warn_on);					\
> > +})
> > +#endif
> 
> This will generate atrocious shite at the WARN sites.

You mean the missing semicolon and huge size increase?
On x86 with these guard+deffered in the upper variant, before:
    text    data     bss     dec   filename
   93910   37424     832  132166   kernel/sched/core.o
   61802    4945     152   66899   kernel/sched/fair.o
  215108   24453    3768  243329   kernel/sched/build_policy.o
   86128   30092   12704  128924   kernel/sched/build_utility.o
  456948   96914   17456  571318   total
After:
   96140   37408     832  134380   kernel/sched/core.o
   64490    4937     152   69579   kernel/sched/fair.o
  222980   24157    3768  250905   kernel/sched/build_policy.o
   86544   30100   12704  129348   kernel/sched/build_utility.o
  470154   96602   17456  584212   total + 1.3%

total went up by 1.3% or 12.59KiB.
This effects:  alpha, arc, arm, csky, hexagon, m68k, microblaze, mips,
nios2, openrisc, sparc, um, xtensa
and could motivate them to implement __WARN_FLAGS which would lower size
in general and this stunt would have no effect.

Just looked at arm and it has support for invalid opcodes somehow but
not for this.

Sebastian

^ permalink raw reply

* Re: [PATCH 0/2] sched: Introduce and use deferred WARNs in sched
From: Peter Zijlstra @ 2026-06-24  9:33 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Andrew Morton, Arnd Bergmann, Ben Segall,
	Breno Leitao, Changwoo Min, David Vernet, Dietmar Eggemann,
	Eric Dumazet, Ingo Molnar, Jakub Kicinski, John Ogness,
	Juri Lelli, K Prateek Nayak, Paolo Abeni, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623142650.265721-1-bigeasy@linutronix.de>

On Tue, Jun 23, 2026 at 04:26:48PM +0200, Sebastian Andrzej Siewior wrote:
> This is a follow-up to the netconsole lockup reported
> 	https://lore.kernel.org/all/20260610183621.3915271-1-vlad.wing@gmail.com/
> 
> The idea is to use deferred printing for WARNs and use them in sched. I
> tried to use only where it looks that the rq lock acquired instead a
> plain s/WARN_ON/WARN_ON_DEFFERED which would be simpler.
> 
> This unholy deferred mess can be removed once we don't have legacy
> consoles anymore _or_ force force_legacy_kthread=true.

So I really don't see why we should do this. This has been a 'problem'
forever, and printk() is actually being fixed.



^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Peter Zijlstra @ 2026-06-24  9:31 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Andrew Morton, Arnd Bergmann, Ben Segall,
	Breno Leitao, Changwoo Min, David Vernet, Dietmar Eggemann,
	Eric Dumazet, Ingo Molnar, Jakub Kicinski, John Ogness,
	Juri Lelli, K Prateek Nayak, Paolo Abeni, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623142650.265721-2-bigeasy@linutronix.de>

On Tue, Jun 23, 2026 at 04:26:49PM +0200, Sebastian Andrzej Siewior wrote:

> +#ifndef WARN_ON_DEFERRED
> +#define WARN_ON_DEFERRED(condition) ({					\
> +	int __ret_warn_on = !!(condition);				\
> +	if (unlikely(__ret_warn_on)) {					\
> +		guard(preempt)();					\
> +		printk_deferred_enter()					\
> +		__WARN();						\
> +		printk_deferred_exit()					\
> +	}								\
> +	unlikely(__ret_warn_on);					\
> +})
> +#endif

This will generate atrocious shite at the WARN sites.

^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Petr Mladek @ 2026-06-24  9:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: K Prateek Nayak, linux-arch, linux-kernel, sched-ext, netdev,
	David S . Miller, Andrea Righi, Andrew Morton, Arnd Bergmann,
	Ben Segall, Breno Leitao, Changwoo Min, David Vernet,
	Dietmar Eggemann, Eric Dumazet, Ingo Molnar, Jakub Kicinski,
	John Ogness, Juri Lelli, Paolo Abeni, Peter Zijlstra,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260624062642.5DER6vrP@linutronix.de>

On Wed 2026-06-24 08:26:42, Sebastian Andrzej Siewior wrote:
> On 2026-06-23 20:24:02 [+0530], K Prateek Nayak wrote:
> > Hello Sebastian,
> Hi Prateek,
> 
> > nit.
> > 
> > Instead of replicating these bits, can we replace that return with a
> > "goto out" ...
> 
> sure
> 
> …
> > ... and replace this return with a:
> > 
> >     return (warning) ? BUG_TRAP_TYPE_WARN : BUG_TRAP_TYPE_BUG;
> > 
> > Looks a tab bit cleaner to my eyes. Thoughts?
> 
> It sure does.
> I wait for PeterZ' executive order to either do this and sprinkle sched/
> _or_ make legacy consoles deferred as it is done on RT.
> 
> Petr, was there a big push back doing it unconditionally?

For Linus, it was a no-go, definitely.

The problem are situations where the system gets stuck and panic()
is not called. This is why nbcon consoles switch to the atomic
mode in some emergency situations, see nbcon_cpu_emergency_enter(),
for example, into __warn(), oops_enter(), rcu stall, and lockdep
calls.

Moving legacy consoles to a kthread would prevent stall in situations
where printk() is called from the scheduler code. But it would cause
that some other stalls become silent.

In my opinion, we should not move the legacy consoles to a kthread
by default. I believe that the rest of the kernel is a bigger
source of possible stalls than the scheduler. So, the overall
experience will be better if we keep the status quo.

I would vote for adding the WARN_*DEFERRED() into the scheduler code
at least until majority of console drivers are converted to nbcon API.

Best Regards,
Petr

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox