Generic Linux architectural discussions
 help / color / mirror / Atom feed
* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Lorenzo Stoakes @ 2026-06-25 12:48 UTC (permalink / raw)
  To: Xin Zhao
  Cc: brauner, mjguzik, pfalcato, ebiederm, viro, jack, jlayton,
	chuck.lever, alex.aring, arnd, keescook, mcgrof, j.granados,
	allen.lkml, linux-fsdevel, linux-kernel, linux-arch,
	Jonathan Corbet, Andrew Morton, David Hildenbrand, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Liam R. Howlett,
	linux-doc, linux-mm
In-Reply-To: <20260624145552.70143-1-jackzxcui1989@163.com>

+cc missing maintainers, lists.

NAK.

This is un-upstreamable for numerous reasons.

The stuff you're doing in mm is broken, wrong and invasive and you've not
even bothered to cc- mm people. I'm annoyed by this.

You're also doing incredibly silly mistakes at v4 of something that should have
been an RFC.

You don't seem to understand the concept of patch _series_ (break it up into
smaller patches!!!) and you haven't bothered cc'ing maintainers whose subsystems
you're radically alterting.

I'm annoyed as you have a history where you were told not to add insane hacks
before ([0], my reply at [1]).

[0]:https://lore.kernel.org/all/20260116042817.3790405-1-jackzxcui1989@163.com/
[1]:https://lore.kernel.org/all/14110b70-19e7-474d-b0dd-ba80e8bed9b0@lucifer.local/

Was I wasting my time there? Am I wasting my time responding now?

And how hard is it to run a simple perl script?

Let me run it for you for _just_ the maintainers:

$ scripts/get_maintainer.pl --nogit --nogit-fallback --nor your_patch.patch
Jonathan Corbet <corbet@lwn.net> (maintainer:DOCUMENTATION)
Alexander Viro <viro@zeniv.linux.org.uk> (maintainer:FILESYSTEMS (VFS and infrastructure))
Christian Brauner <brauner@kernel.org> (maintainer:FILESYSTEMS (VFS and infrastructure))
Andrew Morton <akpm@linux-foundation.org> (maintainer:MEMORY MANAGEMENT - CORE)
David Hildenbrand <david@kernel.org> (maintainer:MEMORY MANAGEMENT - CORE)
Arnd Bergmann <arnd@arndb.de> (maintainer:GENERIC INCLUDE/ASM HEADER FILES)
Ingo Molnar <mingo@redhat.com> (maintainer:SCHEDULER)
Peter Zijlstra <peterz@infradead.org> (maintainer:SCHEDULER)
Juri Lelli <juri.lelli@redhat.com> (maintainer:SCHEDULER)
Vincent Guittot <vincent.guittot@linaro.org> (maintainer:SCHEDULER)
Kees Cook <kees@kernel.org> (maintainer:EXEC & BINFMT API, ELF)
"Liam R. Howlett" <liam@infradead.org> (maintainer:MEMORY MAPPING)
Lorenzo Stoakes <ljs@kernel.org> (maintainer:MEMORY MAPPING)
linux-doc@vger.kernel.org (open list:DOCUMENTATION)
linux-kernel@vger.kernel.org (open list)
linux-fsdevel@vger.kernel.org (open list:PROC FILESYSTEM)
linux-mm@kvack.org (open list:MEMORY MANAGEMENT - CORE)
linux-arch@vger.kernel.org (open list:GENERIC INCLUDE/ASM HEADER FILES)
EXEC & BINFMT API, ELF status: Supported

You're missing the majority of these. That's _not OK_.

On Wed, Jun 24, 2026 at 10:55:52PM +0800, Xin Zhao wrote:
> A coredump typically takes some time to complete. If we happen to hold a
> write lock with flock just before triggering the coredump, that write lock
> will not be released during the entire coredump process. As a result,
> other processes attempting to acquire the same write lock may experience
> significant delays. Another typical scenario is that shared memory, such
> as dma-buf, remains occupied and is not released for a long time due to
> core dumps.
>
> To address this, add /proc/<pid>/coredump_pre_exit node so that people can

This is a horrible idea.

> specify which resources they want to release before dumping core. This
> patch implements the early release of two types of resources: flock files
> and file-backed shared memory. Default settings are NOT pre-exit anything.

What, people set this ahead of time? For a dynamic thing like files?

>
> A temporary bit, O_TMPCLOS, is added to mark vma->vm_file->f_flags during
> the execution of the newly introduced exit_mmap_mapped_shared() function.
> In this way, the subsequent exit_files_pre_exit() function does not need
> to find the corresponding vma through the file to check for the VM_SHARED
> attribute, thereby reducing the traversal cost.

This sentence doesn't even make sense?

And also !VM_SHARED means !vma->vm_file so your code would NULL deref if you
didn't check that. But !VM_SHARED VMAs can absolutely be file-backed...

>
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
> ---
>
> Change in v4:
> - Christian pointed out that the coredump process will traverse file
>   descriptors (fd), so certain fds should not be closed by default.
>   Rework the whole feature, add /proc/<pid>/coredump_pre_exit for user
>   pre-exit resources selection, default is NOT pre-exit anything.
> - Mateusz suggested that walking the fd table and release the file-lock is
>   reasonable. No longer release all the fd(s). Based on user config, only
>   the flock fd(s) and the fd(s) correspondent to file-backed shared memory
>   will be released at most.
>
> Change in v3:
> - Add comment and commit-log to explain why do the MMF_DUMP_MAPPED_SHARED
>   mm_flags_test() check, note that memory mapped files keep their own
>   separate references to the files. The case to work around is that early
>   unlocking a flock on a file allows other processes to lock and modify
>   the mapped data protected by the flock,
>   as suggested by Pedro Falcato.
> - Link to v3: https://lore.kernel.org/all/20260619122419.3954581-1-jackzxcui1989@163.com/
>
> Change in v2:
> - Get rid of the implement of adding new fcntl API, the issue does not
>   worth inflicting the cost on everyone,
>   as suggested by Al Viro.
> - Call exit_files() in coredump_wait(),
>   as suggested by Eric W. Biederman.
>   Add MMF_DUMP_MAPPED_SHARED mm_flags_test() check to filter cases that
>   need to dump file-backed shared memory.
> - Link to v2: https://lore.kernel.org/lkml/20260618150301.3226517-1-jackzxcui1989@163.com/
>
> v1:
> - Link to v1: https://lore.kernel.org/all/20260618030700.2511668-1-jackzxcui1989@163.com/
> ---
>  .../admin-guide/kernel-parameters.txt         |  5 ++
>  Documentation/filesystems/proc.rst            | 58 +++++++++-----
>  fs/coredump.c                                 | 23 ++++++
>  fs/file.c                                     | 46 +++++++++++
>  fs/proc/base.c                                | 78 +++++++++++++++++++
>  include/linux/mm.h                            |  1 +

No.

>  include/linux/mm_types.h                      |  9 +++

No.

>  include/linux/sched/task.h                    |  1 +
>  include/uapi/asm-generic/fcntl.h              |  4 +
>  kernel/fork.c                                 | 12 +++
>  mm/mmap.c                                     | 21 +++++

No.

>  11 files changed, 238 insertions(+), 20 deletions(-)

This is a completely insane diffstat for a single patch. Ridiculous.

AND YOU HAVEN'T ADDED A SINGLE TEST.

>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f575d4508..bc6d3859f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1024,6 +1024,11 @@ Kernel parameters
>  			/proc/<pid>/coredump_filter.
>  			See also Documentation/filesystems/proc.rst.
>
> +	coredump_pre_exit=
> +			[KNL] Change the default value for
> +			/proc/<pid>/coredump_pre_exit.
> +			See also Documentation/filesystems/proc.rst.
> +
>  	coresight_cpu_debug.enable
>  			[ARM,ARM64]
>  			Format: <bool>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index db6167bef..6a637d31d 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -39,16 +39,17 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
>    3.2	/proc/<pid>/oom_score - Display current oom-killer score
>    3.3	/proc/<pid>/io - Display the IO accounting fields
>    3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
> -  3.5	/proc/<pid>/mountinfo - Information about mounts
> -  3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
> -  3.7   /proc/<pid>/task/<tid>/children - Information about task children
> -  3.8   /proc/<pid>/fdinfo/<fd> - Information about opened file
> -  3.9   /proc/<pid>/map_files - Information about memory mapped files
> -  3.10  /proc/<pid>/timerslack_ns - Task timerslack value
> -  3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> -  3.12	/proc/<pid>/arch_status - Task architecture specific information
> -  3.13  /proc/<pid>/fd - List of symlinks to open files
> -  3.14  /proc/<pid>/ksm_stat - Information about the process's ksm status.
> +  3.5  /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
> +  3.6	/proc/<pid>/mountinfo - Information about mounts
> +  3.7	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
> +  3.8   /proc/<pid>/task/<tid>/children - Information about task children
> +  3.9   /proc/<pid>/fdinfo/<fd> - Information about opened file
> +  3.10   /proc/<pid>/map_files - Information about memory mapped files
> +  3.11  /proc/<pid>/timerslack_ns - Task timerslack value
> +  3.12	/proc/<pid>/patch_state - Livepatch patch operation state
> +  3.13	/proc/<pid>/arch_status - Task architecture specific information
> +  3.14  /proc/<pid>/fd - List of symlinks to open files
> +  3.15  /proc/<pid>/ksm_stat - Information about the process's ksm status.
>
>    4	Configuring procfs
>    4.1	Mount options
> @@ -1961,7 +1962,24 @@ For example::
>    $ echo 0x7 > /proc/self/coredump_filter
>    $ ./some_program
>
> -3.5	/proc/<pid>/mountinfo - Information about mounts
> +3.5 /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
> +---------------------------------------------------------------
> +A coredump typically takes some time to complete. If we happen to hold a write
> +lock with flock just before triggering the coredump, that write lock will not
> +be released during the entire coredump process. As a result, other processes
> +attempting to acquire the same write lock may experience significant delays.
> +Another typical scenario is that shared memory, such as dma-buf, remains
> +occupied and is not released for a long time due to core dumps.
> +
> +/proc/<pid>/coredump_pre_exit allows you to pre-exit some resources before
> +dumping core.
> +
> +The following two types are supported:
> +
> +  - (bit 0) flock files
> +  - (bit 1) file-backed shared memory
> +
> +3.6	/proc/<pid>/mountinfo - Information about mounts
>  --------------------------------------------------------
>
>  This file contains lines of the form::
> @@ -2001,7 +2019,7 @@ For more information on mount propagation see:
>    Documentation/filesystems/sharedsubtree.rst
>
>
> -3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
> +3.7	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
>  --------------------------------------------------------
>  These files provide a method to access a task's comm value. It also allows for
>  a task to set its own or one of its thread siblings comm value. The comm value
> @@ -2010,7 +2028,7 @@ then the kernel's TASK_COMM_LEN (currently 16 chars, including the NUL
>  terminator) will result in a truncated comm value.
>
>
> -3.7	/proc/<pid>/task/<tid>/children - Information about task children
> +3.8	/proc/<pid>/task/<tid>/children - Information about task children
>  -------------------------------------------------------------------------
>  This file provides a fast way to retrieve first level children pids
>  of a task pointed by <pid>/<tid> pair. The format is a space separated
> @@ -2027,7 +2045,7 @@ pids, so one needs to either stop or freeze processes being inspected
>  if precise results are needed.
>
>
> -3.8	/proc/<pid>/fdinfo/<fd> - Information about opened file
> +3.9	/proc/<pid>/fdinfo/<fd> - Information about opened file
>  ---------------------------------------------------------------
>  This file provides information associated with an opened file. The regular
>  files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'.
> @@ -2198,7 +2216,7 @@ VFIO Device files
>  where 'vfio-device-syspath' is the sysfs path corresponding to the VFIO device
>  file.
>
> -3.9	/proc/<pid>/map_files - Information about memory mapped files
> +3.10	/proc/<pid>/map_files - Information about memory mapped files
>  ---------------------------------------------------------------------
>  This directory contains symbolic links which represent memory mapped files
>  the process is maintaining.  Example output::
> @@ -2220,7 +2238,7 @@ time one can open(2) mappings from the listings of two processes and
>  comparing their inode numbers to figure out which anonymous memory areas
>  are actually shared.
>
> -3.10	/proc/<pid>/timerslack_ns - Task timerslack value
> +3.11	/proc/<pid>/timerslack_ns - Task timerslack value
>  ---------------------------------------------------------
>  This file provides the value of the task's timerslack value in nanoseconds.
>  This value specifies an amount of time that normal timers may be deferred
> @@ -2236,7 +2254,7 @@ Valid values are from 0 - ULLONG_MAX
>  An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level
>  permissions on the task specified to change its timerslack_ns value.
>
> -3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> +3.12	/proc/<pid>/patch_state - Livepatch patch operation state
>  -----------------------------------------------------------------
>  When CONFIG_LIVEPATCH is enabled, this file displays the value of the
>  patch state for the task.
> @@ -2253,7 +2271,7 @@ patched.  If the patch is being enabled, then the task has already been
>  patched.  If the patch is being disabled, then the task hasn't been
>  unpatched yet.
>
> -3.12 /proc/<pid>/arch_status - task architecture specific status
> +3.13 /proc/<pid>/arch_status - task architecture specific status
>  -------------------------------------------------------------------
>  When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
>  architecture specific status of the task.
> @@ -2298,7 +2316,7 @@ AVX512_elapsed_ms
>    the task is unlikely an AVX512 user, but depends on the workload and the
>    scheduling scenario, it also could be a false negative mentioned above.
>
> -3.13 /proc/<pid>/fd - List of symlinks to open files
> +3.14 /proc/<pid>/fd - List of symlinks to open files
>  -------------------------------------------------------
>  This directory contains symbolic links which represent open files
>  the process is maintaining.  Example output::
> @@ -2313,7 +2331,7 @@ The number of open files for the process is stored in 'size' member
>  of stat() output for /proc/<pid>/fd for fast access.
>  -------------------------------------------------------
>
> -3.14 /proc/<pid>/ksm_stat - Information about the process's ksm status
> +3.15 /proc/<pid>/ksm_stat - Information about the process's ksm status
>  ----------------------------------------------------------------------
>  When CONFIG_KSM is enabled, each process has this file which displays
>  the information of ksm merging status.
> diff --git a/fs/coredump.c b/fs/coredump.c
> index bb6fdb1f4..e08a8a6c4 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -521,6 +521,27 @@ static int zap_threads(struct task_struct *tsk,
>  	return nr;
>  }
>
> +static void coredump_pre_exit(void)
> +{
> +	struct task_struct *tsk = current;
> +	unsigned long flags = __mm_flags_get_dumpable(tsk->mm);
> +
> +	if (!likely(flags & MMF_DUMP_PRE_EXIT_MASK))
> +		return;
> +
> +	/*
> +	 * Set O_TMPCLOS of file f_flags if file needs to be closed.
> +	 */
> +	if (test_bit(MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED, &flags) &&
> +	    !test_bit(MMF_DUMP_MAPPED_SHARED, &flags))
> +		exit_mmap_mapped_shared(tsk->mm);

What the hell are you doing?

This is not where we unmap VMAs?

This is likely broken in subtle ways.

> +
> +	/*
> +	 * Check O_TMPCLOS of file f_flags to close file and clear it.
> +	 */
> +	exit_files_pre_exit(tsk, mm_flags_test(MMF_DUMP_PRE_EXIT_FLOCK, tsk->mm));
> +}
> +
>  static int coredump_wait(int exit_code, struct core_state *core_state)
>  {
>  	struct task_struct *tsk = current;
> @@ -1100,6 +1121,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
>  		return;
>  	}
>
> +	coredump_pre_exit();
> +
>  	switch (cn->core_type) {
>  	case COREDUMP_FILE:
>  		if (!coredump_file(cn, cprm, binfmt))
> diff --git a/fs/file.c b/fs/file.c
> index 2c81c0b16..a58ffffcc 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -23,6 +23,7 @@
>  #include <linux/file_ref.h>
>  #include <net/sock.h>
>  #include <linux/init_task.h>
> +#include <linux/filelock.h>
>
>  #include "internal.h"
>
> @@ -527,6 +528,51 @@ void exit_files(struct task_struct *tsk)
>  	}
>  }
>
> +void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
> +{
> +	struct files_struct *files = tsk->files;
> +	struct fdtable *fdt;
> +	struct file *file;
> +	unsigned int i, j = 0;
> +
> +	if (!files)
> +		return;
> +
> +	fdt = rcu_dereference_raw(files->fdt);
> +	for (;;) {
> +		unsigned long set;
> +
> +		i = j * BITS_PER_LONG;
> +		if (i >= fdt->max_fds)
> +			break;
> +		set = fdt->open_fds[j++];
> +		while (set) {
> +			if (!(set & 1))
> +				goto next_fd;
> +			file = fdt->fd[i];
> +			if (!file)
> +				goto next_fd;
> +			if (file->f_flags & O_TMPCLOS) {
> +				file->f_flags &= ~O_TMPCLOS;
> +				goto close_fd;
> +			}
> +			if (!checkflock)
> +				goto next_fd;
> +			if (!vfs_inode_has_locks(file_inode(file)))
> +				goto next_fd;
> +
> +close_fd:
> +			fdt->fd[i] = NULL;
> +			filp_close(file, files);
> +			cond_resched();
> +
> +next_fd:
> +			i++;
> +			set >>= 1;
> +		}
> +	}

This code hurts my eyes.

> +}
> +
>  struct files_struct init_files = {
>  	.count		= ATOMIC_INIT(1),
>  	.fdt		= &init_files.fdtab,
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d9acfa89c..99b5f219f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3026,6 +3026,83 @@ static const struct file_operations proc_coredump_filter_operations = {
>  	.write		= proc_coredump_filter_write,
>  	.llseek		= generic_file_llseek,
>  };
> +

No comment, obviously.

> +static ssize_t proc_coredump_pre_exit_read(struct file *file, char __user *buf,
> +					   size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = get_proc_task(file_inode(file));
> +	struct mm_struct *mm;
> +	char buffer[PROC_NUMBUF];
> +	size_t len;
> +	int ret;
> +
> +	if (!task)
> +		return -ESRCH;
> +
> +	ret = 0;
> +	mm = get_task_mm(task);
> +	if (mm) {
> +		unsigned long flags = __mm_flags_get_dumpable(mm);
> +
> +		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
> +			       ((flags & MMF_DUMP_PRE_EXIT_MASK) >>
> +				MMF_DUMP_PRE_EXIT_SHIFT));
> +		mmput(mm);
> +		ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
> +	}
> +
> +	put_task_struct(task);
> +
> +	return ret;
> +}
> +

Yeah who needs a comment...

> +static ssize_t proc_coredump_pre_exit_write(struct file *file,
> +					    const char __user *buf,
> +					    size_t count,
> +					    loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	unsigned int val;
> +	int ret;
> +	int i;
> +	unsigned long mask;
> +
> +	ret = kstrtouint_from_user(buf, count, 0, &val);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = -ESRCH;
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		goto out_no_task;
> +
> +	mm = get_task_mm(task);
> +	if (!mm)
> +		goto out_no_mm;
> +	ret = 0;
> +
> +	for (i = 0, mask = 1; i < MMF_DUMP_PRE_EXIT_BITS; i++, mask <<= 1) {

What?

> +		if (val & mask)
> +			mm_flags_set(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +		else
> +			mm_flags_clear(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +	}
> +
> +	mmput(mm);
> + out_no_mm:
> +	put_task_struct(task);
> + out_no_task:
> +	if (ret < 0)
> +		return ret;
> +	return count;
> +}
> +
> +static const struct file_operations proc_coredump_pre_exit_operations = {
> +	.read		= proc_coredump_pre_exit_read,
> +	.write		= proc_coredump_pre_exit_write,
> +	.llseek		= generic_file_llseek,
> +};
>  #endif
>
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
> @@ -3391,6 +3468,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  #ifdef CONFIG_ELF_CORE
>  	REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
> +	REG("coredump_pre_exit", S_IRUGO|S_IWUSR, proc_coredump_pre_exit_operations),
>  #endif
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
>  	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index af23453e9..dfd4717c7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4066,6 +4066,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
>  extern int __vm_enough_memory(const struct mm_struct *mm, long pages, int cap_sys_admin);
>  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
>  extern void exit_mmap(struct mm_struct *);
> +extern void exit_mmap_mapped_shared(struct mm_struct *mm);

You don't use extern.

>  bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
>  				 unsigned long addr, bool write);
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c7db35be6..0555aaf50 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1963,6 +1963,15 @@ enum {
>  	(BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
>  	 BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
>
> +/* coredump pre-exit bits */
> +#define MMF_DUMP_PRE_EXIT_FLOCK	11
> +#define MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED 12

Err do we have space for this?

You really want to add 2 more bits to mm_struct flags for this insanity?

> +
> +#define MMF_DUMP_PRE_EXIT_SHIFT	(MMF_DUMPABLE_BITS + MMF_DUMP_FILTER_BITS)
> +#define MMF_DUMP_PRE_EXIT_BITS	2
> +#define MMF_DUMP_PRE_EXIT_MASK	\
> +	(((1 << MMF_DUMP_PRE_EXIT_BITS) - 1) << MMF_DUMP_PRE_EXIT_SHIFT)

So are these dumpable bits or not? Why are you not just incrementing
MMF_DUMPABLE_BITS?

> +
>  #ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
>  # define MMF_DUMP_MASK_DEFAULT_ELF	BIT(MMF_DUMP_ELF_HEADERS)
>  #else
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 41ed884cf..b4becbf6c 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -93,6 +93,7 @@ static inline void exit_thread(struct task_struct *tsk)
>  extern __noreturn void do_group_exit(int);
>
>  extern void exit_files(struct task_struct *);
> +extern void exit_files_pre_exit(struct task_struct *, bool);
>  extern void exit_itimers(struct task_struct *);
>
>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285..360604d65 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -95,6 +95,10 @@
>  #define O_NDELAY	O_NONBLOCK
>  #endif
>
> +#ifndef O_TMPCLOS
> +#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
> +#endif
> +
>  #define F_DUPFD		0	/* dup */
>  #define F_GETFD		1	/* get close_on_exec */
>  #define F_SETFD		2	/* set/clear close_on_exec */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a679b2448..84f1ee7f3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>
>  __setup("coredump_filter=", coredump_filter_setup);
>
> +static unsigned long default_dump_pre_exit;
> +
> +static int __init coredump_pre_exit_setup(char *s)
> +{
> +	default_dump_pre_exit =
> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> +		MMF_DUMP_PRE_EXIT_MASK;
> +	return 1;
> +}
> +
> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> +
>  #include <linux/init_task.h>
>
>  static void mm_init_aio(struct mm_struct *mm)
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5754d1c36..b955c47c0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1326,6 +1326,27 @@ void exit_mmap(struct mm_struct *mm)
>  	vm_unacct_memory(nr_accounted);
>  }
>
> +void exit_mmap_mapped_shared(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vma;
> +	VMA_ITERATOR(vmi, mm, 0);
> +
> +	mmap_write_lock(mm);
> +	lru_add_drain();

Why?

> +
> +	for_each_vma(vmi, vma) {

Literally every single VMA? Including the gate VMA too?

No VMA locks... so that's already broken.

> +		if (vma->vm_flags & VM_HUGETLB)
> +			continue;

That's not how you test for hugetlb.

> +		if (!(vma->vm_flags & VM_SHARED) || !file_inode(vma->vm_file)->i_nlink)

This isn't how we work with flags any more.

> +			continue;
> +		vma->vm_file->f_flags |= O_TMPCLOS;


Not sure directly manipulating file flags like this is valid in any way, shape,
or form.

> +		do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);

This is utterly broken, the outer loop will be invalidated by you removing
these, do_munmap() has its own iterator...

And this is just madly inefficient. Why wouldn't you just loop over the VMAs to
alter flags then unmap the whole range?

But this is also introducing a completely separate, duplicative, version of
exit_mmap().

You're not doing any of what that function does. You're just very inefficiently
unmapping everything?

> +		cond_resched();

Of course!

> +	}
> +
> +	mmap_write_unlock(mm);

And VMAs can be mapped again now?

> +}
> +
>  /*
>   * Return true if the calling process may expand its vm space by the passed
>   * number of pages
> --
> 2.34.1
>

I'm not sure if this idea can be made upstreamble in any way. But this patch or
anything that looks like it or fundamentally alters mm is just not acceptable,
sorry.

Lorenzo

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: David Hildenbrand (Arm) @ 2026-06-25 11:43 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Christian Brauner, Mike Rapoport, Lorenzo Stoakes, mjguzik,
	ebiederm, viro, jack, jlayton, chuck.lever, alex.aring, arnd,
	keescook, mcgrof, j.granados, allen.lkml, linux-fsdevel,
	linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <aj0Mr3e9yt0kU-Qj@pedro-suse>

On 6/25/26 13:18, Pedro Falcato wrote:
> On Thu, Jun 25, 2026 at 12:57:02PM +0200, David Hildenbrand (Arm) wrote:
>>>
>>> This makes no sense. I think you really need to sit down and think about
>>> a design for this that doesn't introduce state machinery for boot, mm,
>>> and the VFS in one shot to solve a fringe problem...
>>
>> Staring at exit_mmap_mapped_shared(), ... this looks rather hacky ("let's fake
>> munmap and set some magical flags").
>>
>> We're essentially saying "we don't want (pretty much) anything that's MAP_SHARED
>> in the coredump". And for some reason someone should configure that, that's a
>> rather weird toggle tbh.
>>
>> And the granularity ("file-backed shared memory") is completely odd.
>>
>>
>> Aren't there other ways we could optimize this internally?
>>
>> Like, if we know that a process is dead and cannot run anymore, downgrade writes
>> to reads (and make sure we block GUP write attempts accordingly), or would that
>> also not be sufficient?
>>
>>
>> Another thought:
>>
>> fs/coredump.c calls get_dump_page().
>>
>> get_dump_page() will not fault in any memory. So if a page is not in the page
>> tables at the time of the dump, it will not get included in the coredump. Which
>> means, that whether most non-anonymous memory will be included in a coredump is
>> already like playing the lottery.
>>
>> This is true for MAP_SHARED file mappings and MAP_PRIVATE file mappings without
>> private modifications.
>>
>> Which makes me wonder: How much is tooling relying on file-backed pages to end
>> up in a coredump?
> 
> FWIW this mechanism already exists, see /proc/self/coredump_filter. The
> default is bits 0, 1, 4 and 5 (see core(5)), which maps back to no file pages
> being dumped to a core dump, apart from ELF headers (these help the debugger
> trace back the mapped binary to the debug info using the buildid).
> 
> So the answer to this question is "approximately none" :)
> 

Ah, thanks! vma_dump_size() honors this, and I am sure through some magical
routing the information stored in m->dump_size will end up not dumping these pages.

Staring at elf_core_dump(), this "unmap some stuff" part is really, really
nasty, as it effectively removes the VMAs->segments from the dump. (unless I am
missing something important)

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Pedro Falcato @ 2026-06-25 11:18 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Christian Brauner, Mike Rapoport, Lorenzo Stoakes, mjguzik,
	ebiederm, viro, jack, jlayton, chuck.lever, alex.aring, arnd,
	keescook, mcgrof, j.granados, allen.lkml, linux-fsdevel,
	linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <9105c433-44a7-4e8f-bacb-def93d11a7f2@kernel.org>

On Thu, Jun 25, 2026 at 12:57:02PM +0200, David Hildenbrand (Arm) wrote:
> >> +
> >>  #define F_DUPFD		0	/* dup */
> >>  #define F_GETFD		1	/* get close_on_exec */
> >>  #define F_SETFD		2	/* set/clear close_on_exec */
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index a679b2448234..84f1ee7f32cf 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
> >>  
> >>  __setup("coredump_filter=", coredump_filter_setup);
> >>  
> >> +static unsigned long default_dump_pre_exit;
> >> +
> >> +static int __init coredump_pre_exit_setup(char *s)
> >> +{
> >> +	default_dump_pre_exit =
> >> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> >> +		MMF_DUMP_PRE_EXIT_MASK;
> >> +	return 1;
> >> +}
> >> +
> >> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> > 
> > This makes no sense. I think you really need to sit down and think about
> > a design for this that doesn't introduce state machinery for boot, mm,
> > and the VFS in one shot to solve a fringe problem...
> 
> Staring at exit_mmap_mapped_shared(), ... this looks rather hacky ("let's fake
> munmap and set some magical flags").
> 
> We're essentially saying "we don't want (pretty much) anything that's MAP_SHARED
> in the coredump". And for some reason someone should configure that, that's a
> rather weird toggle tbh.
> 
> And the granularity ("file-backed shared memory") is completely odd.
> 
> 
> Aren't there other ways we could optimize this internally?
> 
> Like, if we know that a process is dead and cannot run anymore, downgrade writes
> to reads (and make sure we block GUP write attempts accordingly), or would that
> also not be sufficient?
> 
> 
> Another thought:
> 
> fs/coredump.c calls get_dump_page().
> 
> get_dump_page() will not fault in any memory. So if a page is not in the page
> tables at the time of the dump, it will not get included in the coredump. Which
> means, that whether most non-anonymous memory will be included in a coredump is
> already like playing the lottery.
> 
> This is true for MAP_SHARED file mappings and MAP_PRIVATE file mappings without
> private modifications.
> 
> Which makes me wonder: How much is tooling relying on file-backed pages to end
> up in a coredump?

FWIW this mechanism already exists, see /proc/self/coredump_filter. The
default is bits 0, 1, 4 and 5 (see core(5)), which maps back to no file pages
being dumped to a core dump, apart from ELF headers (these help the debugger
trace back the mapped binary to the debug info using the buildid).

So the answer to this question is "approximately none" :)

-- 
Pedro

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: David Hildenbrand (Arm) @ 2026-06-25 10:57 UTC (permalink / raw)
  To: Christian Brauner, Mike Rapoport, Lorenzo Stoakes, mjguzik,
	pfalcato, ebiederm, viro, jack, jlayton, chuck.lever, alex.aring,
	arnd, keescook, mcgrof, j.granados, allen.lkml
  Cc: linux-fsdevel, linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <20260625-wappnen-drohbrief-wermutstropfen-c53538f01547@brauner>

>> +
>>  #define F_DUPFD		0	/* dup */
>>  #define F_GETFD		1	/* get close_on_exec */
>>  #define F_SETFD		2	/* set/clear close_on_exec */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index a679b2448234..84f1ee7f32cf 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>>  
>>  __setup("coredump_filter=", coredump_filter_setup);
>>  
>> +static unsigned long default_dump_pre_exit;
>> +
>> +static int __init coredump_pre_exit_setup(char *s)
>> +{
>> +	default_dump_pre_exit =
>> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
>> +		MMF_DUMP_PRE_EXIT_MASK;
>> +	return 1;
>> +}
>> +
>> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> 
> This makes no sense. I think you really need to sit down and think about
> a design for this that doesn't introduce state machinery for boot, mm,
> and the VFS in one shot to solve a fringe problem...

Staring at exit_mmap_mapped_shared(), ... this looks rather hacky ("let's fake
munmap and set some magical flags").

We're essentially saying "we don't want (pretty much) anything that's MAP_SHARED
in the coredump". And for some reason someone should configure that, that's a
rather weird toggle tbh.

And the granularity ("file-backed shared memory") is completely odd.


Aren't there other ways we could optimize this internally?

Like, if we know that a process is dead and cannot run anymore, downgrade writes
to reads (and make sure we block GUP write attempts accordingly), or would that
also not be sufficient?


Another thought:

fs/coredump.c calls get_dump_page().

get_dump_page() will not fault in any memory. So if a page is not in the page
tables at the time of the dump, it will not get included in the coredump. Which
means, that whether most non-anonymous memory will be included in a coredump is
already like playing the lottery.

This is true for MAP_SHARED file mappings and MAP_PRIVATE file mappings without
private modifications.

Which makes me wonder: How much is tooling relying on file-backed pages to end
up in a coredump?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Xin Zhao @ 2026-06-25  8:50 UTC (permalink / raw)
  To: brauner
  Cc: alex.aring, allen.lkml, arnd, chuck.lever, david, ebiederm,
	j.granados, jack, jackzxcui1989, jlayton, keescook, linux-arch,
	linux-fsdevel, linux-kernel, linux-mm, ljs, mcgrof, mjguzik,
	pfalcato, rppt, viro
In-Reply-To: <20260625-wappnen-drohbrief-wermutstropfen-c53538f01547@brauner>

On Thu, 25 Jun 2026 09:28:08 +0200 Christian Brauner <brauner@kernel.org> wrote:

> > +	coredump_pre_exit=
> > +			[KNL] Change the default value for
> > +			/proc/<pid>/coredump_pre_exit.
> > +			See also Documentation/filesystems/proc.rst.
> 
> Nah, we're not doing a separate file for this. That makes no sense
> whatsoever. I've already explained this in the first mail. There are
> effectively three modes:
> 
> (1) dump to a file
> (2) spawn super-privileged usermode helper process connect coredumping
>     process and said helper via pipe
> (3) coredumping process connects to AF_UNIX socket
> 
> Parameterize (1) and (2) via a command line arguments. I strongly
> suspect you're using some AI tooling so it should be able to figure out
> how this was done in the past.
> 
> (3) can be extended by just introducing a new flag value for struct
>     coredump_req. That is also illustrated by previous work.
> 
> We're not spreading procfs files. It's terrible api design especially
> for security sensitive changes.

The coredump socket approach is easier to implement because it allows for
interaction between the server and client, enabling the customization of
protocols. However, for the coredump file method, I can only think of
defining "r" and "R" through core_pattern to release flock and file-backed
shared data in advance. I'm unsure if this is feasible, as it changes the
original definition of core_pattern.

Regarding the coredump pipe, there is also a lack of a mechanism for the
pipe program to notify the coredump process, so it might still require
adding "r" and "R" at the end of core_pattern to indicate this, allowing
the coredump process to handle the early release on its own. I'm not sure
if my understanding is correct.

Even if the coredump pipe program obtains the file pointer from the process
that generated the coredump, it cannot reduce the reference count of the
file (which I understand is a very bad attempt). Since it cannot decrease
the reference count of the file, the early release must still be performed
by the task that generated the coredump. Given this situation, it seems
that we indeed need to use core_pattern for marking. I've thought for a
long time about more suitable solutions, but I haven't come up with any.


> > +#ifndef O_TMPCLOS
> > +#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
> > +#endif
> 
> Sorry, not going to happen. This doesn't not justify the addition of a
> new uapi value at all.

OK, if I use it at last, I will not put it in user header file.

> > +
> > +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> 
> This makes no sense. I think you really need to sit down and think about
> a design for this that doesn't introduce state machinery for boot, mm,
> and the VFS in one shot to solve a fringe problem...

I'll get rid of the attempt to add a new boot-up argument for this feature.

> [Severity: High]
> Does modifying the VMA maple tree via do_munmap() during the for_each_vma()
> iteration invalidate the outer iterator? The loop traverses the maple tree
> using the iterator vmi. However, do_munmap() creates its own internal
> VMA_ITERATOR and removes the VMA from the tree. Because the outer vmi
> iterator is not updated to reflect these structural changes, its cached
> state becomes stale, which can lead to a use-after-free when vma_next()
> is subsequently called.
> 
> via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com

When executing this traversal logic, we have already acquired a lock, and
the process has been frozen. The traversal logic goes from start to finish.
Are you sure that this approach could still have issues?

> [Severity: High]
> Is it safe to iterate the file descriptor table without holding
> rcu_read_lock()? Because coredump_pre_exit() is called before zap_threads()
> kills other threads, concurrent threads can still trigger expand_files(),
> which replaces the fdt and frees the old one after an RCU grace period.

Since the process has already been frozen, shouldn't we not need to consider
such concurrency issues?

> [Severity: Medium]
> Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
> of file->f_flags risks losing concurrent fcntl() updates since it doesn't
> hold file->f_lock.
> 
> Also, if a file has duplicated file descriptors (e.g., via dup()), will
> clearing O_TMPCLOS here prematurely skip the closure of the remaining
> descriptors? When encountering the duplicated descriptor later, the flag
> will already be cleared, leaving the shared file actively referenced.

> [Severity: Medium]
> Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
> of file->f_flags risks losing concurrent fcntl() updates since it doesn't
> hold file->f_lock.
> 
> Also, if a file has duplicated file descriptors (e.g., via dup()), will
> clearing O_TMPCLOS here prematurely skip the closure of the remaining
> descriptors? When encountering the duplicated descriptor later, the flag
> will already be cleared, leaving the shared file actively referenced.

Currently, this flag will only be used by the logic we added, so I believe
there won't be any issues.

Thanks
Xin Zhao


^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Christian Brauner @ 2026-06-25  7:28 UTC (permalink / raw)
  To: David Hildenbrand, Mike Rapoport, Lorenzo Stoakes, brauner,
	mjguzik, pfalcato, ebiederm, viro, jack, jlayton, chuck.lever,
	alex.aring, arnd, keescook, mcgrof, j.granados, allen.lkml
  Cc: linux-fsdevel, linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <20260624145552.70143-1-jackzxcui1989@163.com>

> A coredump typically takes some time to complete. If we happen to hold a
> write lock with flock just before triggering the coredump, that write lock
> will not be released during the entire coredump process. As a result,
> other processes attempting to acquire the same write lock may experience
> significant delays. Another typical scenario is that shared memory, such
> as dma-buf, remains occupied and is not released for a long time due to
> core dumps.
> 
> To address this, add /proc/<pid>/coredump_pre_exit node so that people can
> specify which resources they want to release before dumping core. This
> patch implements the early release of two types of resources: flock files
> and file-backed shared memory. Default settings are NOT pre-exit anything.
> 
> A temporary bit, O_TMPCLOS, is added to mark vma->vm_file->f_flags during
> the execution of the newly introduced exit_mmap_mapped_shared() function.
> In this way, the subsequent exit_files_pre_exit() function does not need
> to find the corresponding vma through the file to check for the VM_SHARED
> attribute, thereby reducing the traversal cost.
> 
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f575d450861e..bc6d3859f874 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1024,6 +1024,11 @@ Kernel parameters
>  			/proc/<pid>/coredump_filter.
>  			See also Documentation/filesystems/proc.rst.
>  
> +	coredump_pre_exit=
> +			[KNL] Change the default value for
> +			/proc/<pid>/coredump_pre_exit.
> +			See also Documentation/filesystems/proc.rst.

Nah, we're not doing a separate file for this. That makes no sense
whatsoever. I've already explained this in the first mail. There are
effectively three modes:

(1) dump to a file
(2) spawn super-privileged usermode helper process connect coredumping
    process and said helper via pipe
(3) coredumping process connects to AF_UNIX socket

Parameterize (1) and (2) via a command line arguments. I strongly
suspect you're using some AI tooling so it should be able to figure out
how this was done in the past.

(3) can be extended by just introducing a new flag value for struct
    coredump_req. That is also illustrated by previous work.

We're not spreading procfs files. It's terrible api design especially
for security sensitive changes.

> +static void coredump_pre_exit(void)
> +{
> +	struct task_struct *tsk = current;
> +	unsigned long flags = __mm_flags_get_dumpable(tsk->mm);
> +
> +	if (!likely(flags & MMF_DUMP_PRE_EXIT_MASK))
> +		return;
> +
> +	/*
> +	 * Set O_TMPCLOS of file f_flags if file needs to be closed.
> +	 */
> +	if (test_bit(MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED, &flags) &&
> +	    !test_bit(MMF_DUMP_MAPPED_SHARED, &flags))
> +		exit_mmap_mapped_shared(tsk->mm);
> +
> +	/*
> +	 * Check O_TMPCLOS of file f_flags to close file and clear it.
> +	 */
> +	exit_files_pre_exit(tsk, mm_flags_test(MMF_DUMP_PRE_EXIT_FLOCK, tsk->mm));
> +}
> +
>  static int coredump_wait(int exit_code, struct core_state *core_state)
>  {
>  	struct task_struct *tsk = current;
> @@ -1100,6 +1121,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
>  		return;
>  	}
>  
> +	coredump_pre_exit();
> +
>  	switch (cn->core_type) {
>  	case COREDUMP_FILE:
>  		if (!coredump_file(cn, cprm, binfmt))
> diff --git a/fs/file.c b/fs/file.c
> index 2c81c0b162d0..a58ffffcc31d 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -23,6 +23,7 @@
>  #include <linux/file_ref.h>
>  #include <net/sock.h>
>  #include <linux/init_task.h>
> +#include <linux/filelock.h>
>  
>  #include "internal.h"
>  
> @@ -527,6 +528,51 @@ void exit_files(struct task_struct *tsk)
>  	}
>  }
>  
> +void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
> +{
> +	struct files_struct *files = tsk->files;
> +	struct fdtable *fdt;
> +	struct file *file;
> +	unsigned int i, j = 0;
> +
> +	if (!files)
> +		return;
> +
> +	fdt = rcu_dereference_raw(files->fdt);
> +	for (;;) {
> +		unsigned long set;
> +
> +		i = j * BITS_PER_LONG;
> +		if (i >= fdt->max_fds)
> +			break;
> +		set = fdt->open_fds[j++];
> +		while (set) {
> +			if (!(set & 1))
> +				goto next_fd;
> +			file = fdt->fd[i];
> +			if (!file)
> +				goto next_fd;
> +			if (file->f_flags & O_TMPCLOS) {
> +				file->f_flags &= ~O_TMPCLOS;
> +				goto close_fd;
> +			}
> +			if (!checkflock)
> +				goto next_fd;
> +			if (!vfs_inode_has_locks(file_inode(file)))
> +				goto next_fd;
> +
> +close_fd:
> +			fdt->fd[i] = NULL;
> +			filp_close(file, files);
> +			cond_resched();
> +
> +next_fd:
> +			i++;
> +			set >>= 1;
> +		}
> +	}
> +}
> +
>  struct files_struct init_files = {
>  	.count		= ATOMIC_INIT(1),
>  	.fdt		= &init_files.fdtab,
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d9acfa89c894..99b5f219f7fa 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3026,6 +3026,83 @@ static const struct file_operations proc_coredump_filter_operations = {
>  	.write		= proc_coredump_filter_write,
>  	.llseek		= generic_file_llseek,
>  };
> +
> +static ssize_t proc_coredump_pre_exit_read(struct file *file, char __user *buf,
> +					   size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = get_proc_task(file_inode(file));
> +	struct mm_struct *mm;
> +	char buffer[PROC_NUMBUF];
> +	size_t len;
> +	int ret;
> +
> +	if (!task)
> +		return -ESRCH;
> +
> +	ret = 0;
> +	mm = get_task_mm(task);
> +	if (mm) {
> +		unsigned long flags = __mm_flags_get_dumpable(mm);
> +
> +		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
> +			       ((flags & MMF_DUMP_PRE_EXIT_MASK) >>
> +				MMF_DUMP_PRE_EXIT_SHIFT));
> +		mmput(mm);
> +		ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
> +	}
> +
> +	put_task_struct(task);
> +
> +	return ret;
> +}
> +
> +static ssize_t proc_coredump_pre_exit_write(struct file *file,
> +					    const char __user *buf,
> +					    size_t count,
> +					    loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	unsigned int val;
> +	int ret;
> +	int i;
> +	unsigned long mask;
> +
> +	ret = kstrtouint_from_user(buf, count, 0, &val);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = -ESRCH;
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		goto out_no_task;
> +
> +	mm = get_task_mm(task);
> +	if (!mm)
> +		goto out_no_mm;
> +	ret = 0;
> +
> +	for (i = 0, mask = 1; i < MMF_DUMP_PRE_EXIT_BITS; i++, mask <<= 1) {
> +		if (val & mask)
> +			mm_flags_set(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +		else
> +			mm_flags_clear(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +	}
> +
> +	mmput(mm);
> + out_no_mm:
> +	put_task_struct(task);
> + out_no_task:
> +	if (ret < 0)
> +		return ret;
> +	return count;
> +}
> +
> +static const struct file_operations proc_coredump_pre_exit_operations = {
> +	.read		= proc_coredump_pre_exit_read,
> +	.write		= proc_coredump_pre_exit_write,
> +	.llseek		= generic_file_llseek,
> +};
>  #endif
>  
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
> @@ -3391,6 +3468,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  #ifdef CONFIG_ELF_CORE
>  	REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
> +	REG("coredump_pre_exit", S_IRUGO|S_IWUSR, proc_coredump_pre_exit_operations),
>  #endif
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
>  	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index af23453e9dbd..dfd4717c7e3e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4066,6 +4066,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
>  extern int __vm_enough_memory(const struct mm_struct *mm, long pages, int cap_sys_admin);
>  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
>  extern void exit_mmap(struct mm_struct *);
> +extern void exit_mmap_mapped_shared(struct mm_struct *mm);
>  bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
>  				 unsigned long addr, bool write);
>  
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c7db35be6a30..0555aaf50001 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1963,6 +1963,15 @@ enum {
>  	(BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
>  	 BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
>  
> +/* coredump pre-exit bits */
> +#define MMF_DUMP_PRE_EXIT_FLOCK	11
> +#define MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED 12
> +
> +#define MMF_DUMP_PRE_EXIT_SHIFT	(MMF_DUMPABLE_BITS + MMF_DUMP_FILTER_BITS)
> +#define MMF_DUMP_PRE_EXIT_BITS	2
> +#define MMF_DUMP_PRE_EXIT_MASK	\
> +	(((1 << MMF_DUMP_PRE_EXIT_BITS) - 1) << MMF_DUMP_PRE_EXIT_SHIFT)
> +
>  #ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
>  # define MMF_DUMP_MASK_DEFAULT_ELF	BIT(MMF_DUMP_ELF_HEADERS)
>  #else
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 41ed884cffc9..b4becbf6c0eb 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -93,6 +93,7 @@ static inline void exit_thread(struct task_struct *tsk)
>  extern __noreturn void do_group_exit(int);
>  
>  extern void exit_files(struct task_struct *);
> +extern void exit_files_pre_exit(struct task_struct *, bool);
>  extern void exit_itimers(struct task_struct *);
>  
>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285643..360604d653b4 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -95,6 +95,10 @@
>  #define O_NDELAY	O_NONBLOCK
>  #endif
>  
> +#ifndef O_TMPCLOS
> +#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
> +#endif

Sorry, not going to happen. This doesn't not justify the addition of a
new uapi value at all.

I'm also including various Sashkio comments:

sashiko.dev <sashiko@sashiko.dev>:

[Severity: Medium]
Is it safe to expose an internal, temporary kernel flag in a UAPI header?
Userspace applications could intentionally or accidentally pass O_TMPCLOS to
open(), which might permanently pollute the userspace ABI and trigger
unexpected behavior during a coredump.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com

> +
>  #define F_DUPFD		0	/* dup */
>  #define F_GETFD		1	/* get close_on_exec */
>  #define F_SETFD		2	/* set/clear close_on_exec */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a679b2448234..84f1ee7f32cf 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>  
>  __setup("coredump_filter=", coredump_filter_setup);
>  
> +static unsigned long default_dump_pre_exit;
> +
> +static int __init coredump_pre_exit_setup(char *s)
> +{
> +	default_dump_pre_exit =
> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> +		MMF_DUMP_PRE_EXIT_MASK;
> +	return 1;
> +}
> +
> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);

This makes no sense. I think you really need to sit down and think about
a design for this that doesn't introduce state machinery for boot, mm,
and the VFS in one shot to solve a fringe problem...





sashiko.dev <sashiko@sashiko.dev>:

[Severity: High]
Does modifying the VMA maple tree via do_munmap() during the for_each_vma()
iteration invalidate the outer iterator? The loop traverses the maple tree
using the iterator vmi. However, do_munmap() creates its own internal
VMA_ITERATOR and removes the VMA from the tree. Because the outer vmi
iterator is not updated to reflect these structural changes, its cached
state becomes stale, which can lead to a use-after-free when vma_next()
is subsequently called.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com


sashiko.dev <sashiko@sashiko.dev>:

[Severity: High]
Is it safe to iterate the file descriptor table without holding
rcu_read_lock()? Because coredump_pre_exit() is called before zap_threads()
kills other threads, concurrent threads can still trigger expand_files(),
which replaces the fdt and frees the old one after an RCU grace period.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com


sashiko.dev <sashiko@sashiko.dev>:

[Severity: Medium]
Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
of file->f_flags risks losing concurrent fcntl() updates since it doesn't
hold file->f_lock.

Also, if a file has duplicated file descriptors (e.g., via dup()), will
clearing O_TMPCLOS here prematurely skip the closure of the remaining
descriptors? When encountering the duplicated descriptor later, the flag
will already be cleared, leaving the shared file actively referenced.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com

-- 
Christian Brauner <brauner@kernel.org>

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: John Ericson @ 2026-06-25  3:41 UTC (permalink / raw)
  To: Al Viro
  Cc: Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <20260625011023.GM2636677@ZenIV>

Ah, I started replying to your first email, but this is better, this
gets to the heart of the matter. Please don't mind me responding to your
two questions in reverse.

On Wed, Jun 24, 2026, at 9:10 PM, Al Viro wrote:
> What's the fundamental difference between CWD and any open descriptor
> for a directory?  Why does it make sense to ban the former, but allow
> the equivalents done via the latter?

Yes! These two notions are very close --- but that's the *problem*, not
a reason to not care about the existence of the CWD and root FS. I want
to get rid of CWD in my processes not because it is fundamentally
different (it isn't), but because it is superfluous.

If one is capability-minded like me, it's a bad mistake that we ever had
this "working directory" notion to begin with, and yet another example
of the folks at Bell Labs sticking something in the kernel that was
really only needed by the shell, and that could have just been done in
userland.

The current working directory, roughly, is *just* some global state
holding a directory file descriptor. But I don't want that global state.
If I am writing my userland program (that is not a shell), I would not
create the global variable. I do not appreciate the fact that the kernel
foists that state upon me whether I like it or not.

Now obviously we cannot have a giant breaking change removing the notion
of a current working directory altogether. But we can allow individual
processes which don't want it to opt out, and that is what nulling out
these fields (and updating the path resolution code to cope with that)
allows.

There is no loss of expressive power doing this, because one can (and
should!) just use the `*at` and file descriptors. But there is, however,
the imposition of discipline. The programmer (or coding agent) is
encouraged to do everything with file descriptors rather than path
concatenations etc., because they need to use `*at` anyways, and then
voilà, without browbeating anyone in security seminars or code review, a
bunch of TOCTOU issues disappear simply because doing the right thing is
now the path of least resistance.

> Please, start with explaining what, in your opinion, a mount namespace
> _is_, and where does "mount X is attached at path P relative to mount
> Y" belong.

Let's take a pathological example:

- Process A has `/foo` bind-mounted at `/bar/foo`

- Process B has `/bar` without that bind mount, and `/foo` mounted at
  `/baz/foo`, as is possible because it is in a different mount
  namespace.

If A opens `/bar/foo`, and sends it over (via socket) to B, and then B
does `openat(recv_fd, "..")`, B will get `/bar`, not `/baz`. This is
because `..` is resolved according to the mount referenced in the open
file. (This is, by the way, very good! Directory file descriptors would
be perilous to use if this were not the case!)

The moral of the story is that "mount X is attached at path P relative
to mount Y" is information accessed in the mounts themselves (maybe via
their containing mount namespace, per the `mnt_ns` field, or maybe not,
I am not sure, but it is immaterial). In contrast, the mount namespace
of the *opening* task (`current->nsproxy->mnt_ns`, and current is B)
doesn't matter at all for this purpose.

I am not on a crusade against `struct mnt_namespace` in general; I am
just trying to null out `(struct nsproxy)::mnt_ns` in particular. (This
is just as I am not on a crusade against `struct path`, just `root` and
`pwd` of `struct fs_struct`.)

These days, `current->nsproxy->mnt_ns` is, to me, first and foremost,
there for the legacy mount API. Again, just like our CWD example above,
this is mostly just global state.

The new mount API drastically [^1] reduces the need for it, since it
allows referring to mounts explicitly via file descriptors. That's OK!
The argument is the same as the above --- I am *not* trying to limit
what can be done if one has all the right files open with the right
perms. I am just trying to limit what works out of the box --- to reduce
the default set of privileges, *especially* where the resources involved
are implicit and/or stateful.

[^1]: It doesn't *quite* eliminate the need for `nsproxy->mnt_ns`
    entirely, since (as I understand it, from reading the `move_mount`
    man page) it is still used for some authorization checks, since
    `O_PATH` file descriptors do not grant privileges other than mere
    discoverability. But that's a problem that could be solved later
    with an `O_MOUNT` option analogous to `O_RDONLY` or `O_WRONLY`. In
    the meantime, I am perfectly happy if my processes with null mount
    namespaces get `move_mount` permission errors.

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Xin Zhao @ 2026-06-25  2:51 UTC (permalink / raw)
  To: viro
  Cc: alex.aring, allen.lkml, arnd, brauner, chuck.lever, ebiederm,
	j.granados, jack, jackzxcui1989, jlayton, keescook, linux-arch,
	linux-fsdevel, linux-kernel, mcgrof, mjguzik, pfalcato
In-Reply-To: <20260624162844.GK2636677@ZenIV>

On Wed, 24 Jun 2026 17:28:44 +0100 Al Viro <viro@zeniv.linux.org.uk> wrote:

> > +			if (file->f_flags & O_TMPCLOS) {
> > +				file->f_flags &= ~O_TMPCLOS;
> > +				goto close_fd;
> > +			}
> 
> *blink*
> 
> 	How could that possibly make sense?  Many descriptors
> may refer to the same file; what's more, many descriptor tables
> may contain such descriptors, so... just what is that code
> trying to do?

This is yet another serious mistake. Perhaps my test scenarios were not
complex enough, or I was overly confident in removing the logic that
cleared the O_TMPCLOS flag and performed debug printing only when the
reference count dropped to zero during that single close operation,
without conducting further tests.

In v5, I plan to avoid clearing the O_TMPCLOS flag to handle the situation
where multiple file descriptors map to a single file. Of course, there are
some cases where the lifecycle of this file may extend beyond the process
exit, but AFICT such situations either cannot last long or do not involve
memory in the case where i_nlink != 0. Therefore, keeping this flag seems
unlikely to cause any issues.

Since this flag is no longer used temporarily (it will never be cleared),
I would like to rename it to O_PRECLOS.

Thanks
Xin Zhao


^ permalink raw reply

* Re: [RFC] Null Namespaces
From: Al Viro @ 2026-06-25  1:10 UTC (permalink / raw)
  To: John Ericson
  Cc: Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <103524f8-1658-41df-88e9-cf49c628a721@app.fastmail.com>

On Wed, Jun 24, 2026 at 07:53:53PM -0400, John Ericson wrote:
> I wanted to discuss a bit about each type of namespace to indicate that
> this is a concept I think works across the board --- it wouldn't be such
> a good solution for the process spawning API if it was only applicable
> to some but not all namespace types. But the truth is that I have
> thought about the FS cases the most, as I think you have picked up on.
> 
> If there is interest in landing
> 
>   1. null CWD
>   2. null root fs
>   3. null mount namespace
> 
> in isolation, and then returning to the other namespaces to iron out
> their details, that would be fantastic. It would be much nicer for me to
> get some momentum that way, without having to design everything all at
> once first before getting to implement anything.

Please, start with explaining what, in your opinion, a mount namespace _is_,
and where does "mount X is attached at path P relative to mount Y" belong.

What's the fundamental difference between CWD and any open descriptor for
a directory?  Why does it make sense to ban the former, but allow the
equivalents done via the latter?

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: John Ericson @ 2026-06-24 23:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Li Chen, Cong Wang, Christian Brauner, linux-arch, LKML,
	linux-fsdevel, linux-api, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jan Kara, Jonathan Corbet, Shuah Khan, Al Viro, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <CALCETrU3bgUxp0k1y-U-uL0-fW2016Gmsyu9O_=830czEUGMcQ@mail.gmail.com>

On Wed, Jun 24, 2026, at 7:20 PM, Andy Lutomirski wrote:
> I think I like this, but some comments:

Thanks, that's really nice to hear!

While arguably this is just the culmination of a direction Linux has
been going in for a while, it could also be seen as a very "out there"
idea. That at least one person likes the rough sound of things makes me
feel a lot better!

> On Wed, Jun 24, 2026 at 4:06 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Jun 24, 2026 at 3:52 PM John Ericson <mail@johnericson.me> wrote:
>
> > >   - null current working directory: relative paths with traditional,
> > >     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
> >
> > It's perfectly valid to cd to a directory that does not belong to
> > one's namespace.  We have fchdir.  What's wrong with letting it
> > continue working?
> >
> > Regardless of that, the current directory either needs to be a
> > directory or to be nothing at all, and if we support the latter, we
> > need to figure out what /proc will show.
>
> Thinking about this more: I think that handling CWD might actually be
> a prerequisite for the series and has little to do with namespaces.
> Maybe try adding, as a standalone feature, the ability to have a null
> CWD.  Define semantics and see what the implementation looks like.
>
> Then, if you add null namespaces, you could optionally make
> transitioning to a null namespace set a null CWD.  Or those features
> could be orthogonal.

Hehe, I had the same thought after working on the filesystem patches,
along with the analogous thought for the root filesystem. It had been so
long since I had done a `chroot` without also doing a mount namespace
`unshare` --- despite the former being much older --- that I had
forgotten this separation of concerns.

My apologies for forgetting to include this insight in the original
email.

> Maybe the way to go is to implement the ones that have clearer
> semantics and to defer the others.

I would much prefer this, actually.

I wanted to discuss a bit about each type of namespace to indicate that
this is a concept I think works across the board --- it wouldn't be such
a good solution for the process spawning API if it was only applicable
to some but not all namespace types. But the truth is that I have
thought about the FS cases the most, as I think you have picked up on.

If there is interest in landing

  1. null CWD
  2. null root fs
  3. null mount namespace

in isolation, and then returning to the other namespaces to iron out
their details, that would be fantastic. It would be much nicer for me to
get some momentum that way, without having to design everything all at
once first before getting to implement anything.

John

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: Andy Lutomirski @ 2026-06-24 23:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: John Ericson, Li Chen, Cong Wang, Christian Brauner, linux-arch,
	linux-kernel, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan,
	Alexander Viro, Kees Cook, Sergei Zimmerman, Farid Zakaria
In-Reply-To: <CALCETrWhXNetw-BsAaoyT31suMmjYLdMh9MAuLB2Lvk2ac-31g@mail.gmail.com>

On Wed, Jun 24, 2026 at 4:06 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Jun 24, 2026 at 3:52 PM John Ericson <mail@johnericson.me> wrote:

> >   - null current working directory: relative paths with traditional,
> >     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
>
> It's perfectly valid to cd to a directory that does not belong to
> one's namespace.  We have fchdir.  What's wrong with letting it
> continue working?
>
> Regardless of that, the current directory either needs to be a
> directory or to be nothing at all, and if we support the latter, we
> need to figure out what /proc will show.

Thinking about this more: I think that handling CWD might actually be
a prerequisite for the series and has little to do with namespaces.
Maybe try adding, as a standalone feature, the ability to have a null
CWD.  Define semantics and see what the implementation looks like.

Then, if you add null namespaces, you could optionally make
transitioning to a null namespace set a null CWD.  Or those features
could be orthogonal.

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: Al Viro @ 2026-06-24 23:12 UTC (permalink / raw)
  To: John Ericson
  Cc: Li Chen, Cong Wang, Christian Brauner, linux-arch, linux-kernel,
	linux-fsdevel, linux-api, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <a49ce818-f38d-41b0-bbf7-80b8aad998b1@app.fastmail.com>

On Wed, Jun 24, 2026 at 06:51:47PM -0400, John Ericson wrote:

> #### Null mount namespace
> 
> - requires:
> 
>   - null root file system: absolute paths don't work.
> 
>   - null current working directory: relative paths with traditional,
>     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
> 
> - All operations relating to the "ambient" mount tree don't work.
> 
> - `*at` operations with a file descriptor do work.

Huh?  The last bit looks contradicts the previous one - if you have
an opened directory in a mount from some namespace, those `*at` operations
with that descriptor *will* be seeing the mount tree of that namespace,
whatever the hell is "ambient" supposed to mean.  Either that, or you
will be exposing whatever's overmounted in that mount, which is a huge
can of worms.

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: Andy Lutomirski @ 2026-06-24 23:06 UTC (permalink / raw)
  To: John Ericson
  Cc: Li Chen, Cong Wang, Christian Brauner, linux-arch, linux-kernel,
	linux-fsdevel, linux-api, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan,
	Alexander Viro, Kees Cook, Sergei Zimmerman, Farid Zakaria
In-Reply-To: <a49ce818-f38d-41b0-bbf7-80b8aad998b1@app.fastmail.com>

On Wed, Jun 24, 2026 at 3:52 PM John Ericson <mail@johnericson.me> wrote:
>
> Hello, I am hoping to discuss an idea I've had for a while, that I am
> calling "null namespaces" that has become more relevant with some recent
> other discussions. First I'll discuss null namespaces in general terms,
> and then I'll link those recent discussions and relate null namespaces
> to them.
>
> ### Null namespaces
>
> The essence of null namespaces is trying to give processes as little
> ambient authority as possible, so they are lighter weight and allowed to
> do even less than fully unshared processes today.
>
> Namespaces as they exist today are frequently described as an isolation
> mechanism, but I think this is the conflation of two different things.
> *Removing* a new process from its parent's namespaces unquestionably is
> increasing isolation --- no disagreement there. But putting the process
> in new namespaces is something else; I would call it supporting
> "delusions of grandeur" of that process. For example, namespaces allow a
> process to do mounts, have `CAP_SYS_ADMIN`, create network interfaces,
> look up other processes by PID, etc.
>
> Conceptually, to remove a process from one ambient authority scope (the
> very name "namespaces" indicates they are about ambient authority)
> should not require putting it in some ambient authority scope. Just
> because, for example, the process cannot see one mount tree, doesn't
> mean it needs to see another.

I think I like this, but some comments:

>
> Here's what I am thinking would happen concretely:
>
> First, the simpler cases:
>
> #### Null mount namespace
>
> - requires:
>
>   - null root file system: absolute paths don't work.
>
>   - null current working directory: relative paths with traditional,
>     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.

It's perfectly valid to cd to a directory that does not belong to
one's namespace.  We have fchdir.  What's wrong with letting it
continue working?

Regardless of that, the current directory either needs to be a
directory or to be nothing at all, and if we support the latter, we
need to figure out what /proc will show.

> #### Null user namespace

A user namespace is kind of about how *non-current* uids and gids work
for the process and how it perceives its own uid and gid and not so
much about what uid and gid it has when accessing outside resources.
So...

>
> - Process has no user or group ids

What does that mean?  What does ps show?



Maybe the way to go is to implement the ones that have clearer
semantics and to defer the others.

^ permalink raw reply

* [RFC] Null Namespaces
From: John Ericson @ 2026-06-24 22:51 UTC (permalink / raw)
  To: Li Chen, Cong Wang, Christian Brauner, linux-arch
  Cc: linux-kernel, linux-fsdevel, linux-api, Arnd Bergmann,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
	Shuah Khan, Alexander Viro, Kees Cook, Sergei Zimmerman,
	Farid Zakaria

Hello, I am hoping to discuss an idea I've had for a while, that I am
calling "null namespaces" that has become more relevant with some recent
other discussions. First I'll discuss null namespaces in general terms,
and then I'll link those recent discussions and relate null namespaces
to them.

### Null namespaces

The essence of null namespaces is trying to give processes as little
ambient authority as possible, so they are lighter weight and allowed to
do even less than fully unshared processes today.

Namespaces as they exist today are frequently described as an isolation
mechanism, but I think this is the conflation of two different things.
*Removing* a new process from its parent's namespaces unquestionably is
increasing isolation --- no disagreement there. But putting the process
in new namespaces is something else; I would call it supporting
"delusions of grandeur" of that process. For example, namespaces allow a
process to do mounts, have `CAP_SYS_ADMIN`, create network interfaces,
look up other processes by PID, etc.

Conceptually, to remove a process from one ambient authority scope (the
very name "namespaces" indicates they are about ambient authority)
should not require putting it in some ambient authority scope. Just
because, for example, the process cannot see one mount tree, doesn't
mean it needs to see another.

Here's what I am thinking would happen concretely:

First, the simpler cases:

#### Null mount namespace

- requires:

  - null root file system: absolute paths don't work.

  - null current working directory: relative paths with traditional,
    non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.

- All operations relating to the "ambient" mount tree don't work.

- `*at` operations with a file descriptor do work.

- The new fd-based mount APIs with detached mounts do work, modulo
  the calling process having enough permissions (as usual).

#### Null network namespace

- No network interfaces

- No abstract Unix sockets

#### Null IPC namespace

- cannot create or look up either type of message queue

#### Null UTS namespace

- no hostname or domainname: `uname`, `gethostname`/`sethostname`, and the
  related `/proc/sys/kernel` sysctls all fail.

#### Null user namespace

- Process has no user or group ids

- All uid/gid-based authorization lookups return "denied"

- -1 / "nobody" IDs for operations we don't want to fail (like `fstat`)
  can be used.

Note how in each of these, the notion of there "existing" a "single"
null namespace or not is degenerate --- every process with a null
namespace field is as isolated from one another (in terms of the axis
that namespace regulates) as they are from processes that are in other
namespaces. It is truly a minimal permission level, and (as we shall
see) cheap too, because it is just a null pointer in `task_struct`.

Then for the nested ones --- PID and cgroup --- we cannot have quite a
null namespace in the same sense, because it is an important property
that these namespaces are hierarchical up to the root namespaces.
Instead of having a disjoint null namespace, we need a null namespace
with a parent.

#### Null PID namespace

- cannot look up other processes by PID

- current process ID lookup fails

- current process's parent process ID lookup fails

- current process still assigned IDs in parent PID namespaces, per usual

#### Null cgroup namespace

- Process still can have resources restricted according to parent cgroup

- Process unaware of cgroup hierarchy though --- blind to who/how it is
  constrained

In these cases, we cannot just implement with a null pointer, because we
still need a valid parent namespace. However, we shouldn't need any info
*but* the parent namespace. A pair of a pointer and a bool indicating
null namespace with parent namespace or actual namespace membership,
with some sort of helper to get the parent namespace in either case
(since the actual namespace has its parent), should implement this.

Finally there is the time namespace. Conceptually a null time namespace
is simple enough --- you cannot look up the time! --- but the
implementation is a bit more complex to get right because of the vDSO
for certain timing operations.

### General Motivation

Why am I so interested in this stuff?

Firstly it is because I have always been interested in a more strictly
object-capability-based userland, and projects like
Capsicum/CloudABI/WASI. I think going all in on file descriptors is
generally the direction that Linux has been going in, and it creates a
genuinely better programming model than the traditional Unix one with
all its ambient authority, and the TOCTOU and other issues that attend
it.

Today's container idioms and the "delusions of grandeur" that namespaces
provide are great for retrofitting existing software to run in a more
isolated environment. But I don't want that to be the ceiling of our
ambitions. Especially in this age of LLM refactoring, it is very easy to
get both new and existing software to abide by the more limited set of
allowed operations that null-namespace processes allow. And the
modifications that that entails (more `openat`, more socket activation,
etc.) make that software (in my view) simply *better* --- I would want
it to work that way with or without these constraints forcing the issue.

Secondly, and more concretely/imminently as a Nix developer, I am very
interested in the performance and overhead of process isolation. It is
very much my ambition to move Nix into the Bazel/Buck space of ever more
numerous and fine-grained atomic build steps (i.e. small compilation
units, not "packages"), but to do this *without* sacrificing Nix's
strong sandboxing guarantees that make our build plans so self-contained
and thus the ease of onboarding new Nix users.

I think this "null namespace" sandboxing will likely be simpler and more
performant than creating and destroying a bunch of regular namespaces
for each compilation unit. And while it will no doubt take some compiler
/ other tool patching to fix up any assumptions that get in the way of
running processes with so few permissions, I am happy to take a stab at
that too. Nix is, after all, for "tool-assisted yak shaves" as one put
it --- patching GCC / Clang / whatever and then rebuilding the world is
something we are quite good at.

Lastly, I'll add that the traditional way people have thought about
things like Capsicum/CloudABI is custom personalities/seccomp rules, but
IMO trying to tackle the massive UAPI surface area so shallowly is ugly
and unmaintainable. Nulling out namespace fields in `task_struct`,
conversely, attacks the problem at its core, much more elegantly, and
makes it easy to handle both current *and future* syscalls in a
minimally invasive and maintainable manner.

### Null namespaces and process spawning

Why bring this up now?

Recently [1], Li Chen took a stab at the venerable old goal of making a
better process spawning UAPI than fork/clone + exec. I am quite excited
to see this happen, as it generally dovetails very nicely with the
object capability goals I have above. (E.g. making it performant and
idiomatic to opt-in, rather than opt-out of sharing file descriptors
with a child process is very good for a world where all
resource/privilege sharing is done with file descriptors.)

One problem with clone that didn't yet come up is that its defaults are
not good from a security perspective: sharing by default, and unsharing
as the opt in means that one must remember and take active measures to
ensure that child processes get *less* privileges. This is very bad ---
secure practices mean that the "lazy programmer" and the "smallest
program" must always err on the side of giving the child process *less*
privileges. This is the only way economics and the "principle of least
privilege" will work together, rather than against each other (and
economics is quite likely to win when they are working against each
other).

The reason that clone *doesn't* work that way is, of course,
performance: it would be wasteful to unshare and create new namespaces
when they are just going to be thrown away because the user wants to
share after all.

Null namespaces I think elegantly work around this performance/security
trade-off, while also avoiding the need for gazillion-parameter syscalls
like clone. This is because, as the most secure option, and a cheap
option, they are the rightful default for a new process creation API.

1. When an "embryonic" (under construction, not yet ready to be
   scheduled) task is first created, it should have all null namespaces.

2. Separate syscalls (`io_uring` exists for batching, we don't need to
   reinvent an ad-hoc batch solution) can exist for setting the
   namespaces on the process, where either "sharing" (use parent process
   namespace) or "unsharing" (use fresh namespace, usually derived from
   the parent process namespace but perhaps derived from a different
   one) are choices that can be opted into instead of the null namespace
   default.

3. After all state is initialized (arguments, environment variables,
   file descriptors, namespaces, etc.), the process can be "birthed",
   and submitted as ready to be scheduled.

This design is very natural to me, but its full naturality is *only*
available with the null namespace option. Otherwise we are stuck in a
place of no good defaults, and the "builder pattern" seems more awkward.

Also in [2], I bring up a design for unix sockets without the file
system or the "abstract" socket namespace, and how I want to avoid both
in order to firmly rule out TOCTOU and other ambient authority issues. I
think those arguments stand on their own, but the possibility of a null
network namespace sharpens the issue: it forces the `O_PATH` FD stuff I
discuss to be the only viable option.

### Implementation

I've "LLM'd" out some draft patches [3] for this. I'm not submitting
them because I still need to review and test them, and I don't want
(currently, pre those steps) low-quality slop to tarnish this proposal.
What this initial exploration did, however, confirm for me is that these
changes should be quite lightweight to implement. (Also, what I propose
is slightly different from my implementation draft in a few cases where
I think the design I proposed here is better than my draft
implementation.)

If the discussion here starts moving towards consensus, I'll clean up
and rework those patches along the lines of the consensus. Ideally I
would submit them one at a time, I figure, since the implementations for
different namespaces are necessarily changes to different subsystems.

Cheers!

John

[1]: https://lore.kernel.org/all/20260528095235.2491226-1-me@linux.beauty/

[2]: https://lore.kernel.org/all/455281ec-3ee1-4f27-989b-c239f0690d8b@app.fastmail.com/

[3]: https://github.com/Ericson2314/linux/commits/null-namespace

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Al Viro @ 2026-06-24 16:28 UTC (permalink / raw)
  To: Xin Zhao
  Cc: brauner, mjguzik, pfalcato, ebiederm, jack, jlayton, chuck.lever,
	alex.aring, arnd, keescook, mcgrof, j.granados, allen.lkml,
	linux-fsdevel, linux-kernel, linux-arch
In-Reply-To: <20260624145552.70143-1-jackzxcui1989@163.com>

On Wed, Jun 24, 2026 at 10:55:52PM +0800, Xin Zhao wrote:
> +void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
> +{
> +	struct files_struct *files = tsk->files;
> +	struct fdtable *fdt;
> +	struct file *file;
> +	unsigned int i, j = 0;
> +
> +	if (!files)
> +		return;
> +
> +	fdt = rcu_dereference_raw(files->fdt);
> +	for (;;) {
> +		unsigned long set;
> +
> +		i = j * BITS_PER_LONG;
> +		if (i >= fdt->max_fds)
> +			break;
> +		set = fdt->open_fds[j++];
> +		while (set) {
> +			if (!(set & 1))
> +				goto next_fd;
> +			file = fdt->fd[i];
> +			if (!file)
> +				goto next_fd;
> +			if (file->f_flags & O_TMPCLOS) {
> +				file->f_flags &= ~O_TMPCLOS;
> +				goto close_fd;
> +			}

*blink*

	How could that possibly make sense?  Many descriptors
may refer to the same file; what's more, many descriptor tables
may contain such descriptors, so... just what is that code
trying to do?

^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Sebastian Andrzej Siewior @ 2026-06-24 15:24 UTC (permalink / raw)
  To: Petr Mladek
  Cc: K Prateek Nayak, linux-arch, linux-kernel, sched-ext, netdev,
	David S . Miller, Andrea Righi, Andrew Morton, Arnd Bergmann,
	Ben Segall, Breno Leitao, Changwoo Min, David Vernet,
	Dietmar Eggemann, Eric Dumazet, Ingo Molnar, Jakub Kicinski,
	John Ogness, Juri Lelli, Paolo Abeni, Peter Zijlstra,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <ajugq8VAciqtMx9F@pathway.suse.cz>

On 2026-06-24 11:17:31 [+0200], Petr Mladek wrote:
> For Linus, it was a no-go, definitely.
> I would vote for adding the WARN_*DEFERRED() into the scheduler code
> at least until majority of console drivers are converted to nbcon API.

I see four nbcon serial console drivers (+netconsole, + drm_log). We
have at least four times that many console drivers. What is the
majority from your point of view? The 8250 should cover all of x86.

> Best Regards,
> Petr

Sebastian

^ permalink raw reply

* [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Xin Zhao @ 2026-06-24 14:55 UTC (permalink / raw)
  To: brauner, mjguzik, pfalcato, ebiederm, viro, jack, jlayton,
	chuck.lever, alex.aring, arnd, keescook, mcgrof, j.granados,
	allen.lkml
  Cc: linux-fsdevel, linux-kernel, linux-arch, Xin Zhao

A coredump typically takes some time to complete. If we happen to hold a
write lock with flock just before triggering the coredump, that write lock
will not be released during the entire coredump process. As a result,
other processes attempting to acquire the same write lock may experience
significant delays. Another typical scenario is that shared memory, such
as dma-buf, remains occupied and is not released for a long time due to
core dumps.

To address this, add /proc/<pid>/coredump_pre_exit node so that people can
specify which resources they want to release before dumping core. This
patch implements the early release of two types of resources: flock files
and file-backed shared memory. Default settings are NOT pre-exit anything.

A temporary bit, O_TMPCLOS, is added to mark vma->vm_file->f_flags during
the execution of the newly introduced exit_mmap_mapped_shared() function.
In this way, the subsequent exit_files_pre_exit() function does not need
to find the corresponding vma through the file to check for the VM_SHARED
attribute, thereby reducing the traversal cost.

Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
---

Change in v4:
- Christian pointed out that the coredump process will traverse file
  descriptors (fd), so certain fds should not be closed by default.
  Rework the whole feature, add /proc/<pid>/coredump_pre_exit for user
  pre-exit resources selection, default is NOT pre-exit anything.
- Mateusz suggested that walking the fd table and release the file-lock is
  reasonable. No longer release all the fd(s). Based on user config, only
  the flock fd(s) and the fd(s) correspondent to file-backed shared memory
  will be released at most.

Change in v3:
- Add comment and commit-log to explain why do the MMF_DUMP_MAPPED_SHARED
  mm_flags_test() check, note that memory mapped files keep their own
  separate references to the files. The case to work around is that early
  unlocking a flock on a file allows other processes to lock and modify
  the mapped data protected by the flock,
  as suggested by Pedro Falcato.
- Link to v3: https://lore.kernel.org/all/20260619122419.3954581-1-jackzxcui1989@163.com/

Change in v2:
- Get rid of the implement of adding new fcntl API, the issue does not
  worth inflicting the cost on everyone,
  as suggested by Al Viro.
- Call exit_files() in coredump_wait(),
  as suggested by Eric W. Biederman.
  Add MMF_DUMP_MAPPED_SHARED mm_flags_test() check to filter cases that
  need to dump file-backed shared memory.
- Link to v2: https://lore.kernel.org/lkml/20260618150301.3226517-1-jackzxcui1989@163.com/

v1:
- Link to v1: https://lore.kernel.org/all/20260618030700.2511668-1-jackzxcui1989@163.com/
---
 .../admin-guide/kernel-parameters.txt         |  5 ++
 Documentation/filesystems/proc.rst            | 58 +++++++++-----
 fs/coredump.c                                 | 23 ++++++
 fs/file.c                                     | 46 +++++++++++
 fs/proc/base.c                                | 78 +++++++++++++++++++
 include/linux/mm.h                            |  1 +
 include/linux/mm_types.h                      |  9 +++
 include/linux/sched/task.h                    |  1 +
 include/uapi/asm-generic/fcntl.h              |  4 +
 kernel/fork.c                                 | 12 +++
 mm/mmap.c                                     | 21 +++++
 11 files changed, 238 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f575d4508..bc6d3859f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1024,6 +1024,11 @@ Kernel parameters
 			/proc/<pid>/coredump_filter.
 			See also Documentation/filesystems/proc.rst.
 
+	coredump_pre_exit=
+			[KNL] Change the default value for
+			/proc/<pid>/coredump_pre_exit.
+			See also Documentation/filesystems/proc.rst.
+
 	coresight_cpu_debug.enable
 			[ARM,ARM64]
 			Format: <bool>
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index db6167bef..6a637d31d 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -39,16 +39,17 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
   3.2	/proc/<pid>/oom_score - Display current oom-killer score
   3.3	/proc/<pid>/io - Display the IO accounting fields
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
-  3.5	/proc/<pid>/mountinfo - Information about mounts
-  3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
-  3.7   /proc/<pid>/task/<tid>/children - Information about task children
-  3.8   /proc/<pid>/fdinfo/<fd> - Information about opened file
-  3.9   /proc/<pid>/map_files - Information about memory mapped files
-  3.10  /proc/<pid>/timerslack_ns - Task timerslack value
-  3.11	/proc/<pid>/patch_state - Livepatch patch operation state
-  3.12	/proc/<pid>/arch_status - Task architecture specific information
-  3.13  /proc/<pid>/fd - List of symlinks to open files
-  3.14  /proc/<pid>/ksm_stat - Information about the process's ksm status.
+  3.5  /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
+  3.6	/proc/<pid>/mountinfo - Information about mounts
+  3.7	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
+  3.8   /proc/<pid>/task/<tid>/children - Information about task children
+  3.9   /proc/<pid>/fdinfo/<fd> - Information about opened file
+  3.10   /proc/<pid>/map_files - Information about memory mapped files
+  3.11  /proc/<pid>/timerslack_ns - Task timerslack value
+  3.12	/proc/<pid>/patch_state - Livepatch patch operation state
+  3.13	/proc/<pid>/arch_status - Task architecture specific information
+  3.14  /proc/<pid>/fd - List of symlinks to open files
+  3.15  /proc/<pid>/ksm_stat - Information about the process's ksm status.
 
   4	Configuring procfs
   4.1	Mount options
@@ -1961,7 +1962,24 @@ For example::
   $ echo 0x7 > /proc/self/coredump_filter
   $ ./some_program
 
-3.5	/proc/<pid>/mountinfo - Information about mounts
+3.5 /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
+---------------------------------------------------------------
+A coredump typically takes some time to complete. If we happen to hold a write
+lock with flock just before triggering the coredump, that write lock will not
+be released during the entire coredump process. As a result, other processes
+attempting to acquire the same write lock may experience significant delays.
+Another typical scenario is that shared memory, such as dma-buf, remains
+occupied and is not released for a long time due to core dumps.
+
+/proc/<pid>/coredump_pre_exit allows you to pre-exit some resources before
+dumping core.
+
+The following two types are supported:
+
+  - (bit 0) flock files
+  - (bit 1) file-backed shared memory
+
+3.6	/proc/<pid>/mountinfo - Information about mounts
 --------------------------------------------------------
 
 This file contains lines of the form::
@@ -2001,7 +2019,7 @@ For more information on mount propagation see:
   Documentation/filesystems/sharedsubtree.rst
 
 
-3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
+3.7	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
 --------------------------------------------------------
 These files provide a method to access a task's comm value. It also allows for
 a task to set its own or one of its thread siblings comm value. The comm value
@@ -2010,7 +2028,7 @@ then the kernel's TASK_COMM_LEN (currently 16 chars, including the NUL
 terminator) will result in a truncated comm value.
 
 
-3.7	/proc/<pid>/task/<tid>/children - Information about task children
+3.8	/proc/<pid>/task/<tid>/children - Information about task children
 -------------------------------------------------------------------------
 This file provides a fast way to retrieve first level children pids
 of a task pointed by <pid>/<tid> pair. The format is a space separated
@@ -2027,7 +2045,7 @@ pids, so one needs to either stop or freeze processes being inspected
 if precise results are needed.
 
 
-3.8	/proc/<pid>/fdinfo/<fd> - Information about opened file
+3.9	/proc/<pid>/fdinfo/<fd> - Information about opened file
 ---------------------------------------------------------------
 This file provides information associated with an opened file. The regular
 files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'.
@@ -2198,7 +2216,7 @@ VFIO Device files
 where 'vfio-device-syspath' is the sysfs path corresponding to the VFIO device
 file.
 
-3.9	/proc/<pid>/map_files - Information about memory mapped files
+3.10	/proc/<pid>/map_files - Information about memory mapped files
 ---------------------------------------------------------------------
 This directory contains symbolic links which represent memory mapped files
 the process is maintaining.  Example output::
@@ -2220,7 +2238,7 @@ time one can open(2) mappings from the listings of two processes and
 comparing their inode numbers to figure out which anonymous memory areas
 are actually shared.
 
-3.10	/proc/<pid>/timerslack_ns - Task timerslack value
+3.11	/proc/<pid>/timerslack_ns - Task timerslack value
 ---------------------------------------------------------
 This file provides the value of the task's timerslack value in nanoseconds.
 This value specifies an amount of time that normal timers may be deferred
@@ -2236,7 +2254,7 @@ Valid values are from 0 - ULLONG_MAX
 An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level
 permissions on the task specified to change its timerslack_ns value.
 
-3.11	/proc/<pid>/patch_state - Livepatch patch operation state
+3.12	/proc/<pid>/patch_state - Livepatch patch operation state
 -----------------------------------------------------------------
 When CONFIG_LIVEPATCH is enabled, this file displays the value of the
 patch state for the task.
@@ -2253,7 +2271,7 @@ patched.  If the patch is being enabled, then the task has already been
 patched.  If the patch is being disabled, then the task hasn't been
 unpatched yet.
 
-3.12 /proc/<pid>/arch_status - task architecture specific status
+3.13 /proc/<pid>/arch_status - task architecture specific status
 -------------------------------------------------------------------
 When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
 architecture specific status of the task.
@@ -2298,7 +2316,7 @@ AVX512_elapsed_ms
   the task is unlikely an AVX512 user, but depends on the workload and the
   scheduling scenario, it also could be a false negative mentioned above.
 
-3.13 /proc/<pid>/fd - List of symlinks to open files
+3.14 /proc/<pid>/fd - List of symlinks to open files
 -------------------------------------------------------
 This directory contains symbolic links which represent open files
 the process is maintaining.  Example output::
@@ -2313,7 +2331,7 @@ The number of open files for the process is stored in 'size' member
 of stat() output for /proc/<pid>/fd for fast access.
 -------------------------------------------------------
 
-3.14 /proc/<pid>/ksm_stat - Information about the process's ksm status
+3.15 /proc/<pid>/ksm_stat - Information about the process's ksm status
 ----------------------------------------------------------------------
 When CONFIG_KSM is enabled, each process has this file which displays
 the information of ksm merging status.
diff --git a/fs/coredump.c b/fs/coredump.c
index bb6fdb1f4..e08a8a6c4 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -521,6 +521,27 @@ static int zap_threads(struct task_struct *tsk,
 	return nr;
 }
 
+static void coredump_pre_exit(void)
+{
+	struct task_struct *tsk = current;
+	unsigned long flags = __mm_flags_get_dumpable(tsk->mm);
+
+	if (!likely(flags & MMF_DUMP_PRE_EXIT_MASK))
+		return;
+
+	/*
+	 * Set O_TMPCLOS of file f_flags if file needs to be closed.
+	 */
+	if (test_bit(MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED, &flags) &&
+	    !test_bit(MMF_DUMP_MAPPED_SHARED, &flags))
+		exit_mmap_mapped_shared(tsk->mm);
+
+	/*
+	 * Check O_TMPCLOS of file f_flags to close file and clear it.
+	 */
+	exit_files_pre_exit(tsk, mm_flags_test(MMF_DUMP_PRE_EXIT_FLOCK, tsk->mm));
+}
+
 static int coredump_wait(int exit_code, struct core_state *core_state)
 {
 	struct task_struct *tsk = current;
@@ -1100,6 +1121,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
 		return;
 	}
 
+	coredump_pre_exit();
+
 	switch (cn->core_type) {
 	case COREDUMP_FILE:
 		if (!coredump_file(cn, cprm, binfmt))
diff --git a/fs/file.c b/fs/file.c
index 2c81c0b16..a58ffffcc 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -23,6 +23,7 @@
 #include <linux/file_ref.h>
 #include <net/sock.h>
 #include <linux/init_task.h>
+#include <linux/filelock.h>
 
 #include "internal.h"
 
@@ -527,6 +528,51 @@ void exit_files(struct task_struct *tsk)
 	}
 }
 
+void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
+{
+	struct files_struct *files = tsk->files;
+	struct fdtable *fdt;
+	struct file *file;
+	unsigned int i, j = 0;
+
+	if (!files)
+		return;
+
+	fdt = rcu_dereference_raw(files->fdt);
+	for (;;) {
+		unsigned long set;
+
+		i = j * BITS_PER_LONG;
+		if (i >= fdt->max_fds)
+			break;
+		set = fdt->open_fds[j++];
+		while (set) {
+			if (!(set & 1))
+				goto next_fd;
+			file = fdt->fd[i];
+			if (!file)
+				goto next_fd;
+			if (file->f_flags & O_TMPCLOS) {
+				file->f_flags &= ~O_TMPCLOS;
+				goto close_fd;
+			}
+			if (!checkflock)
+				goto next_fd;
+			if (!vfs_inode_has_locks(file_inode(file)))
+				goto next_fd;
+
+close_fd:
+			fdt->fd[i] = NULL;
+			filp_close(file, files);
+			cond_resched();
+
+next_fd:
+			i++;
+			set >>= 1;
+		}
+	}
+}
+
 struct files_struct init_files = {
 	.count		= ATOMIC_INIT(1),
 	.fdt		= &init_files.fdtab,
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d9acfa89c..99b5f219f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3026,6 +3026,83 @@ static const struct file_operations proc_coredump_filter_operations = {
 	.write		= proc_coredump_filter_write,
 	.llseek		= generic_file_llseek,
 };
+
+static ssize_t proc_coredump_pre_exit_read(struct file *file, char __user *buf,
+					   size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file_inode(file));
+	struct mm_struct *mm;
+	char buffer[PROC_NUMBUF];
+	size_t len;
+	int ret;
+
+	if (!task)
+		return -ESRCH;
+
+	ret = 0;
+	mm = get_task_mm(task);
+	if (mm) {
+		unsigned long flags = __mm_flags_get_dumpable(mm);
+
+		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
+			       ((flags & MMF_DUMP_PRE_EXIT_MASK) >>
+				MMF_DUMP_PRE_EXIT_SHIFT));
+		mmput(mm);
+		ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
+	}
+
+	put_task_struct(task);
+
+	return ret;
+}
+
+static ssize_t proc_coredump_pre_exit_write(struct file *file,
+					    const char __user *buf,
+					    size_t count,
+					    loff_t *ppos)
+{
+	struct task_struct *task;
+	struct mm_struct *mm;
+	unsigned int val;
+	int ret;
+	int i;
+	unsigned long mask;
+
+	ret = kstrtouint_from_user(buf, count, 0, &val);
+	if (ret < 0)
+		return ret;
+
+	ret = -ESRCH;
+	task = get_proc_task(file_inode(file));
+	if (!task)
+		goto out_no_task;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out_no_mm;
+	ret = 0;
+
+	for (i = 0, mask = 1; i < MMF_DUMP_PRE_EXIT_BITS; i++, mask <<= 1) {
+		if (val & mask)
+			mm_flags_set(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
+		else
+			mm_flags_clear(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
+	}
+
+	mmput(mm);
+ out_no_mm:
+	put_task_struct(task);
+ out_no_task:
+	if (ret < 0)
+		return ret;
+	return count;
+}
+
+static const struct file_operations proc_coredump_pre_exit_operations = {
+	.read		= proc_coredump_pre_exit_read,
+	.write		= proc_coredump_pre_exit_write,
+	.llseek		= generic_file_llseek,
+};
 #endif
 
 #ifdef CONFIG_TASK_IO_ACCOUNTING
@@ -3391,6 +3468,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 #ifdef CONFIG_ELF_CORE
 	REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
+	REG("coredump_pre_exit", S_IRUGO|S_IWUSR, proc_coredump_pre_exit_operations),
 #endif
 #ifdef CONFIG_TASK_IO_ACCOUNTING
 	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
diff --git a/include/linux/mm.h b/include/linux/mm.h
index af23453e9..dfd4717c7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4066,6 +4066,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
 extern int __vm_enough_memory(const struct mm_struct *mm, long pages, int cap_sys_admin);
 extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
 extern void exit_mmap(struct mm_struct *);
+extern void exit_mmap_mapped_shared(struct mm_struct *mm);
 bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
 				 unsigned long addr, bool write);
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c7db35be6..0555aaf50 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1963,6 +1963,15 @@ enum {
 	(BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
 	 BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
 
+/* coredump pre-exit bits */
+#define MMF_DUMP_PRE_EXIT_FLOCK	11
+#define MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED 12
+
+#define MMF_DUMP_PRE_EXIT_SHIFT	(MMF_DUMPABLE_BITS + MMF_DUMP_FILTER_BITS)
+#define MMF_DUMP_PRE_EXIT_BITS	2
+#define MMF_DUMP_PRE_EXIT_MASK	\
+	(((1 << MMF_DUMP_PRE_EXIT_BITS) - 1) << MMF_DUMP_PRE_EXIT_SHIFT)
+
 #ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
 # define MMF_DUMP_MASK_DEFAULT_ELF	BIT(MMF_DUMP_ELF_HEADERS)
 #else
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 41ed884cf..b4becbf6c 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -93,6 +93,7 @@ static inline void exit_thread(struct task_struct *tsk)
 extern __noreturn void do_group_exit(int);
 
 extern void exit_files(struct task_struct *);
+extern void exit_files_pre_exit(struct task_struct *, bool);
 extern void exit_itimers(struct task_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 613475285..360604d65 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -95,6 +95,10 @@
 #define O_NDELAY	O_NONBLOCK
 #endif
 
+#ifndef O_TMPCLOS
+#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
+#endif
+
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
 #define F_SETFD		2	/* set/clear close_on_exec */
diff --git a/kernel/fork.c b/kernel/fork.c
index a679b2448..84f1ee7f3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
 
 __setup("coredump_filter=", coredump_filter_setup);
 
+static unsigned long default_dump_pre_exit;
+
+static int __init coredump_pre_exit_setup(char *s)
+{
+	default_dump_pre_exit =
+		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
+		MMF_DUMP_PRE_EXIT_MASK;
+	return 1;
+}
+
+__setup("coredump_pre_exit=", coredump_pre_exit_setup);
+
 #include <linux/init_task.h>
 
 static void mm_init_aio(struct mm_struct *mm)
diff --git a/mm/mmap.c b/mm/mmap.c
index 5754d1c36..b955c47c0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1326,6 +1326,27 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 }
 
+void exit_mmap_mapped_shared(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, 0);
+
+	mmap_write_lock(mm);
+	lru_add_drain();
+
+	for_each_vma(vmi, vma) {
+		if (vma->vm_flags & VM_HUGETLB)
+			continue;
+		if (!(vma->vm_flags & VM_SHARED) || !file_inode(vma->vm_file)->i_nlink)
+			continue;
+		vma->vm_file->f_flags |= O_TMPCLOS;
+		do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
+		cond_resched();
+	}
+
+	mmap_write_unlock(mm);
+}
+
 /*
  * Return true if the calling process may expand its vm space by the passed
  * number of pages
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Sebastian Andrzej Siewior @ 2026-06-24 11:03 UTC (permalink / raw)
  To: Breno Leitao
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Andrew Morton, Arnd Bergmann, Ben Segall,
	Changwoo Min, David Vernet, Dietmar Eggemann, Eric Dumazet,
	Ingo Molnar, Jakub Kicinski, John Ogness, Juri Lelli,
	K Prateek Nayak, Paolo Abeni, Peter Zijlstra, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <ajuWnKsQR0Z825Wn@gmail.com>

On 2026-06-24 01:37:53 [-0700], Breno Leitao wrote:
Hi Breno,

> Have you considered an approach similar to printk_deferred_enter(),
> where you mark the code region that needs deferral and all WARN() calls
> within that region are automatically deferred?

Doing this at rq-lock site is not something the scheduler department
takes. It increases/ bloats the code sides more than what we have now.

Not everything is in __sched section so we can't check for this from
within printk. So this turd was the only idea I had.

> The current proposal requires changing individual WARN() call sites,
> but whether they need deferral might depend on the calling context. This
> means you'd need to convert many call sites and ensure all nested
> warnings are also converted to the deferred variant.

I hope for the forced-threaded-legacy the default but this camp has not
a lot members. It would increase the pressure to provide nbcon so it
could be a good thing.

To accept this series and make it more bullet-proof we could do
s/WARN_ON\>/WARN_ON_DEFERRED/ for all sched/ and require it regardless
if the rq-lock is held. So you wouldn't have to audit it each and every
time. Due to that preempt-disable thingy it can be used in preemptible
sections without breaking anything.

> 
> Thanks,
> --breno

Sebastian

^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Sebastian Andrzej Siewior @ 2026-06-24 10:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Andrew Morton, Arnd Bergmann, Ben Segall,
	Breno Leitao, Changwoo Min, David Vernet, Dietmar Eggemann,
	Eric Dumazet, Ingo Molnar, Jakub Kicinski, John Ogness,
	Juri Lelli, K Prateek Nayak, Paolo Abeni, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260624093117.GY48970@noisy.programming.kicks-ass.net>

On 2026-06-24 11:31:17 [+0200], Peter Zijlstra wrote:
> On Tue, Jun 23, 2026 at 04:26:49PM +0200, Sebastian Andrzej Siewior wrote:
> 
> > +#ifndef WARN_ON_DEFERRED
> > +#define WARN_ON_DEFERRED(condition) ({					\
> > +	int __ret_warn_on = !!(condition);				\
> > +	if (unlikely(__ret_warn_on)) {					\
> > +		guard(preempt)();					\
> > +		printk_deferred_enter()					\
> > +		__WARN();						\
> > +		printk_deferred_exit()					\
> > +	}								\
> > +	unlikely(__ret_warn_on);					\
> > +})
> > +#endif
> 
> This will generate atrocious shite at the WARN sites.

You mean the missing semicolon and huge size increase?
On x86 with these guard+deffered in the upper variant, before:
    text    data     bss     dec   filename
   93910   37424     832  132166   kernel/sched/core.o
   61802    4945     152   66899   kernel/sched/fair.o
  215108   24453    3768  243329   kernel/sched/build_policy.o
   86128   30092   12704  128924   kernel/sched/build_utility.o
  456948   96914   17456  571318   total
After:
   96140   37408     832  134380   kernel/sched/core.o
   64490    4937     152   69579   kernel/sched/fair.o
  222980   24157    3768  250905   kernel/sched/build_policy.o
   86544   30100   12704  129348   kernel/sched/build_utility.o
  470154   96602   17456  584212   total + 1.3%

total went up by 1.3% or 12.59KiB.
This effects:  alpha, arc, arm, csky, hexagon, m68k, microblaze, mips,
nios2, openrisc, sparc, um, xtensa
and could motivate them to implement __WARN_FLAGS which would lower size
in general and this stunt would have no effect.

Just looked at arm and it has support for invalid opcodes somehow but
not for this.

Sebastian

^ permalink raw reply

* Re: [PATCH 0/2] sched: Introduce and use deferred WARNs in sched
From: Peter Zijlstra @ 2026-06-24  9:33 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Andrew Morton, Arnd Bergmann, Ben Segall,
	Breno Leitao, Changwoo Min, David Vernet, Dietmar Eggemann,
	Eric Dumazet, Ingo Molnar, Jakub Kicinski, John Ogness,
	Juri Lelli, K Prateek Nayak, Paolo Abeni, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623142650.265721-1-bigeasy@linutronix.de>

On Tue, Jun 23, 2026 at 04:26:48PM +0200, Sebastian Andrzej Siewior wrote:
> This is a follow-up to the netconsole lockup reported
> 	https://lore.kernel.org/all/20260610183621.3915271-1-vlad.wing@gmail.com/
> 
> The idea is to use deferred printing for WARNs and use them in sched. I
> tried to use only where it looks that the rq lock acquired instead a
> plain s/WARN_ON/WARN_ON_DEFFERED which would be simpler.
> 
> This unholy deferred mess can be removed once we don't have legacy
> consoles anymore _or_ force force_legacy_kthread=true.

So I really don't see why we should do this. This has been a 'problem'
forever, and printk() is actually being fixed.



^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Peter Zijlstra @ 2026-06-24  9:31 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Andrew Morton, Arnd Bergmann, Ben Segall,
	Breno Leitao, Changwoo Min, David Vernet, Dietmar Eggemann,
	Eric Dumazet, Ingo Molnar, Jakub Kicinski, John Ogness,
	Juri Lelli, K Prateek Nayak, Paolo Abeni, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623142650.265721-2-bigeasy@linutronix.de>

On Tue, Jun 23, 2026 at 04:26:49PM +0200, Sebastian Andrzej Siewior wrote:

> +#ifndef WARN_ON_DEFERRED
> +#define WARN_ON_DEFERRED(condition) ({					\
> +	int __ret_warn_on = !!(condition);				\
> +	if (unlikely(__ret_warn_on)) {					\
> +		guard(preempt)();					\
> +		printk_deferred_enter()					\
> +		__WARN();						\
> +		printk_deferred_exit()					\
> +	}								\
> +	unlikely(__ret_warn_on);					\
> +})
> +#endif

This will generate atrocious shite at the WARN sites.

^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Petr Mladek @ 2026-06-24  9:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: K Prateek Nayak, linux-arch, linux-kernel, sched-ext, netdev,
	David S . Miller, Andrea Righi, Andrew Morton, Arnd Bergmann,
	Ben Segall, Breno Leitao, Changwoo Min, David Vernet,
	Dietmar Eggemann, Eric Dumazet, Ingo Molnar, Jakub Kicinski,
	John Ogness, Juri Lelli, Paolo Abeni, Peter Zijlstra,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260624062642.5DER6vrP@linutronix.de>

On Wed 2026-06-24 08:26:42, Sebastian Andrzej Siewior wrote:
> On 2026-06-23 20:24:02 [+0530], K Prateek Nayak wrote:
> > Hello Sebastian,
> Hi Prateek,
> 
> > nit.
> > 
> > Instead of replicating these bits, can we replace that return with a
> > "goto out" ...
> 
> sure
> 
> …
> > ... and replace this return with a:
> > 
> >     return (warning) ? BUG_TRAP_TYPE_WARN : BUG_TRAP_TYPE_BUG;
> > 
> > Looks a tab bit cleaner to my eyes. Thoughts?
> 
> It sure does.
> I wait for PeterZ' executive order to either do this and sprinkle sched/
> _or_ make legacy consoles deferred as it is done on RT.
> 
> Petr, was there a big push back doing it unconditionally?

For Linus, it was a no-go, definitely.

The problem are situations where the system gets stuck and panic()
is not called. This is why nbcon consoles switch to the atomic
mode in some emergency situations, see nbcon_cpu_emergency_enter(),
for example, into __warn(), oops_enter(), rcu stall, and lockdep
calls.

Moving legacy consoles to a kthread would prevent stall in situations
where printk() is called from the scheduler code. But it would cause
that some other stalls become silent.

In my opinion, we should not move the legacy consoles to a kthread
by default. I believe that the rest of the kernel is a bigger
source of possible stalls than the scheduler. So, the overall
experience will be better if we keep the status quo.

I would vote for adding the WARN_*DEFERRED() into the scheduler code
at least until majority of console drivers are converted to nbcon API.

Best Regards,
Petr

^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Breno Leitao @ 2026-06-24  8:37 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Andrew Morton, Arnd Bergmann, Ben Segall,
	Changwoo Min, David Vernet, Dietmar Eggemann, Eric Dumazet,
	Ingo Molnar, Jakub Kicinski, John Ogness, Juri Lelli,
	K Prateek Nayak, Paolo Abeni, Peter Zijlstra, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623142650.265721-2-bigeasy@linutronix.de>

Hello Sebastian,

First of all thanks for working on it.

On Tue, Jun 23, 2026 at 04:26:49PM +0200, Sebastian Andrzej Siewior wrote:
> Provide a deferred version of the WARN_ON() macro. It will delay
> flushing the console until a later context. It is needed in a context
> where the caller holds locks which can lead to a deadlock content is
> flushed to the console driver.
> An example would from a warning from within the scheduler resulting in a
> wake-up of a task.
> 
> Deferring the output works by using printk_deferred_enter/ exit() around
> the printing output. This must be used in a context where the task can't
> migrate to another CPU. This should be the case usually, since the
> scheduler would acquire the rq lock whith disabled interrupts, but to be
> safe preemption is disabled to guarantee this.
> 
> In order not to bloat the code on architectures which provide an
> optimized __WARN_FLAGS() define BUGFLAG_DEFERRED which is handled by
> __report_bug() and does not increase the code size.
> 
> Provide the DEFERRED macros based on __WARN_FLAGS and __WARN_FLAGS
> macros. Extend __report_bug() to handle the deferred case.

Have you considered an approach similar to printk_deferred_enter(),
where you mark the code region that needs deferral and all WARN() calls
within that region are automatically deferred?

The current proposal requires changing individual WARN() call sites,
but whether they need deferral might depend on the calling context. This
means you'd need to convert many call sites and ensure all nested
warnings are also converted to the deferred variant.


Thanks,
--breno

^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Sebastian Andrzej Siewior @ 2026-06-24  6:26 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Andrew Morton, Arnd Bergmann, Ben Segall,
	Breno Leitao, Changwoo Min, David Vernet, Dietmar Eggemann,
	Eric Dumazet, Ingo Molnar, Jakub Kicinski, John Ogness,
	Juri Lelli, Paolo Abeni, Peter Zijlstra, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <038a11a7-4ced-49ae-b605-2058733e841a@amd.com>

On 2026-06-23 20:24:02 [+0530], K Prateek Nayak wrote:
> Hello Sebastian,
Hi Prateek,

> nit.
> 
> Instead of replicating these bits, can we replace that return with a
> "goto out" ...

sure

…
> ... and replace this return with a:
> 
>     return (warning) ? BUG_TRAP_TYPE_WARN : BUG_TRAP_TYPE_BUG;
> 
> Looks a tab bit cleaner to my eyes. Thoughts?

It sure does.
I wait for PeterZ' executive order to either do this and sprinkle sched/
_or_ make legacy consoles deferred as it is done on RT.

Petr, was there a big push back doing it unconditionally?

> >  }
> >  

Sebastian

^ permalink raw reply

* RE: [PATCH v9 2/2] i3c: master: Add driver for AMD AXI I3C master controller
From: Guntupalli, Manikanta @ 2026-06-24  6:06 UTC (permalink / raw)
  To: Patil, Shubham Sanjay, git (AMD-Xilinx), Simek, Michal,
	alexandre.belloni@bootlin.com, Frank.Li@nxp.com, robh@kernel.org,
	krzk+dt@kernel.org, conor+dt@kernel.org, pgaj@cadence.com,
	wsa+renesas@sang-engineering.com,
	tommaso.merciai.xr@bp.renesas.com, arnd@arndb.de,
	quic_msavaliy@quicinc.com, S-k, Shyam-sundar,
	sakari.ailus@linux.intel.com, billy_tsai@aspeedtech.com,
	kees@kernel.org, gustavoars@kernel.org,
	jarkko.nikula@linux.intel.com, jorge.marques@analog.com,
	linux-i3c@lists.infradead.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-hardening@vger.kernel.org
  Cc: Pandey, Radhey Shyam, Goud, Srinivas, Datta, Shubhrajyoti,
	Patil, Shubham Sanjay
In-Reply-To: <20260623114417.2578189-3-shubhamsanjay.patil@amd.com>

AMD General

Hi,

> -----Original Message-----
> From: Shubham Patil <shubhamsanjay.patil@amd.com>
> Sent: Tuesday, June 23, 2026 5:14 PM
> To: git (AMD-Xilinx) <git@amd.com>; Simek, Michal <michal.simek@amd.com>;
> alexandre.belloni@bootlin.com; Frank.Li@nxp.com; robh@kernel.org;
> krzk+dt@kernel.org; conor+dt@kernel.org; pgaj@cadence.com;
> wsa+renesas@sang-engineering.com; tommaso.merciai.xr@bp.renesas.com;
> arnd@arndb.de; quic_msavaliy@quicinc.com; S-k, Shyam-sundar <Shyam-
> sundar.S-k@amd.com>; sakari.ailus@linux.intel.com; billy_tsai@aspeedtech.com;
> kees@kernel.org; gustavoars@kernel.org; jarkko.nikula@linux.intel.com;
> jorge.marques@analog.com; linux-i3c@lists.infradead.org;
> devicetree@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> arch@vger.kernel.org; linux-hardening@vger.kernel.org
> Cc: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Goud, Srinivas
> <srinivas.goud@amd.com>; Datta, Shubhrajyoti <shubhrajyoti.datta@amd.com>;
> Patil, Shubham Sanjay <ShubhamSanjay.Patil@amd.com>; Guntupalli, Manikanta
> <manikanta.guntupalli@amd.com>
> Subject: [PATCH v9 2/2] i3c: master: Add driver for AMD AXI I3C master controller
>
> From: Manikanta Guntupalli <manikanta.guntupalli@amd.com>
>
> Add an I3C master driver and maintainers fragment for the AMD I3C bus controller.
>
> The driver currently supports the I3C bus operating in SDR mode, with features
> including Dynamic Address Assignment, private data transfers, and CCC transfers in
> both broadcast and direct modes. It also supports operation in I2C mode.
>
> The controller's data FIFOs are accessed big-endian; the driver performs this
> conversion locally using ioread32be()/iowrite32be() with the helpers, so it does not
> depend on any core FIFO-endianness helpers.
>
> Signed-off-by: Manikanta Guntupalli <manikanta.guntupalli@amd.com>
> Co-developed-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
> Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
> Co-developed-by: Shubham Patil <shubhamsanjay.patil@amd.com>
> Signed-off-by: Shubham Patil <shubhamsanjay.patil@amd.com>
> ---
> Changes for V9:
> Updated commit description to note that the driver performs big-endian FIFO
> accesses locally (the v8 core-helper patches were dropped).
> Dropped the big-endian MMIO infrastructure patches from the series
> ("asm-generic/io.h: Add big-endian MMIO accessors", "i3c: fix big-endian FIFO
> transfers", and "i3c: master: Add endianness support for
> i3c_readl_fifo()/i3c_writel_fifo()"). The driver now performs big-endian FIFO
> accesses locally using ioread32be()/iowrite32be() with
> get_unaligned()/put_unaligned(), so the series is self-contained and no longer
> includes internals.h.
> Replaced the async completion/transfer-queue machinery with a simple
> synchronous transfer path under the existing mutex.
> Reworked response handling: added enum i3c_error_code to struct xi3c_cmd,
> named the response codes, return -ENODEV/-EIO as appropriate and set err =
> I3C_ERROR_M2/M0 so the i3c core and callers can tell a NACK apart from a bus
> error; propagate err to CCC commands and to each priv xfer (including actual_len).
> Switched from .priv_xfers to the new .i3c_xfers op; reject non-SDR modes with -
> EOPNOTSUPP and report actual_len.
> Reworked DAA: assign addresses incrementally, bound the device count (-
> ENOSPC), detect end-of-enumeration via -ENODEV, zero-initialize the PID buffers,
> and check i3c_master_add_i3c_dev_locked().
> Avoid busy-spinning: sleep with usleep_range() in the FIFO drain/fill loops.
> Use FIELD_PREP() with named command-FIFO field masks instead of open-coded
> shifts, and convert the register-accessor macros to inline functions.
> Split the overloaded timeout macro into XI3C_RESP_TIMEOUT_US and
> XI3C_XFER_TIMEOUT_MS with documented units, and add
> XI3C_POLL_INTERVAL_US.
> xi3c_clk_cfg(): use NSEC_PER_SEC and named timing constants, guard against
> unsigned underflow, and handle I3C_BUS_MODE_MIXED_SLOW.
> Dropped ENTHDR from supports_ccc_cmd() (SDR-only), and dispatch CCCs using
> the I3C_CCC_DIRECT bit.
> Use const for TX buffers and drop the related casts; use parity8() for the DAA parity
> bit.
> Updated MODULE_DESCRIPTION and authors, the copyright year, renamed the
> Kconfig symbol to AMD_AXI_I3C_MASTER, and fixed the MAINTAINERS entry
> (title, mailing list, and the correct binding filename).
>
> Changes for V8:
> Used time_left instead of timeout.
> Used __free(kfree) for xfer to simplify err path in multiple places.
>
> Changes for V7:
> Updated timeout macro name.
> Updated xi3c_master_wr_to_tx_fifo() and xi3c_master_rd_from_rx_fifo() to use
> i3c_writel_fifo() and i3c_readl_fifo().
>
> Changes for V6:
> Removed typecast for xi3c_getrevisionnumber(), xi3c_wrfifolevel(), and
> xi3c_rdfifolevel().
> Replaced dynamic allocation with a static variable for pid_bcr_dcr.
> Fixed sparse warning in do_daa by typecasting the address parity value to u8.
> Fixed sparse warning in xi3c_master_bus_init by typecasting the pid value to u64 in
> info.pid calculation.
>
> Changes for V5:
> Used GENMASK_ULL for PID mask as it's 64bit mask.
>
> Changes for V4:
> Updated timeout macros.
> Removed type casting for xi3c_is_resp_available() macro.
> Used ioread32() and iowrite32() instead of readl() and writel() to keep consistency.
> Read XI3C_RESET_OFFSET reg before udelay().
> Removed xi3c_master_free_xfer() and directly used kfree().
> Skipped checking return value of i3c_master_add_i3c_dev_locked().
> Used devm_mutex_init() instead of mutex_init().
>
> Changes for V3:
> Resolved merge conflicts.
>
> Changes for V2:
> Updated commit description.
> Added mixed mode support with clock configuration.
> Converted smaller functions into inline functions.
> Used FIELD_GET() in xi3c_get_response().
> Updated xi3c_master_rd_from_rx_fifo() to use cmd->rx_buf.
> Used parity8() for address parity calculation.
> Added guards for locks.
> Dropped num_targets and updated xi3c_master_do_daa().
> Used __free(kfree) in xi3c_master_send_bdcast_ccc_cmd().
> Dropped PM runtime support.
> Updated xi3c_master_read() and xi3c_master_write() with
> xi3c_is_resp_available() check.
> Created separate functions: xi3c_master_init() and xi3c_master_reinit().
> Used xi3c_master_init() in bus initialization and xi3c_master_reinit() in error paths.
> Added DAA structure to xi3c_master structure.
> ---
>  MAINTAINERS                         |    8 +
>  drivers/i3c/master/Kconfig          |   15 +
>  drivers/i3c/master/Makefile         |    1 +
>  drivers/i3c/master/amd-i3c-master.c | 1060 +++++++++++++++++++++++++++
>  4 files changed, 1084 insertions(+)
>  create mode 100644 drivers/i3c/master/amd-i3c-master.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 461a3eed6129..bfaa6999913c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1035,6 +1035,14 @@ L:     linux-sound@vger.kernel.org
>  S:   Supported
>  F:   sound/soc/amd/
>
> +AMD AXI I3C MASTER DRIVER
> +M:   Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
> +M:   Shubham Patil <shubhamsanjay.patil@amd.com>
> +L:   linux-i3c@lists.infradead.org
> +S:   Maintained
> +F:   Documentation/devicetree/bindings/i3c/xlnx,axi-i3c-1.0.yaml
> +F:   drivers/i3c/master/amd-i3c-master.c
> +
>  AMD AXI W1 DRIVER
>  M:   Kris Chaplin <kris.chaplin@amd.com>
>  R:   Thomas Delev <thomas.delev@amd.com>
> diff --git a/drivers/i3c/master/Kconfig b/drivers/i3c/master/Kconfig index
> 2609f2b18e0a..da96d2aaa399 100644
> --- a/drivers/i3c/master/Kconfig
> +++ b/drivers/i3c/master/Kconfig
> @@ -86,3 +86,18 @@ config RENESAS_I3C
>
>         This driver can also be built as a module. If so, the module will be
>         called renesas-i3c.
> +
> +config AMD_AXI_I3C_MASTER
> +     tristate "AMD AXI I3C Master driver"
> +     depends on HAS_IOMEM
> +     help
> +       Support for the AMD AXI I3C master controller, a soft IP used on
> +       AMD (Xilinx) FPGAs and adaptive SoCs with ARM or MicroBlaze
> +       processors.
> +
> +       The controller currently supports Standard Data Rate (SDR) mode.
> +       Features include Dynamic Address Assignment, private transfers,
> +       and CCC transfers in both broadcast and direct modes.
> +
> +       This driver can also be built as a module. If so, the module
> +       will be called amd-i3c-master.
> diff --git a/drivers/i3c/master/Makefile b/drivers/i3c/master/Makefile index
> 816a227b6f7a..8d82196dcf83 100644
> --- a/drivers/i3c/master/Makefile
> +++ b/drivers/i3c/master/Makefile
> @@ -6,3 +6,4 @@ obj-$(CONFIG_AST2600_I3C_MASTER)      += ast2600-i3c-
> master.o
>  obj-$(CONFIG_SVC_I3C_MASTER)         += svc-i3c-master.o
>  obj-$(CONFIG_MIPI_I3C_HCI)           += mipi-i3c-hci/
>  obj-$(CONFIG_RENESAS_I3C)            += renesas-i3c.o
> +obj-$(CONFIG_AMD_AXI_I3C_MASTER)     += amd-i3c-master.o
> diff --git a/drivers/i3c/master/amd-i3c-master.c b/drivers/i3c/master/amd-i3c-master.c
> new file mode 100644
> index 000000000000..34ab1028c3ce
> --- /dev/null
> +++ b/drivers/i3c/master/amd-i3c-master.c
> @@ -0,0 +1,1060 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * I3C master driver for the AMD I3C controller.
> + *
> + * Copyright (C) 2026, Advanced Micro Devices, Inc.
> + */
> +
> +#include <linux/bitfield.h>
> +#include <linux/bitops.h>
> +#include <linux/cleanup.h>
> +#include <linux/clk.h>
> +#include <linux/delay.h>
> +#include <linux/err.h>
> +#include <linux/i3c/master.h>
> +#include <linux/io.h>
> +#include <linux/iopoll.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/of.h>
> +#include <linux/platform_device.h>
> +#include <linux/slab.h>
> +#include <linux/time.h>
> +#include <linux/unaligned.h>
> +
> +#define XI3C_VERSION_OFFSET                  0x00    /* Version Register */
> +#define XI3C_RESET_OFFSET                    0x04    /* Soft Reset Register */
> +#define XI3C_CR_OFFSET                               0x08    /* Control Register */
> +#define XI3C_ADDRESS_OFFSET                  0x0C    /* Target Address
> Register */
> +#define XI3C_SR_OFFSET                               0x10    /* Status Register */
> +#define XI3C_CMD_FIFO_OFFSET                 0x20    /* I3C Command
> FIFO Register */
> +#define XI3C_WR_FIFO_OFFSET                  0x24    /* I3C Write Data FIFO
> Register */
> +#define XI3C_RD_FIFO_OFFSET                  0x28    /* I3C Read Data FIFO
> Register */
> +#define XI3C_RESP_STATUS_FIFO_OFFSET         0x2C    /* I3C Response
> status FIFO Register */
> +#define XI3C_FIFO_LVL_STATUS_OFFSET          0x30    /* CMD slots free
> | WR-FIFO free (words) */
> +#define XI3C_FIFO_LVL_STATUS_1_OFFSET                0x34    /* RESP fill | RD-
> FIFO fill level (words) */
> +#define XI3C_SCL_HIGH_TIME_OFFSET            0x38    /* I3C SCL HIGH
> Register */
> +#define XI3C_SCL_LOW_TIME_OFFSET             0x3C    /* I3C SCL LOW
> Register */
> +#define XI3C_SDA_HOLD_TIME_OFFSET            0x40    /* I3C SDA
> HOLD Register */
> +#define XI3C_TSU_START_OFFSET                        0x48    /* I3C START
> SETUP Register */
> +#define XI3C_THD_START_OFFSET                        0x4C    /* I3C START
> HOLD Register */
> +#define XI3C_TSU_STOP_OFFSET                 0x50    /* I3C STOP
> Setup Register */
> +#define XI3C_OD_SCL_HIGH_TIME_OFFSET         0x54    /* I3C OD SCL
> HIGH Register */
> +#define XI3C_OD_SCL_LOW_TIME_OFFSET          0x58    /* I3C OD SCL
> LOW Register */
> +#define XI3C_PID0_OFFSET                     0x6C    /* LSB 4 bytes of the
> PID */
> +#define XI3C_PID1_BCR_DCR                    0x70    /* MSB 2 bytes of the
> PID, BCR and DCR */
> +
> +#define XI3C_CR_EN_MASK                              BIT(0)  /* Core Enable */
> +#define XI3C_CR_RESUME_MASK                  BIT(2)  /* Core Resume
> */
> +#define XI3C_SR_RESP_NOT_EMPTY_MASK          BIT(4)  /* Resp Fifo not
> empty status mask */
> +#define XI3C_RD_FIFO_NOT_EMPTY_MASK          BIT(15) /* Read
> Fifo not empty status mask */
> +
> +#define XI3C_BCR_MASK                                GENMASK(23, 16)
> +#define XI3C_DCR_MASK                                GENMASK(31, 24)
> +#define XI3C_PID_MASK                                GENMASK_ULL(63, 16)
> +#define XI3C_TIMING_MASK                     GENMASK(17, 0)
> +#define XI3C_REV_NUM_MASK                    GENMASK(15, 8)
> +#define XI3C_PID1_MASK                               GENMASK(15, 0)
> +#define XI3C_FIFO_LEVEL_MASK                 GENMASK(15, 0)
> +#define XI3C_RESP_CODE_MASK                  GENMASK(8, 5)
> +#define XI3C_RESP_CODE_SUCCESS                       0       /* Transfer
> completed OK */
> +#define XI3C_RESP_CODE_NO_TARGET             2       /* 7E NACK: no
> target on bus */
> +#define XI3C_RESP_CODE_NACK                  3       /* Target NACK /
> CE2 / DAA end */
> +#define XI3C_ADDR_MASK                               GENMASK(6, 0)
> +#define XI3C_FIFOS_RST_MASK                  GENMASK(4, 1)
> +
> +/* Command FIFO word layout (bit ranges encoded in the GENMASK/BIT args) */
> +#define XI3C_CMD_TYPE                                GENMASK(3, 0)   /*
> command type */
> +#define XI3C_CMD_TERMINATE                   BIT(4)          /* terminate (last
> cmd of xfer) */
> +#define XI3C_CMD_ADDR                                GENMASK(15, 8)  /* target
> address << 1 | RnW */
> +#define XI3C_CMD_LEN                         GENMASK(27, 16) /*
> payload length in bytes */
> +#define XI3C_CMD_TID                         GENMASK(31, 28) /* transfer
> ID */
> +
> +#define XI3C_OD_TLOW_NS                              500000
> +#define XI3C_OD_THIGH_NS                     41000
> +#define XI3C_I2C_TCASMIN_NS                  600000
> +#define XI3C_TCASMIN_NS                              260000
> +#define XI3C_MAXDATA_LENGTH                  4095
> +#define XI3C_MAX_DEVS                                32
> +#define XI3C_DAA_SLAVEINFO_READ_BYTECOUNT    8
> +
> +#define XI3C_THOLD_MIN_REV0                  5       /* Min SDA hold cycles,
> rev 0 IP */
> +#define XI3C_THOLD_MIN_REV1                  6       /* Min SDA hold cycles,
> rev >= 1 IP */
> +#define XI3C_CYCLE_ADJUST                    2       /* SCL/SDA pre-bias for
> HW pipeline */
> +#define XI3C_FIFO_RESET_DELAY_US             10      /* HW settling time after
> FIFO reset */
> +#define XI3C_POLL_INTERVAL_US                        10      /*
> readl_poll_timeout() sleep slice */
> +
> +#define XI3C_I2C_MODE                                0
> +#define XI3C_I2C_TID                         0
> +#define XI3C_SDR_MODE                                1
> +#define XI3C_SDR_TID                         1
> +
> +#define XI3C_WORD_LEN                                4
> +
> +/*
> + * XI3C_RESP_TIMEOUT_US is in microseconds because it is passed as the
> + * timeout_us argument of readl_poll_timeout(). XI3C_XFER_TIMEOUT_MS is
> +in
> + * milliseconds because it feeds msecs_to_jiffies(). Keep the two units
> + * distinct in the names so callers cannot mix them up.
> + */
> +#define XI3C_RESP_TIMEOUT_US                 500000
> +#define XI3C_XFER_TIMEOUT_MS                 1000
> +
> +struct xi3c_cmd {
> +     const void *tx_buf;
> +     void *rx_buf;
> +     u16 tx_len;
> +     u16 rx_len;
> +     u8 addr;
> +     u8 type;
> +     u8 tid;
> +     bool rnw;
> +     bool is_daa;
> +     bool continued;
> +     enum i3c_error_code err;
> +};
> +
> +struct xi3c_xfer {
> +     unsigned int ncmds;
> +     struct xi3c_cmd cmds[] __counted_by(ncmds); };
> +
> +/**
> + * struct xi3c_master - I3C master controller state.
> + * @base: I3C master controller embedded by the framework.
> + * @dev: Pointer to the backing device structure.
> + * @membase: Memory base of the HW registers.
> + * @pclk: Input clock driving the controller.
> + * @lock: Serializes transfers and CCC submission.
> + * @daa: ENTDAA enumeration state.
> + * @daa.addrs: Dynamic addresses assigned in enumeration order.
> + * @daa.index: Number of responders enumerated so far.
> + */
> +struct xi3c_master {
> +     struct i3c_master_controller base;
> +     struct device *dev;
> +     void __iomem *membase;
> +     struct clk *pclk;
> +     struct mutex lock; /* serializes transfers and CCC submission */
> +     struct {
> +             u8 addrs[XI3C_MAX_DEVS];
> +             u8 index;
> +     } daa;
> +};
> +
> +static inline struct xi3c_master *
> +to_xi3c_master(struct i3c_master_controller *master) {
> +     return container_of(master, struct xi3c_master, base); }
> +
> +static inline u8 xi3c_get_revision_number(struct xi3c_master *master) {
> +     return FIELD_GET(XI3C_REV_NUM_MASK,
> +                      ioread32(master->membase + XI3C_VERSION_OFFSET)); }
> +
> +static inline u16 xi3c_wr_fifo_level(struct xi3c_master *master) {
> +     return ioread32(master->membase + XI3C_FIFO_LVL_STATUS_OFFSET) &
> +            XI3C_FIFO_LEVEL_MASK;
> +}
> +
> +static inline u16 xi3c_rd_fifo_level(struct xi3c_master *master) {
> +     return ioread32(master->membase +
> XI3C_FIFO_LVL_STATUS_1_OFFSET) &
> +            XI3C_FIFO_LEVEL_MASK;
> +}
> +
> +static inline bool xi3c_is_resp_available(struct xi3c_master *master) {
> +     return FIELD_GET(XI3C_SR_RESP_NOT_EMPTY_MASK,
> +                      ioread32(master->membase + XI3C_SR_OFFSET)); }
> +
> +static int xi3c_get_response(struct xi3c_master *master, struct
> +xi3c_cmd *cmd) {
> +     u32 response_data;
> +     u32 resp_reg;
> +     u8 code;
> +     int ret;
> +
> +     ret = readl_poll_timeout(master->membase + XI3C_SR_OFFSET,
> +                              resp_reg,
> +                              resp_reg & XI3C_SR_RESP_NOT_EMPTY_MASK,
> +                              XI3C_POLL_INTERVAL_US,
> XI3C_RESP_TIMEOUT_US);
> +     if (ret) {
> +             dev_err(master->dev, "XI3C response timeout\n");
> +             return ret;
> +     }
> +
> +     response_data = ioread32(master->membase +
> XI3C_RESP_STATUS_FIFO_OFFSET);
> +     code = FIELD_GET(XI3C_RESP_CODE_MASK, response_data);
> +
> +     switch (code) {
> +     case XI3C_RESP_CODE_SUCCESS:
> +             cmd->err = I3C_ERROR_UNKNOWN;
> +             return 0;
> +     case XI3C_RESP_CODE_NO_TARGET:
> +     case XI3C_RESP_CODE_NACK:
> +             /*
> +              * Target did not ACK. Record it as I3C_ERROR_M2 so callers
> +              * (and the i3c core, which keys on err == I3C_ERROR_M2) can
> +              * tell a NACK apart from other failures. A normal transfer
> +              * surfaces this as -EIO per the i3c_xfer contract; the DAA
> +              * path instead expects -ENODEV as its enumeration terminator.
> +              */
> +             cmd->err = I3C_ERROR_M2;
> +             return cmd->is_daa ? -ENODEV : -EIO;
> +     default:
> +             cmd->err = I3C_ERROR_M0;
> +             dev_err(master->dev, "XI3C transfer error, response code %u\n",
> +                     code);
> +             return -EIO;
> +     }
> +}
> +
> +static inline void xi3c_writesl_be(void __iomem *addr, const void *buffer,
> +                                unsigned int count)
> +{
> +     const u32 *buf = buffer;
> +
> +     while (count--)
> +             iowrite32be(get_unaligned(buf++), addr); }
> +
> +static inline void xi3c_readsl_be(const void __iomem *addr, void *buffer,
> +                               unsigned int count)
> +{
> +     u32 *buf = buffer;
> +
> +     while (count--)
> +             put_unaligned(ioread32be(addr), buf++); }
> +
> +static inline void xi3c_writel_fifo(void __iomem *addr, const void *buf,
> +                                 int nbytes)
> +{
> +     xi3c_writesl_be(addr, buf, nbytes / 4);
> +     if (nbytes & 3) {
> +             u32 tmp = 0;
> +
> +             memcpy(&tmp, (const u8 *)buf + (nbytes & ~3), nbytes & 3);
> +             xi3c_writesl_be(addr, &tmp, 1);
> +     }
> +}
> +
> +static inline void xi3c_readl_fifo(const void __iomem *addr, void *buf,
> +                                int nbytes)
> +{
> +     xi3c_readsl_be(addr, buf, nbytes / 4);
> +     if (nbytes & 3) {
> +             u32 tmp;
> +
> +             xi3c_readsl_be(addr, &tmp, 1);
> +             memcpy((u8 *)buf + (nbytes & ~3), &tmp, nbytes & 3);
> +     }
> +}
> +
> +static void xi3c_master_write_to_cmdfifo(struct xi3c_master *master,
> +                                      struct xi3c_cmd *cmd, u16 len)
> +{
> +     u32 transfer_cmd;
> +     u8 addr;
> +
> +     addr = ((cmd->addr & XI3C_ADDR_MASK) << 1) | (u8)cmd->rnw;
> +
> +     transfer_cmd  = FIELD_PREP(XI3C_CMD_TYPE, cmd->type);
> +     transfer_cmd |= FIELD_PREP(XI3C_CMD_TERMINATE, !cmd->continued);
> +     transfer_cmd |= FIELD_PREP(XI3C_CMD_ADDR, addr);
> +     transfer_cmd |= FIELD_PREP(XI3C_CMD_TID, cmd->tid);
> +
> +     /*
> +      * For dynamic addressing, an additional 1-byte length must be added
> +      * to the command FIFO to account for the address present in the TX FIFO
> +      */
> +     if (cmd->is_daa) {
> +             xi3c_writel_fifo(master->membase + XI3C_WR_FIFO_OFFSET,
> +                              cmd->tx_buf, cmd->tx_len);
> +
> +             len++;
> +     }
> +
> +     transfer_cmd |= FIELD_PREP(XI3C_CMD_LEN, len);
> +     iowrite32(transfer_cmd, master->membase + XI3C_CMD_FIFO_OFFSET); }
> +
> +static inline void xi3c_master_enable(struct xi3c_master *master) {
> +     iowrite32(ioread32(master->membase + XI3C_CR_OFFSET) |
> XI3C_CR_EN_MASK,
> +               master->membase + XI3C_CR_OFFSET);
> +}
> +
> +static inline void xi3c_master_disable(struct xi3c_master *master) {
> +     iowrite32(ioread32(master->membase + XI3C_CR_OFFSET) &
> ~XI3C_CR_EN_MASK,
> +               master->membase + XI3C_CR_OFFSET);
> +}
> +
> +static inline void xi3c_master_resume(struct xi3c_master *master) {
> +     iowrite32(ioread32(master->membase + XI3C_CR_OFFSET) |
> +               XI3C_CR_RESUME_MASK, master->membase +
> XI3C_CR_OFFSET); }
> +
> +static void xi3c_master_reset_fifos(struct xi3c_master *master) {
> +     u32 data;
> +
> +     /* Assert FIFO reset. */
> +     data = ioread32(master->membase + XI3C_RESET_OFFSET);
> +     data |= XI3C_FIFOS_RST_MASK;
> +     iowrite32(data, master->membase + XI3C_RESET_OFFSET);
> +     /* Read-back flushes the posted write before the settling delay below. */
> +     ioread32(master->membase + XI3C_RESET_OFFSET);
> +     udelay(XI3C_FIFO_RESET_DELAY_US);
> +
> +     /* De-assert FIFO reset, then wait for the FIFOs to come back up. */
> +     data &= ~XI3C_FIFOS_RST_MASK;
> +     iowrite32(data, master->membase + XI3C_RESET_OFFSET);
> +     ioread32(master->membase + XI3C_RESET_OFFSET);
> +     udelay(XI3C_FIFO_RESET_DELAY_US);
> +}
> +
> +static inline void xi3c_master_init(struct xi3c_master *master) {
> +     /* Reset fifos */
> +     xi3c_master_reset_fifos(master);
> +
> +     /* Enable controller */
> +     xi3c_master_enable(master);
> +}
> +
> +static inline void xi3c_master_reinit(struct xi3c_master *master) {
> +     /* Reset fifos */
> +     xi3c_master_reset_fifos(master);
> +
> +     /* Resume controller */
> +     xi3c_master_resume(master);
> +}
> +
> +static struct xi3c_xfer *xi3c_master_alloc_xfer(unsigned int ncmds) {
> +     struct xi3c_xfer *xfer;
> +
> +     xfer = kzalloc(struct_size(xfer, cmds, ncmds), GFP_KERNEL);
> +     if (!xfer)
> +             return NULL;
> +
> +     xfer->ncmds = ncmds;
> +
> +     return xfer;
> +}
> +
> +static void xi3c_master_rd_from_rx_fifo(struct xi3c_master *master,
> +                                     struct xi3c_cmd *cmd)
> +{
> +     u16 rx_data_available;
> +     u16 copy_len;
> +     u16 len;
> +
> +     rx_data_available = xi3c_rd_fifo_level(master);
> +     len = rx_data_available * XI3C_WORD_LEN;
> +
> +     if (!len)
> +             return;
> +
> +     copy_len = min_t(u16, len, cmd->rx_len);
> +     xi3c_readl_fifo(master->membase + XI3C_RD_FIFO_OFFSET,
> +                     (u8 *)cmd->rx_buf, copy_len);
> +
> +     cmd->rx_buf = (u8 *)cmd->rx_buf + copy_len;
> +     cmd->rx_len -= copy_len;
> +}
> +
> +static int xi3c_master_read(struct xi3c_master *master, struct xi3c_cmd
> +*cmd) {
> +     unsigned long timeout;
> +     u32 status_reg;
> +     int ret;
> +
> +     if (!cmd->rx_buf || cmd->rx_len > XI3C_MAXDATA_LENGTH)
> +             return -EINVAL;
> +
> +     /* Fill command fifo */
> +     xi3c_master_write_to_cmdfifo(master, cmd, cmd->rx_len);
> +
> +     if (!cmd->rx_len)
> +             return 0;
> +
> +     ret = readl_poll_timeout(master->membase + XI3C_SR_OFFSET,
> +                              status_reg,
> +                              status_reg & (XI3C_RD_FIFO_NOT_EMPTY_MASK
> |
> +                                            XI3C_SR_RESP_NOT_EMPTY_MASK),
> +                              XI3C_POLL_INTERVAL_US,
> XI3C_RESP_TIMEOUT_US);
> +     if (ret) {
> +             dev_err(master->dev, "XI3C read timeout\n");
> +             return ret;
> +     }
> +
> +     if (!(status_reg & XI3C_RD_FIFO_NOT_EMPTY_MASK))
> +             return 0;
> +
> +     timeout = jiffies + msecs_to_jiffies(XI3C_XFER_TIMEOUT_MS);
> +
> +     /* Read data from rx fifo */
> +     while (cmd->rx_len > 0 && !xi3c_is_resp_available(master)) {
> +             if (time_after(jiffies, timeout)) {
> +                     dev_err(master->dev, "XI3C read timeout\n");
> +                     return -EIO;
> +             }
> +             xi3c_master_rd_from_rx_fifo(master, cmd);
> +             usleep_range(XI3C_POLL_INTERVAL_US, 2 *
> XI3C_POLL_INTERVAL_US);
> +     }
> +
> +     /* Read remaining data */
> +     xi3c_master_rd_from_rx_fifo(master, cmd);
> +
> +     return 0;
> +}
> +
> +static void xi3c_master_wr_to_tx_fifo(struct xi3c_master *master,
> +                                   struct xi3c_cmd *cmd)
> +{
> +     u16 wrfifo_space;
> +     u16 len;
> +
> +     wrfifo_space = xi3c_wr_fifo_level(master);
> +     if (cmd->tx_len > wrfifo_space * XI3C_WORD_LEN)
> +             len = wrfifo_space * XI3C_WORD_LEN;
> +     else
> +             len = cmd->tx_len;
> +
> +     if (len) {
> +             xi3c_writel_fifo(master->membase + XI3C_WR_FIFO_OFFSET,
> cmd->tx_buf,
> +                              len);
> +
> +             cmd->tx_buf = (const u8 *)cmd->tx_buf + len;
> +             cmd->tx_len -= len;
> +     }
> +}
> +
> +static int xi3c_master_write(struct xi3c_master *master, struct
> +xi3c_cmd *cmd) {
> +     unsigned long timeout;
> +     u16 cmd_len;
> +
> +     if (!cmd->tx_buf || cmd->tx_len > XI3C_MAXDATA_LENGTH)
> +             return -EINVAL;
> +
> +     cmd_len = cmd->tx_len;
> +
> +     /* Fill Tx fifo */
> +     xi3c_master_wr_to_tx_fifo(master, cmd);
> +
> +     /* Write to command fifo */
> +     xi3c_master_write_to_cmdfifo(master, cmd, cmd_len);
> +
> +     timeout = jiffies + msecs_to_jiffies(XI3C_XFER_TIMEOUT_MS);
> +     /* Fill if any remaining data to tx fifo */
> +     while (cmd->tx_len > 0 && !xi3c_is_resp_available(master)) {
> +             if (time_after(jiffies, timeout)) {
> +                     dev_err(master->dev, "XI3C write timeout\n");
> +                     return -EIO;
> +             }
> +
> +             xi3c_master_wr_to_tx_fifo(master, cmd);
> +             usleep_range(XI3C_POLL_INTERVAL_US, 2 *
> XI3C_POLL_INTERVAL_US);
> +     }
> +
> +     return 0;
> +}
> +
> +static int xi3c_master_xfer(struct xi3c_master *master, struct xi3c_cmd
> +*cmd) {
> +     int ret;
> +
> +     if (cmd->rnw)
> +             ret = xi3c_master_read(master, cmd);
> +     else
> +             ret = xi3c_master_write(master, cmd);
> +
> +     if (ret)
> +             goto err_xfer_out;
> +
> +     ret = xi3c_get_response(master, cmd);
> +     if (ret)
> +             goto err_xfer_out;
> +
> +     return 0;
> +
> +err_xfer_out:
> +     xi3c_master_reinit(master);
> +     return ret;
> +}
> +
> +static int xi3c_master_common_xfer(struct xi3c_master *master,
> +                                struct xi3c_xfer *xfer)
> +{
> +     unsigned int i;
> +     int ret;
> +
> +     guard(mutex)(&master->lock);
> +
> +     for (i = 0; i < xfer->ncmds; i++) {
> +             ret = xi3c_master_xfer(master, &xfer->cmds[i]);
> +             if (ret)
> +                     return ret;
> +     }
> +
> +     return 0;
> +}
> +
> +static int xi3c_master_do_daa(struct i3c_master_controller *m) {
> +     u8
> pid_bufs[XI3C_MAX_DEVS][XI3C_DAA_SLAVEINFO_READ_BYTECOUNT] = {};
> +     struct xi3c_master *master = to_xi3c_master(m);
> +     struct xi3c_xfer *xfer __free(kfree) = NULL;
> +     struct xi3c_cmd *daa_cmd;
> +     int addr, ret, i;
> +     u8 last_addr = 0;
> +     u8 *pid_buf;
> +     u8 ccc_id;
> +
> +     xfer = xi3c_master_alloc_xfer(1);
> +     if (!xfer)
> +             return -ENOMEM;
> +
> +     /* Fill ENTDAA CCC */
> +     ccc_id = I3C_CCC_ENTDAA;
> +     daa_cmd = &xfer->cmds[0];
> +     daa_cmd->addr = I3C_BROADCAST_ADDR;
> +     daa_cmd->rnw = false;
> +     daa_cmd->tx_buf = &ccc_id;
> +     daa_cmd->tx_len = 1;
> +     daa_cmd->type = XI3C_SDR_MODE;
> +     daa_cmd->tid = XI3C_SDR_TID;
> +     daa_cmd->continued = true;
> +
> +     ret = xi3c_master_common_xfer(master, xfer);
> +     /*
> +      * A NACK on the ENTDAA broadcast (I3C_ERROR_M2) means no slaves
> are
> +      * present to enter DAA. Treat as a successful no-op after letting
> +      * err_daa reinitialize the controller.
> +      */
> +     if (ret && daa_cmd->err == I3C_ERROR_M2) {
> +             ret = 0;
> +             goto err_daa;
> +     }
> +     if (ret)
> +             goto err_daa;
> +
> +     master->daa.index = 0;
> +
> +     while (true) {
> +             struct xi3c_cmd *cmd = &xfer->cmds[0];
> +             u8 daa_byte;
> +
> +             if (master->daa.index >= XI3C_MAX_DEVS) {
> +                     ret = -ENOSPC;
> +                     goto err_daa;
> +             }
> +
> +             addr = i3c_master_get_free_addr(m, last_addr + 1);
> +             if (addr < 0) {
> +                     ret = addr;
> +                     goto err_daa;
> +             }
> +
> +             pid_buf = pid_bufs[master->daa.index];
> +
> +             daa_byte = (addr << 1) | (parity8(addr) ^ 1);
> +
> +             cmd->tx_buf = &daa_byte;
> +             cmd->tx_len = 1;
> +             cmd->addr = I3C_BROADCAST_ADDR;
> +             cmd->rnw = true;
> +             cmd->rx_buf = pid_buf;
> +             cmd->rx_len = XI3C_DAA_SLAVEINFO_READ_BYTECOUNT;
> +             cmd->is_daa = true;
> +             cmd->type = XI3C_SDR_MODE;
> +             cmd->tid = XI3C_SDR_TID;
> +             cmd->continued = true;
> +
> +             ret = xi3c_master_common_xfer(master, xfer);
> +
> +             /*
> +              * End of enumeration: the next responder NACK'd the
> +              * dynamic-address grant, surfaced as -ENODEV.
> +              * xi3c_master_xfer() has already reset the FIFOs and
> +              * resumed the core for us; just exit the loop and
> +              * register the responders collected so far.
> +              */
> +             if (ret == -ENODEV) {
> +                     ret = 0;
> +                     break;
> +             }
> +             if (ret)
> +                     goto err_daa;
> +
> +             master->daa.addrs[master->daa.index] = addr;
> +             last_addr = addr;
> +             master->daa.index++;
> +     }
> +
> +     for (i = 0; i < master->daa.index; i++) {
> +             u64 pid;
> +
> +             ret = i3c_master_add_i3c_dev_locked(m, master->daa.addrs[i]);
> +             if (ret)
> +                     goto err_daa;
> +
> +             pid = FIELD_GET(XI3C_PID_MASK,
> +                             get_unaligned_be64(pid_bufs[i]));
> +             dev_dbg(master->dev, "Client %d: PID: 0x%llx\n", i, pid);
> +     }
> +
> +     return 0;
> +
> +err_daa:
> +     xi3c_master_reinit(master);
> +     return ret;
> +}
> +
> +static bool
> +xi3c_master_supports_ccc_cmd(struct i3c_master_controller *master,
> +                          const struct i3c_ccc_cmd *cmd)
> +{
> +     if (cmd->ndests > 1)
> +             return false;
> +
> +     switch (cmd->id) {
> +     case I3C_CCC_ENEC(true):
> +     case I3C_CCC_ENEC(false):
> +     case I3C_CCC_DISEC(true):
> +     case I3C_CCC_DISEC(false):
> +     case I3C_CCC_ENTAS(0, true):
> +     case I3C_CCC_ENTAS(0, false):
> +     case I3C_CCC_RSTDAA(true):
> +     case I3C_CCC_RSTDAA(false):
> +     case I3C_CCC_ENTDAA:
> +     case I3C_CCC_SETMWL(true):
> +     case I3C_CCC_SETMWL(false):
> +     case I3C_CCC_SETMRL(true):
> +     case I3C_CCC_SETMRL(false):
> +     case I3C_CCC_SETDASA:
> +     case I3C_CCC_SETNEWDA:
> +     case I3C_CCC_GETMWL:
> +     case I3C_CCC_GETMRL:
> +     case I3C_CCC_GETPID:
> +     case I3C_CCC_GETBCR:
> +     case I3C_CCC_GETDCR:
> +     case I3C_CCC_GETSTATUS:
> +     case I3C_CCC_GETMXDS:
> +             return true;
> +     default:
> +             return false;
> +     }
> +}
> +
> +static int xi3c_master_send_bdcast_ccc_cmd(struct xi3c_master *master,
> +                                        struct i3c_ccc_cmd *ccc)
> +{
> +     struct xi3c_xfer *xfer __free(kfree) = NULL;
> +     u8 *buf __free(kfree) = NULL;
> +     struct xi3c_cmd *cmd;
> +     u16 xfer_len;
> +     int ret;
> +
> +     if (ccc->dests[0].payload.len >= XI3C_MAXDATA_LENGTH)
> +             return -EINVAL;
> +
> +     xfer_len = ccc->dests[0].payload.len + 1;
> +
> +     xfer = xi3c_master_alloc_xfer(1);
> +     if (!xfer)
> +             return -ENOMEM;
> +
> +     buf = kmalloc(xfer_len, GFP_KERNEL);
> +     if (!buf)
> +             return -ENOMEM;
> +
> +     buf[0] = ccc->id;
> +     memcpy(&buf[1], ccc->dests[0].payload.data,
> +ccc->dests[0].payload.len);
> +
> +     cmd = &xfer->cmds[0];
> +     cmd->addr = ccc->dests[0].addr;
> +     cmd->rnw = ccc->rnw;
> +     cmd->tx_buf = buf;
> +     cmd->tx_len = xfer_len;
> +     cmd->type = XI3C_SDR_MODE;
> +     cmd->tid = XI3C_SDR_TID;
> +     cmd->continued = false;
> +
> +     ret = xi3c_master_common_xfer(master, xfer);
> +     ccc->err = cmd->err;
> +
> +     return ret;
> +}
> +
> +static int xi3c_master_send_direct_ccc_cmd(struct xi3c_master *master,
> +                                        struct i3c_ccc_cmd *ccc)
> +{
> +     struct xi3c_xfer *xfer __free(kfree) = NULL;
> +     struct xi3c_cmd *cmd;
> +     int ret;
> +
> +     if (ccc->dests[0].payload.len > XI3C_MAXDATA_LENGTH)
> +             return -EINVAL;
> +
> +     xfer = xi3c_master_alloc_xfer(2);
> +     if (!xfer)
> +             return -ENOMEM;
> +
> +     /* Broadcasted message */
> +     cmd = &xfer->cmds[0];
> +     cmd->addr = I3C_BROADCAST_ADDR;
> +     cmd->rnw = false;
> +     cmd->tx_buf = &ccc->id;
> +     cmd->tx_len = 1;
> +     cmd->type = XI3C_SDR_MODE;
> +     cmd->tid = XI3C_SDR_TID;
> +     cmd->continued = true;
> +
> +     /* Directed message */
> +     cmd = &xfer->cmds[1];
> +     cmd->addr = ccc->dests[0].addr;
> +     cmd->rnw = ccc->rnw;
> +     if (cmd->rnw) {
> +             cmd->rx_buf = ccc->dests[0].payload.data;
> +             cmd->rx_len = ccc->dests[0].payload.len;
> +     } else {
> +             cmd->tx_buf = ccc->dests[0].payload.data;
> +             cmd->tx_len = ccc->dests[0].payload.len;
> +     }
> +     cmd->type = XI3C_SDR_MODE;
> +     cmd->tid = XI3C_SDR_TID;
> +     cmd->continued = false;
> +
> +     ret = xi3c_master_common_xfer(master, xfer);
> +
> +     /*
> +      * Report the broadcast command's error if it failed, otherwise the
> +      * directed command's, so a NACK on either phase reaches the caller.
> +      */
> +     ccc->err = xfer->cmds[0].err ? xfer->cmds[0].err : xfer->cmds[1].err;
> +
> +     return ret;
> +}
> +
> +static int xi3c_master_send_ccc_cmd(struct i3c_master_controller *m,
> +                                 struct i3c_ccc_cmd *cmd)
> +{
> +     struct xi3c_master *master = to_xi3c_master(m);
> +
> +     if (cmd->id & I3C_CCC_DIRECT)
> +             return xi3c_master_send_direct_ccc_cmd(master, cmd);
> +
> +     return xi3c_master_send_bdcast_ccc_cmd(master, cmd); }
> +
> +static int xi3c_master_i3c_xfers(struct i3c_dev_desc *dev,
> +                              struct i3c_xfer *xfers,
> +                              int nxfers, enum i3c_xfer_mode mode) {
> +     struct i3c_master_controller *m = i3c_dev_get_master(dev);
> +     struct xi3c_master *master = to_xi3c_master(m);
> +     struct xi3c_xfer *xfer __free(kfree) = NULL;
> +     int i, ret;
> +
> +     if (!nxfers)
> +             return 0;
> +
> +     if (mode != I3C_SDR)
> +             return -EOPNOTSUPP;
> +
> +     for (i = 0; i < nxfers; i++)
> +             if (xfers[i].len > XI3C_MAXDATA_LENGTH)
> +                     return -EINVAL;
> +
> +     xfer = xi3c_master_alloc_xfer(nxfers);
> +     if (!xfer)
> +             return -ENOMEM;
> +
> +     for (i = 0; i < nxfers; i++) {
> +             struct xi3c_cmd *cmd = &xfer->cmds[i];
> +
> +             cmd->addr = dev->info.dyn_addr;
> +             cmd->rnw = xfers[i].rnw;
> +
> +             if (cmd->rnw) {
> +                     cmd->rx_buf = xfers[i].data.in;
> +                     cmd->rx_len = xfers[i].len;
> +             } else {
> +                     cmd->tx_buf = xfers[i].data.out;
> +                     cmd->tx_len = xfers[i].len;
> +             }
> +
> +             cmd->type = XI3C_SDR_MODE;
> +             cmd->tid = XI3C_SDR_TID;
> +             cmd->continued = (i + 1) < nxfers;
> +     }
> +
> +     ret = xi3c_master_common_xfer(master, xfer);
> +
> +     for (i = 0; i < nxfers; i++) {
> +             xfers[i].err = xfer->cmds[i].err;
> +             if (xfers[i].rnw)
> +                     xfers[i].actual_len = xfers[i].len - xfer->cmds[i].rx_len;
> +     }
> +
> +     return ret;
> +}
> +
> +static int xi3c_master_i2c_xfers(struct i2c_dev_desc *dev,
> +                              struct i2c_msg *xfers,
> +                              int nxfers)
> +{
> +     struct i3c_master_controller *m = i2c_dev_get_master(dev);
> +     struct xi3c_master *master = to_xi3c_master(m);
> +     struct xi3c_xfer *xfer __free(kfree) = NULL;
> +     int i;
> +
> +     if (!nxfers)
> +             return 0;
> +
> +     for (i = 0; i < nxfers; i++)
> +             if (xfers[i].len > XI3C_MAXDATA_LENGTH)
> +                     return -EINVAL;
> +
> +     xfer = xi3c_master_alloc_xfer(nxfers);
> +     if (!xfer)
> +             return -ENOMEM;
> +
> +     for (i = 0; i < nxfers; i++) {
> +             struct xi3c_cmd *cmd = &xfer->cmds[i];
> +
> +             cmd->addr = xfers[i].addr & XI3C_ADDR_MASK;
> +             cmd->rnw = !!(xfers[i].flags & I2C_M_RD);
> +
> +             if (cmd->rnw) {
> +                     cmd->rx_buf = xfers[i].buf;
> +                     cmd->rx_len = xfers[i].len;
> +             } else {
> +                     cmd->tx_buf = xfers[i].buf;
> +                     cmd->tx_len = xfers[i].len;
> +             }
> +
> +             cmd->type = XI3C_I2C_MODE;
> +             cmd->tid = XI3C_I2C_TID;
> +             cmd->continued = (i + 1) < nxfers;
> +     }
> +
> +     return xi3c_master_common_xfer(master, xfer); }
> +
> +static int xi3c_clk_cfg(struct xi3c_master *master, unsigned long
> +sclhz, u8 mode) {
> +     unsigned long core_rate, core_periodns;
> +     u32 tcasmin, tsustart, tsustop, thdstart;
> +     u32 thigh, tlow, thold;
> +     u32 odthigh, odtlow;
> +
> +     core_rate = clk_get_rate(master->pclk);
> +     if (!core_rate)
> +             return -EINVAL;
> +
> +     if (!sclhz)
> +             return -EINVAL;
> +
> +     core_periodns = DIV_ROUND_UP(NSEC_PER_SEC, core_rate);
> +
> +     thigh = DIV_ROUND_UP(core_rate, sclhz) >> 1;
> +     tlow = thigh;
> +
> +     if (thigh <= XI3C_CYCLE_ADJUST)
> +             return -EINVAL;
> +
> +     /* Hold time : 40% of tlow time */
> +     thold = (tlow * 4) / 10;
> +
> +     if (xi3c_get_revision_number(master) == 0)
> +             thold = max_t(u32, thold, XI3C_THOLD_MIN_REV0);
> +     else
> +             thold = max_t(u32, thold, XI3C_THOLD_MIN_REV1);
> +
> +     iowrite32((thigh - XI3C_CYCLE_ADJUST) & XI3C_TIMING_MASK,
> +               master->membase + XI3C_SCL_HIGH_TIME_OFFSET);
> +     iowrite32((tlow - XI3C_CYCLE_ADJUST) & XI3C_TIMING_MASK,
> +               master->membase + XI3C_SCL_LOW_TIME_OFFSET);
> +     iowrite32((thold - XI3C_CYCLE_ADJUST) & XI3C_TIMING_MASK,
> +               master->membase + XI3C_SDA_HOLD_TIME_OFFSET);
> +
> +     if (mode == XI3C_I2C_MODE) {
> +             iowrite32((thigh - XI3C_CYCLE_ADJUST) & XI3C_TIMING_MASK,
> +                       master->membase +
> XI3C_OD_SCL_HIGH_TIME_OFFSET);
> +             iowrite32((tlow - XI3C_CYCLE_ADJUST) & XI3C_TIMING_MASK,
> +                       master->membase +
> XI3C_OD_SCL_LOW_TIME_OFFSET);
> +
> +             tcasmin = DIV_ROUND_UP(XI3C_I2C_TCASMIN_NS,
> core_periodns);
> +     } else {
> +             odtlow = DIV_ROUND_UP(XI3C_OD_TLOW_NS, core_periodns);
> +             odthigh = DIV_ROUND_UP(XI3C_OD_THIGH_NS, core_periodns);
> +
> +             odtlow = max(tlow, odtlow);
> +             odthigh = min(thigh, odthigh);
> +
> +             if (odthigh <= XI3C_CYCLE_ADJUST)
> +                     return -EINVAL;
> +
> +             iowrite32((odthigh - XI3C_CYCLE_ADJUST) &
> XI3C_TIMING_MASK,
> +                       master->membase +
> XI3C_OD_SCL_HIGH_TIME_OFFSET);
> +             iowrite32((odtlow - XI3C_CYCLE_ADJUST) & XI3C_TIMING_MASK,
> +                       master->membase +
> XI3C_OD_SCL_LOW_TIME_OFFSET);
> +
> +             tcasmin = DIV_ROUND_UP(XI3C_TCASMIN_NS, core_periodns);
> +     }
> +
> +     thdstart = max(thigh, tcasmin);
> +     tsustart = max(tlow, tcasmin);
> +     tsustop = max(tlow, tcasmin);
> +
> +     iowrite32((tsustart - XI3C_CYCLE_ADJUST) & XI3C_TIMING_MASK,
> +               master->membase + XI3C_TSU_START_OFFSET);
> +     iowrite32((thdstart - XI3C_CYCLE_ADJUST) & XI3C_TIMING_MASK,
> +               master->membase + XI3C_THD_START_OFFSET);
> +     iowrite32((tsustop - XI3C_CYCLE_ADJUST) & XI3C_TIMING_MASK,
> +               master->membase + XI3C_TSU_STOP_OFFSET);
> +
> +     return 0;
> +}
> +
> +static int xi3c_master_bus_init(struct i3c_master_controller *m) {
> +     struct xi3c_master *master = to_xi3c_master(m);
> +     struct i3c_bus *bus = i3c_master_get_bus(m);
> +     struct i3c_device_info info = {};
> +     unsigned long sclhz;
> +     u32 pid1_bcr_dcr;
> +     u8 mode;
> +     int ret;
> +
> +     switch (bus->mode) {
> +     case I3C_BUS_MODE_MIXED_FAST:
> +     case I3C_BUS_MODE_MIXED_LIMITED:
> +     case I3C_BUS_MODE_MIXED_SLOW:
> +             mode = XI3C_I2C_MODE;
> +             sclhz = bus->scl_rate.i2c;
> +             break;
> +     case I3C_BUS_MODE_PURE:
> +             mode = XI3C_SDR_MODE;
> +             sclhz = bus->scl_rate.i3c;
> +             break;
> +     default:
> +             return -EINVAL;
> +     }
> +
> +     ret = xi3c_clk_cfg(master, sclhz, mode);
> +     if (ret)
> +             return ret;
> +
> +     xi3c_master_init(master);
> +
> +     /* Get an address for the master. */
> +     ret = i3c_master_get_free_addr(m, 0);
> +     if (ret < 0)
> +             return ret;
> +
> +     info.dyn_addr = ret;
> +
> +     /* Write the dynamic address value to the address register. */
> +     iowrite32(info.dyn_addr, master->membase + XI3C_ADDRESS_OFFSET);
> +
> +     /* Read PID, BCR and DCR values, and assign to i3c device info. */
> +     pid1_bcr_dcr = ioread32(master->membase + XI3C_PID1_BCR_DCR);
> +     info.pid = ((u64)FIELD_GET(XI3C_PID1_MASK, pid1_bcr_dcr) << 32) |
> +                ioread32(master->membase + XI3C_PID0_OFFSET);
> +     info.bcr = FIELD_GET(XI3C_BCR_MASK, pid1_bcr_dcr);
> +     info.dcr = FIELD_GET(XI3C_DCR_MASK, pid1_bcr_dcr);
> +
> +     return i3c_master_set_info(&master->base, &info); }
> +
> +static void xi3c_master_bus_cleanup(struct i3c_master_controller *m) {
> +     struct xi3c_master *master = to_xi3c_master(m);
> +
> +     xi3c_master_disable(master);
> +}
> +
> +static const struct i3c_master_controller_ops xi3c_master_ops = {
> +     .bus_init = xi3c_master_bus_init,
> +     .bus_cleanup = xi3c_master_bus_cleanup,
> +     .do_daa = xi3c_master_do_daa,
> +     .supports_ccc_cmd = xi3c_master_supports_ccc_cmd,
> +     .send_ccc_cmd = xi3c_master_send_ccc_cmd,
> +     .i3c_xfers = xi3c_master_i3c_xfers,
> +     .i2c_xfers = xi3c_master_i2c_xfers,
> +};
> +
> +static int xi3c_master_probe(struct platform_device *pdev) {
> +     struct xi3c_master *master;
> +     int ret;
> +
> +     master = devm_kzalloc(&pdev->dev, sizeof(*master), GFP_KERNEL);
> +     if (!master)
> +             return -ENOMEM;
> +
> +     master->dev = &pdev->dev;
> +
> +     master->membase = devm_platform_ioremap_resource(pdev, 0);
> +     if (IS_ERR(master->membase))
> +             return dev_err_probe(master->dev, PTR_ERR(master->membase),
> +                                  "Failed to map registers\n");
> +
> +     master->pclk = devm_clk_get_enabled(master->dev, NULL);
> +     if (IS_ERR(master->pclk))
> +             return dev_err_probe(master->dev, PTR_ERR(master->pclk),
> +                                  "Failed to get and enable clock\n");
> +
> +     ret = devm_mutex_init(master->dev, &master->lock);
> +     if (ret)
> +             return ret;
> +
> +     platform_set_drvdata(pdev, master);
> +
> +     return i3c_master_register(&master->base, master->dev,
> +                                &xi3c_master_ops, false);
> +}
> +
> +static void xi3c_master_remove(struct platform_device *pdev) {
> +     struct xi3c_master *master = platform_get_drvdata(pdev);
> +
> +     i3c_master_unregister(&master->base);
> +}
> +
> +static const struct of_device_id xi3c_master_of_ids[] = {
> +     { .compatible = "xlnx,axi-i3c-1.0" },
> +     { },
> +};
> +MODULE_DEVICE_TABLE(of, xi3c_master_of_ids);
> +
> +static struct platform_driver xi3c_master_driver = {
> +     .probe = xi3c_master_probe,
> +     .remove = xi3c_master_remove,
> +     .driver = {
> +             .name = "axi-i3c-master",
> +             .of_match_table = xi3c_master_of_ids,
> +     },
> +};
> +module_platform_driver(xi3c_master_driver);
> +
> +MODULE_AUTHOR("Manikanta Guntupalli <manikanta.guntupalli@amd.com>");
> +MODULE_AUTHOR("Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>");
> +MODULE_AUTHOR("Shubham Patil <shubhamsanjay.patil@amd.com>");
I don't agree with adding new authors in V9.

This driver is already part of the downstream kernel and is being used:
https://github.com/Xilinx/linux-xlnx/blob/master/drivers/i3c/master/amd-i3c-master.c

The main purpose of V9 is to drop the framework-level support added in recent versions. The current V9 patch is mostly aligned with the initial patch versions (without framework support changes).

Thanks,
Manikanta

> +MODULE_DESCRIPTION("AMD AXI I3C master driver");
> MODULE_LICENSE("GPL");
> --
> 2.34.1


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox