* [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata
@ 2026-05-20 21:48 Christian Brauner (Amutable)
2026-05-20 21:48 ` [PATCH RFC v3 1/4] sched/coredump: introduce enum task_dumpable Christian Brauner (Amutable)
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Christian Brauner (Amutable) @ 2026-05-20 21:48 UTC (permalink / raw)
To: Jann Horn, Linus Torvalds, Oleg Nesterov
Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory,
Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Michal Hocko, Christian Brauner (Amutable)
This series relocates the dumpable mode and the user_namespace
captured at execve() from mm_struct onto a new per-task
task_exec_state structure that stays attached to the task for its
full lifetime.
__ptrace_may_access() and several /proc owner / visibility checks
need to consult two pieces of state for any observable task,
including zombies that have already gone through exit_mm(): the
dumpable mode and the user namespace captured at execve(). Both
live on mm_struct today, which exit_mm() clears from the task long
before the task is reaped.
A reader that races with do_exit() observes task->mm == NULL and
either fails the check or falls back to init_user_ns - which denies
legitimate access to non-dumpable zombies that were running in a
nested user namespace.
mm_struct loses ->user_ns and the dumpability bits in ->flags.
MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* layout exposed via
/proc/<pid>/coredump_filter stays stable. task->user_dumpable and its
exit_mm() snapshot are removed.
task_exec_state is the privilege domain established by an execve()
[1]. Within a thread group it is shared via refcount; across thread
groups each task has its own:
- CLONE_VM siblings (thread-group members, io_uring workers)
refcount-share the parent's exec_state.
- Non-CLONE_VM clones (fork(), vfork() without CLONE_VM)
allocate a fresh exec_state inheriting the parent's dumpable
mode and user_ns.
- execve() in the child allocates a fresh instance and installs
it under task_lock + exec_update_lock via
task_exec_state_replace().
- Credential changes (setresuid, capset, ...) and
prctl(PR_SET_DUMPABLE) update dumpability on the current
task's exec_state, i.e. on the thread group's shared instance.
Behavioral change:
Kernel threads that briefly use a user mm via kthread_use_mm() no
longer inherit dumpability from the borrowed mm. Kthreads are not
ptraceable (PF_KTHREAD short-circuits __ptrace_may_access), so this
is observable only via /proc surfaces that a sufficiently privileged
reader can reach.
[1] https://lore.kernel.org/r/CAHk-=wj+NgoDH3GSicJ140SV8OoDd71pLmL3fgFEsTcgoMC6Og@mail.gmail.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
Changes in v3:
- Restore alloc-fresh-and-inherit semantics for non-CLONE_VM clones.
CLONE_VM siblings still refcount-share; fork() and other
non-CLONE_VM clones get a fresh exec_state that inherits the
parent's dumpable mode and user_ns. The v2 "every clone
refcount-shares" model would have let any forked process in an
Android zygote64 subtree influence dumpability of its siblings
via prctl(PR_SET_DUMPABLE).
- Link to v2: https://patch.msgid.link/20260520-work-task_exec_state-v2-0-9ea88ceb09e6@kernel.org
Changes in v2:
- Drop dup-on-fork for non-CLONE_VM clones: every clone() variant
refcount-shares the parent's task_exec_state; only execve()
allocates a fresh one. See "Behavioral changes" in the cover
letter for the implications.
- Switch commit_creds() to update dumpability on the new
task_exec_state (instead of dropping the set_dumpable() call
entirely as in v1). Drops the explicit smp_wmb()/smp_rmb() pair
- RCU acquire/release on the cred pointer provides the ordering.
- Link to v1: https://patch.msgid.link/20260516-work-exit_mm-v1-1-76bcc7c2439d@kernel.org
---
Christian Brauner (Amutable) (4):
sched/coredump: introduce enum task_dumpable
exec: introduce struct task_exec_state
ptrace: add ptracer_access_allowed()
exec_state: relocate dumpable information
arch/arm64/kernel/mte.c | 6 +-
drivers/firmware/efi/efi.c | 1 -
fs/coredump.c | 22 +++-----
fs/exec.c | 39 ++++++-------
fs/pidfs.c | 23 +++-----
fs/proc/base.c | 39 ++++++-------
include/linux/binfmts.h | 2 +
include/linux/coredump.h | 4 ++
include/linux/mm_types.h | 9 ++-
include/linux/ptrace.h | 1 +
include/linux/sched.h | 6 +-
include/linux/sched/coredump.h | 47 ++++------------
include/linux/sched/exec_state.h | 29 ++++++++++
init/init_task.c | 10 ++++
kernel/Makefile | 2 +-
kernel/cred.c | 3 +-
kernel/exec_state.c | 116 +++++++++++++++++++++++++++++++++++++++
kernel/exit.c | 1 -
kernel/fork.c | 32 +++++++++--
kernel/kthread.c | 1 -
kernel/ptrace.c | 53 ++++++++++++------
kernel/sys.c | 6 +-
mm/init-mm.c | 1 -
23 files changed, 301 insertions(+), 152 deletions(-)
---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260520-work-task_exec_state-83209d8b3e53
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH RFC v3 1/4] sched/coredump: introduce enum task_dumpable 2026-05-20 21:48 [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata Christian Brauner (Amutable) @ 2026-05-20 21:48 ` Christian Brauner (Amutable) 2026-05-22 22:14 ` David Hildenbrand (Arm) 2026-05-20 21:48 ` [PATCH RFC v3 2/4] exec: introduce struct task_exec_state Christian Brauner (Amutable) ` (2 subsequent siblings) 3 siblings, 1 reply; 16+ messages in thread From: Christian Brauner (Amutable) @ 2026-05-20 21:48 UTC (permalink / raw) To: Jann Horn, Linus Torvalds, Oleg Nesterov Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko, Christian Brauner (Amutable) Replace the SUID_DUMP_DISABLE/USER/ROOT preprocessor constants with enum task_dumpable. Numeric values are preserved (kernel.suid_dumpable sysctl and prctl(PR_SET_DUMPABLE) ABI), so this is a pure rename with no behavioral change. Subsequent commits relocate dumpability onto a per-task structure where the enum type will allow stronger type-checking on the new API. Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> --- arch/arm64/kernel/mte.c | 2 +- fs/coredump.c | 4 ++-- fs/exec.c | 8 ++++---- fs/pidfs.c | 6 +++--- fs/proc/base.c | 2 +- include/linux/mm_types.h | 2 +- include/linux/sched/coredump.h | 15 +++++++++++---- kernel/exit.c | 2 +- kernel/ptrace.c | 4 ++-- kernel/sys.c | 2 +- 10 files changed, 27 insertions(+), 20 deletions(-) diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index 6874b16d0657..904ac41f93bc 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -538,7 +538,7 @@ static int access_remote_tags(struct task_struct *tsk, unsigned long addr, return -EPERM; if (!tsk->ptrace || (current != tsk->parent) || - ((get_dumpable(mm) != SUID_DUMP_USER) && + ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) && !ptracer_capable(tsk, mm->user_ns))) { mmput(mm); return -EPERM; diff --git a/fs/coredump.c b/fs/coredump.c index bb6fdb1f458e..f5348d5bc441 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -873,7 +873,7 @@ static inline bool coredump_socket(struct core_name *cn, struct coredump_params static inline bool coredump_force_suid_safe(const struct coredump_params *cprm) { /* Require nonrelative corefile path and be extra careful. */ - return __get_dumpable(cprm->mm_flags) == SUID_DUMP_ROOT; + return __get_dumpable(cprm->mm_flags) == TASK_DUMPABLE_ROOT; } static bool coredump_file(struct core_name *cn, struct coredump_params *cprm, @@ -1419,7 +1419,7 @@ EXPORT_SYMBOL(dump_align); void validate_coredump_safety(void) { - if (suid_dumpable == SUID_DUMP_ROOT && + if (suid_dumpable == TASK_DUMPABLE_ROOT && core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != '@') { coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: " diff --git a/fs/exec.c b/fs/exec.c index ba12b4c466f6..f5663bb607d3 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1212,7 +1212,7 @@ int begin_new_exec(struct linux_binprm * bprm) gid_eq(current_egid(), current_gid()))) set_dumpable(current->mm, suid_dumpable); else - set_dumpable(current->mm, SUID_DUMP_USER); + set_dumpable(current->mm, TASK_DUMPABLE_OWNER); perf_event_exec(); @@ -1261,7 +1261,7 @@ int begin_new_exec(struct linux_binprm * bprm) * wait until new credentials are committed * by commit_creds() above */ - if (get_dumpable(me->mm) != SUID_DUMP_USER) + if (get_dumpable(me->mm) != TASK_DUMPABLE_OWNER) perf_event_exit_task(me); /* * cred_guard_mutex must be held at least to this point to prevent @@ -1906,11 +1906,11 @@ void set_binfmt(struct linux_binfmt *new) EXPORT_SYMBOL(set_binfmt); /* - * set_dumpable stores three-value SUID_DUMP_* into mm->flags. + * set_dumpable stores three-value TASK_DUMPABLE_* into mm->flags. */ void set_dumpable(struct mm_struct *mm, int value) { - if (WARN_ON((unsigned)value > SUID_DUMP_ROOT)) + if (WARN_ON((unsigned)value > TASK_DUMPABLE_ROOT)) return; __mm_flags_set_mask_dumpable(mm, value); diff --git a/fs/pidfs.c b/fs/pidfs.c index 1cce4f34a051..9cd12f2f004c 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -341,11 +341,11 @@ static inline bool pid_in_current_pidns(const struct pid *pid) static __u32 pidfs_coredump_mask(unsigned long mm_flags) { switch (__get_dumpable(mm_flags)) { - case SUID_DUMP_USER: + case TASK_DUMPABLE_OWNER: return PIDFD_COREDUMP_USER; - case SUID_DUMP_ROOT: + case TASK_DUMPABLE_ROOT: return PIDFD_COREDUMP_ROOT; - case SUID_DUMP_DISABLE: + case TASK_DUMPABLE_OFF: return PIDFD_COREDUMP_SKIP; default: WARN_ON_ONCE(true); diff --git a/fs/proc/base.c b/fs/proc/base.c index d9acfa89c894..da0b316befb8 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1909,7 +1909,7 @@ void task_dump_owner(struct task_struct *task, umode_t mode, mm = task->mm; /* Make non-dumpable tasks owned by some root */ if (mm) { - if (get_dumpable(mm) != SUID_DUMP_USER) { + if (get_dumpable(mm) != TASK_DUMPABLE_OWNER) { struct user_namespace *user_ns = mm->user_ns; uid = make_kuid(user_ns, 0); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index a308e2c23b82..51ea37b2a0aa 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1908,7 +1908,7 @@ enum { /* * The first two bits represent core dump modes for set-user-ID, - * the modes are SUID_DUMP_* defined in linux/sched/coredump.h + * the modes are TASK_DUMPABLE_* defined in linux/sched/coredump.h */ #define MMF_DUMPABLE_BITS 2 #define MMF_DUMPABLE_MASK (BIT(MMF_DUMPABLE_BITS) - 1) diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index 624fda17a785..ed6547692b61 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -4,9 +4,16 @@ #include <linux/mm_types.h> -#define SUID_DUMP_DISABLE 0 /* No setuid dumping */ -#define SUID_DUMP_USER 1 /* Dump as user of process */ -#define SUID_DUMP_ROOT 2 /* Dump as root */ +/* + * Task dumpability mode. Gates core dump production and ptrace_attach() + * authorization. The numeric values are stable ABI (suid_dumpable + * sysctl, prctl(PR_SET_DUMPABLE)); do not renumber. + */ +enum task_dumpable { + TASK_DUMPABLE_OFF = 0, /* no dump; ptrace needs CAP_SYS_PTRACE */ + TASK_DUMPABLE_OWNER = 1, /* default; dump and ptrace by uid match */ + TASK_DUMPABLE_ROOT = 2, /* dump as root; ptrace needs CAP_SYS_PTRACE */ +}; static inline unsigned long __mm_flags_get_dumpable(const struct mm_struct *mm) { @@ -26,7 +33,7 @@ extern void set_dumpable(struct mm_struct *mm, int value); /* * This returns the actual value of the suid_dumpable flag. For things * that are using this for checking for privilege transitions, it must - * test against SUID_DUMP_USER rather than treating it as a boolean + * test against TASK_DUMPABLE_OWNER rather than treating it as a boolean * value. */ static inline int __get_dumpable(unsigned long mm_flags) diff --git a/kernel/exit.c b/kernel/exit.c index f50d73c272d6..507eda655e8d 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -571,7 +571,7 @@ static void exit_mm(void) */ smp_mb__after_spinlock(); local_irq_disable(); - current->user_dumpable = (get_dumpable(mm) == SUID_DUMP_USER); + current->user_dumpable = (get_dumpable(mm) == TASK_DUMPABLE_OWNER); current->mm = NULL; membarrier_update_current_mm(NULL); enter_lazy_tlb(mm, current); diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 130043bfc209..07398c9c8fe3 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -53,7 +53,7 @@ int ptrace_access_vm(struct task_struct *tsk, unsigned long addr, if (!tsk->ptrace || (current != tsk->parent) || - ((get_dumpable(mm) != SUID_DUMP_USER) && + ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) && !ptracer_capable(tsk, mm->user_ns))) { mmput(mm); return 0; @@ -276,7 +276,7 @@ static bool task_still_dumpable(struct task_struct *task, unsigned int mode) { struct mm_struct *mm = task->mm; if (mm) { - if (get_dumpable(mm) == SUID_DUMP_USER) + if (get_dumpable(mm) == TASK_DUMPABLE_OWNER) return true; return ptrace_has_cap(mm->user_ns, mode); } diff --git a/kernel/sys.c b/kernel/sys.c index 62e842055cc9..f1189f719db5 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2568,7 +2568,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = get_dumpable(me->mm); break; case PR_SET_DUMPABLE: - if (arg2 != SUID_DUMP_DISABLE && arg2 != SUID_DUMP_USER) { + if (arg2 != TASK_DUMPABLE_OFF && arg2 != TASK_DUMPABLE_OWNER) { error = -EINVAL; break; } -- 2.47.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 1/4] sched/coredump: introduce enum task_dumpable 2026-05-20 21:48 ` [PATCH RFC v3 1/4] sched/coredump: introduce enum task_dumpable Christian Brauner (Amutable) @ 2026-05-22 22:14 ` David Hildenbrand (Arm) 0 siblings, 0 replies; 16+ messages in thread From: David Hildenbrand (Arm) @ 2026-05-22 22:14 UTC (permalink / raw) To: Christian Brauner (Amutable), Jann Horn, Linus Torvalds, Oleg Nesterov Cc: Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On 5/20/26 23:48, Christian Brauner (Amutable) wrote: > Replace the SUID_DUMP_DISABLE/USER/ROOT preprocessor constants with > enum task_dumpable. Numeric values are preserved (kernel.suid_dumpable > sysctl and prctl(PR_SET_DUMPABLE) ABI), so this is a pure rename with > no behavioral change. > > Subsequent commits relocate dumpability onto a per-task structure > where the enum type will allow stronger type-checking on the new API. > > Reviewed-by: Jann Horn <jannh@google.com> > Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> > --- Reviewed-by: David Hildenbrand (arm) <david@kernel.org> -- Cheers, David ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RFC v3 2/4] exec: introduce struct task_exec_state 2026-05-20 21:48 [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata Christian Brauner (Amutable) 2026-05-20 21:48 ` [PATCH RFC v3 1/4] sched/coredump: introduce enum task_dumpable Christian Brauner (Amutable) @ 2026-05-20 21:48 ` Christian Brauner (Amutable) 2026-05-22 15:00 ` Oleg Nesterov 2026-05-22 22:21 ` David Hildenbrand (Arm) 2026-05-20 21:48 ` [PATCH RFC v3 3/4] ptrace: add ptracer_access_allowed() Christian Brauner (Amutable) 2026-05-20 21:48 ` [PATCH RFC v3 4/4] exec_state: relocate dumpable information Christian Brauner (Amutable) 3 siblings, 2 replies; 16+ messages in thread From: Christian Brauner (Amutable) @ 2026-05-20 21:48 UTC (permalink / raw) To: Jann Horn, Linus Torvalds, Oleg Nesterov Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko, Christian Brauner (Amutable) Introduce struct task_exec_state, a per-task RCU-protected structure that holds the dumpable mode and stays attached to the task for its full lifetime. task_exec_state_rcu() is the canonical reader: asserts RCU or task_lock is held, WARNs on a NULL state, returns the rcu_dereference()'d pointer. Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> --- include/linux/sched.h | 2 + include/linux/sched/exec_state.h | 31 +++++++++++ kernel/Makefile | 2 +- kernel/exec_state.c | 116 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 150 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ee06cba5c6f5..6674dbf960b5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -962,6 +962,8 @@ struct task_struct { struct mm_struct *mm; struct mm_struct *active_mm; + struct task_exec_state __rcu *exec_state; + int exit_state; int exit_code; int exit_signal; diff --git a/include/linux/sched/exec_state.h b/include/linux/sched/exec_state.h new file mode 100644 index 000000000000..dc5a795cbfe2 --- /dev/null +++ b/include/linux/sched/exec_state.h @@ -0,0 +1,31 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2026 Christian Brauner <brauner@kernel.org> */ +#ifndef _LINUX_SCHED_EXEC_STATE_H +#define _LINUX_SCHED_EXEC_STATE_H + +#include <linux/init.h> +#include <linux/rcupdate.h> +#include <linux/refcount.h> +#include <linux/sched/coredump.h> +#include <linux/user_namespace.h> + +struct task_exec_state { + refcount_t count; + enum task_dumpable dumpable; + struct user_namespace *user_ns; + struct rcu_head rcu; +}; + +struct task_exec_state *alloc_task_exec_state(struct user_namespace *user_ns); +void put_task_exec_state(struct task_exec_state *exec_state); +struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk); +struct task_exec_state *task_exec_state_replace(struct task_struct *tsk, + struct task_exec_state *exec_state); +void task_exec_state_set_dumpable(enum task_dumpable value); +enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task); +int task_exec_state_copy(struct task_struct *tsk); +void __init exec_state_init(void); + +DEFINE_FREE(put_task_exec_state, struct task_exec_state *, put_task_exec_state(_T)) + +#endif /* _LINUX_SCHED_EXEC_STATE_H */ diff --git a/kernel/Makefile b/kernel/Makefile index 6785982013dc..1e1a31673577 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -3,7 +3,7 @@ # Makefile for the linux kernel. # -obj-y = fork.o exec_domain.o panic.o \ +obj-y = fork.o exec_domain.o exec_state.o panic.o \ cpu.o exit.o softirq.o resource.o \ sysctl.o capability.o ptrace.o user.o \ signal.o sys.o umh.o workqueue.o pid.o task_work.o \ diff --git a/kernel/exec_state.c b/kernel/exec_state.c new file mode 100644 index 000000000000..a0ca5d913900 --- /dev/null +++ b/kernel/exec_state.c @@ -0,0 +1,116 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2026 Christian Brauner <brauner@kernel.org> */ +#include <linux/init.h> +#include <linux/rcupdate.h> +#include <linux/refcount.h> +#include <linux/sched.h> +#include <linux/sched/coredump.h> +#include <linux/sched/exec_state.h> +#include <linux/sched/signal.h> +#include <linux/slab.h> +#include <linux/user_namespace.h> + +static struct kmem_cache *task_exec_state_cachep; + +static void __free_task_exec_state(struct rcu_head *rcu) +{ + struct task_exec_state *exec_state = container_of(rcu, struct task_exec_state, rcu); + + put_user_ns(exec_state->user_ns); + kmem_cache_free(task_exec_state_cachep, exec_state); +} + +void put_task_exec_state(struct task_exec_state *exec_state) +{ + if (exec_state && refcount_dec_and_test(&exec_state->count)) + call_rcu(&exec_state->rcu, __free_task_exec_state); +} + +struct task_exec_state *alloc_task_exec_state(struct user_namespace *user_ns) +{ + struct task_exec_state *exec_state; + + exec_state = kmem_cache_alloc(task_exec_state_cachep, GFP_KERNEL); + if (!exec_state) + return NULL; + refcount_set(&exec_state->count, 1); + exec_state->dumpable = TASK_DUMPABLE_OFF; + exec_state->user_ns = get_user_ns(user_ns); + return exec_state; +} + +struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk) +{ + struct task_exec_state *exec_state; + + exec_state = rcu_dereference_check(tsk->exec_state, + lockdep_is_held(&tsk->alloc_lock)); + WARN_ON_ONCE(!exec_state); + return exec_state; +} + +struct task_exec_state *task_exec_state_replace(struct task_struct *tsk, + struct task_exec_state *exec_state) +{ + /* + * Updates must hold both locks so callers needing a consistent + * snapshot of mm + dumpability are covered. + */ + lockdep_assert_held(&tsk->alloc_lock); + lockdep_assert_held_write(&tsk->signal->exec_update_lock); + + return rcu_replace_pointer(tsk->exec_state, exec_state, true); +} + +/* + * The non-CLONE_VM clone path: allocate a fresh exec_state and + * inherit the parent's dumpable mode and user_ns reference. CLONE_VM + * siblings refcount-share via copy_exec_state() in fork.c; only this + * path and execve() ever allocate. + */ +int task_exec_state_copy(struct task_struct *tsk) +{ + struct task_exec_state *src, *dst; + + src = rcu_dereference_protected(current->exec_state, true); + dst = alloc_task_exec_state(src->user_ns); + if (!dst) + return -ENOMEM; + dst->dumpable = src->dumpable; + rcu_assign_pointer(tsk->exec_state, dst); + return 0; +} + +/* + * Store TASK_DUMPABLE_* on current->exec_state. All callers + * (commit_creds, begin_new_exec, prctl(PR_SET_DUMPABLE)) act on the + * running task, which guarantees ->exec_state is allocated and cannot + * be replaced under us. + */ +void task_exec_state_set_dumpable(enum task_dumpable value) +{ + struct task_exec_state *exec_state; + + if (WARN_ON(value > TASK_DUMPABLE_ROOT)) + value = TASK_DUMPABLE_OFF; + + exec_state = rcu_dereference_protected(current->exec_state, true); + WRITE_ONCE(exec_state->dumpable, value); +} + +enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task) +{ + struct task_exec_state *exec_state; + + guard(rcu)(); + exec_state = rcu_dereference(task->exec_state); + return READ_ONCE(exec_state->dumpable); +} + +void __init exec_state_init(void) +{ + task_exec_state_cachep = kmem_cache_create("task_exec_state", + sizeof(struct task_exec_state), 0, + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, + NULL); +} -- 2.47.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 2/4] exec: introduce struct task_exec_state 2026-05-20 21:48 ` [PATCH RFC v3 2/4] exec: introduce struct task_exec_state Christian Brauner (Amutable) @ 2026-05-22 15:00 ` Oleg Nesterov 2026-05-26 7:16 ` Christian Brauner 2026-05-22 22:21 ` David Hildenbrand (Arm) 1 sibling, 1 reply; 16+ messages in thread From: Oleg Nesterov @ 2026-05-22 15:00 UTC (permalink / raw) To: Christian Brauner (Amutable) Cc: Jann Horn, Linus Torvalds, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On 05/20, Christian Brauner (Amutable) wrote: > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -962,6 +962,8 @@ struct task_struct { > struct mm_struct *mm; > struct mm_struct *active_mm; > > + struct task_exec_state __rcu *exec_state; Sorry if this was already discussed... Can't we (later) move exec_state into signal_struct? AFAICS, the only complication is that task_still_dumpable/etc can't use tsk->signal->exec_state if tsk was alredy reaped (another thread can do exec_mmap() after that). But perhaps we can rely on pid_alive() check and return an error if the task has already passed __unhash_process() ? Oleg. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 2/4] exec: introduce struct task_exec_state 2026-05-22 15:00 ` Oleg Nesterov @ 2026-05-26 7:16 ` Christian Brauner 2026-05-26 8:17 ` Oleg Nesterov 0 siblings, 1 reply; 16+ messages in thread From: Christian Brauner @ 2026-05-26 7:16 UTC (permalink / raw) To: Oleg Nesterov Cc: Jann Horn, Linus Torvalds, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Fri, May 22, 2026 at 05:00:10PM +0200, Oleg Nesterov wrote: > On 05/20, Christian Brauner (Amutable) wrote: > > > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -962,6 +962,8 @@ struct task_struct { > > struct mm_struct *mm; > > struct mm_struct *active_mm; > > > > + struct task_exec_state __rcu *exec_state; > > Sorry if this was already discussed... > > Can't we (later) move exec_state into signal_struct? > > AFAICS, the only complication is that task_still_dumpable/etc can't use > tsk->signal->exec_state if tsk was alredy reaped (another thread can do > exec_mmap() after that). But perhaps we can rely on pid_alive() check > and return an error if the task has already passed __unhash_process() ? So Jann pointed out a problem with this in https://lore.kernel.org/CAG48ez0Gz_GghVeVzaixAQRNYBdWHYEj3K6FXBSzc+8WNsFxtA@mail.gmail.com I quote: I think signal_struct is not unshared on exec; so in this sequence of events: - task T1 is a non-dumpable task - task T1 creates another thread T2 - T2 exits - T1 goes through execve and becomes dumpable I believe T1 and T2 are still associated with the same signal_struct, which means that even though T2 is part of the pre-execve process, it shares state with the post-execve process and it would wrongly be considered dumpable. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 2/4] exec: introduce struct task_exec_state 2026-05-26 7:16 ` Christian Brauner @ 2026-05-26 8:17 ` Oleg Nesterov 0 siblings, 0 replies; 16+ messages in thread From: Oleg Nesterov @ 2026-05-26 8:17 UTC (permalink / raw) To: Christian Brauner Cc: Jann Horn, Linus Torvalds, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On 05/26, Christian Brauner wrote: > > On Fri, May 22, 2026 at 05:00:10PM +0200, Oleg Nesterov wrote: > > > > Can't we (later) move exec_state into signal_struct? > > > > AFAICS, the only complication is that task_still_dumpable/etc can't use > > tsk->signal->exec_state if tsk was alredy reaped (another thread can do > > exec_mmap() after that). But perhaps we can rely on pid_alive() check > > and return an error if the task has already passed __unhash_process() ? > > So Jann pointed out a problem with this in > https://lore.kernel.org/CAG48ez0Gz_GghVeVzaixAQRNYBdWHYEj3K6FXBSzc+8WNsFxtA@mail.gmail.com Aha, thanks. This is basically what I have said above. > I quote: > > I think signal_struct is not unshared on exec; so in this sequence of events: > > - task T1 is a non-dumpable task > - task T1 creates another thread T2 > - T2 exits > - T1 goes through execve and becomes dumpable Note that T1 can call exec_mmap/etc only after T2 is already reaped, that is why I said we need to use something like pid_alive() check. OK, lets forget it for now, this needs some changes in release_task() path, perhaps makes no sense. FWIW, your series + additional fixlets you sent in reply to 0/4 look good to me, I see nothing wrong. Oleg. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 2/4] exec: introduce struct task_exec_state 2026-05-20 21:48 ` [PATCH RFC v3 2/4] exec: introduce struct task_exec_state Christian Brauner (Amutable) 2026-05-22 15:00 ` Oleg Nesterov @ 2026-05-22 22:21 ` David Hildenbrand (Arm) 1 sibling, 0 replies; 16+ messages in thread From: David Hildenbrand (Arm) @ 2026-05-22 22:21 UTC (permalink / raw) To: Christian Brauner (Amutable), Jann Horn, Linus Torvalds, Oleg Nesterov Cc: Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On 5/20/26 23:48, Christian Brauner (Amutable) wrote: > Introduce struct task_exec_state, a per-task RCU-protected structure > that holds the dumpable mode ... and the user namespace, and ... > and stays attached to the task for its > full lifetime. > > task_exec_state_rcu() is the canonical reader: asserts RCU or > task_lock is held, WARNs on a NULL state, returns the > rcu_dereference()'d pointer. > > Reviewed-by: Jann Horn <jannh@google.com> > Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> > --- [...] > + > +/* > + * Store TASK_DUMPABLE_* on current->exec_state. All callers > + * (commit_creds, begin_new_exec, prctl(PR_SET_DUMPABLE)) act on the > + * running task, which guarantees ->exec_state is allocated and cannot > + * be replaced under us. > + */ > +void task_exec_state_set_dumpable(enum task_dumpable value) > +{ > + struct task_exec_state *exec_state; > + > + if (WARN_ON(value > TASK_DUMPABLE_ROOT)) WARN_ON_ONCE() ? Not that I think this would really trigger ;) > + value = TASK_DUMPABLE_OFF; > + > + exec_state = rcu_dereference_protected(current->exec_state, true); > + WRITE_ONCE(exec_state->dumpable, value); > +} > + Nothing jumped at me ... but it's just after midnight ... :) -- Cheers, David ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RFC v3 3/4] ptrace: add ptracer_access_allowed() 2026-05-20 21:48 [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata Christian Brauner (Amutable) 2026-05-20 21:48 ` [PATCH RFC v3 1/4] sched/coredump: introduce enum task_dumpable Christian Brauner (Amutable) 2026-05-20 21:48 ` [PATCH RFC v3 2/4] exec: introduce struct task_exec_state Christian Brauner (Amutable) @ 2026-05-20 21:48 ` Christian Brauner (Amutable) 2026-05-22 15:08 ` Oleg Nesterov 2026-05-22 22:32 ` David Hildenbrand (Arm) 2026-05-20 21:48 ` [PATCH RFC v3 4/4] exec_state: relocate dumpable information Christian Brauner (Amutable) 3 siblings, 2 replies; 16+ messages in thread From: Christian Brauner (Amutable) @ 2026-05-20 21:48 UTC (permalink / raw) To: Jann Horn, Linus Torvalds, Oleg Nesterov Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko, Christian Brauner (Amutable) Add a helper that encapsulates all of the logic for checking ptrace access and remove open-coded versions in follow-up patches. Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> --- include/linux/ptrace.h | 1 + kernel/ptrace.c | 27 +++++++++++++++++++++++++++ 2 files changed, 28 insertions(+) diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h index 90507d4afcd6..ef314f7a9ecc 100644 --- a/include/linux/ptrace.h +++ b/include/linux/ptrace.h @@ -17,6 +17,7 @@ struct syscall_info { struct seccomp_data data; }; +bool ptracer_access_allowed(struct task_struct *tsk); extern int ptrace_access_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, unsigned int gup_flags); diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 07398c9c8fe3..0e1f80f73a7f 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -13,6 +13,7 @@ #include <linux/sched.h> #include <linux/sched/mm.h> #include <linux/sched/coredump.h> +#include <linux/sched/exec_state.h> #include <linux/sched/task.h> #include <linux/errno.h> #include <linux/mm.h> @@ -36,6 +37,32 @@ #include <asm/syscall.h> /* for syscall_get_* */ +/** + * ptracer_access_allowed - may current peek/poke @tsk's address space? + * @tsk: tracee + * + * Per-access check used by ptrace_access_vm() and architecture-specific + * tag/register accessors. Returns true iff current is the registered + * ptracer of @tsk and either @tsk is owner-dumpable or current holds + * CAP_SYS_PTRACE in @tsk's exec namespace. Lighter than + * __ptrace_may_access(): it re-validates only dumpability and + * capability on every access, without re-running LSM hooks or + * cred_cap_issubset() checks performed at attach time. + */ +bool ptracer_access_allowed(struct task_struct *tsk) +{ + const struct task_exec_state *es; + + if (!tsk->ptrace) + return false; + if (current != tsk->parent) + return false; + guard(rcu)(); + es = task_exec_state_rcu(tsk); + return READ_ONCE(es->dumpable) == TASK_DUMPABLE_OWNER || + ptracer_capable(tsk, es->user_ns); +} + /* * Access another process' address space via ptrace. * Source/target buffer must be kernel space, -- 2.47.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 3/4] ptrace: add ptracer_access_allowed() 2026-05-20 21:48 ` [PATCH RFC v3 3/4] ptrace: add ptracer_access_allowed() Christian Brauner (Amutable) @ 2026-05-22 15:08 ` Oleg Nesterov 2026-05-22 22:32 ` David Hildenbrand (Arm) 1 sibling, 0 replies; 16+ messages in thread From: Oleg Nesterov @ 2026-05-22 15:08 UTC (permalink / raw) To: Christian Brauner (Amutable) Cc: Jann Horn, Linus Torvalds, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On 05/20, Christian Brauner (Amutable) wrote: > > +bool ptracer_access_allowed(struct task_struct *tsk) > +{ > + const struct task_exec_state *es; > + > + if (!tsk->ptrace) > + return false; > + if (current != tsk->parent) > + return false; > + guard(rcu)(); Really minor nit feel, free to ignore... guard(rcu)(); if (ptrace_parent(tsk) != current) return false; ... With or without this series, I don't really understand why ptrace_access_vm() needs these security checks... And ptrace_parent(tsk) == current should be always true? Oleg. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 3/4] ptrace: add ptracer_access_allowed() 2026-05-20 21:48 ` [PATCH RFC v3 3/4] ptrace: add ptracer_access_allowed() Christian Brauner (Amutable) 2026-05-22 15:08 ` Oleg Nesterov @ 2026-05-22 22:32 ` David Hildenbrand (Arm) 1 sibling, 0 replies; 16+ messages in thread From: David Hildenbrand (Arm) @ 2026-05-22 22:32 UTC (permalink / raw) To: Christian Brauner (Amutable), Jann Horn, Linus Torvalds, Oleg Nesterov Cc: Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On 5/20/26 23:48, Christian Brauner (Amutable) wrote: > Add a helper that encapsulates all of the logic for checking ptrace > access and remove open-coded versions in follow-up patches. > > Reviewed-by: Jann Horn <jannh@google.com> > Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> > --- > include/linux/ptrace.h | 1 + > kernel/ptrace.c | 27 +++++++++++++++++++++++++++ > 2 files changed, 28 insertions(+) > > diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h > index 90507d4afcd6..ef314f7a9ecc 100644 > --- a/include/linux/ptrace.h > +++ b/include/linux/ptrace.h > @@ -17,6 +17,7 @@ struct syscall_info { > struct seccomp_data data; > }; > > +bool ptracer_access_allowed(struct task_struct *tsk); > extern int ptrace_access_vm(struct task_struct *tsk, unsigned long addr, > void *buf, int len, unsigned int gup_flags); > > diff --git a/kernel/ptrace.c b/kernel/ptrace.c > index 07398c9c8fe3..0e1f80f73a7f 100644 > --- a/kernel/ptrace.c > +++ b/kernel/ptrace.c > @@ -13,6 +13,7 @@ > #include <linux/sched.h> > #include <linux/sched/mm.h> > #include <linux/sched/coredump.h> > +#include <linux/sched/exec_state.h> > #include <linux/sched/task.h> > #include <linux/errno.h> > #include <linux/mm.h> > @@ -36,6 +37,32 @@ > > #include <asm/syscall.h> /* for syscall_get_* */ > > +/** > + * ptracer_access_allowed - may current peek/poke @tsk's address space? > + * @tsk: tracee > + * > + * Per-access check used by ptrace_access_vm() and architecture-specific > + * tag/register accessors. Returns true iff current is the registered > + * ptracer of @tsk and either @tsk is owner-dumpable or current holds > + * CAP_SYS_PTRACE in @tsk's exec namespace. Lighter than > + * __ptrace_may_access(): it re-validates only dumpability and > + * capability on every access, without re-running LSM hooks or > + * cred_cap_issubset() checks performed at attach time. > + */ > +bool ptracer_access_allowed(struct task_struct *tsk) > +{ > + const struct task_exec_state *es; > + > + if (!tsk->ptrace) > + return false; > + if (current != tsk->parent) > + return false; > + guard(rcu)(); > + es = task_exec_state_rcu(tsk); > + return READ_ONCE(es->dumpable) == TASK_DUMPABLE_OWNER || > + ptracer_capable(tsk, es->user_ns); > +} > + > /* > * Access another process' address space via ptrace. > * Source/target buffer must be kernel space, > Besides the new RCU + old MM handling, this matches what we do in ptrace_access_vm(). Reviewed-by: David Hildenbrand (arm) <david@kernel.org> -- Cheers, David ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RFC v3 4/4] exec_state: relocate dumpable information 2026-05-20 21:48 [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata Christian Brauner (Amutable) ` (2 preceding siblings ...) 2026-05-20 21:48 ` [PATCH RFC v3 3/4] ptrace: add ptracer_access_allowed() Christian Brauner (Amutable) @ 2026-05-20 21:48 ` Christian Brauner (Amutable) 2026-05-21 10:05 ` Christian Brauner ` (2 more replies) 3 siblings, 3 replies; 16+ messages in thread From: Christian Brauner (Amutable) @ 2026-05-20 21:48 UTC (permalink / raw) To: Jann Horn, Linus Torvalds, Oleg Nesterov Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko, Christian Brauner (Amutable) The dumpable flag captured at execve() is consulted by __ptrace_may_access() and several /proc owner / visibility checks. It lives on mm_struct today, which exit_mm() clears from the task long before the task itself is reaped. exec_state is anchored to the execve() that established the current privilege domain. CLONE_VM siblings refcount-share the parent's exec_state via copy_exec_state(); non-CLONE_VM clones allocate a fresh exec_state inheriting the parent's dumpable mode and user_ns reference via task_exec_state_copy(). execve() allocates a fresh instance (via alloc_task_exec_state() in begin_new_exec()) and installs it under task_lock + exec_update_lock with task_exec_state_replace(). init_task uses a static instance. The dumpable mode now lives on task->exec_state->dumpable. task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit positions remain stable for the /proc/<pid>/coredump_filter ABI. The task->user_dumpable cache bit and its assignment in exit_mm() are removed; readers go through get_dumpable(task) directly. coredump_params gains a snapshot field cprm.dumpable, populated from get_dumpable(current) at vfs_coredump() entry, replacing the previous __get_dumpable(cprm->mm_flags) consumers in fs/coredump.c and fs/pidfs.c. The user namespace recorded at execve() is consulted by __ptrace_may_access() and by /proc/PID/* owner derivation. Move the captured user_ns onto task_exec_state, which stays attached to the task past exit_mm() and across exit_files(). bprm grows a user_ns field staged in bprm_mm_init() with the caller's user_ns, narrowed by would_dump() to the closest privileged ancestor, and consumed by exec_mmap() via alloc_task_exec_state(bprm->user_ns). free_bprm() releases the staging reference. mm_struct loses ->user_ns entirely. Initializers in init-mm, efi_mm, and the implicit one in mm_init()/dup_mm()/mm_alloc() are removed; __mmdrop() drops the matching put_user_ns(). The kthread_use_mm() WARN_ON_ONCE(!mm->user_ns) is no longer meaningful and goes too. Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> --- arch/arm64/kernel/mte.c | 6 ++---- drivers/firmware/efi/efi.c | 1 - fs/coredump.c | 20 +++++++------------- fs/exec.c | 39 ++++++++++++++++++++------------------- fs/pidfs.c | 17 ++++++----------- fs/proc/base.c | 39 ++++++++++++++++----------------------- include/linux/binfmts.h | 2 ++ include/linux/coredump.h | 4 ++++ include/linux/mm_types.h | 9 ++++----- include/linux/sched.h | 4 +--- include/linux/sched/coredump.h | 36 ++---------------------------------- include/linux/sched/exec_state.h | 2 -- init/init_task.c | 10 ++++++++++ kernel/cred.c | 3 +-- kernel/exit.c | 1 - kernel/fork.c | 32 ++++++++++++++++++++++++++------ kernel/kthread.c | 1 - kernel/ptrace.c | 26 ++++++++------------------ kernel/sys.c | 4 ++-- mm/init-mm.c | 1 - 20 files changed, 111 insertions(+), 146 deletions(-) diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index 904ac41f93bc..1a9aad6ef22a 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -8,6 +8,7 @@ #include <linux/kernel.h> #include <linux/mm.h> #include <linux/prctl.h> +#include <linux/ptrace.h> #include <linux/sched.h> #include <linux/sched/mm.h> #include <linux/string.h> @@ -537,16 +538,13 @@ static int access_remote_tags(struct task_struct *tsk, unsigned long addr, if (!mm) return -EPERM; - if (!tsk->ptrace || (current != tsk->parent) || - ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) && - !ptracer_capable(tsk, mm->user_ns))) { + if (!ptracer_access_allowed(tsk)) { mmput(mm); return -EPERM; } ret = __access_remote_tags(mm, addr, kiov, gup_flags); mmput(mm); - return ret; } diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index d04be38f1750..ae78bc021b41 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -73,7 +73,6 @@ struct mm_struct efi_mm = { MMAP_LOCK_INITIALIZER(efi_mm) .page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), - .user_ns = &init_user_ns, #ifdef CONFIG_SCHED_MM_CID .mm_cid.lock = __RAW_SPIN_LOCK_UNLOCKED(efi_mm.mm_cid.lock), #endif diff --git a/fs/coredump.c b/fs/coredump.c index f5348d5bc441..e943569e9b6d 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -395,8 +395,7 @@ static bool coredump_parse(struct core_name *cn, struct coredump_params *cprm, cred->gid)); break; case 'd': - err = cn_printf(cn, "%d", - __get_dumpable(cprm->mm_flags)); + err = cn_printf(cn, "%d", cprm->dumpable); break; /* signal that caused the coredump */ case 's': @@ -869,11 +868,11 @@ static inline void coredump_sock_shutdown(struct file *file) { } static inline bool coredump_socket(struct core_name *cn, struct coredump_params *cprm) { return false; } #endif -/* cprm->mm_flags contains a stable snapshot of dumpability flags. */ +/* cprm->dumpable is the snapshot of task dumpability at dump start. */ static inline bool coredump_force_suid_safe(const struct coredump_params *cprm) { /* Require nonrelative corefile path and be extra careful. */ - return __get_dumpable(cprm->mm_flags) == TASK_DUMPABLE_ROOT; + return cprm->dumpable == TASK_DUMPABLE_ROOT; } static bool coredump_file(struct core_name *cn, struct coredump_params *cprm, @@ -1085,7 +1084,7 @@ static inline bool coredump_skip(const struct coredump_params *cprm, return true; if (!binfmt->core_dump) return true; - if (!__get_dumpable(cprm->mm_flags)) + if (cprm->dumpable == TASK_DUMPABLE_OFF) return true; return false; } @@ -1170,14 +1169,9 @@ void vfs_coredump(const kernel_siginfo_t *siginfo) struct coredump_params cprm = { .siginfo = siginfo, .limit = rlimit(RLIMIT_CORE), - /* - * We must use the same mm->flags while dumping core to avoid - * inconsistency of bit flags, since this flag is not protected - * by any locks. - * - * Note that we only care about MMF_DUMP* flags. - */ - .mm_flags = __mm_flags_get_dumpable(mm), + /* Snapshot MMF_DUMP_FILTER_* (unlocked) and dumpable for the dump. */ + .mm_flags = __mm_flags_get_word(mm), + .dumpable = task_exec_state_get_dumpable(current), .vma_meta = NULL, .cpu = raw_smp_processor_id(), }; diff --git a/fs/exec.c b/fs/exec.c index f5663bb607d3..9e7f25e2cd41 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -35,6 +35,7 @@ #include <linux/init.h> #include <linux/sched/mm.h> #include <linux/sched/coredump.h> +#include <linux/sched/exec_state.h> #include <linux/sched/signal.h> #include <linux/sched/numa_balancing.h> #include <linux/sched/task.h> @@ -263,6 +264,9 @@ static int bprm_mm_init(struct linux_binprm *bprm) if (!mm) goto err; + /* Staged for would_dump() narrowing; consumed by begin_new_exec(). */ + bprm->user_ns = get_user_ns(current_user_ns()); + /* Save current stack limit for all calculations made during exec. */ task_lock(current->group_leader); bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK]; @@ -834,12 +838,17 @@ EXPORT_SYMBOL(read_code); * On success, this function returns with exec_update_lock * held for writing. */ -static int exec_mmap(struct mm_struct *mm) +static int exec_mmap(struct mm_struct *mm, struct user_namespace *user_ns) { + struct task_exec_state *exec_state __free(put_task_exec_state) = NULL; struct task_struct *tsk; struct mm_struct *old_mm, *active_mm; int ret; + exec_state = alloc_task_exec_state(user_ns); + if (!exec_state) + return -ENOMEM; + /* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; @@ -870,6 +879,7 @@ static int exec_mmap(struct mm_struct *mm) tsk->active_mm = mm; tsk->mm = mm; mm_init_cid(mm, tsk); + exec_state = task_exec_state_replace(tsk, exec_state); /* * This prevents preemption while active_mm is being loaded and * it and mm are being updated, which could cause problems for @@ -1145,7 +1155,7 @@ int begin_new_exec(struct linux_binprm * bprm) * Release all of the old mmap stuff */ acct_arg_size(bprm, 0); - retval = exec_mmap(bprm->mm); + retval = exec_mmap(bprm->mm, bprm->user_ns); if (retval) goto out; @@ -1210,9 +1220,9 @@ int begin_new_exec(struct linux_binprm * bprm) if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || !(uid_eq(current_euid(), current_uid()) && gid_eq(current_egid(), current_gid()))) - set_dumpable(current->mm, suid_dumpable); + task_exec_state_set_dumpable(suid_dumpable); else - set_dumpable(current->mm, TASK_DUMPABLE_OWNER); + task_exec_state_set_dumpable(TASK_DUMPABLE_OWNER); perf_event_exec(); @@ -1261,7 +1271,7 @@ int begin_new_exec(struct linux_binprm * bprm) * wait until new credentials are committed * by commit_creds() above */ - if (get_dumpable(me->mm) != TASK_DUMPABLE_OWNER) + if (task_exec_state_get_dumpable(me) != TASK_DUMPABLE_OWNER) perf_event_exit_task(me); /* * cred_guard_mutex must be held at least to this point to prevent @@ -1298,14 +1308,14 @@ void would_dump(struct linux_binprm *bprm, struct file *file) struct user_namespace *old, *user_ns; bprm->interp_flags |= BINPRM_FLAGS_ENFORCE_NONDUMP; - /* Ensure mm->user_ns contains the executable */ - user_ns = old = bprm->mm->user_ns; + /* Ensure bprm->user_ns contains the executable. */ + user_ns = old = bprm->user_ns; while ((user_ns != &init_user_ns) && !privileged_wrt_inode_uidgid(user_ns, idmap, inode)) user_ns = user_ns->parent; if (old != user_ns) { - bprm->mm->user_ns = get_user_ns(user_ns); + bprm->user_ns = get_user_ns(user_ns); put_user_ns(old); } } @@ -1375,6 +1385,8 @@ static void free_bprm(struct linux_binprm *bprm) acct_arg_size(bprm, 0); mmput(bprm->mm); } + if (bprm->user_ns) + put_user_ns(bprm->user_ns); free_arg_pages(bprm); if (bprm->cred) { /* in case exec fails before de_thread() succeeds */ @@ -1905,17 +1917,6 @@ void set_binfmt(struct linux_binfmt *new) } EXPORT_SYMBOL(set_binfmt); -/* - * set_dumpable stores three-value TASK_DUMPABLE_* into mm->flags. - */ -void set_dumpable(struct mm_struct *mm, int value) -{ - if (WARN_ON((unsigned)value > TASK_DUMPABLE_ROOT)) - return; - - __mm_flags_set_mask_dumpable(mm, value); -} - static inline struct user_arg_ptr native_arg(const char __user *const __user *p) { return (struct user_arg_ptr){.ptr.native = p}; diff --git a/fs/pidfs.c b/fs/pidfs.c index 9cd12f2f004c..b2ff950a096e 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -338,9 +338,9 @@ static inline bool pid_in_current_pidns(const struct pid *pid) return false; } -static __u32 pidfs_coredump_mask(unsigned long mm_flags) +static __u32 pidfs_coredump_mask(enum task_dumpable dumpable) { - switch (__get_dumpable(mm_flags)) { + switch (dumpable) { case TASK_DUMPABLE_OWNER: return PIDFD_COREDUMP_USER; case TASK_DUMPABLE_ROOT: @@ -433,14 +433,9 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg) return -ESRCH; if ((mask & PIDFD_INFO_COREDUMP) && !kinfo.coredump_mask) { - guard(task_lock)(task); - if (task->mm) { - unsigned long flags = __mm_flags_get_dumpable(task->mm); - - kinfo.coredump_mask = pidfs_coredump_mask(flags); - kinfo.mask |= PIDFD_INFO_COREDUMP; - /* No coredump actually took place, so no coredump signal. */ - } + kinfo.coredump_mask = pidfs_coredump_mask(task_exec_state_get_dumpable(task)); + kinfo.mask |= PIDFD_INFO_COREDUMP; + /* No coredump actually took place, so no coredump signal. */ } /* Unconditionally return identifiers and credentials, the rest only on request */ @@ -779,7 +774,7 @@ void pidfs_coredump(const struct coredump_params *cprm) VFS_WARN_ON_ONCE(attr == PIDFS_PID_DEAD); /* Note how we were coredumped and that we coredumped. */ - attr->coredump_mask = pidfs_coredump_mask(cprm->mm_flags) | + attr->coredump_mask = pidfs_coredump_mask(cprm->dumpable) | PIDFD_COREDUMPED; /* If coredumping is set to skip we should never end up here. */ VFS_WARN_ON_ONCE(attr->coredump_mask & PIDFD_COREDUMP_SKIP); diff --git a/fs/proc/base.c b/fs/proc/base.c index da0b316befb8..65f56136ec3f 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -91,6 +91,7 @@ #include <linux/sched/mm.h> #include <linux/sched/coredump.h> #include <linux/sched/debug.h> +#include <linux/sched/exec_state.h> #include <linux/sched/stat.h> #include <linux/posix-timers.h> #include <linux/time_namespace.h> @@ -1893,7 +1894,6 @@ void task_dump_owner(struct task_struct *task, umode_t mode, cred = __task_cred(task); uid = cred->euid; gid = cred->egid; - rcu_read_unlock(); /* * Before the /proc/pid/status file was created the only way to read @@ -1903,29 +1903,22 @@ void task_dump_owner(struct task_struct *task, umode_t mode, * made this apply to all per process world readable and executable * directories. */ - if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) { - struct mm_struct *mm; - task_lock(task); - mm = task->mm; - /* Make non-dumpable tasks owned by some root */ - if (mm) { - if (get_dumpable(mm) != TASK_DUMPABLE_OWNER) { - struct user_namespace *user_ns = mm->user_ns; - - uid = make_kuid(user_ns, 0); - if (!uid_valid(uid)) - uid = GLOBAL_ROOT_UID; - - gid = make_kgid(user_ns, 0); - if (!gid_valid(gid)) - gid = GLOBAL_ROOT_GID; - } - } else { - uid = GLOBAL_ROOT_UID; - gid = GLOBAL_ROOT_GID; + if (mode != (S_IFDIR | S_IRUGO | S_IXUGO)) { + struct task_exec_state *exec_state; + + exec_state = task_exec_state_rcu(task); + if (READ_ONCE(exec_state->dumpable) != TASK_DUMPABLE_OWNER) { + uid = make_kuid(exec_state->user_ns, 0); + if (!uid_valid(uid)) + uid = GLOBAL_ROOT_UID; + + gid = make_kgid(exec_state->user_ns, 0); + if (!gid_valid(gid)) + gid = GLOBAL_ROOT_GID; } - task_unlock(task); } + rcu_read_unlock(); + *ruid = uid; *rgid = gid; } @@ -2965,7 +2958,7 @@ static ssize_t proc_coredump_filter_read(struct file *file, char __user *buf, ret = 0; mm = get_task_mm(task); if (mm) { - unsigned long flags = __mm_flags_get_dumpable(mm); + unsigned long flags = __mm_flags_get_word(mm); len = snprintf(buffer, sizeof(buffer), "%08lx\n", ((flags & MMF_DUMP_FILTER_MASK) >> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index 65abd5ab8836..a8379f4eee61 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -25,6 +25,8 @@ struct linux_binprm { struct page *page[MAX_ARG_PAGES]; #endif struct mm_struct *mm; + /* user_ns published to task->exec_state at execve, narrowed by would_dump(). */ + struct user_namespace *user_ns; unsigned long p; /* current top of mem */ unsigned int /* Should an execfd be passed to userspace? */ diff --git a/include/linux/coredump.h b/include/linux/coredump.h index 68861da4cf7c..7b38ee2e7913 100644 --- a/include/linux/coredump.h +++ b/include/linux/coredump.h @@ -5,6 +5,7 @@ #include <linux/types.h> #include <linux/mm.h> #include <linux/fs.h> +#include <linux/sched/coredump.h> #include <asm/siginfo.h> #ifdef CONFIG_COREDUMP @@ -20,7 +21,10 @@ struct coredump_params { const kernel_siginfo_t *siginfo; struct file *file; unsigned long limit; + /* MMF_DUMP_FILTER_* bits, snapshot of mm->flags at dump start. */ unsigned long mm_flags; + /* Snapshot of dumpable at dump start. */ + enum task_dumpable dumpable; int cpu; loff_t written; loff_t pos; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 51ea37b2a0aa..9588ce3b16df 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1342,7 +1342,6 @@ struct mm_struct { */ struct task_struct __rcu *owner; #endif - struct user_namespace *user_ns; /* store ref to file /proc/<pid>/exe symlink points to */ struct file __rcu *exe_file; @@ -1907,11 +1906,11 @@ enum { /* mm flags */ /* - * The first two bits represent core dump modes for set-user-ID, - * the modes are TASK_DUMPABLE_* defined in linux/sched/coredump.h + * Bits 0 and 1 were dumpability; that moved to task->exec_state. Reserve + * the bits so MMF_DUMP_FILTER_* positions stay stable for the + * /proc/<pid>/coredump_filter ABI. */ #define MMF_DUMPABLE_BITS 2 -#define MMF_DUMPABLE_MASK (BIT(MMF_DUMPABLE_BITS) - 1) /* coredump filter bits */ #define MMF_DUMP_ANON_PRIVATE 2 #define MMF_DUMP_ANON_SHARED 3 @@ -1972,7 +1971,7 @@ enum { #define MMF_TOPDOWN 31 /* mm searches top down by default */ #define MMF_TOPDOWN_MASK BIT(MMF_TOPDOWN) -#define MMF_INIT_LEGACY_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ +#define MMF_INIT_LEGACY_MASK (MMF_DUMP_FILTER_MASK |\ MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\ MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6674dbf960b5..258cb075478d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -85,6 +85,7 @@ struct seq_file; struct sighand_struct; struct signal_struct; struct task_delay_info; +struct task_exec_state; struct task_group; struct task_struct; struct timespec64; @@ -1004,9 +1005,6 @@ struct task_struct { unsigned sched_rt_mutex:1; #endif - /* Save user-dumpable when mm goes away */ - unsigned user_dumpable:1; - /* Bit to tell TOMOYO we're in execve(): */ unsigned in_execve:1; unsigned in_iowait:1; diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index ed6547692b61..20957ccde3b5 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -2,8 +2,6 @@ #ifndef _LINUX_SCHED_COREDUMP_H #define _LINUX_SCHED_COREDUMP_H -#include <linux/mm_types.h> - /* * Task dumpability mode. Gates core dump production and ptrace_attach() * authorization. The numeric values are stable ABI (suid_dumpable @@ -15,37 +13,7 @@ enum task_dumpable { TASK_DUMPABLE_ROOT = 2, /* dump as root; ptrace needs CAP_SYS_PTRACE */ }; -static inline unsigned long __mm_flags_get_dumpable(const struct mm_struct *mm) -{ - /* - * By convention, dumpable bits are contained in first 32 bits of the - * bitmap, so we can simply access this first unsigned long directly. - */ - return __mm_flags_get_word(mm); -} - -static inline void __mm_flags_set_mask_dumpable(struct mm_struct *mm, int value) -{ - __mm_flags_set_mask_bits_word(mm, MMF_DUMPABLE_MASK, value); -} - -extern void set_dumpable(struct mm_struct *mm, int value); -/* - * This returns the actual value of the suid_dumpable flag. For things - * that are using this for checking for privilege transitions, it must - * test against TASK_DUMPABLE_OWNER rather than treating it as a boolean - * value. - */ -static inline int __get_dumpable(unsigned long mm_flags) -{ - return mm_flags & MMF_DUMPABLE_MASK; -} - -static inline int get_dumpable(struct mm_struct *mm) -{ - unsigned long flags = __mm_flags_get_dumpable(mm); - - return __get_dumpable(flags); -} +void task_exec_state_set_dumpable(enum task_dumpable value); +enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task); #endif /* _LINUX_SCHED_COREDUMP_H */ diff --git a/include/linux/sched/exec_state.h b/include/linux/sched/exec_state.h index dc5a795cbfe2..23fe4b55e010 100644 --- a/include/linux/sched/exec_state.h +++ b/include/linux/sched/exec_state.h @@ -21,8 +21,6 @@ void put_task_exec_state(struct task_exec_state *exec_state); struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk); struct task_exec_state *task_exec_state_replace(struct task_struct *tsk, struct task_exec_state *exec_state); -void task_exec_state_set_dumpable(enum task_dumpable value); -enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task); int task_exec_state_copy(struct task_struct *tsk); void __init exec_state_init(void); diff --git a/init/init_task.c b/init/init_task.c index b5f48ebdc2b6..47a651b05058 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -7,6 +7,8 @@ #include <linux/sched/rt.h> #include <linux/sched/task.h> #include <linux/sched/ext.h> +#include <linux/sched/exec_state.h> +#include <linux/user_namespace.h> #include <linux/init.h> #include <linux/fs.h> #include <linux/mm.h> @@ -56,6 +58,13 @@ static struct sighand_struct init_sighand = { .signalfd_wqh = __WAIT_QUEUE_HEAD_INITIALIZER(init_sighand.signalfd_wqh), }; +/* init to 2 - one for init_task, one to ensure it is never freed */ +static struct task_exec_state init_task_exec_state = { + .count = REFCOUNT_INIT(2), + .dumpable = TASK_DUMPABLE_OWNER, + .user_ns = &init_user_ns, +}; + #ifdef CONFIG_SHADOW_CALL_STACK unsigned long init_shadow_call_stack[SCS_SIZE / sizeof(long)] = { [(SCS_SIZE / sizeof(long)) - 1] = SCS_END_MAGIC @@ -113,6 +122,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = { .nr_cpus_allowed= NR_CPUS, .mm = NULL, .active_mm = &init_mm, + .exec_state = &init_task_exec_state, .restart_block = { .fn = do_no_restart_syscall, }, diff --git a/kernel/cred.c b/kernel/cred.c index 12a7b1ce5131..dceb9fa4a4b4 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -384,8 +384,7 @@ int commit_creds(struct cred *new) !uid_eq(old->fsuid, new->fsuid) || !gid_eq(old->fsgid, new->fsgid) || !cred_cap_issubset(old, new)) { - if (task->mm) - set_dumpable(task->mm, suid_dumpable); + task_exec_state_set_dumpable(suid_dumpable); task->pdeath_signal = 0; /* * If a task drops privileges and becomes nondumpable, diff --git a/kernel/exit.c b/kernel/exit.c index 507eda655e8d..9a909993ab1d 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -571,7 +571,6 @@ static void exit_mm(void) */ smp_mb__after_spinlock(); local_irq_disable(); - current->user_dumpable = (get_dumpable(mm) == TASK_DUMPABLE_OWNER); current->mm = NULL; membarrier_update_current_mm(NULL); enter_lazy_tlb(mm, current); diff --git a/kernel/fork.c b/kernel/fork.c index 5f3fdfdb14c7..b8b651abce8b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -23,6 +23,7 @@ #include <linux/sched/task_stack.h> #include <linux/sched/cputime.h> #include <linux/sched/ext.h> +#include <linux/sched/exec_state.h> #include <linux/seq_file.h> #include <linux/rtmutex.h> #include <linux/init.h> @@ -555,6 +556,7 @@ void free_task(struct task_struct *tsk) if (tsk->flags & PF_KTHREAD) free_kthread_struct(tsk); bpf_task_storage_free(tsk); + put_task_exec_state(tsk->exec_state); free_task_struct(tsk); } EXPORT_SYMBOL(free_task); @@ -731,7 +733,6 @@ void __mmdrop(struct mm_struct *mm) destroy_context(mm); mmu_notifier_subscriptions_destroy(mm); check_mm(mm); - put_user_ns(mm->user_ns); mm_pasid_drop(mm); mm_destroy_cid(mm); percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); @@ -1072,8 +1073,7 @@ static void mmap_init_lock(struct mm_struct *mm) #endif } -static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, - struct user_namespace *user_ns) +static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) { mt_init_flags(&mm->mm_mt, MM_MT_FLAGS); mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock); @@ -1132,7 +1132,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, NR_MM_COUNTERS)) goto fail_pcpu; - mm->user_ns = get_user_ns(user_ns); lru_gen_init_mm(mm); return mm; @@ -1163,7 +1162,7 @@ struct mm_struct *mm_alloc(void) return NULL; memset(mm, 0, sizeof(*mm)); - return mm_init(mm, current, current_user_ns()); + return mm_init(mm, current); } EXPORT_SYMBOL_IF_KUNIT(mm_alloc); @@ -1527,7 +1526,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk, memcpy(mm, oldmm, sizeof(*mm)); - if (!mm_init(mm, tsk, mm->user_ns)) + if (!mm_init(mm, tsk)) goto fail_nomem; uprobe_start_dup_mmap(); @@ -1593,6 +1592,23 @@ static int copy_mm(u64 clone_flags, struct task_struct *tsk) return 0; } +static int copy_exec_state(u64 clone_flags, struct task_struct *tsk) +{ + int ret; + struct task_exec_state *exec_state; + + exec_state = rcu_access_pointer(tsk->exec_state); + if (clone_flags & CLONE_VM) { + refcount_inc(&exec_state->count); + return 0; + } + + ret = task_exec_state_copy(tsk); + if (ret) + RCU_INIT_POINTER(tsk->exec_state, NULL); + return ret; +} + static int copy_fs(u64 clone_flags, struct task_struct *tsk) { struct fs_struct *fs = current->fs; @@ -2090,6 +2106,9 @@ __latent_entropy struct task_struct *copy_process( p = dup_task_struct(current, node); if (!p) goto fork_out; + retval = copy_exec_state(clone_flags, p); + if (retval) + goto bad_fork_free; p->flags &= ~PF_KTHREAD; if (args->kthread) p->flags |= PF_KTHREAD; @@ -3098,6 +3117,7 @@ void __init proc_caches_init(void) sizeof(struct signal_struct), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); + exec_state_init(); files_cachep = kmem_cache_create("files_cache", sizeof(struct files_struct), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, diff --git a/kernel/kthread.c b/kernel/kthread.c index 791210daf8b4..63beb59b7a3d 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1619,7 +1619,6 @@ void kthread_use_mm(struct mm_struct *mm) WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD)); WARN_ON_ONCE(tsk->mm); - WARN_ON_ONCE(!mm->user_ns); /* * It is possible for mm to be the same as tsk->active_mm, but diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 0e1f80f73a7f..ea8a682e837d 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -72,21 +72,14 @@ int ptrace_access_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, unsigned int gup_flags) { struct mm_struct *mm; - int ret; + int ret = 0; mm = get_task_mm(tsk); if (!mm) return 0; - if (!tsk->ptrace || - (current != tsk->parent) || - ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) && - !ptracer_capable(tsk, mm->user_ns))) { - mmput(mm); - return 0; - } - - ret = access_remote_vm(mm, addr, buf, len, gup_flags); + if (ptracer_access_allowed(tsk)) + ret = access_remote_vm(mm, addr, buf, len, gup_flags); mmput(mm); return ret; @@ -301,16 +294,13 @@ static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode) static bool task_still_dumpable(struct task_struct *task, unsigned int mode) { - struct mm_struct *mm = task->mm; - if (mm) { - if (get_dumpable(mm) == TASK_DUMPABLE_OWNER) - return true; - return ptrace_has_cap(mm->user_ns, mode); - } + const struct task_exec_state *exec_state; - if (task->user_dumpable) + guard(rcu)(); + exec_state = task_exec_state_rcu(task); + if (READ_ONCE(exec_state->dumpable) == TASK_DUMPABLE_OWNER) return true; - return ptrace_has_cap(&init_user_ns, mode); + return ptrace_has_cap(exec_state->user_ns, mode); } /* Returns 0 on success, -errno on denial. */ diff --git a/kernel/sys.c b/kernel/sys.c index f1189f719db5..df69bd71de03 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2565,14 +2565,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = put_user(me->pdeath_signal, (int __user *)arg2); break; case PR_GET_DUMPABLE: - error = get_dumpable(me->mm); + error = task_exec_state_get_dumpable(me); break; case PR_SET_DUMPABLE: if (arg2 != TASK_DUMPABLE_OFF && arg2 != TASK_DUMPABLE_OWNER) { error = -EINVAL; break; } - set_dumpable(me->mm, arg2); + task_exec_state_set_dumpable(arg2); break; case PR_SET_UNALIGN: diff --git a/mm/init-mm.c b/mm/init-mm.c index c5556bb9d5f0..3e792aad7626 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -43,7 +43,6 @@ struct mm_struct init_mm = { .vma_writer_wait = __RCUWAIT_INITIALIZER(init_mm.vma_writer_wait), .mm_lock_seq = SEQCNT_ZERO(init_mm.mm_lock_seq), #endif - .user_ns = &init_user_ns, #ifdef CONFIG_SCHED_MM_CID .mm_cid.lock = __RAW_SPIN_LOCK_UNLOCKED(init_mm.mm_cid.lock), #endif -- 2.47.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 4/4] exec_state: relocate dumpable information 2026-05-20 21:48 ` [PATCH RFC v3 4/4] exec_state: relocate dumpable information Christian Brauner (Amutable) @ 2026-05-21 10:05 ` Christian Brauner 2026-05-21 11:16 ` Jann Horn 2026-05-26 13:07 ` David Hildenbrand (Arm) 2 siblings, 0 replies; 16+ messages in thread From: Christian Brauner @ 2026-05-21 10:05 UTC (permalink / raw) To: Jann Horn, Linus Torvalds, Oleg Nesterov Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko > --- a/kernel/cred.c > +++ b/kernel/cred.c > @@ -384,8 +384,7 @@ int commit_creds(struct cred *new) > !uid_eq(old->fsuid, new->fsuid) || > !gid_eq(old->fsgid, new->fsgid) || > !cred_cap_issubset(old, new)) { > - if (task->mm) > - set_dumpable(task->mm, suid_dumpable); > + task_exec_state_set_dumpable(suid_dumpable); When looking at this I wondered how the hell I ended up removing the mm check and that was from one of the prior versions. So this check should stay and I want to leave an explanation why. So the check is obviously needed for two cases: (1) kthreads Afaict, we don't have any kthreads that do commit_creds(). I think that is system call path only. (1.1) But kthreads are created with CLONE_VM and thus all start out with kthread->mm == NULL and with task->exec_state shared as well. So having them end up in commit_creds() with the task->mm check is fine as we won't do anything. (1.2) kthreads that make use of kthread_use_mm() may _not_ call commit_creds() in any form because they would alter dumpability for all other kernel threads because while they have assumed a new mm, they have not assumed a new exec_state. (2) user mode helpers User mode helpers are created with CLONE_VM and are created as a child of a kernel threads but aren't actual kernel threads (in the sense that they aren't marked as such, +/- a few other details irrelevant to this). So at fork() time their umh->mm == NULL and the exec_state is shared with all other kthreads as well. user mode helpers _do_ commit_creds() but before they went through exec so umh->mm still is NULL and shared exec_state with other kthreads is unchanged. All umh's go through exec and afterwards they will have both a separate mm and a separate exec state and so it's all fine. So I'm going to fold the following diff which asserts the invariant that altering global exec_state is not supported: diff --git a/include/linux/sched/exec_state.h b/include/linux/sched/exec_state.h index 23fe4b55e010..9b61782510b8 100644 --- a/include/linux/sched/exec_state.h +++ b/include/linux/sched/exec_state.h @@ -16,6 +16,8 @@ struct task_exec_state { struct rcu_head rcu; }; +extern struct task_exec_state init_task_exec_state; + struct task_exec_state *alloc_task_exec_state(struct user_namespace *user_ns); void put_task_exec_state(struct task_exec_state *exec_state); struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk); diff --git a/init/init_task.c b/init/init_task.c index 47a651b05058..8cad78da469c 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -59,7 +59,7 @@ static struct sighand_struct init_sighand = { }; /* init to 2 - one for init_task, one to ensure it is never freed */ -static struct task_exec_state init_task_exec_state = { +struct task_exec_state init_task_exec_state = { .count = REFCOUNT_INIT(2), .dumpable = TASK_DUMPABLE_OWNER, .user_ns = &init_user_ns, diff --git a/kernel/cred.c b/kernel/cred.c index dceb9fa4a4b4..3df4e15bd67f 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -384,7 +384,9 @@ int commit_creds(struct cred *new) !uid_eq(old->fsuid, new->fsuid) || !gid_eq(old->fsgid, new->fsgid) || !cred_cap_issubset(old, new)) { - task_exec_state_set_dumpable(suid_dumpable); + /* mm-less tasks share init_task's exec_state */ + if (task->mm) + task_exec_state_set_dumpable(suid_dumpable); task->pdeath_signal = 0; /* * If a task drops privileges and becomes nondumpable, diff --git a/kernel/exec_state.c b/kernel/exec_state.c index 814a475fc786..2b7d0262d0f4 100644 --- a/kernel/exec_state.c +++ b/kernel/exec_state.c @@ -95,6 +95,9 @@ void task_exec_state_set_dumpable(enum task_dumpable value) value = TASK_DUMPABLE_OFF; exec_state = rcu_dereference_protected(current->exec_state, true); + /* mm-less tasks share init_task's exec_state; never mutate it */ + if (WARN_ON_ONCE(exec_state == &init_task_exec_state)) + return; WRITE_ONCE(exec_state->dumpable, value); } ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 4/4] exec_state: relocate dumpable information 2026-05-20 21:48 ` [PATCH RFC v3 4/4] exec_state: relocate dumpable information Christian Brauner (Amutable) 2026-05-21 10:05 ` Christian Brauner @ 2026-05-21 11:16 ` Jann Horn 2026-05-21 13:08 ` Christian Brauner 2026-05-26 13:07 ` David Hildenbrand (Arm) 2 siblings, 1 reply; 16+ messages in thread From: Jann Horn @ 2026-05-21 11:16 UTC (permalink / raw) To: Christian Brauner (Amutable) Cc: Linus Torvalds, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 11:49 PM Christian Brauner (Amutable) <brauner@kernel.org> wrote: > @@ -2090,6 +2106,9 @@ __latent_entropy struct task_struct *copy_process( > p = dup_task_struct(current, node); > if (!p) > goto fork_out; > + retval = copy_exec_state(clone_flags, p); > + if (retval) > + goto bad_fork_free; AFAICS for state like this that is torn down in free_task(), normally dup_task_struct() NULLs out pointers that require refcounting, and then copy_process() initializes them properly, so that in copy_process() we can bail out in the middle and have the task_struct in a sufficiently clean state to go through more or less the normal free_task() path. In particular, I'm thinking of the handling of tsk->seccomp.filter - dup_task_struct() sets `tsk->seccomp.filter = NULL`, and later copy_process() calls copy_seccomp(). With your implementation, the error handling would break if anyone tried to add another bailout between dup_task_struct() and copy_exec_state(). (Sidenote: Ugh, the way dup_task_struct() just copies the entire task_struct is so ugly...) > p->flags &= ~PF_KTHREAD; > if (args->kthread) > p->flags |= PF_KTHREAD; ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 4/4] exec_state: relocate dumpable information 2026-05-21 11:16 ` Jann Horn @ 2026-05-21 13:08 ` Christian Brauner 0 siblings, 0 replies; 16+ messages in thread From: Christian Brauner @ 2026-05-21 13:08 UTC (permalink / raw) To: Jann Horn Cc: Linus Torvalds, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Thu, May 21, 2026 at 01:16:23PM +0200, Jann Horn wrote: > On Wed, May 20, 2026 at 11:49 PM Christian Brauner (Amutable) > <brauner@kernel.org> wrote: > > @@ -2090,6 +2106,9 @@ __latent_entropy struct task_struct *copy_process( > > p = dup_task_struct(current, node); > > if (!p) > > goto fork_out; > > + retval = copy_exec_state(clone_flags, p); > > + if (retval) > > + goto bad_fork_free; > > AFAICS for state like this that is torn down in free_task(), normally > dup_task_struct() NULLs out pointers that require refcounting, and > then copy_process() initializes them properly, so that in > copy_process() we can bail out in the middle and have the task_struct > in a sufficiently clean state to go through more or less the normal > free_task() path. > > In particular, I'm thinking of the handling of tsk->seccomp.filter - > dup_task_struct() sets `tsk->seccomp.filter = NULL`, and later > copy_process() calls copy_seccomp(). > > With your implementation, the error handling would break if anyone > tried to add another bailout between dup_task_struct() and > copy_exec_state(). I folded: diff --git a/kernel/fork.c b/kernel/fork.c index 61a44b33da28..91545ed6463f 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -947,6 +947,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) tsk->seccomp.filter = NULL; #endif + RCU_INIT_POINTER(tsk->exec_state, NULL); + setup_thread_stack(tsk, orig); clear_user_return_notifier(tsk); clear_tsk_need_resched(tsk); @@ -1594,19 +1596,18 @@ static int copy_mm(u64 clone_flags, struct task_struct *tsk) static int copy_exec_state(u64 clone_flags, struct task_struct *tsk) { - int ret; struct task_exec_state *exec_state; - exec_state = rcu_access_pointer(tsk->exec_state); + /* CLONE_VM siblings refcount-share the parent's exec_state. */ if (clone_flags & CLONE_VM) { + exec_state = rcu_dereference_protected(current->exec_state, true); refcount_inc(&exec_state->count); + rcu_assign_pointer(tsk->exec_state, exec_state); return 0; } - ret = task_exec_state_copy(tsk); - if (ret) - RCU_INIT_POINTER(tsk->exec_state, NULL); - return ret; + /* Everyone else inherits a fresh copy. */ + return task_exec_state_copy(tsk); } > > (Sidenote: Ugh, the way dup_task_struct() just copies the entire > task_struct is so ugly...) Yes, I agree. ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 4/4] exec_state: relocate dumpable information 2026-05-20 21:48 ` [PATCH RFC v3 4/4] exec_state: relocate dumpable information Christian Brauner (Amutable) 2026-05-21 10:05 ` Christian Brauner 2026-05-21 11:16 ` Jann Horn @ 2026-05-26 13:07 ` David Hildenbrand (Arm) 2 siblings, 0 replies; 16+ messages in thread From: David Hildenbrand (Arm) @ 2026-05-26 13:07 UTC (permalink / raw) To: Christian Brauner (Amutable), Jann Horn, Linus Torvalds, Oleg Nesterov Cc: Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On 5/20/26 23:48, Christian Brauner (Amutable) wrote: > The dumpable flag captured at execve() is consulted by > __ptrace_may_access() and several /proc owner / visibility checks. > It lives on mm_struct today, which exit_mm() clears from the task > long before the task itself is reaped. > > exec_state is anchored to the execve() that established the current > privilege domain. CLONE_VM siblings refcount-share the parent's > exec_state via copy_exec_state(); non-CLONE_VM clones allocate a > fresh exec_state inheriting the parent's dumpable mode and user_ns > reference via task_exec_state_copy(). execve() allocates a fresh > instance (via alloc_task_exec_state() in begin_new_exec()) and > installs it under task_lock + exec_update_lock with > task_exec_state_replace(). init_task uses a static instance. > > The dumpable mode now lives on task->exec_state->dumpable. > task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is > removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit > positions remain stable for the /proc/<pid>/coredump_filter ABI. The > task->user_dumpable cache bit and its assignment in exit_mm() are > removed; readers go through get_dumpable(task) directly. > > coredump_params gains a snapshot field cprm.dumpable, populated from > get_dumpable(current) at vfs_coredump() entry, replacing the previous > __get_dumpable(cprm->mm_flags) consumers in fs/coredump.c and > fs/pidfs.c. > > The user namespace recorded at execve() is consulted by > __ptrace_may_access() and by /proc/PID/* owner derivation. Move the > captured user_ns onto task_exec_state, which stays attached to the task > past exit_mm() and across exit_files(). > > bprm grows a user_ns field staged in bprm_mm_init() with the caller's > user_ns, narrowed by would_dump() to the closest privileged ancestor, > and consumed by exec_mmap() via alloc_task_exec_state(bprm->user_ns). > free_bprm() releases the staging reference. > > mm_struct loses ->user_ns entirely. Initializers in init-mm, efi_mm, > and the implicit one in mm_init()/dup_mm()/mm_alloc() are removed; > __mmdrop() drops the matching put_user_ns(). The kthread_use_mm() > WARN_ON_ONCE(!mm->user_ns) is no longer meaningful and goes too. > > Reviewed-by: Jann Horn <jannh@google.com> > Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> > --- > arch/arm64/kernel/mte.c | 6 ++---- > drivers/firmware/efi/efi.c | 1 - > fs/coredump.c | 20 +++++++------------- > fs/exec.c | 39 ++++++++++++++++++++------------------- > fs/pidfs.c | 17 ++++++----------- > fs/proc/base.c | 39 ++++++++++++++++----------------------- > include/linux/binfmts.h | 2 ++ > include/linux/coredump.h | 4 ++++ > include/linux/mm_types.h | 9 ++++----- > include/linux/sched.h | 4 +--- > include/linux/sched/coredump.h | 36 ++---------------------------------- > include/linux/sched/exec_state.h | 2 -- > init/init_task.c | 10 ++++++++++ > kernel/cred.c | 3 +-- > kernel/exit.c | 1 - > kernel/fork.c | 32 ++++++++++++++++++++++++++------ > kernel/kthread.c | 1 - > kernel/ptrace.c | 26 ++++++++------------------ > kernel/sys.c | 4 ++-- > mm/init-mm.c | 1 - > 20 files changed, 111 insertions(+), 146 deletions(-) > > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c > index 904ac41f93bc..1a9aad6ef22a 100644 > --- a/arch/arm64/kernel/mte.c > +++ b/arch/arm64/kernel/mte.c > @@ -8,6 +8,7 @@ > #include <linux/kernel.h> > #include <linux/mm.h> > #include <linux/prctl.h> > +#include <linux/ptrace.h> > #include <linux/sched.h> > #include <linux/sched/mm.h> > #include <linux/string.h> > @@ -537,16 +538,13 @@ static int access_remote_tags(struct task_struct *tsk, unsigned long addr, > if (!mm) > return -EPERM; > > - if (!tsk->ptrace || (current != tsk->parent) || > - ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) && > - !ptracer_capable(tsk, mm->user_ns))) { > + if (!ptracer_access_allowed(tsk)) { > mmput(mm); > return -EPERM; > } This reads much nicer. But given that we obtain the MM before ptracer_access_allowed(), but don't actually need it for the check, it raises the question whether that order will be strictly required? IOW, would if (!ptracer_access_allowed(tsk)) return -EPERM; mm = get_task_mm(tsk); if (!mm) return -EPERM; ... be similarly valid? I would assume so, which cleanes this up further. > > ret = __access_remote_tags(mm, addr, kiov, gup_flags); > mmput(mm); > - > return ret; > } > [...] > > -static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, > - struct user_namespace *user_ns) > +static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) > { > mt_init_flags(&mm->mm_mt, MM_MT_FLAGS); > mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock); > @@ -1132,7 +1132,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, > NR_MM_COUNTERS)) > goto fail_pcpu; > > - mm->user_ns = get_user_ns(user_ns); > lru_gen_init_mm(mm); > return mm; > > @@ -1163,7 +1162,7 @@ struct mm_struct *mm_alloc(void) > return NULL; > > memset(mm, 0, sizeof(*mm)); > - return mm_init(mm, current, current_user_ns()); > + return mm_init(mm, current); > } That's a very nice cleanup IMHO. From a MM POV that conceptually looks good to me. Unfortunately, I have to shutdown my computer now to go back to the beach ;) -- Cheers, David ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-05-26 13:07 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-20 21:48 [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata Christian Brauner (Amutable) 2026-05-20 21:48 ` [PATCH RFC v3 1/4] sched/coredump: introduce enum task_dumpable Christian Brauner (Amutable) 2026-05-22 22:14 ` David Hildenbrand (Arm) 2026-05-20 21:48 ` [PATCH RFC v3 2/4] exec: introduce struct task_exec_state Christian Brauner (Amutable) 2026-05-22 15:00 ` Oleg Nesterov 2026-05-26 7:16 ` Christian Brauner 2026-05-26 8:17 ` Oleg Nesterov 2026-05-22 22:21 ` David Hildenbrand (Arm) 2026-05-20 21:48 ` [PATCH RFC v3 3/4] ptrace: add ptracer_access_allowed() Christian Brauner (Amutable) 2026-05-22 15:08 ` Oleg Nesterov 2026-05-22 22:32 ` David Hildenbrand (Arm) 2026-05-20 21:48 ` [PATCH RFC v3 4/4] exec_state: relocate dumpable information Christian Brauner (Amutable) 2026-05-21 10:05 ` Christian Brauner 2026-05-21 11:16 ` Jann Horn 2026-05-21 13:08 ` Christian Brauner 2026-05-26 13:07 ` David Hildenbrand (Arm)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox