* [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm()
@ 2026-05-20 14:42 Christian Brauner
2026-05-20 14:42 ` [PATCH RFC v2 1/5] sched/coredump: introduce enum task_dumpable Christian Brauner
` (6 more replies)
0 siblings, 7 replies; 24+ messages in thread
From: Christian Brauner @ 2026-05-20 14:42 UTC (permalink / raw)
To: Jann Horn, Linus Torvalds, Oleg Nesterov
Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory,
Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Michal Hocko, Christian Brauner (Amutable)
This series relocates the dumpable mode and the user_namespace
captured at execve() from mm_struct onto a new per-task
task_exec_state structure that stays attached to the task for its
full lifetime.
__ptrace_may_access() and several /proc owner / visibility checks
need to consult two pieces of state for any observable task,
including zombies that have already gone through exit_mm(): the
dumpable mode and the user namespace captured at execve(). Both
live on mm_struct today, which exit_mm() clears from the task long
before the task is reaped.
A reader that races with do_exit() observes task->mm == NULL and
either fails the check or falls back to init_user_ns - which denies
legitimate access to non-dumpable zombies that were running in a
nested user namespace.
task_exec_state is RCU-protected, refcounted, freed via call_rcu()
from free_task(). init_task uses a static instance with refcount 2
so it is never freed.
mm_struct loses ->user_ns and the dumpability bits in ->flags.
MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* layout exposed via
/proc/<pid>/coredump_filter stays stable. task->user_dumpable and its
exit_mm() snapshot are removed.
task_exec_state is the privilege domain established by an execve(), not
a property of the address space. Following the model Linus sketched in
[1]:
- Every clone() variant - thread, process, vfork(), io_uring
worker - refcount-shares the parent's exec_state. No
dup-on-fork.
- Only execve() in the child allocates a fresh instance.
- Credential changes (setresuid, capset, ...) and
prctl(PR_SET_DUMPABLE) update dumpability on the shared
exec_state.
The entire fork subtree of one execve shares one exec_state; a
child enters a new privilege domain only by execve()ing into one.
Behavioral changes:
(1) Dumpability lowering on credential changes now propagates
across the fork subtree.
Pre-series, set_dumpable() on commit_creds() targeted
mm->flags, which was per-mm: shared by CLONE_VM threads but
private to fork()-without-CLONE_VM children. Under the new
model the write targets the shared task_exec_state, so a
privilege drop in any task in the subtree lowers dumpability
for the entire subtree, including non-CLONE_VM siblings.
Same-uid ptrace shedding and /proc visibility for the
"root-launched daemon drops to a service uid" pattern (sshd,
polkitd, dbus-daemon, NetworkManager, ...) is preserved.
(3) Kernel threads that briefly use a user mm via
kthread_use_mm() no longer inherit dumpability from the
borrowed mm. Kthreads are not ptraceable (PF_KTHREAD
short-circuits __ptrace_may_access), so this is observable
only via /proc surfaces that a sufficiently privileged reader
can reach.
[1] https://lore.kernel.org/r/CAHk-=wj+NgoDH3GSicJ140SV8OoDd71pLmL3fgFEsTcgoMC6Og@mail.gmail.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
Changes in v2:
- Drop dup-on-fork for non-CLONE_VM clones: every clone() variant
refcount-shares the parent's task_exec_state; only execve()
allocates a fresh one. See "Behavioral changes" in the cover
letter for the implications.
- Switch commit_creds() to update dumpability on the new
task_exec_state (instead of dropping the set_dumpable() call
entirely as in v1). Drops the explicit smp_wmb()/smp_rmb() pair
- RCU acquire/release on the cred pointer provides the ordering.
- Link to v1: https://patch.msgid.link/20260516-work-exit_mm-v1-1-76bcc7c2439d@kernel.org
---
Christian Brauner (5):
sched/coredump: introduce enum task_dumpable
exec: introduce struct task_exec_state and relocate dumpable
ptrace: add ptracer_access_allowed()
exec_state: relocate dumpable information
cred: switch dumpability lowering to task_exec_state
arch/arm64/kernel/mte.c | 6 +--
drivers/firmware/efi/efi.c | 1 -
fs/coredump.c | 22 +++-----
fs/exec.c | 39 +++++++-------
fs/pidfs.c | 22 ++++----
fs/proc/base.c | 39 ++++++--------
include/linux/binfmts.h | 2 +
include/linux/coredump.h | 4 ++
include/linux/mm_types.h | 9 ++--
include/linux/ptrace.h | 1 +
include/linux/sched.h | 7 +--
include/linux/sched/coredump.h | 47 ++++-------------
include/linux/sched/exec_state.h | 31 +++++++++++
init/init_task.c | 10 ++++
kernel/Makefile | 2 +-
kernel/cred.c | 25 +++++----
kernel/exec_state.c | 108 +++++++++++++++++++++++++++++++++++++++
kernel/exit.c | 1 -
kernel/fork.c | 15 +++---
kernel/kthread.c | 1 -
kernel/ptrace.c | 62 ++++++++++++----------
kernel/sys.c | 6 +--
mm/init-mm.c | 1 -
23 files changed, 289 insertions(+), 172 deletions(-)
---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260520-work-task_exec_state-83209d8b3e53
^ permalink raw reply [flat|nested] 24+ messages in thread* [PATCH RFC v2 1/5] sched/coredump: introduce enum task_dumpable 2026-05-20 14:42 [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() Christian Brauner @ 2026-05-20 14:42 ` Christian Brauner 2026-05-20 16:27 ` Jann Horn 2026-05-20 14:42 ` [PATCH RFC v2 2/5] exec: introduce struct task_exec_state and relocate dumpable Christian Brauner ` (5 subsequent siblings) 6 siblings, 1 reply; 24+ messages in thread From: Christian Brauner @ 2026-05-20 14:42 UTC (permalink / raw) To: Jann Horn, Linus Torvalds, Oleg Nesterov Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko, Christian Brauner (Amutable) Replace the SUID_DUMP_DISABLE/USER/ROOT preprocessor constants with enum task_dumpable. Numeric values are preserved (kernel.suid_dumpable sysctl and prctl(PR_SET_DUMPABLE) ABI), so this is a pure rename with no behavioral change. Subsequent commits relocate dumpability onto a per-task structure where the enum type will allow stronger type-checking on the new API. Signed-off-by: Christian Brauner <brauner@kernel.org> --- arch/arm64/kernel/mte.c | 2 +- fs/coredump.c | 4 ++-- fs/exec.c | 8 ++++---- fs/pidfs.c | 6 +++--- fs/proc/base.c | 2 +- include/linux/mm_types.h | 2 +- include/linux/sched/coredump.h | 15 +++++++++++---- kernel/exit.c | 2 +- kernel/ptrace.c | 4 ++-- kernel/sys.c | 2 +- 10 files changed, 27 insertions(+), 20 deletions(-) diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index 6874b16d0657..904ac41f93bc 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -538,7 +538,7 @@ static int access_remote_tags(struct task_struct *tsk, unsigned long addr, return -EPERM; if (!tsk->ptrace || (current != tsk->parent) || - ((get_dumpable(mm) != SUID_DUMP_USER) && + ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) && !ptracer_capable(tsk, mm->user_ns))) { mmput(mm); return -EPERM; diff --git a/fs/coredump.c b/fs/coredump.c index bb6fdb1f458e..f5348d5bc441 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -873,7 +873,7 @@ static inline bool coredump_socket(struct core_name *cn, struct coredump_params static inline bool coredump_force_suid_safe(const struct coredump_params *cprm) { /* Require nonrelative corefile path and be extra careful. */ - return __get_dumpable(cprm->mm_flags) == SUID_DUMP_ROOT; + return __get_dumpable(cprm->mm_flags) == TASK_DUMPABLE_ROOT; } static bool coredump_file(struct core_name *cn, struct coredump_params *cprm, @@ -1419,7 +1419,7 @@ EXPORT_SYMBOL(dump_align); void validate_coredump_safety(void) { - if (suid_dumpable == SUID_DUMP_ROOT && + if (suid_dumpable == TASK_DUMPABLE_ROOT && core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != '@') { coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: " diff --git a/fs/exec.c b/fs/exec.c index ba12b4c466f6..f5663bb607d3 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1212,7 +1212,7 @@ int begin_new_exec(struct linux_binprm * bprm) gid_eq(current_egid(), current_gid()))) set_dumpable(current->mm, suid_dumpable); else - set_dumpable(current->mm, SUID_DUMP_USER); + set_dumpable(current->mm, TASK_DUMPABLE_OWNER); perf_event_exec(); @@ -1261,7 +1261,7 @@ int begin_new_exec(struct linux_binprm * bprm) * wait until new credentials are committed * by commit_creds() above */ - if (get_dumpable(me->mm) != SUID_DUMP_USER) + if (get_dumpable(me->mm) != TASK_DUMPABLE_OWNER) perf_event_exit_task(me); /* * cred_guard_mutex must be held at least to this point to prevent @@ -1906,11 +1906,11 @@ void set_binfmt(struct linux_binfmt *new) EXPORT_SYMBOL(set_binfmt); /* - * set_dumpable stores three-value SUID_DUMP_* into mm->flags. + * set_dumpable stores three-value TASK_DUMPABLE_* into mm->flags. */ void set_dumpable(struct mm_struct *mm, int value) { - if (WARN_ON((unsigned)value > SUID_DUMP_ROOT)) + if (WARN_ON((unsigned)value > TASK_DUMPABLE_ROOT)) return; __mm_flags_set_mask_dumpable(mm, value); diff --git a/fs/pidfs.c b/fs/pidfs.c index 1cce4f34a051..9cd12f2f004c 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -341,11 +341,11 @@ static inline bool pid_in_current_pidns(const struct pid *pid) static __u32 pidfs_coredump_mask(unsigned long mm_flags) { switch (__get_dumpable(mm_flags)) { - case SUID_DUMP_USER: + case TASK_DUMPABLE_OWNER: return PIDFD_COREDUMP_USER; - case SUID_DUMP_ROOT: + case TASK_DUMPABLE_ROOT: return PIDFD_COREDUMP_ROOT; - case SUID_DUMP_DISABLE: + case TASK_DUMPABLE_OFF: return PIDFD_COREDUMP_SKIP; default: WARN_ON_ONCE(true); diff --git a/fs/proc/base.c b/fs/proc/base.c index d9acfa89c894..da0b316befb8 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1909,7 +1909,7 @@ void task_dump_owner(struct task_struct *task, umode_t mode, mm = task->mm; /* Make non-dumpable tasks owned by some root */ if (mm) { - if (get_dumpable(mm) != SUID_DUMP_USER) { + if (get_dumpable(mm) != TASK_DUMPABLE_OWNER) { struct user_namespace *user_ns = mm->user_ns; uid = make_kuid(user_ns, 0); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index a308e2c23b82..51ea37b2a0aa 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1908,7 +1908,7 @@ enum { /* * The first two bits represent core dump modes for set-user-ID, - * the modes are SUID_DUMP_* defined in linux/sched/coredump.h + * the modes are TASK_DUMPABLE_* defined in linux/sched/coredump.h */ #define MMF_DUMPABLE_BITS 2 #define MMF_DUMPABLE_MASK (BIT(MMF_DUMPABLE_BITS) - 1) diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index 624fda17a785..ed6547692b61 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -4,9 +4,16 @@ #include <linux/mm_types.h> -#define SUID_DUMP_DISABLE 0 /* No setuid dumping */ -#define SUID_DUMP_USER 1 /* Dump as user of process */ -#define SUID_DUMP_ROOT 2 /* Dump as root */ +/* + * Task dumpability mode. Gates core dump production and ptrace_attach() + * authorization. The numeric values are stable ABI (suid_dumpable + * sysctl, prctl(PR_SET_DUMPABLE)); do not renumber. + */ +enum task_dumpable { + TASK_DUMPABLE_OFF = 0, /* no dump; ptrace needs CAP_SYS_PTRACE */ + TASK_DUMPABLE_OWNER = 1, /* default; dump and ptrace by uid match */ + TASK_DUMPABLE_ROOT = 2, /* dump as root; ptrace needs CAP_SYS_PTRACE */ +}; static inline unsigned long __mm_flags_get_dumpable(const struct mm_struct *mm) { @@ -26,7 +33,7 @@ extern void set_dumpable(struct mm_struct *mm, int value); /* * This returns the actual value of the suid_dumpable flag. For things * that are using this for checking for privilege transitions, it must - * test against SUID_DUMP_USER rather than treating it as a boolean + * test against TASK_DUMPABLE_OWNER rather than treating it as a boolean * value. */ static inline int __get_dumpable(unsigned long mm_flags) diff --git a/kernel/exit.c b/kernel/exit.c index f50d73c272d6..507eda655e8d 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -571,7 +571,7 @@ static void exit_mm(void) */ smp_mb__after_spinlock(); local_irq_disable(); - current->user_dumpable = (get_dumpable(mm) == SUID_DUMP_USER); + current->user_dumpable = (get_dumpable(mm) == TASK_DUMPABLE_OWNER); current->mm = NULL; membarrier_update_current_mm(NULL); enter_lazy_tlb(mm, current); diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 130043bfc209..07398c9c8fe3 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -53,7 +53,7 @@ int ptrace_access_vm(struct task_struct *tsk, unsigned long addr, if (!tsk->ptrace || (current != tsk->parent) || - ((get_dumpable(mm) != SUID_DUMP_USER) && + ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) && !ptracer_capable(tsk, mm->user_ns))) { mmput(mm); return 0; @@ -276,7 +276,7 @@ static bool task_still_dumpable(struct task_struct *task, unsigned int mode) { struct mm_struct *mm = task->mm; if (mm) { - if (get_dumpable(mm) == SUID_DUMP_USER) + if (get_dumpable(mm) == TASK_DUMPABLE_OWNER) return true; return ptrace_has_cap(mm->user_ns, mode); } diff --git a/kernel/sys.c b/kernel/sys.c index 62e842055cc9..f1189f719db5 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2568,7 +2568,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = get_dumpable(me->mm); break; case PR_SET_DUMPABLE: - if (arg2 != SUID_DUMP_DISABLE && arg2 != SUID_DUMP_USER) { + if (arg2 != TASK_DUMPABLE_OFF && arg2 != TASK_DUMPABLE_OWNER) { error = -EINVAL; break; } -- 2.47.3 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 1/5] sched/coredump: introduce enum task_dumpable 2026-05-20 14:42 ` [PATCH RFC v2 1/5] sched/coredump: introduce enum task_dumpable Christian Brauner @ 2026-05-20 16:27 ` Jann Horn 0 siblings, 0 replies; 24+ messages in thread From: Jann Horn @ 2026-05-20 16:27 UTC (permalink / raw) To: Christian Brauner Cc: Linus Torvalds, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 4:43 PM Christian Brauner <brauner@kernel.org> wrote: > Replace the SUID_DUMP_DISABLE/USER/ROOT preprocessor constants with > enum task_dumpable. Numeric values are preserved (kernel.suid_dumpable > sysctl and prctl(PR_SET_DUMPABLE) ABI), so this is a pure rename with > no behavioral change. > > Subsequent commits relocate dumpability onto a per-task structure > where the enum type will allow stronger type-checking on the new API. > > Signed-off-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jann Horn <jannh@google.com> ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH RFC v2 2/5] exec: introduce struct task_exec_state and relocate dumpable 2026-05-20 14:42 [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() Christian Brauner 2026-05-20 14:42 ` [PATCH RFC v2 1/5] sched/coredump: introduce enum task_dumpable Christian Brauner @ 2026-05-20 14:42 ` Christian Brauner 2026-05-20 15:14 ` Linus Torvalds 2026-05-20 16:27 ` Jann Horn 2026-05-20 14:42 ` [PATCH RFC v2 3/5] ptrace: add ptracer_access_allowed() Christian Brauner ` (4 subsequent siblings) 6 siblings, 2 replies; 24+ messages in thread From: Christian Brauner @ 2026-05-20 14:42 UTC (permalink / raw) To: Jann Horn, Linus Torvalds, Oleg Nesterov Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko, Christian Brauner (Amutable) Introduce struct task_exec_state, a per-task RCU-protected structure that holds the dumpable mode and stays attached to the task for its full lifetime. task_exec_state_rcu() is the canonical reader: asserts RCU or task_lock is held, WARNs on a NULL state, returns the rcu_dereference()'d pointer. Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> --- include/linux/sched.h | 3 ++ include/linux/sched/exec_state.h | 31 ++++++++++++ kernel/Makefile | 2 +- kernel/exec_state.c | 105 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 140 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ee06cba5c6f5..d895c3ff2154 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -962,6 +962,9 @@ struct task_struct { struct mm_struct *mm; struct mm_struct *active_mm; + /* Exec-time state outliving exit_mm(); see <linux/sched/exec_state.h>. */ + struct task_exec_state __rcu *exec_state; + int exit_state; int exit_code; int exit_signal; diff --git a/include/linux/sched/exec_state.h b/include/linux/sched/exec_state.h new file mode 100644 index 000000000000..7a267efc34d3 --- /dev/null +++ b/include/linux/sched/exec_state.h @@ -0,0 +1,31 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_SCHED_EXEC_STATE_H +#define _LINUX_SCHED_EXEC_STATE_H + +#include <linux/init.h> +#include <linux/rcupdate.h> +#include <linux/refcount.h> +#include <linux/sched/coredump.h> +#include <linux/user_namespace.h> + +struct task_exec_state { + refcount_t count; + enum task_dumpable dumpable; + struct user_namespace *user_ns; + struct rcu_head rcu; +}; + +struct task_exec_state *alloc_task_exec_state(void); +void put_task_exec_state(struct task_exec_state *es); +struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk); +struct task_exec_state *task_exec_state_replace(struct task_struct *tsk, + struct task_exec_state *exec_state); +void task_exec_state_set_dumpable(enum task_dumpable value); +enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task); +void copy_exec_state(struct task_struct *tsk); +void __init exec_state_init(void); + +DEFINE_FREE(put_task_exec_state, struct task_exec_state *, + if (_T) put_task_exec_state(_T)) + +#endif /* _LINUX_SCHED_EXEC_STATE_H */ diff --git a/kernel/Makefile b/kernel/Makefile index 6785982013dc..1e1a31673577 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -3,7 +3,7 @@ # Makefile for the linux kernel. # -obj-y = fork.o exec_domain.o panic.o \ +obj-y = fork.o exec_domain.o exec_state.o panic.o \ cpu.o exit.o softirq.o resource.o \ sysctl.o capability.o ptrace.o user.o \ signal.o sys.o umh.o workqueue.o pid.o task_work.o \ diff --git a/kernel/exec_state.c b/kernel/exec_state.c new file mode 100644 index 000000000000..85178b1d2c57 --- /dev/null +++ b/kernel/exec_state.c @@ -0,0 +1,105 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/init.h> +#include <linux/rcupdate.h> +#include <linux/refcount.h> +#include <linux/sched.h> +#include <linux/sched/coredump.h> +#include <linux/sched/exec_state.h> +#include <linux/sched/signal.h> +#include <linux/slab.h> + +static struct kmem_cache *task_exec_state_cachep; + +static void __free_task_exec_state(struct rcu_head *rcu) +{ + struct task_exec_state *es = container_of(rcu, struct task_exec_state, rcu); + + kmem_cache_free(task_exec_state_cachep, es); +} + +void put_task_exec_state(struct task_exec_state *es) +{ + if (es && refcount_dec_and_test(&es->count)) + call_rcu(&es->rcu, __free_task_exec_state); +} + +struct task_exec_state *alloc_task_exec_state(void) +{ + struct task_exec_state *es; + + es = kmem_cache_alloc(task_exec_state_cachep, GFP_KERNEL); + if (!es) + return NULL; + refcount_set(&es->count, 1); + es->dumpable = TASK_DUMPABLE_OFF; + return es; +} + +struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk) +{ + RCU_LOCKDEP_WARN(!rcu_read_lock_held() && !lockdep_is_held(&tsk->alloc_lock), + "task_exec_state_rcu() requires RCU or task_lock"); + WARN_ON_ONCE(!tsk->exec_state); + return rcu_dereference(tsk->exec_state); +} + +struct task_exec_state *task_exec_state_replace(struct task_struct *tsk, + struct task_exec_state *exec_state) +{ + /* + * Updates must hold both locks so callers needing a consistent + * snapshot of mm + dumpability are covered. + */ + lockdep_assert_held(&tsk->alloc_lock); + lockdep_assert_held_write(&tsk->signal->exec_update_lock); + + return rcu_replace_pointer(tsk->exec_state, exec_state, true); +} + +/* + * exec_state is anchored to the execve() that established the current + * privilege domain. All clone() variants refcount-share it; only a + * subsequent execve() in the child swaps in a fresh one. + */ +void copy_exec_state(struct task_struct *tsk) +{ + struct task_exec_state *es = current->exec_state; + + refcount_inc(&es->count); + rcu_assign_pointer(tsk->exec_state, es); +} + +/* + * Store TASK_DUMPABLE_* on current->exec_state. All callers + * (commit_creds, begin_new_exec, prctl(PR_SET_DUMPABLE)) act on the + * running task, which guarantees ->exec_state is allocated and cannot + * be replaced under us. + */ +void task_exec_state_set_dumpable(enum task_dumpable value) +{ + struct task_exec_state *es; + + if (WARN_ON(value > TASK_DUMPABLE_ROOT)) + value = TASK_DUMPABLE_OFF; + + es = rcu_dereference_protected(current->exec_state, true); + WRITE_ONCE(es->dumpable, value); +} + +enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task) +{ + struct task_exec_state *es; + + guard(rcu)(); + es = rcu_dereference(task->exec_state); + return READ_ONCE(es->dumpable); +} + +void __init exec_state_init(void) +{ + task_exec_state_cachep = kmem_cache_create("task_exec_state", + sizeof(struct task_exec_state), 0, + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, + NULL); +} -- 2.47.3 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 2/5] exec: introduce struct task_exec_state and relocate dumpable 2026-05-20 14:42 ` [PATCH RFC v2 2/5] exec: introduce struct task_exec_state and relocate dumpable Christian Brauner @ 2026-05-20 15:14 ` Linus Torvalds 2026-05-20 15:24 ` Christian Brauner 2026-05-20 16:27 ` Jann Horn 1 sibling, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2026-05-20 15:14 UTC (permalink / raw) To: Christian Brauner Cc: Jann Horn, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, 20 May 2026 at 09:43, Christian Brauner <brauner@kernel.org> wrote: > > +struct task_exec_state { > + refcount_t count; > + enum task_dumpable dumpable; > + struct user_namespace *user_ns; > + struct rcu_head rcu; > +}; I like the whole series, and it all gets acks from me, and I just wanted to say that maybe we would want to - possibly later - add some more of the bprm information into this structure. I'm thinking of bprm->cred in particular, because in some respects that's really the core piece of the permission puzzle: "this was the credentials we got at execve() time". Because thinking of ptrace_may_access(), wouldn't it be lovely if it just compared *those* creds against the tracer? The other creds can change at any time, and we literally have that dumpability check there because we don't want to compare against some new lowered credentials. Checking the exec-time credentials would make all of those issues just go away, and essentially make the whole "set dumpable" almost irrelevant (it would still remain for people who want to restrict dumpability manually). But I think that is all an independent expansion of this series, which looks good to me. Ack,. Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 2/5] exec: introduce struct task_exec_state and relocate dumpable 2026-05-20 15:14 ` Linus Torvalds @ 2026-05-20 15:24 ` Christian Brauner 0 siblings, 0 replies; 24+ messages in thread From: Christian Brauner @ 2026-05-20 15:24 UTC (permalink / raw) To: Linus Torvalds Cc: Jann Horn, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 10:14:17AM -0500, Linus Torvalds wrote: > On Wed, 20 May 2026 at 09:43, Christian Brauner <brauner@kernel.org> wrote: > > > > +struct task_exec_state { > > + refcount_t count; > > + enum task_dumpable dumpable; > > + struct user_namespace *user_ns; > > + struct rcu_head rcu; > > +}; > > I like the whole series, and it all gets acks from me, and I just > wanted to say that maybe we would want to - possibly later - add some > more of the bprm information into this structure. > > I'm thinking of bprm->cred in particular, because in some respects Yeah, I spoke to Jann off-list about the same idea of a creds-at-exec primitive that would come in handy in various ways... So +1 from me. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 2/5] exec: introduce struct task_exec_state and relocate dumpable 2026-05-20 14:42 ` [PATCH RFC v2 2/5] exec: introduce struct task_exec_state and relocate dumpable Christian Brauner 2026-05-20 15:14 ` Linus Torvalds @ 2026-05-20 16:27 ` Jann Horn 2026-05-20 19:47 ` Christian Brauner 1 sibling, 1 reply; 24+ messages in thread From: Jann Horn @ 2026-05-20 16:27 UTC (permalink / raw) To: Christian Brauner Cc: Linus Torvalds, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 4:43 PM Christian Brauner <brauner@kernel.org> wrote: > Introduce struct task_exec_state, a per-task RCU-protected structure > that holds the dumpable mode and stays attached to the task for its > full lifetime. > > task_exec_state_rcu() is the canonical reader: asserts RCU or > task_lock is held, WARNs on a NULL state, returns the > rcu_dereference()'d pointer. > > Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> > Signed-off-by: Christian Brauner <brauner@kernel.org> (you signed off twice with different names, idk if that's intentional) Reviewed-by: Jann Horn <jannh@google.com> > --- > include/linux/sched.h | 3 ++ > include/linux/sched/exec_state.h | 31 ++++++++++++ > kernel/Makefile | 2 +- > kernel/exec_state.c | 105 +++++++++++++++++++++++++++++++++++++++ > 4 files changed, 140 insertions(+), 1 deletion(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index ee06cba5c6f5..d895c3ff2154 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -962,6 +962,9 @@ struct task_struct { > struct mm_struct *mm; > struct mm_struct *active_mm; > > + /* Exec-time state outliving exit_mm(); see <linux/sched/exec_state.h>. */ > + struct task_exec_state __rcu *exec_state; > + > int exit_state; > int exit_code; > int exit_signal; > diff --git a/include/linux/sched/exec_state.h b/include/linux/sched/exec_state.h > new file mode 100644 > index 000000000000..7a267efc34d3 > --- /dev/null > +++ b/include/linux/sched/exec_state.h > @@ -0,0 +1,31 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_SCHED_EXEC_STATE_H > +#define _LINUX_SCHED_EXEC_STATE_H > + > +#include <linux/init.h> > +#include <linux/rcupdate.h> > +#include <linux/refcount.h> > +#include <linux/sched/coredump.h> > +#include <linux/user_namespace.h> > + > +struct task_exec_state { > + refcount_t count; > + enum task_dumpable dumpable; > + struct user_namespace *user_ns; > + struct rcu_head rcu; > +}; > + > +struct task_exec_state *alloc_task_exec_state(void); > +void put_task_exec_state(struct task_exec_state *es); > +struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk); > +struct task_exec_state *task_exec_state_replace(struct task_struct *tsk, > + struct task_exec_state *exec_state); > +void task_exec_state_set_dumpable(enum task_dumpable value); > +enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task); > +void copy_exec_state(struct task_struct *tsk); > +void __init exec_state_init(void); > + > +DEFINE_FREE(put_task_exec_state, struct task_exec_state *, > + if (_T) put_task_exec_state(_T)) nit: you have an "if (_T)" check here, but put_task_exec_state() contains another such check, is that intentional? > diff --git a/kernel/Makefile b/kernel/Makefile > index 6785982013dc..1e1a31673577 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -3,7 +3,7 @@ > # Makefile for the linux kernel. > # > > -obj-y = fork.o exec_domain.o panic.o \ > +obj-y = fork.o exec_domain.o exec_state.o panic.o \ > cpu.o exit.o softirq.o resource.o \ > sysctl.o capability.o ptrace.o user.o \ > signal.o sys.o umh.o workqueue.o pid.o task_work.o \ > diff --git a/kernel/exec_state.c b/kernel/exec_state.c > new file mode 100644 > index 000000000000..85178b1d2c57 > --- /dev/null > +++ b/kernel/exec_state.c > @@ -0,0 +1,105 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +#include <linux/init.h> > +#include <linux/rcupdate.h> > +#include <linux/refcount.h> > +#include <linux/sched.h> > +#include <linux/sched/coredump.h> > +#include <linux/sched/exec_state.h> > +#include <linux/sched/signal.h> > +#include <linux/slab.h> > + > +static struct kmem_cache *task_exec_state_cachep; > + > +static void __free_task_exec_state(struct rcu_head *rcu) > +{ > + struct task_exec_state *es = container_of(rcu, struct task_exec_state, rcu); > + > + kmem_cache_free(task_exec_state_cachep, es); > +} > + > +void put_task_exec_state(struct task_exec_state *es) > +{ > + if (es && refcount_dec_and_test(&es->count)) > + call_rcu(&es->rcu, __free_task_exec_state); > +} One somewhat weird aspect of the use of RCU in the exit path is that, with the series applied, the system will probably need to go through even more RCU grace periods after a task exits to clean it up fully: 1. One RCU grace period for going from put_task_struct_rcu_user() to delayed_put_task_struct(). 2. One RCU grace period for going from put_task_struct() to __put_task_struct_rcu_cb() (which according to the comment in put_task_struct() is actually just done as an easy way to defer work to task context). 3. One RCU grace period for going from put_task_exec_state() to __free_task_exec_state(). Not really a problem, just a weird quirk... > +struct task_exec_state *alloc_task_exec_state(void) > +{ > + struct task_exec_state *es; > + > + es = kmem_cache_alloc(task_exec_state_cachep, GFP_KERNEL); > + if (!es) > + return NULL; > + refcount_set(&es->count, 1); > + es->dumpable = TASK_DUMPABLE_OFF; > + return es; > +} > + > +struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk) > +{ > + RCU_LOCKDEP_WARN(!rcu_read_lock_held() && !lockdep_is_held(&tsk->alloc_lock), > + "task_exec_state_rcu() requires RCU or task_lock"); > + WARN_ON_ONCE(!tsk->exec_state); This will generate a second memory load because rcu_dereference() does a READ_ONCE() under the hood. > + return rcu_dereference(tsk->exec_state); This should be equivalent and nicer: ``` struct task_exec_state *result; result = rcu_dereference_check(tsk->exec_state, lockdep_is_held(&tsk->alloc_lock)); WARN_ON_ONCE(!result); return result; ``` ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 2/5] exec: introduce struct task_exec_state and relocate dumpable 2026-05-20 16:27 ` Jann Horn @ 2026-05-20 19:47 ` Christian Brauner 0 siblings, 0 replies; 24+ messages in thread From: Christian Brauner @ 2026-05-20 19:47 UTC (permalink / raw) To: Jann Horn Cc: Linus Torvalds, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 06:27:56PM +0200, Jann Horn wrote: > On Wed, May 20, 2026 at 4:43 PM Christian Brauner <brauner@kernel.org> wrote: > > Introduce struct task_exec_state, a per-task RCU-protected structure > > that holds the dumpable mode and stays attached to the task for its > > full lifetime. > > > > task_exec_state_rcu() is the canonical reader: asserts RCU or > > task_lock is held, WARNs on a NULL state, returns the > > rcu_dereference()'d pointer. > > > > Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> > > Signed-off-by: Christian Brauner <brauner@kernel.org> > > (you signed off twice with different names, idk if that's intentional) No, that's just incompetence. ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH RFC v2 3/5] ptrace: add ptracer_access_allowed() 2026-05-20 14:42 [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() Christian Brauner 2026-05-20 14:42 ` [PATCH RFC v2 1/5] sched/coredump: introduce enum task_dumpable Christian Brauner 2026-05-20 14:42 ` [PATCH RFC v2 2/5] exec: introduce struct task_exec_state and relocate dumpable Christian Brauner @ 2026-05-20 14:42 ` Christian Brauner 2026-05-20 16:28 ` Jann Horn 2026-05-20 14:42 ` [PATCH RFC v2 4/5] exec_state: relocate dumpable information Christian Brauner ` (3 subsequent siblings) 6 siblings, 1 reply; 24+ messages in thread From: Christian Brauner @ 2026-05-20 14:42 UTC (permalink / raw) To: Jann Horn, Linus Torvalds, Oleg Nesterov Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko, Christian Brauner (Amutable) Add a helper that encapsulates all of the logic for checking ptrace access and remove open-coded versions in follow-up patches. Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> --- include/linux/ptrace.h | 1 + kernel/ptrace.c | 26 ++++++++++++++++++++++++++ 2 files changed, 27 insertions(+) diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h index 90507d4afcd6..ef314f7a9ecc 100644 --- a/include/linux/ptrace.h +++ b/include/linux/ptrace.h @@ -17,6 +17,7 @@ struct syscall_info { struct seccomp_data data; }; +bool ptracer_access_allowed(struct task_struct *tsk); extern int ptrace_access_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, unsigned int gup_flags); diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 07398c9c8fe3..2dc7d01baba0 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -13,6 +13,7 @@ #include <linux/sched.h> #include <linux/sched/mm.h> #include <linux/sched/coredump.h> +#include <linux/sched/exec_state.h> #include <linux/sched/task.h> #include <linux/errno.h> #include <linux/mm.h> @@ -36,6 +37,31 @@ #include <asm/syscall.h> /* for syscall_get_* */ +/** + * ptracer_access_allowed - may current peek/poke @tsk's address space? + * @tsk: tracee + * + * Per-access check used by ptrace_access_vm() and architecture-specific + * tag/register accessors. Returns true iff current is the registered + * ptracer of @tsk and either @tsk is owner-dumpable or current holds + * CAP_SYS_PTRACE in @tsk's exec namespace. Stricter than the up-front + * ptrace_may_access() check at attach time; this re-validates on every + * memory access so privilege changes are observed promptly. + */ +bool ptracer_access_allowed(struct task_struct *tsk) +{ + const struct task_exec_state *es; + + if (!tsk->ptrace) + return false; + if (current != tsk->parent) + return false; + guard(rcu)(); + es = task_exec_state_rcu(tsk); + return READ_ONCE(es->dumpable) == TASK_DUMPABLE_OWNER || + ptracer_capable(tsk, es->user_ns); +} + /* * Access another process' address space via ptrace. * Source/target buffer must be kernel space, -- 2.47.3 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 3/5] ptrace: add ptracer_access_allowed() 2026-05-20 14:42 ` [PATCH RFC v2 3/5] ptrace: add ptracer_access_allowed() Christian Brauner @ 2026-05-20 16:28 ` Jann Horn 0 siblings, 0 replies; 24+ messages in thread From: Jann Horn @ 2026-05-20 16:28 UTC (permalink / raw) To: Christian Brauner Cc: Linus Torvalds, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 4:43 PM Christian Brauner <brauner@kernel.org> wrote: > Add a helper that encapsulates all of the logic for checking ptrace > access and remove open-coded versions in follow-up patches. > > Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> Reviewed-by: Jann Horn <jannh@google.com> ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH RFC v2 4/5] exec_state: relocate dumpable information 2026-05-20 14:42 [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() Christian Brauner ` (2 preceding siblings ...) 2026-05-20 14:42 ` [PATCH RFC v2 3/5] ptrace: add ptracer_access_allowed() Christian Brauner @ 2026-05-20 14:42 ` Christian Brauner 2026-05-20 19:21 ` Jann Horn 2026-05-20 14:42 ` [PATCH RFC v2 5/5] cred: switch dumpability lowering to task_exec_state Christian Brauner ` (2 subsequent siblings) 6 siblings, 1 reply; 24+ messages in thread From: Christian Brauner @ 2026-05-20 14:42 UTC (permalink / raw) To: Jann Horn, Linus Torvalds, Oleg Nesterov Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko, Christian Brauner (Amutable) The dumpable flag captured at execve() is consulted by __ptrace_may_access() and several /proc owner / visibility checks. It lives on mm_struct today, which exit_mm() clears from the task long before the task itself is reaped. exec_state is anchored to the execve() that established the current privilege domain. Every clone() variant refcount-shares the parent's exec_state via copy_exec_state(); only execve() allocates a fresh instance (via alloc_task_exec_state() in begin_new_exec()) and installs it under task_lock + exec_update_lock with task_exec_state_replace(). init_task uses a static instance. The dumpable mode now lives on task->exec_state->dumpable. task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit positions remain stable for the /proc/<pid>/coredump_filter ABI. The task->user_dumpable cache bit and its assignment in exit_mm() are removed; readers go through get_dumpable(task) directly. coredump_params gains a snapshot field cprm.dumpable, populated from get_dumpable(current) at vfs_coredump() entry, replacing the previous __get_dumpable(cprm->mm_flags) consumers in fs/coredump.c and fs/pidfs.c. The user namespace recorded at execve() is consulted by __ptrace_may_access() and by /proc/PID/* owner derivation. Move the captured user_ns onto task_exec_state, which stays attached to the task past exit_mm() and across exit_files(). bprm grows a user_ns field staged in bprm_mm_init() with the caller's user_ns, narrowed by would_dump() to the closest privileged ancestor, and consumed by exec_mmap() via alloc_task_exec_state(bprm->user_ns). free_bprm() releases the staging reference. mm_struct loses ->user_ns entirely. Initializers in init-mm, efi_mm, and the implicit one in mm_init()/dup_mm()/mm_alloc() are removed; __mmdrop() drops the matching put_user_ns(). The kthread_use_mm() WARN_ON_ONCE(!mm->user_ns) is no longer meaningful and goes too. Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> --- arch/arm64/kernel/mte.c | 6 ++---- drivers/firmware/efi/efi.c | 1 - fs/coredump.c | 20 +++++++------------- fs/exec.c | 39 ++++++++++++++++++++------------------- fs/pidfs.c | 16 ++++++---------- fs/proc/base.c | 39 ++++++++++++++++----------------------- include/linux/binfmts.h | 2 ++ include/linux/coredump.h | 4 ++++ include/linux/mm_types.h | 9 ++++----- include/linux/sched.h | 4 +--- include/linux/sched/coredump.h | 36 ++---------------------------------- include/linux/sched/exec_state.h | 6 +++--- init/init_task.c | 10 ++++++++++ kernel/cred.c | 2 +- kernel/exec_state.c | 5 ++++- kernel/exit.c | 1 - kernel/fork.c | 15 +++++++++------ kernel/kthread.c | 1 - kernel/ptrace.c | 26 ++++++++------------------ kernel/sys.c | 4 ++-- mm/init-mm.c | 1 - 21 files changed, 101 insertions(+), 146 deletions(-) diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index 904ac41f93bc..1a9aad6ef22a 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -8,6 +8,7 @@ #include <linux/kernel.h> #include <linux/mm.h> #include <linux/prctl.h> +#include <linux/ptrace.h> #include <linux/sched.h> #include <linux/sched/mm.h> #include <linux/string.h> @@ -537,16 +538,13 @@ static int access_remote_tags(struct task_struct *tsk, unsigned long addr, if (!mm) return -EPERM; - if (!tsk->ptrace || (current != tsk->parent) || - ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) && - !ptracer_capable(tsk, mm->user_ns))) { + if (!ptracer_access_allowed(tsk)) { mmput(mm); return -EPERM; } ret = __access_remote_tags(mm, addr, kiov, gup_flags); mmput(mm); - return ret; } diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index d04be38f1750..ae78bc021b41 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -73,7 +73,6 @@ struct mm_struct efi_mm = { MMAP_LOCK_INITIALIZER(efi_mm) .page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), - .user_ns = &init_user_ns, #ifdef CONFIG_SCHED_MM_CID .mm_cid.lock = __RAW_SPIN_LOCK_UNLOCKED(efi_mm.mm_cid.lock), #endif diff --git a/fs/coredump.c b/fs/coredump.c index f5348d5bc441..e943569e9b6d 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -395,8 +395,7 @@ static bool coredump_parse(struct core_name *cn, struct coredump_params *cprm, cred->gid)); break; case 'd': - err = cn_printf(cn, "%d", - __get_dumpable(cprm->mm_flags)); + err = cn_printf(cn, "%d", cprm->dumpable); break; /* signal that caused the coredump */ case 's': @@ -869,11 +868,11 @@ static inline void coredump_sock_shutdown(struct file *file) { } static inline bool coredump_socket(struct core_name *cn, struct coredump_params *cprm) { return false; } #endif -/* cprm->mm_flags contains a stable snapshot of dumpability flags. */ +/* cprm->dumpable is the snapshot of task dumpability at dump start. */ static inline bool coredump_force_suid_safe(const struct coredump_params *cprm) { /* Require nonrelative corefile path and be extra careful. */ - return __get_dumpable(cprm->mm_flags) == TASK_DUMPABLE_ROOT; + return cprm->dumpable == TASK_DUMPABLE_ROOT; } static bool coredump_file(struct core_name *cn, struct coredump_params *cprm, @@ -1085,7 +1084,7 @@ static inline bool coredump_skip(const struct coredump_params *cprm, return true; if (!binfmt->core_dump) return true; - if (!__get_dumpable(cprm->mm_flags)) + if (cprm->dumpable == TASK_DUMPABLE_OFF) return true; return false; } @@ -1170,14 +1169,9 @@ void vfs_coredump(const kernel_siginfo_t *siginfo) struct coredump_params cprm = { .siginfo = siginfo, .limit = rlimit(RLIMIT_CORE), - /* - * We must use the same mm->flags while dumping core to avoid - * inconsistency of bit flags, since this flag is not protected - * by any locks. - * - * Note that we only care about MMF_DUMP* flags. - */ - .mm_flags = __mm_flags_get_dumpable(mm), + /* Snapshot MMF_DUMP_FILTER_* (unlocked) and dumpable for the dump. */ + .mm_flags = __mm_flags_get_word(mm), + .dumpable = task_exec_state_get_dumpable(current), .vma_meta = NULL, .cpu = raw_smp_processor_id(), }; diff --git a/fs/exec.c b/fs/exec.c index f5663bb607d3..9e7f25e2cd41 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -35,6 +35,7 @@ #include <linux/init.h> #include <linux/sched/mm.h> #include <linux/sched/coredump.h> +#include <linux/sched/exec_state.h> #include <linux/sched/signal.h> #include <linux/sched/numa_balancing.h> #include <linux/sched/task.h> @@ -263,6 +264,9 @@ static int bprm_mm_init(struct linux_binprm *bprm) if (!mm) goto err; + /* Staged for would_dump() narrowing; consumed by begin_new_exec(). */ + bprm->user_ns = get_user_ns(current_user_ns()); + /* Save current stack limit for all calculations made during exec. */ task_lock(current->group_leader); bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK]; @@ -834,12 +838,17 @@ EXPORT_SYMBOL(read_code); * On success, this function returns with exec_update_lock * held for writing. */ -static int exec_mmap(struct mm_struct *mm) +static int exec_mmap(struct mm_struct *mm, struct user_namespace *user_ns) { + struct task_exec_state *exec_state __free(put_task_exec_state) = NULL; struct task_struct *tsk; struct mm_struct *old_mm, *active_mm; int ret; + exec_state = alloc_task_exec_state(user_ns); + if (!exec_state) + return -ENOMEM; + /* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; @@ -870,6 +879,7 @@ static int exec_mmap(struct mm_struct *mm) tsk->active_mm = mm; tsk->mm = mm; mm_init_cid(mm, tsk); + exec_state = task_exec_state_replace(tsk, exec_state); /* * This prevents preemption while active_mm is being loaded and * it and mm are being updated, which could cause problems for @@ -1145,7 +1155,7 @@ int begin_new_exec(struct linux_binprm * bprm) * Release all of the old mmap stuff */ acct_arg_size(bprm, 0); - retval = exec_mmap(bprm->mm); + retval = exec_mmap(bprm->mm, bprm->user_ns); if (retval) goto out; @@ -1210,9 +1220,9 @@ int begin_new_exec(struct linux_binprm * bprm) if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || !(uid_eq(current_euid(), current_uid()) && gid_eq(current_egid(), current_gid()))) - set_dumpable(current->mm, suid_dumpable); + task_exec_state_set_dumpable(suid_dumpable); else - set_dumpable(current->mm, TASK_DUMPABLE_OWNER); + task_exec_state_set_dumpable(TASK_DUMPABLE_OWNER); perf_event_exec(); @@ -1261,7 +1271,7 @@ int begin_new_exec(struct linux_binprm * bprm) * wait until new credentials are committed * by commit_creds() above */ - if (get_dumpable(me->mm) != TASK_DUMPABLE_OWNER) + if (task_exec_state_get_dumpable(me) != TASK_DUMPABLE_OWNER) perf_event_exit_task(me); /* * cred_guard_mutex must be held at least to this point to prevent @@ -1298,14 +1308,14 @@ void would_dump(struct linux_binprm *bprm, struct file *file) struct user_namespace *old, *user_ns; bprm->interp_flags |= BINPRM_FLAGS_ENFORCE_NONDUMP; - /* Ensure mm->user_ns contains the executable */ - user_ns = old = bprm->mm->user_ns; + /* Ensure bprm->user_ns contains the executable. */ + user_ns = old = bprm->user_ns; while ((user_ns != &init_user_ns) && !privileged_wrt_inode_uidgid(user_ns, idmap, inode)) user_ns = user_ns->parent; if (old != user_ns) { - bprm->mm->user_ns = get_user_ns(user_ns); + bprm->user_ns = get_user_ns(user_ns); put_user_ns(old); } } @@ -1375,6 +1385,8 @@ static void free_bprm(struct linux_binprm *bprm) acct_arg_size(bprm, 0); mmput(bprm->mm); } + if (bprm->user_ns) + put_user_ns(bprm->user_ns); free_arg_pages(bprm); if (bprm->cred) { /* in case exec fails before de_thread() succeeds */ @@ -1905,17 +1917,6 @@ void set_binfmt(struct linux_binfmt *new) } EXPORT_SYMBOL(set_binfmt); -/* - * set_dumpable stores three-value TASK_DUMPABLE_* into mm->flags. - */ -void set_dumpable(struct mm_struct *mm, int value) -{ - if (WARN_ON((unsigned)value > TASK_DUMPABLE_ROOT)) - return; - - __mm_flags_set_mask_dumpable(mm, value); -} - static inline struct user_arg_ptr native_arg(const char __user *const __user *p) { return (struct user_arg_ptr){.ptr.native = p}; diff --git a/fs/pidfs.c b/fs/pidfs.c index 9cd12f2f004c..ba4a729497c9 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -338,9 +338,9 @@ static inline bool pid_in_current_pidns(const struct pid *pid) return false; } -static __u32 pidfs_coredump_mask(unsigned long mm_flags) +static __u32 pidfs_coredump_mask(enum task_dumpable dumpable) { - switch (__get_dumpable(mm_flags)) { + switch (dumpable) { case TASK_DUMPABLE_OWNER: return PIDFD_COREDUMP_USER; case TASK_DUMPABLE_ROOT: @@ -434,13 +434,9 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg) if ((mask & PIDFD_INFO_COREDUMP) && !kinfo.coredump_mask) { guard(task_lock)(task); - if (task->mm) { - unsigned long flags = __mm_flags_get_dumpable(task->mm); - - kinfo.coredump_mask = pidfs_coredump_mask(flags); - kinfo.mask |= PIDFD_INFO_COREDUMP; - /* No coredump actually took place, so no coredump signal. */ - } + kinfo.coredump_mask = pidfs_coredump_mask(task_exec_state_get_dumpable(task)); + kinfo.mask |= PIDFD_INFO_COREDUMP; + /* No coredump actually took place, so no coredump signal. */ } /* Unconditionally return identifiers and credentials, the rest only on request */ @@ -779,7 +775,7 @@ void pidfs_coredump(const struct coredump_params *cprm) VFS_WARN_ON_ONCE(attr == PIDFS_PID_DEAD); /* Note how we were coredumped and that we coredumped. */ - attr->coredump_mask = pidfs_coredump_mask(cprm->mm_flags) | + attr->coredump_mask = pidfs_coredump_mask(cprm->dumpable) | PIDFD_COREDUMPED; /* If coredumping is set to skip we should never end up here. */ VFS_WARN_ON_ONCE(attr->coredump_mask & PIDFD_COREDUMP_SKIP); diff --git a/fs/proc/base.c b/fs/proc/base.c index da0b316befb8..65f56136ec3f 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -91,6 +91,7 @@ #include <linux/sched/mm.h> #include <linux/sched/coredump.h> #include <linux/sched/debug.h> +#include <linux/sched/exec_state.h> #include <linux/sched/stat.h> #include <linux/posix-timers.h> #include <linux/time_namespace.h> @@ -1893,7 +1894,6 @@ void task_dump_owner(struct task_struct *task, umode_t mode, cred = __task_cred(task); uid = cred->euid; gid = cred->egid; - rcu_read_unlock(); /* * Before the /proc/pid/status file was created the only way to read @@ -1903,29 +1903,22 @@ void task_dump_owner(struct task_struct *task, umode_t mode, * made this apply to all per process world readable and executable * directories. */ - if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) { - struct mm_struct *mm; - task_lock(task); - mm = task->mm; - /* Make non-dumpable tasks owned by some root */ - if (mm) { - if (get_dumpable(mm) != TASK_DUMPABLE_OWNER) { - struct user_namespace *user_ns = mm->user_ns; - - uid = make_kuid(user_ns, 0); - if (!uid_valid(uid)) - uid = GLOBAL_ROOT_UID; - - gid = make_kgid(user_ns, 0); - if (!gid_valid(gid)) - gid = GLOBAL_ROOT_GID; - } - } else { - uid = GLOBAL_ROOT_UID; - gid = GLOBAL_ROOT_GID; + if (mode != (S_IFDIR | S_IRUGO | S_IXUGO)) { + struct task_exec_state *exec_state; + + exec_state = task_exec_state_rcu(task); + if (READ_ONCE(exec_state->dumpable) != TASK_DUMPABLE_OWNER) { + uid = make_kuid(exec_state->user_ns, 0); + if (!uid_valid(uid)) + uid = GLOBAL_ROOT_UID; + + gid = make_kgid(exec_state->user_ns, 0); + if (!gid_valid(gid)) + gid = GLOBAL_ROOT_GID; } - task_unlock(task); } + rcu_read_unlock(); + *ruid = uid; *rgid = gid; } @@ -2965,7 +2958,7 @@ static ssize_t proc_coredump_filter_read(struct file *file, char __user *buf, ret = 0; mm = get_task_mm(task); if (mm) { - unsigned long flags = __mm_flags_get_dumpable(mm); + unsigned long flags = __mm_flags_get_word(mm); len = snprintf(buffer, sizeof(buffer), "%08lx\n", ((flags & MMF_DUMP_FILTER_MASK) >> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index 65abd5ab8836..a8379f4eee61 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -25,6 +25,8 @@ struct linux_binprm { struct page *page[MAX_ARG_PAGES]; #endif struct mm_struct *mm; + /* user_ns published to task->exec_state at execve, narrowed by would_dump(). */ + struct user_namespace *user_ns; unsigned long p; /* current top of mem */ unsigned int /* Should an execfd be passed to userspace? */ diff --git a/include/linux/coredump.h b/include/linux/coredump.h index 68861da4cf7c..7b38ee2e7913 100644 --- a/include/linux/coredump.h +++ b/include/linux/coredump.h @@ -5,6 +5,7 @@ #include <linux/types.h> #include <linux/mm.h> #include <linux/fs.h> +#include <linux/sched/coredump.h> #include <asm/siginfo.h> #ifdef CONFIG_COREDUMP @@ -20,7 +21,10 @@ struct coredump_params { const kernel_siginfo_t *siginfo; struct file *file; unsigned long limit; + /* MMF_DUMP_FILTER_* bits, snapshot of mm->flags at dump start. */ unsigned long mm_flags; + /* Snapshot of dumpable at dump start. */ + enum task_dumpable dumpable; int cpu; loff_t written; loff_t pos; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 51ea37b2a0aa..9588ce3b16df 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1342,7 +1342,6 @@ struct mm_struct { */ struct task_struct __rcu *owner; #endif - struct user_namespace *user_ns; /* store ref to file /proc/<pid>/exe symlink points to */ struct file __rcu *exe_file; @@ -1907,11 +1906,11 @@ enum { /* mm flags */ /* - * The first two bits represent core dump modes for set-user-ID, - * the modes are TASK_DUMPABLE_* defined in linux/sched/coredump.h + * Bits 0 and 1 were dumpability; that moved to task->exec_state. Reserve + * the bits so MMF_DUMP_FILTER_* positions stay stable for the + * /proc/<pid>/coredump_filter ABI. */ #define MMF_DUMPABLE_BITS 2 -#define MMF_DUMPABLE_MASK (BIT(MMF_DUMPABLE_BITS) - 1) /* coredump filter bits */ #define MMF_DUMP_ANON_PRIVATE 2 #define MMF_DUMP_ANON_SHARED 3 @@ -1972,7 +1971,7 @@ enum { #define MMF_TOPDOWN 31 /* mm searches top down by default */ #define MMF_TOPDOWN_MASK BIT(MMF_TOPDOWN) -#define MMF_INIT_LEGACY_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ +#define MMF_INIT_LEGACY_MASK (MMF_DUMP_FILTER_MASK |\ MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\ MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK) diff --git a/include/linux/sched.h b/include/linux/sched.h index d895c3ff2154..f74350f50901 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -85,6 +85,7 @@ struct seq_file; struct sighand_struct; struct signal_struct; struct task_delay_info; +struct task_exec_state; struct task_group; struct task_struct; struct timespec64; @@ -1005,9 +1006,6 @@ struct task_struct { unsigned sched_rt_mutex:1; #endif - /* Save user-dumpable when mm goes away */ - unsigned user_dumpable:1; - /* Bit to tell TOMOYO we're in execve(): */ unsigned in_execve:1; unsigned in_iowait:1; diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index ed6547692b61..20957ccde3b5 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -2,8 +2,6 @@ #ifndef _LINUX_SCHED_COREDUMP_H #define _LINUX_SCHED_COREDUMP_H -#include <linux/mm_types.h> - /* * Task dumpability mode. Gates core dump production and ptrace_attach() * authorization. The numeric values are stable ABI (suid_dumpable @@ -15,37 +13,7 @@ enum task_dumpable { TASK_DUMPABLE_ROOT = 2, /* dump as root; ptrace needs CAP_SYS_PTRACE */ }; -static inline unsigned long __mm_flags_get_dumpable(const struct mm_struct *mm) -{ - /* - * By convention, dumpable bits are contained in first 32 bits of the - * bitmap, so we can simply access this first unsigned long directly. - */ - return __mm_flags_get_word(mm); -} - -static inline void __mm_flags_set_mask_dumpable(struct mm_struct *mm, int value) -{ - __mm_flags_set_mask_bits_word(mm, MMF_DUMPABLE_MASK, value); -} - -extern void set_dumpable(struct mm_struct *mm, int value); -/* - * This returns the actual value of the suid_dumpable flag. For things - * that are using this for checking for privilege transitions, it must - * test against TASK_DUMPABLE_OWNER rather than treating it as a boolean - * value. - */ -static inline int __get_dumpable(unsigned long mm_flags) -{ - return mm_flags & MMF_DUMPABLE_MASK; -} - -static inline int get_dumpable(struct mm_struct *mm) -{ - unsigned long flags = __mm_flags_get_dumpable(mm); - - return __get_dumpable(flags); -} +void task_exec_state_set_dumpable(enum task_dumpable value); +enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task); #endif /* _LINUX_SCHED_COREDUMP_H */ diff --git a/include/linux/sched/exec_state.h b/include/linux/sched/exec_state.h index 7a267efc34d3..e06ba3a2c910 100644 --- a/include/linux/sched/exec_state.h +++ b/include/linux/sched/exec_state.h @@ -8,6 +8,8 @@ #include <linux/sched/coredump.h> #include <linux/user_namespace.h> +struct user_namespace; + struct task_exec_state { refcount_t count; enum task_dumpable dumpable; @@ -15,13 +17,11 @@ struct task_exec_state { struct rcu_head rcu; }; -struct task_exec_state *alloc_task_exec_state(void); +struct task_exec_state *alloc_task_exec_state(struct user_namespace *user_ns); void put_task_exec_state(struct task_exec_state *es); struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk); struct task_exec_state *task_exec_state_replace(struct task_struct *tsk, struct task_exec_state *exec_state); -void task_exec_state_set_dumpable(enum task_dumpable value); -enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task); void copy_exec_state(struct task_struct *tsk); void __init exec_state_init(void); diff --git a/init/init_task.c b/init/init_task.c index b5f48ebdc2b6..47a651b05058 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -7,6 +7,8 @@ #include <linux/sched/rt.h> #include <linux/sched/task.h> #include <linux/sched/ext.h> +#include <linux/sched/exec_state.h> +#include <linux/user_namespace.h> #include <linux/init.h> #include <linux/fs.h> #include <linux/mm.h> @@ -56,6 +58,13 @@ static struct sighand_struct init_sighand = { .signalfd_wqh = __WAIT_QUEUE_HEAD_INITIALIZER(init_sighand.signalfd_wqh), }; +/* init to 2 - one for init_task, one to ensure it is never freed */ +static struct task_exec_state init_task_exec_state = { + .count = REFCOUNT_INIT(2), + .dumpable = TASK_DUMPABLE_OWNER, + .user_ns = &init_user_ns, +}; + #ifdef CONFIG_SHADOW_CALL_STACK unsigned long init_shadow_call_stack[SCS_SIZE / sizeof(long)] = { [(SCS_SIZE / sizeof(long)) - 1] = SCS_END_MAGIC @@ -113,6 +122,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = { .nr_cpus_allowed= NR_CPUS, .mm = NULL, .active_mm = &init_mm, + .exec_state = &init_task_exec_state, .restart_block = { .fn = do_no_restart_syscall, }, diff --git a/kernel/cred.c b/kernel/cred.c index 12a7b1ce5131..51c35ac94787 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -385,7 +385,7 @@ int commit_creds(struct cred *new) !gid_eq(old->fsgid, new->fsgid) || !cred_cap_issubset(old, new)) { if (task->mm) - set_dumpable(task->mm, suid_dumpable); + task_exec_state_set_dumpable(suid_dumpable); task->pdeath_signal = 0; /* * If a task drops privileges and becomes nondumpable, diff --git a/kernel/exec_state.c b/kernel/exec_state.c index 85178b1d2c57..f125757d7f09 100644 --- a/kernel/exec_state.c +++ b/kernel/exec_state.c @@ -8,6 +8,7 @@ #include <linux/sched/exec_state.h> #include <linux/sched/signal.h> #include <linux/slab.h> +#include <linux/user_namespace.h> static struct kmem_cache *task_exec_state_cachep; @@ -15,6 +16,7 @@ static void __free_task_exec_state(struct rcu_head *rcu) { struct task_exec_state *es = container_of(rcu, struct task_exec_state, rcu); + put_user_ns(es->user_ns); kmem_cache_free(task_exec_state_cachep, es); } @@ -24,7 +26,7 @@ void put_task_exec_state(struct task_exec_state *es) call_rcu(&es->rcu, __free_task_exec_state); } -struct task_exec_state *alloc_task_exec_state(void) +struct task_exec_state *alloc_task_exec_state(struct user_namespace *user_ns) { struct task_exec_state *es; @@ -33,6 +35,7 @@ struct task_exec_state *alloc_task_exec_state(void) return NULL; refcount_set(&es->count, 1); es->dumpable = TASK_DUMPABLE_OFF; + es->user_ns = get_user_ns(user_ns); return es; } diff --git a/kernel/exit.c b/kernel/exit.c index 507eda655e8d..9a909993ab1d 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -571,7 +571,6 @@ static void exit_mm(void) */ smp_mb__after_spinlock(); local_irq_disable(); - current->user_dumpable = (get_dumpable(mm) == TASK_DUMPABLE_OWNER); current->mm = NULL; membarrier_update_current_mm(NULL); enter_lazy_tlb(mm, current); diff --git a/kernel/fork.c b/kernel/fork.c index 5f3fdfdb14c7..b08532ac1ba6 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -23,6 +23,7 @@ #include <linux/sched/task_stack.h> #include <linux/sched/cputime.h> #include <linux/sched/ext.h> +#include <linux/sched/exec_state.h> #include <linux/seq_file.h> #include <linux/rtmutex.h> #include <linux/init.h> @@ -555,6 +556,7 @@ void free_task(struct task_struct *tsk) if (tsk->flags & PF_KTHREAD) free_kthread_struct(tsk); bpf_task_storage_free(tsk); + put_task_exec_state(tsk->exec_state); free_task_struct(tsk); } EXPORT_SYMBOL(free_task); @@ -731,7 +733,6 @@ void __mmdrop(struct mm_struct *mm) destroy_context(mm); mmu_notifier_subscriptions_destroy(mm); check_mm(mm); - put_user_ns(mm->user_ns); mm_pasid_drop(mm); mm_destroy_cid(mm); percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); @@ -1072,8 +1073,7 @@ static void mmap_init_lock(struct mm_struct *mm) #endif } -static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, - struct user_namespace *user_ns) +static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) { mt_init_flags(&mm->mm_mt, MM_MT_FLAGS); mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock); @@ -1132,7 +1132,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, NR_MM_COUNTERS)) goto fail_pcpu; - mm->user_ns = get_user_ns(user_ns); lru_gen_init_mm(mm); return mm; @@ -1163,7 +1162,7 @@ struct mm_struct *mm_alloc(void) return NULL; memset(mm, 0, sizeof(*mm)); - return mm_init(mm, current, current_user_ns()); + return mm_init(mm, current); } EXPORT_SYMBOL_IF_KUNIT(mm_alloc); @@ -1527,7 +1526,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk, memcpy(mm, oldmm, sizeof(*mm)); - if (!mm_init(mm, tsk, mm->user_ns)) + if (!mm_init(mm, tsk)) goto fail_nomem; uprobe_start_dup_mmap(); @@ -2090,6 +2089,7 @@ __latent_entropy struct task_struct *copy_process( p = dup_task_struct(current, node); if (!p) goto fork_out; + RCU_INIT_POINTER(p->exec_state, NULL); p->flags &= ~PF_KTHREAD; if (args->kthread) p->flags |= PF_KTHREAD; @@ -2122,6 +2122,8 @@ __latent_entropy struct task_struct *copy_process( #ifdef CONFIG_PROVE_LOCKING DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled); #endif + copy_exec_state(p); + retval = copy_creds(p, clone_flags); if (retval < 0) goto bad_fork_free; @@ -3098,6 +3100,7 @@ void __init proc_caches_init(void) sizeof(struct signal_struct), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); + exec_state_init(); files_cachep = kmem_cache_create("files_cache", sizeof(struct files_struct), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, diff --git a/kernel/kthread.c b/kernel/kthread.c index 791210daf8b4..63beb59b7a3d 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1619,7 +1619,6 @@ void kthread_use_mm(struct mm_struct *mm) WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD)); WARN_ON_ONCE(tsk->mm); - WARN_ON_ONCE(!mm->user_ns); /* * It is possible for mm to be the same as tsk->active_mm, but diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 2dc7d01baba0..a4932ef716c6 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -71,21 +71,14 @@ int ptrace_access_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, unsigned int gup_flags) { struct mm_struct *mm; - int ret; + int ret = 0; mm = get_task_mm(tsk); if (!mm) return 0; - if (!tsk->ptrace || - (current != tsk->parent) || - ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) && - !ptracer_capable(tsk, mm->user_ns))) { - mmput(mm); - return 0; - } - - ret = access_remote_vm(mm, addr, buf, len, gup_flags); + if (ptracer_access_allowed(tsk)) + ret = access_remote_vm(mm, addr, buf, len, gup_flags); mmput(mm); return ret; @@ -300,16 +293,13 @@ static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode) static bool task_still_dumpable(struct task_struct *task, unsigned int mode) { - struct mm_struct *mm = task->mm; - if (mm) { - if (get_dumpable(mm) == TASK_DUMPABLE_OWNER) - return true; - return ptrace_has_cap(mm->user_ns, mode); - } + const struct task_exec_state *exec_state; - if (task->user_dumpable) + guard(rcu)(); + exec_state = task_exec_state_rcu(task); + if (READ_ONCE(exec_state->dumpable) == TASK_DUMPABLE_OWNER) return true; - return ptrace_has_cap(&init_user_ns, mode); + return ptrace_has_cap(exec_state->user_ns, mode); } /* Returns 0 on success, -errno on denial. */ diff --git a/kernel/sys.c b/kernel/sys.c index f1189f719db5..df69bd71de03 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2565,14 +2565,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = put_user(me->pdeath_signal, (int __user *)arg2); break; case PR_GET_DUMPABLE: - error = get_dumpable(me->mm); + error = task_exec_state_get_dumpable(me); break; case PR_SET_DUMPABLE: if (arg2 != TASK_DUMPABLE_OFF && arg2 != TASK_DUMPABLE_OWNER) { error = -EINVAL; break; } - set_dumpable(me->mm, arg2); + task_exec_state_set_dumpable(arg2); break; case PR_SET_UNALIGN: diff --git a/mm/init-mm.c b/mm/init-mm.c index c5556bb9d5f0..3e792aad7626 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -43,7 +43,6 @@ struct mm_struct init_mm = { .vma_writer_wait = __RCUWAIT_INITIALIZER(init_mm.vma_writer_wait), .mm_lock_seq = SEQCNT_ZERO(init_mm.mm_lock_seq), #endif - .user_ns = &init_user_ns, #ifdef CONFIG_SCHED_MM_CID .mm_cid.lock = __RAW_SPIN_LOCK_UNLOCKED(init_mm.mm_cid.lock), #endif -- 2.47.3 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 4/5] exec_state: relocate dumpable information 2026-05-20 14:42 ` [PATCH RFC v2 4/5] exec_state: relocate dumpable information Christian Brauner @ 2026-05-20 19:21 ` Jann Horn 2026-05-20 19:47 ` Christian Brauner 0 siblings, 1 reply; 24+ messages in thread From: Jann Horn @ 2026-05-20 19:21 UTC (permalink / raw) To: Christian Brauner Cc: Linus Torvalds, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 4:43 PM Christian Brauner <brauner@kernel.org> wrote: > The dumpable flag captured at execve() is consulted by > __ptrace_may_access() and several /proc owner / visibility checks. > It lives on mm_struct today, which exit_mm() clears from the task > long before the task itself is reaped. > > exec_state is anchored to the execve() that established the current > privilege domain. Every clone() variant refcount-shares the parent's > exec_state via copy_exec_state(); only execve() allocates a fresh > instance (via alloc_task_exec_state() in begin_new_exec()) and > installs it under task_lock + exec_update_lock with > task_exec_state_replace(). init_task uses a static instance. > > The dumpable mode now lives on task->exec_state->dumpable. > task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is > removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit > positions remain stable for the /proc/<pid>/coredump_filter ABI. The > task->user_dumpable cache bit and its assignment in exit_mm() are > removed; readers go through get_dumpable(task) directly. > > coredump_params gains a snapshot field cprm.dumpable, populated from > get_dumpable(current) at vfs_coredump() entry, replacing the previous > __get_dumpable(cprm->mm_flags) consumers in fs/coredump.c and > fs/pidfs.c. > > The user namespace recorded at execve() is consulted by > __ptrace_may_access() and by /proc/PID/* owner derivation. Move the > captured user_ns onto task_exec_state, which stays attached to the task > past exit_mm() and across exit_files(). > > bprm grows a user_ns field staged in bprm_mm_init() with the caller's > user_ns, narrowed by would_dump() to the closest privileged ancestor, > and consumed by exec_mmap() via alloc_task_exec_state(bprm->user_ns). > free_bprm() releases the staging reference. > > mm_struct loses ->user_ns entirely. Initializers in init-mm, efi_mm, > and the implicit one in mm_init()/dup_mm()/mm_alloc() are removed; > __mmdrop() drops the matching put_user_ns(). The kthread_use_mm() > WARN_ON_ONCE(!mm->user_ns) is no longer meaningful and goes too. > > Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> Except for the discussion on the cover letter about whether to unshare on fork(): Reviewed-by: Jann Horn <jannh@google.com> > @@ -434,13 +434,9 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg) > > if ((mask & PIDFD_INFO_COREDUMP) && !kinfo.coredump_mask) { > guard(task_lock)(task); nit: this can become guard(rcu)() now? > - if (task->mm) { > - unsigned long flags = __mm_flags_get_dumpable(task->mm); > - > - kinfo.coredump_mask = pidfs_coredump_mask(flags); > - kinfo.mask |= PIDFD_INFO_COREDUMP; > - /* No coredump actually took place, so no coredump signal. */ > - } > + kinfo.coredump_mask = pidfs_coredump_mask(task_exec_state_get_dumpable(task)); > + kinfo.mask |= PIDFD_INFO_COREDUMP; > + /* No coredump actually took place, so no coredump signal. */ > } ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 4/5] exec_state: relocate dumpable information 2026-05-20 19:21 ` Jann Horn @ 2026-05-20 19:47 ` Christian Brauner 0 siblings, 0 replies; 24+ messages in thread From: Christian Brauner @ 2026-05-20 19:47 UTC (permalink / raw) To: Jann Horn Cc: Linus Torvalds, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 09:21:19PM +0200, Jann Horn wrote: > On Wed, May 20, 2026 at 4:43 PM Christian Brauner <brauner@kernel.org> wrote: > > The dumpable flag captured at execve() is consulted by > > __ptrace_may_access() and several /proc owner / visibility checks. > > It lives on mm_struct today, which exit_mm() clears from the task > > long before the task itself is reaped. > > > > exec_state is anchored to the execve() that established the current > > privilege domain. Every clone() variant refcount-shares the parent's > > exec_state via copy_exec_state(); only execve() allocates a fresh > > instance (via alloc_task_exec_state() in begin_new_exec()) and > > installs it under task_lock + exec_update_lock with > > task_exec_state_replace(). init_task uses a static instance. > > > > The dumpable mode now lives on task->exec_state->dumpable. > > task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is > > removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit > > positions remain stable for the /proc/<pid>/coredump_filter ABI. The > > task->user_dumpable cache bit and its assignment in exit_mm() are > > removed; readers go through get_dumpable(task) directly. > > > > coredump_params gains a snapshot field cprm.dumpable, populated from > > get_dumpable(current) at vfs_coredump() entry, replacing the previous > > __get_dumpable(cprm->mm_flags) consumers in fs/coredump.c and > > fs/pidfs.c. > > > > The user namespace recorded at execve() is consulted by > > __ptrace_may_access() and by /proc/PID/* owner derivation. Move the > > captured user_ns onto task_exec_state, which stays attached to the task > > past exit_mm() and across exit_files(). > > > > bprm grows a user_ns field staged in bprm_mm_init() with the caller's > > user_ns, narrowed by would_dump() to the closest privileged ancestor, > > and consumed by exec_mmap() via alloc_task_exec_state(bprm->user_ns). > > free_bprm() releases the staging reference. > > > > mm_struct loses ->user_ns entirely. Initializers in init-mm, efi_mm, > > and the implicit one in mm_init()/dup_mm()/mm_alloc() are removed; > > __mmdrop() drops the matching put_user_ns(). The kthread_use_mm() > > WARN_ON_ONCE(!mm->user_ns) is no longer meaningful and goes too. > > > > Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> > > Except for the discussion on the cover letter about whether to unshare > on fork(): > > Reviewed-by: Jann Horn <jannh@google.com> > > > @@ -434,13 +434,9 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg) > > > > if ((mask & PIDFD_INFO_COREDUMP) && !kinfo.coredump_mask) { > > guard(task_lock)(task); > > nit: this can become guard(rcu)() now? Yes. ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH RFC v2 5/5] cred: switch dumpability lowering to task_exec_state 2026-05-20 14:42 [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() Christian Brauner ` (3 preceding siblings ...) 2026-05-20 14:42 ` [PATCH RFC v2 4/5] exec_state: relocate dumpable information Christian Brauner @ 2026-05-20 14:42 ` Christian Brauner 2026-05-20 18:44 ` Jann Horn 2026-05-20 15:08 ` [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() Christian Brauner 2026-05-20 16:27 ` Jann Horn 6 siblings, 1 reply; 24+ messages in thread From: Christian Brauner @ 2026-05-20 14:42 UTC (permalink / raw) To: Jann Horn, Linus Torvalds, Oleg Nesterov Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko, Christian Brauner (Amutable) commit_creds() has historically called set_dumpable(suid_dumpable) on every effective uid/gid/cap change, paired with an smp_wmb()/smp_rmb() fence against __ptrace_may_access() reading the credentials. Switch the call to task_exec_state_set_dumpable() so the dumpability lowering targets the new per-task exec_state rather than mm->flags. Drop the open-coded "if (task->mm)" guard - exec_state is always allocated for any observable task - and drop the explicit smp_wmb()/smp_rmb() pair: the new model relies on RCU acquire/release on the cred pointer. WRITE_ONCE() on es->dumpable inside task_exec_state_set_dumpable() happens-before rcu_assign_pointer() of the new cred in commit_creds(), so a reader that observes the new cred via rcu_dereference(task->real_cred) in __ptrace_may_access() is guaranteed to observe the new dumpable via READ_ONCE(es->dumpable). The same-uid ptrace shedding and /proc visibility behavior that long-running daemons launched as root (sshd, dbus-daemon, polkitd, NetworkManager, postfix workers, ...) rely on when they setresuid() to a service uid is preserved. No userspace audit cycle is required. Behavioral change: dumpability propagates across the fork subtree ================================================================= exec_state is refcount-shared across every clone() variant - thread, fork(), vfork(), io_uring worker - so this write is observed by every task still sharing the same exec_state. Pre-series, set_dumpable() targeted mm->flags, which was per-mm: shared by CLONE_VM threads but private to fork()-without-CLONE_VM children. Under the new model a privilege drop in any task in the subtree lowers dumpability for the entire subtree, including non-CLONE_VM siblings. This matches the model the series codifies: the entire fork subtree of one execve shares one exec_state, and dumpability is a property of that domain. Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> --- kernel/cred.c | 25 ++++++++++++------------- kernel/ptrace.c | 10 ---------- 2 files changed, 12 insertions(+), 23 deletions(-) diff --git a/kernel/cred.c b/kernel/cred.c index 51c35ac94787..335d8da1c43b 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -378,25 +378,24 @@ int commit_creds(struct cred *new) get_cred(new); /* we will require a ref for the subj creds too */ - /* dumpability changes */ + /* + * Lower dumpability on euid/egid/fsuid/fsgid/capability changes. + * Long-running daemons launched as root (sshd, dbus-daemon, + * polkitd, NetworkManager, postfix workers, ...) rely on this to + * shed /proc visibility and same-uid ptrace exposure of + * root-acquired secrets when they setresuid() to a service uid. + * + * exec_state is shared across the whole fork subtree of the + * establishing execve(), so this write is observed by every task + * still sharing the same exec_state. + */ if (!uid_eq(old->euid, new->euid) || !gid_eq(old->egid, new->egid) || !uid_eq(old->fsuid, new->fsuid) || !gid_eq(old->fsgid, new->fsgid) || !cred_cap_issubset(old, new)) { - if (task->mm) - task_exec_state_set_dumpable(suid_dumpable); + task_exec_state_set_dumpable(suid_dumpable); task->pdeath_signal = 0; - /* - * If a task drops privileges and becomes nondumpable, - * the dumpability change must become visible before - * the credential change; otherwise, a __ptrace_may_access() - * racing with this change may be able to attach to a task it - * shouldn't be able to attach to (as if the task had dropped - * privileges without becoming nondumpable). - * Pairs with a read barrier in __ptrace_may_access(). - */ - smp_wmb(); } /* alter the thread keyring */ diff --git a/kernel/ptrace.c b/kernel/ptrace.c index a4932ef716c6..c340a741e76a 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -356,16 +356,6 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode) return -EPERM; ok: rcu_read_unlock(); - /* - * If a task drops privileges and becomes nondumpable (through a syscall - * like setresuid()) while we are trying to access it, we must ensure - * that the dumpability is read after the credentials; otherwise, - * we may be able to attach to a task that we shouldn't be able to - * attach to (as if the task had dropped privileges without becoming - * nondumpable). - * Pairs with a write barrier in commit_creds(). - */ - smp_rmb(); if (!task_still_dumpable(task, mode)) return -EPERM; -- 2.47.3 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 5/5] cred: switch dumpability lowering to task_exec_state 2026-05-20 14:42 ` [PATCH RFC v2 5/5] cred: switch dumpability lowering to task_exec_state Christian Brauner @ 2026-05-20 18:44 ` Jann Horn 0 siblings, 0 replies; 24+ messages in thread From: Jann Horn @ 2026-05-20 18:44 UTC (permalink / raw) To: Christian Brauner Cc: Linus Torvalds, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 4:43 PM Christian Brauner <brauner@kernel.org> wrote: > commit_creds() has historically called set_dumpable(suid_dumpable) on > every effective uid/gid/cap change, paired with an smp_wmb()/smp_rmb() > fence against __ptrace_may_access() reading the credentials. > > Switch the call to task_exec_state_set_dumpable() so the dumpability > lowering targets the new per-task exec_state rather than mm->flags. > Drop the open-coded "if (task->mm)" guard - exec_state is always > allocated for any observable task - and drop the explicit > smp_wmb()/smp_rmb() pair: the new model relies on RCU acquire/release RCU isn't ACQUIRE/RELEASE; RCU is CONSUME/RELEASE, which is weaker on the reader side. (While ACQUIRE orders against following loads, CONSUME only orders against following loads with an address dependency, and is implemented as a plain load on anything other than alpha. It's a bit cursed because the compiler doesn't understand the invariants that must be preserved for CONSUME, which can cause compiler optimizations to break CONSUME ordering if you're not extra careful, which is why arm64 builds with LTO do a memory barrier on every READ_ONCE()...) > on the cred pointer. WRITE_ONCE() on es->dumpable inside > task_exec_state_set_dumpable() happens-before rcu_assign_pointer() of Yes. > the new cred in commit_creds(), so a reader that observes the new > cred via rcu_dereference(task->real_cred) in __ptrace_may_access() is > guaranteed to observe the new dumpable via READ_ONCE(es->dumpable). No, the two loads in __ptrace_may_access() are effectively both just READ_ONCE(), which only provides RELAXED or CONSUME semantics, while you'd need ACQUIRE to order this: __task_cred rcu_dereference rcu_dereference_check __rcu_dereference_check READ_ONCE [...] task_still_dumpable [...] task_exec_state_rcu rcu_dereference rcu_dereference_check __rcu_dereference_check READ_ONCE ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() 2026-05-20 14:42 [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() Christian Brauner ` (4 preceding siblings ...) 2026-05-20 14:42 ` [PATCH RFC v2 5/5] cred: switch dumpability lowering to task_exec_state Christian Brauner @ 2026-05-20 15:08 ` Christian Brauner 2026-05-20 16:27 ` Jann Horn 6 siblings, 0 replies; 24+ messages in thread From: Christian Brauner @ 2026-05-20 15:08 UTC (permalink / raw) To: Jann Horn, Linus Torvalds, Oleg Nesterov Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko Not the best title for the series fwiw. I thought I had fixed that. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() 2026-05-20 14:42 [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() Christian Brauner ` (5 preceding siblings ...) 2026-05-20 15:08 ` [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() Christian Brauner @ 2026-05-20 16:27 ` Jann Horn 2026-05-20 16:52 ` Linus Torvalds 6 siblings, 1 reply; 24+ messages in thread From: Jann Horn @ 2026-05-20 16:27 UTC (permalink / raw) To: Christian Brauner Cc: Linus Torvalds, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 4:43 PM Christian Brauner <brauner@kernel.org> wrote: > task_exec_state is the privilege domain established by an execve(), not > a property of the address space. Following the model Linus sketched in > [1]: > > - Every clone() variant - thread, process, vfork(), io_uring > worker - refcount-shares the parent's exec_state. No > dup-on-fork. > - Only execve() in the child allocates a fresh instance. > - Credential changes (setresuid, capset, ...) and > prctl(PR_SET_DUMPABLE) update dumpability on the shared > exec_state. > > The entire fork subtree of one execve shares one exec_state; a > child enters a new privilege domain only by execve()ing into one. Hm, thinking about this more... sorry I didn't realize this sooner... On Android, there is a process "zygote64" from which all apps are forked without exec, and these apps are supposed to not interfere with each other: frankel:/ # ps -AZ | grep zygote64 u:r:zygote:s0 root 871 1 15352744 230812 do_sys_poll 0 S zygote64 frankel:/ # cut -d' ' -f28 < /proc/871/stat 549023406592 frankel:/ # cut -d' ' -f28 < /proc/$(pgrep calculator)/stat 549023406592 frankel:/ # cut -d' ' -f28 < /proc/$(pgrep photos)/stat 549023406592 So I think with this series applied, any app could use prctl() to influence the dumpability of all other apps, which seems undesirable... Android also has code in SpecializeCommon(), which runs in a freshly forked zygote child and selectively makes some processes dumpable (if the corresponding app has been installed in a mode with some debugging enabled). I think this means that Linus' suggestion of always sharing dumpability across all forked children won't work. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() 2026-05-20 16:27 ` Jann Horn @ 2026-05-20 16:52 ` Linus Torvalds 2026-05-20 16:55 ` Linus Torvalds 2026-05-20 17:29 ` Jann Horn 0 siblings, 2 replies; 24+ messages in thread From: Linus Torvalds @ 2026-05-20 16:52 UTC (permalink / raw) To: Jann Horn Cc: Christian Brauner, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, 20 May 2026 at 11:27, Jann Horn <jannh@google.com> wrote: > > On Android, there is a process "zygote64" from which all apps are > forked without exec, and these apps are supposed to not interfere with > each other: Oh, yeah, I remember that horrid hack. I think it has come up before because it makes fork() very expensive when you have a huge process that you fork for every little thing... "Zygote: the GNU emacs of Android". That said, I think it's mostly harmless, because: > So I think with this series applied, any app could use prctl() to > influence the dumpability of all other apps, which seems > undesirable... Right you are. How about just modifying the rule to saying that prctl(PR_SET_DUMPABLE) just always creates a new exec_state for the current thread? (Insert obvious optimizations: if the state doesn't change, don't split it, and if the current exec state has a refcount of 1, don't split it). It's still not the same thing as "always create a new one at fork", since there would be no "shared mm" component to it, and it would be thread-local. Hmm? Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() 2026-05-20 16:52 ` Linus Torvalds @ 2026-05-20 16:55 ` Linus Torvalds 2026-05-20 18:09 ` Jann Horn 2026-05-20 17:29 ` Jann Horn 1 sibling, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2026-05-20 16:55 UTC (permalink / raw) To: Jann Horn Cc: Christian Brauner, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, 20 May 2026 at 11:52, Linus Torvalds <torvalds@linuxfoundation.org> wrote: > > Right you are. How about just modifying the rule to saying that > prctl(PR_SET_DUMPABLE) just always creates a new exec_state for the > current thread? Obviously we could also make some clone() flag to do this, although I'd rather not expand on that interface.. I guess the good news here is that Android may have more of a "upstream first" model, but I think Android still has lots of its own required infrastructure anyway. Is there some Android kernel person who could pipe up on this all? Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() 2026-05-20 16:55 ` Linus Torvalds @ 2026-05-20 18:09 ` Jann Horn 2026-05-20 18:12 ` Linus Torvalds 0 siblings, 1 reply; 24+ messages in thread From: Jann Horn @ 2026-05-20 18:09 UTC (permalink / raw) To: Linus Torvalds, Christian Brauner Cc: Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 7:01 PM Linus Torvalds <torvalds@linuxfoundation.org> wrote: > On Wed, 20 May 2026 at 11:52, Linus Torvalds > <torvalds@linuxfoundation.org> wrote: > > > > Right you are. How about just modifying the rule to saying that > > prctl(PR_SET_DUMPABLE) just always creates a new exec_state for the > > current thread? > > Obviously we could also make some clone() flag to do this, although > I'd rather not expand on that interface.. This patch series always unshares state with alloc_task_exec_state() on execve(), and never on clone(). I think compared to that, it would be strictly better to also do alloc_task_exec_state() on clone() without CLONE_VM for now, to preserve the existing semantics. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() 2026-05-20 18:09 ` Jann Horn @ 2026-05-20 18:12 ` Linus Torvalds 2026-05-20 19:46 ` Christian Brauner 0 siblings, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2026-05-20 18:12 UTC (permalink / raw) To: Jann Horn Cc: Christian Brauner, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, 20 May 2026 at 13:09, Jann Horn <jannh@google.com> wrote: > This patch series always unshares state with alloc_task_exec_state() > on execve(), and never on clone(). I think compared to that, it would > be strictly better to also do alloc_task_exec_state() on clone() > without CLONE_VM for now, to preserve the existing semantics. Christian did do that originally, but I convinced him that we should try to aim for simpler semantics (and imho *better* semantics) Oh well. Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() 2026-05-20 18:12 ` Linus Torvalds @ 2026-05-20 19:46 ` Christian Brauner 0 siblings, 0 replies; 24+ messages in thread From: Christian Brauner @ 2026-05-20 19:46 UTC (permalink / raw) To: Linus Torvalds Cc: Jann Horn, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 01:12:49PM -0500, Linus Torvalds wrote: > On Wed, 20 May 2026 at 13:09, Jann Horn <jannh@google.com> wrote: > > This patch series always unshares state with alloc_task_exec_state() > > on execve(), and never on clone(). I think compared to that, it would > > be strictly better to also do alloc_task_exec_state() on clone() > > without CLONE_VM for now, to preserve the existing semantics. > > Christian did do that originally, but I convinced him that we should > try to aim for simpler semantics (and imho *better* semantics) Tbh: I always expected that we would end up at the original dup-on-clone. Either now or when rolling this out. There's just too much crap that could rely on this - but we should've definitely tried. And I very very much agree with you that the non-dup-on-clone semantics are fundamentally nicer. I wish that dumpability could've been a userspace only concept. I really dislike that the kernel magically resets various things during commit_creds() transition. It is such a mess and leads to really broken behavior for stuff like pdeath_signal which magically disappears on setuid() which makes it so useless in practice. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() 2026-05-20 16:52 ` Linus Torvalds 2026-05-20 16:55 ` Linus Torvalds @ 2026-05-20 17:29 ` Jann Horn 2026-05-20 18:11 ` Linus Torvalds 1 sibling, 1 reply; 24+ messages in thread From: Jann Horn @ 2026-05-20 17:29 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, May 20, 2026 at 6:58 PM Linus Torvalds <torvalds@linuxfoundation.org> wrote: > On Wed, 20 May 2026 at 11:27, Jann Horn <jannh@google.com> wrote: > > On Android, there is a process "zygote64" from which all apps are > > forked without exec, and these apps are supposed to not interfere with > > each other: > > Oh, yeah, I remember that horrid hack. I think it has come up before > because it makes fork() very expensive when you have a huge process > that you fork for every little thing... > > "Zygote: the GNU emacs of Android". FWIW, while the Zygote concept is something I've only seen in Google products (Android and Chrome), the idea that fork() children are supposed to be less privileged also exists in other stuff; for example, I believe ssh's "sshd-session" binary does privilege separation between the root-privileged part and the rest that way. Though I think dumpability is not relevant for its security. > That said, I think it's mostly harmless, because: > > > So I think with this series applied, any app could use prctl() to > > influence the dumpability of all other apps, which seems > > undesirable... > > Right you are. How about just modifying the rule to saying that > prctl(PR_SET_DUMPABLE) just always creates a new exec_state for the > current thread? > > (Insert obvious optimizations: if the state doesn't change, don't > split it, and if the current exec state has a refcount of 1, don't > split it). > > It's still not the same thing as "always create a new one at fork", > since there would be no "shared mm" component to it, and it would be > thread-local. Hmm? For this case it might work, since that... probably?... runs before more threads get spawned. There is also this: https://cs.android.com/android/platform/superproject/+/android-latest-release:bionic/libc/bionic/android_profiling_dynamic.cpp;l=140;drc=2557f73c05f256db3ffa9ac9892b13e226b6ea4c That's a signal handler for signal 36, which probably usually runs in a multithreaded process, and it includes this gross hack: ``` // If the process is undumpable, /proc/self/mem will be owned by root:root, and therefore // inaccessible to the process itself (see man 5 proc). We temporarily mark the process as // dumpable to allow for the open. Note: prctl is not async signal safe per posix, but bionic's // implementation is. Error checking on prctls is omitted due to them being trivial. int orig_dumpable = prctl(PR_GET_DUMPABLE, 0, 0, 0, 0); if (!orig_dumpable) { prctl(PR_SET_DUMPABLE, 1, 0, 0, 0); } ScopedFd maps_fd{ open("/proc/self/maps", O_RDONLY | O_CLOEXEC) }; ScopedFd mem_fd{ open("/proc/self/mem", O_RDONLY | O_CLOEXEC) }; if (!orig_dumpable) { prctl(PR_SET_DUMPABLE, orig_dumpable, 0, 0, 0); } ``` I think that's a process-directed signal, so this might happen on some random non-main thread, while /proc/self/ refers to the main thread... ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() 2026-05-20 17:29 ` Jann Horn @ 2026-05-20 18:11 ` Linus Torvalds 0 siblings, 0 replies; 24+ messages in thread From: Linus Torvalds @ 2026-05-20 18:11 UTC (permalink / raw) To: Jann Horn Cc: Christian Brauner, Oleg Nesterov, David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko On Wed, 20 May 2026 at 12:29, Jann Horn <jannh@google.com> wrote: > > That's a signal handler for signal 36, which probably usually runs in > a multithreaded process, and it includes this gross hack: > ``` > // If the process is undumpable, /proc/self/mem will be owned by > root:root, and therefore > // inaccessible to the process itself (see man 5 proc). Well, to be fair to that "gross hack", I think that's an example of people working around dumpability being kind of a kernel hack, and then /proc using it oddly. I *think*, for example, that had we had this whole exec_state model, we (a) wouldn't have needed to make setreuid do dumpability games and (b) we could make task_dump_owner build the inode permissions from that exec state, instead of "use the current state, except make it root only for non-dumpable" and that would probably have then made user space not have to play odd games like this (and probably made a lot of people not play dumpability hacks in general). So yes, it's kind of a gross hack, but I suspect we forced people's hands here. But even without that, I do think we could probably always have added a 'permission()' callback that allowed self-introspection. So self-inflicted damage to some degree... Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2026-05-20 19:47 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-20 14:42 [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() Christian Brauner 2026-05-20 14:42 ` [PATCH RFC v2 1/5] sched/coredump: introduce enum task_dumpable Christian Brauner 2026-05-20 16:27 ` Jann Horn 2026-05-20 14:42 ` [PATCH RFC v2 2/5] exec: introduce struct task_exec_state and relocate dumpable Christian Brauner 2026-05-20 15:14 ` Linus Torvalds 2026-05-20 15:24 ` Christian Brauner 2026-05-20 16:27 ` Jann Horn 2026-05-20 19:47 ` Christian Brauner 2026-05-20 14:42 ` [PATCH RFC v2 3/5] ptrace: add ptracer_access_allowed() Christian Brauner 2026-05-20 16:28 ` Jann Horn 2026-05-20 14:42 ` [PATCH RFC v2 4/5] exec_state: relocate dumpable information Christian Brauner 2026-05-20 19:21 ` Jann Horn 2026-05-20 19:47 ` Christian Brauner 2026-05-20 14:42 ` [PATCH RFC v2 5/5] cred: switch dumpability lowering to task_exec_state Christian Brauner 2026-05-20 18:44 ` Jann Horn 2026-05-20 15:08 ` [PATCH RFC v2 0/5] ptrace: keep mm metadata accessible past exit_mm() Christian Brauner 2026-05-20 16:27 ` Jann Horn 2026-05-20 16:52 ` Linus Torvalds 2026-05-20 16:55 ` Linus Torvalds 2026-05-20 18:09 ` Jann Horn 2026-05-20 18:12 ` Linus Torvalds 2026-05-20 19:46 ` Christian Brauner 2026-05-20 17:29 ` Jann Horn 2026-05-20 18:11 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox