Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata
@ 2026-05-20 21:48 Christian Brauner (Amutable)
  2026-05-20 21:48 ` [PATCH RFC v3 1/4] sched/coredump: introduce enum task_dumpable Christian Brauner (Amutable)
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Christian Brauner (Amutable) @ 2026-05-20 21:48 UTC (permalink / raw)
  To: Jann Horn, Linus Torvalds, Oleg Nesterov
  Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory,
	Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Christian Brauner (Amutable)

This series relocates the dumpable mode and the user_namespace
captured at execve() from mm_struct onto a new per-task
task_exec_state structure that stays attached to the task for its
full lifetime.

__ptrace_may_access() and several /proc owner / visibility checks
need to consult two pieces of state for any observable task,
including zombies that have already gone through exit_mm(): the
dumpable mode and the user namespace captured at execve(). Both
live on mm_struct today, which exit_mm() clears from the task long
before the task is reaped.

A reader that races with do_exit() observes task->mm == NULL and
either fails the check or falls back to init_user_ns - which denies
legitimate access to non-dumpable zombies that were running in a
nested user namespace.

mm_struct loses ->user_ns and the dumpability bits in ->flags.
MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* layout exposed via
/proc/<pid>/coredump_filter stays stable. task->user_dumpable and its
exit_mm() snapshot are removed.

task_exec_state is the privilege domain established by an execve()
[1]. Within a thread group it is shared via refcount; across thread
groups each task has its own:

  - CLONE_VM siblings (thread-group members, io_uring workers)
    refcount-share the parent's exec_state.
  - Non-CLONE_VM clones (fork(), vfork() without CLONE_VM)
    allocate a fresh exec_state inheriting the parent's dumpable
    mode and user_ns.
  - execve() in the child allocates a fresh instance and installs
    it under task_lock + exec_update_lock via
    task_exec_state_replace().
  - Credential changes (setresuid, capset, ...) and
    prctl(PR_SET_DUMPABLE) update dumpability on the current
    task's exec_state, i.e. on the thread group's shared instance.

Behavioral change:

Kernel threads that briefly use a user mm via kthread_use_mm() no
longer inherit dumpability from the borrowed mm. Kthreads are not
ptraceable (PF_KTHREAD short-circuits __ptrace_may_access), so this
is observable only via /proc surfaces that a sufficiently privileged
reader can reach.

[1] https://lore.kernel.org/r/CAHk-=wj+NgoDH3GSicJ140SV8OoDd71pLmL3fgFEsTcgoMC6Og@mail.gmail.com

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
Changes in v3:
- Restore alloc-fresh-and-inherit semantics for non-CLONE_VM clones.
  CLONE_VM siblings still refcount-share; fork() and other
  non-CLONE_VM clones get a fresh exec_state that inherits the
  parent's dumpable mode and user_ns. The v2 "every clone
  refcount-shares" model would have let any forked process in an
  Android zygote64 subtree influence dumpability of its siblings
  via prctl(PR_SET_DUMPABLE).
- Link to v2: https://patch.msgid.link/20260520-work-task_exec_state-v2-0-9ea88ceb09e6@kernel.org

Changes in v2:
- Drop dup-on-fork for non-CLONE_VM clones: every clone() variant
  refcount-shares the parent's task_exec_state; only execve()
  allocates a fresh one.  See "Behavioral changes" in the cover
  letter for the implications.
- Switch commit_creds() to update dumpability on the new
  task_exec_state (instead of dropping the set_dumpable() call
  entirely as in v1).  Drops the explicit smp_wmb()/smp_rmb() pair
  - RCU acquire/release on the cred pointer provides the ordering.
- Link to v1: https://patch.msgid.link/20260516-work-exit_mm-v1-1-76bcc7c2439d@kernel.org

---
Christian Brauner (Amutable) (4):
      sched/coredump: introduce enum task_dumpable
      exec: introduce struct task_exec_state
      ptrace: add ptracer_access_allowed()
      exec_state: relocate dumpable information

 arch/arm64/kernel/mte.c          |   6 +-
 drivers/firmware/efi/efi.c       |   1 -
 fs/coredump.c                    |  22 +++-----
 fs/exec.c                        |  39 ++++++-------
 fs/pidfs.c                       |  23 +++-----
 fs/proc/base.c                   |  39 ++++++-------
 include/linux/binfmts.h          |   2 +
 include/linux/coredump.h         |   4 ++
 include/linux/mm_types.h         |   9 ++-
 include/linux/ptrace.h           |   1 +
 include/linux/sched.h            |   6 +-
 include/linux/sched/coredump.h   |  47 ++++------------
 include/linux/sched/exec_state.h |  29 ++++++++++
 init/init_task.c                 |  10 ++++
 kernel/Makefile                  |   2 +-
 kernel/cred.c                    |   3 +-
 kernel/exec_state.c              | 116 +++++++++++++++++++++++++++++++++++++++
 kernel/exit.c                    |   1 -
 kernel/fork.c                    |  32 +++++++++--
 kernel/kthread.c                 |   1 -
 kernel/ptrace.c                  |  53 ++++++++++++------
 kernel/sys.c                     |   6 +-
 mm/init-mm.c                     |   1 -
 23 files changed, 301 insertions(+), 152 deletions(-)
---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260520-work-task_exec_state-83209d8b3e53



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH RFC v3 1/4] sched/coredump: introduce enum task_dumpable
  2026-05-20 21:48 [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata Christian Brauner (Amutable)
@ 2026-05-20 21:48 ` Christian Brauner (Amutable)
  2026-05-20 21:48 ` [PATCH RFC v3 2/4] exec: introduce struct task_exec_state Christian Brauner (Amutable)
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner (Amutable) @ 2026-05-20 21:48 UTC (permalink / raw)
  To: Jann Horn, Linus Torvalds, Oleg Nesterov
  Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory,
	Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Christian Brauner (Amutable)

Replace the SUID_DUMP_DISABLE/USER/ROOT preprocessor constants with
enum task_dumpable.  Numeric values are preserved (kernel.suid_dumpable
sysctl and prctl(PR_SET_DUMPABLE) ABI), so this is a pure rename with
no behavioral change.

Subsequent commits relocate dumpability onto a per-task structure
where the enum type will allow stronger type-checking on the new API.

Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 arch/arm64/kernel/mte.c        |  2 +-
 fs/coredump.c                  |  4 ++--
 fs/exec.c                      |  8 ++++----
 fs/pidfs.c                     |  6 +++---
 fs/proc/base.c                 |  2 +-
 include/linux/mm_types.h       |  2 +-
 include/linux/sched/coredump.h | 15 +++++++++++----
 kernel/exit.c                  |  2 +-
 kernel/ptrace.c                |  4 ++--
 kernel/sys.c                   |  2 +-
 10 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 6874b16d0657..904ac41f93bc 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -538,7 +538,7 @@ static int access_remote_tags(struct task_struct *tsk, unsigned long addr,
 		return -EPERM;
 
 	if (!tsk->ptrace || (current != tsk->parent) ||
-	    ((get_dumpable(mm) != SUID_DUMP_USER) &&
+	    ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) &&
 	     !ptracer_capable(tsk, mm->user_ns))) {
 		mmput(mm);
 		return -EPERM;
diff --git a/fs/coredump.c b/fs/coredump.c
index bb6fdb1f458e..f5348d5bc441 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -873,7 +873,7 @@ static inline bool coredump_socket(struct core_name *cn, struct coredump_params
 static inline bool coredump_force_suid_safe(const struct coredump_params *cprm)
 {
 	/* Require nonrelative corefile path and be extra careful. */
-	return __get_dumpable(cprm->mm_flags) == SUID_DUMP_ROOT;
+	return __get_dumpable(cprm->mm_flags) == TASK_DUMPABLE_ROOT;
 }
 
 static bool coredump_file(struct core_name *cn, struct coredump_params *cprm,
@@ -1419,7 +1419,7 @@ EXPORT_SYMBOL(dump_align);
 
 void validate_coredump_safety(void)
 {
-	if (suid_dumpable == SUID_DUMP_ROOT &&
+	if (suid_dumpable == TASK_DUMPABLE_ROOT &&
 	    core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != '@') {
 
 		coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: "
diff --git a/fs/exec.c b/fs/exec.c
index ba12b4c466f6..f5663bb607d3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1212,7 +1212,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	      gid_eq(current_egid(), current_gid())))
 		set_dumpable(current->mm, suid_dumpable);
 	else
-		set_dumpable(current->mm, SUID_DUMP_USER);
+		set_dumpable(current->mm, TASK_DUMPABLE_OWNER);
 
 	perf_event_exec();
 
@@ -1261,7 +1261,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	 * wait until new credentials are committed
 	 * by commit_creds() above
 	 */
-	if (get_dumpable(me->mm) != SUID_DUMP_USER)
+	if (get_dumpable(me->mm) != TASK_DUMPABLE_OWNER)
 		perf_event_exit_task(me);
 	/*
 	 * cred_guard_mutex must be held at least to this point to prevent
@@ -1906,11 +1906,11 @@ void set_binfmt(struct linux_binfmt *new)
 EXPORT_SYMBOL(set_binfmt);
 
 /*
- * set_dumpable stores three-value SUID_DUMP_* into mm->flags.
+ * set_dumpable stores three-value TASK_DUMPABLE_* into mm->flags.
  */
 void set_dumpable(struct mm_struct *mm, int value)
 {
-	if (WARN_ON((unsigned)value > SUID_DUMP_ROOT))
+	if (WARN_ON((unsigned)value > TASK_DUMPABLE_ROOT))
 		return;
 
 	__mm_flags_set_mask_dumpable(mm, value);
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 1cce4f34a051..9cd12f2f004c 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -341,11 +341,11 @@ static inline bool pid_in_current_pidns(const struct pid *pid)
 static __u32 pidfs_coredump_mask(unsigned long mm_flags)
 {
 	switch (__get_dumpable(mm_flags)) {
-	case SUID_DUMP_USER:
+	case TASK_DUMPABLE_OWNER:
 		return PIDFD_COREDUMP_USER;
-	case SUID_DUMP_ROOT:
+	case TASK_DUMPABLE_ROOT:
 		return PIDFD_COREDUMP_ROOT;
-	case SUID_DUMP_DISABLE:
+	case TASK_DUMPABLE_OFF:
 		return PIDFD_COREDUMP_SKIP;
 	default:
 		WARN_ON_ONCE(true);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d9acfa89c894..da0b316befb8 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1909,7 +1909,7 @@ void task_dump_owner(struct task_struct *task, umode_t mode,
 		mm = task->mm;
 		/* Make non-dumpable tasks owned by some root */
 		if (mm) {
-			if (get_dumpable(mm) != SUID_DUMP_USER) {
+			if (get_dumpable(mm) != TASK_DUMPABLE_OWNER) {
 				struct user_namespace *user_ns = mm->user_ns;
 
 				uid = make_kuid(user_ns, 0);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a308e2c23b82..51ea37b2a0aa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1908,7 +1908,7 @@ enum {
 
 /*
  * The first two bits represent core dump modes for set-user-ID,
- * the modes are SUID_DUMP_* defined in linux/sched/coredump.h
+ * the modes are TASK_DUMPABLE_* defined in linux/sched/coredump.h
  */
 #define MMF_DUMPABLE_BITS 2
 #define MMF_DUMPABLE_MASK (BIT(MMF_DUMPABLE_BITS) - 1)
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 624fda17a785..ed6547692b61 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -4,9 +4,16 @@
 
 #include <linux/mm_types.h>
 
-#define SUID_DUMP_DISABLE	0	/* No setuid dumping */
-#define SUID_DUMP_USER		1	/* Dump as user of process */
-#define SUID_DUMP_ROOT		2	/* Dump as root */
+/*
+ * Task dumpability mode.  Gates core dump production and ptrace_attach()
+ * authorization.  The numeric values are stable ABI (suid_dumpable
+ * sysctl, prctl(PR_SET_DUMPABLE)); do not renumber.
+ */
+enum task_dumpable {
+	TASK_DUMPABLE_OFF	= 0,	/* no dump; ptrace needs CAP_SYS_PTRACE */
+	TASK_DUMPABLE_OWNER	= 1,	/* default; dump and ptrace by uid match */
+	TASK_DUMPABLE_ROOT	= 2,	/* dump as root; ptrace needs CAP_SYS_PTRACE */
+};
 
 static inline unsigned long __mm_flags_get_dumpable(const struct mm_struct *mm)
 {
@@ -26,7 +33,7 @@ extern void set_dumpable(struct mm_struct *mm, int value);
 /*
  * This returns the actual value of the suid_dumpable flag. For things
  * that are using this for checking for privilege transitions, it must
- * test against SUID_DUMP_USER rather than treating it as a boolean
+ * test against TASK_DUMPABLE_OWNER rather than treating it as a boolean
  * value.
  */
 static inline int __get_dumpable(unsigned long mm_flags)
diff --git a/kernel/exit.c b/kernel/exit.c
index f50d73c272d6..507eda655e8d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -571,7 +571,7 @@ static void exit_mm(void)
 	 */
 	smp_mb__after_spinlock();
 	local_irq_disable();
-	current->user_dumpable = (get_dumpable(mm) == SUID_DUMP_USER);
+	current->user_dumpable = (get_dumpable(mm) == TASK_DUMPABLE_OWNER);
 	current->mm = NULL;
 	membarrier_update_current_mm(NULL);
 	enter_lazy_tlb(mm, current);
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 130043bfc209..07398c9c8fe3 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -53,7 +53,7 @@ int ptrace_access_vm(struct task_struct *tsk, unsigned long addr,
 
 	if (!tsk->ptrace ||
 	    (current != tsk->parent) ||
-	    ((get_dumpable(mm) != SUID_DUMP_USER) &&
+	    ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) &&
 	     !ptracer_capable(tsk, mm->user_ns))) {
 		mmput(mm);
 		return 0;
@@ -276,7 +276,7 @@ static bool task_still_dumpable(struct task_struct *task, unsigned int mode)
 {
 	struct mm_struct *mm = task->mm;
 	if (mm) {
-		if (get_dumpable(mm) == SUID_DUMP_USER)
+		if (get_dumpable(mm) == TASK_DUMPABLE_OWNER)
 			return true;
 		return ptrace_has_cap(mm->user_ns, mode);
 	}
diff --git a/kernel/sys.c b/kernel/sys.c
index 62e842055cc9..f1189f719db5 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2568,7 +2568,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		error = get_dumpable(me->mm);
 		break;
 	case PR_SET_DUMPABLE:
-		if (arg2 != SUID_DUMP_DISABLE && arg2 != SUID_DUMP_USER) {
+		if (arg2 != TASK_DUMPABLE_OFF && arg2 != TASK_DUMPABLE_OWNER) {
 			error = -EINVAL;
 			break;
 		}

-- 
2.47.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC v3 2/4] exec: introduce struct task_exec_state
  2026-05-20 21:48 [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata Christian Brauner (Amutable)
  2026-05-20 21:48 ` [PATCH RFC v3 1/4] sched/coredump: introduce enum task_dumpable Christian Brauner (Amutable)
@ 2026-05-20 21:48 ` Christian Brauner (Amutable)
  2026-05-20 21:48 ` [PATCH RFC v3 3/4] ptrace: add ptracer_access_allowed() Christian Brauner (Amutable)
  2026-05-20 21:48 ` [PATCH RFC v3 4/4] exec_state: relocate dumpable information Christian Brauner (Amutable)
  3 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner (Amutable) @ 2026-05-20 21:48 UTC (permalink / raw)
  To: Jann Horn, Linus Torvalds, Oleg Nesterov
  Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory,
	Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Christian Brauner (Amutable)

Introduce struct task_exec_state, a per-task RCU-protected structure
that holds the dumpable mode and stays attached to the task for its
full lifetime.

task_exec_state_rcu() is the canonical reader: asserts RCU or
task_lock is held, WARNs on a NULL state, returns the
rcu_dereference()'d pointer.

Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 include/linux/sched.h            |   2 +
 include/linux/sched/exec_state.h |  31 +++++++++++
 kernel/Makefile                  |   2 +-
 kernel/exec_state.c              | 116 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 150 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee06cba5c6f5..6674dbf960b5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -962,6 +962,8 @@ struct task_struct {
 	struct mm_struct		*mm;
 	struct mm_struct		*active_mm;
 
+	struct task_exec_state __rcu	*exec_state;
+
 	int				exit_state;
 	int				exit_code;
 	int				exit_signal;
diff --git a/include/linux/sched/exec_state.h b/include/linux/sched/exec_state.h
new file mode 100644
index 000000000000..dc5a795cbfe2
--- /dev/null
+++ b/include/linux/sched/exec_state.h
@@ -0,0 +1,31 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Christian Brauner <brauner@kernel.org> */
+#ifndef _LINUX_SCHED_EXEC_STATE_H
+#define _LINUX_SCHED_EXEC_STATE_H
+
+#include <linux/init.h>
+#include <linux/rcupdate.h>
+#include <linux/refcount.h>
+#include <linux/sched/coredump.h>
+#include <linux/user_namespace.h>
+
+struct task_exec_state {
+	refcount_t		count;
+	enum task_dumpable	dumpable;
+	struct user_namespace	*user_ns;
+	struct rcu_head		rcu;
+};
+
+struct task_exec_state *alloc_task_exec_state(struct user_namespace *user_ns);
+void put_task_exec_state(struct task_exec_state *exec_state);
+struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk);
+struct task_exec_state *task_exec_state_replace(struct task_struct *tsk,
+						struct task_exec_state *exec_state);
+void task_exec_state_set_dumpable(enum task_dumpable value);
+enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task);
+int task_exec_state_copy(struct task_struct *tsk);
+void __init exec_state_init(void);
+
+DEFINE_FREE(put_task_exec_state, struct task_exec_state *, put_task_exec_state(_T))
+
+#endif /* _LINUX_SCHED_EXEC_STATE_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 6785982013dc..1e1a31673577 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -3,7 +3,7 @@
 # Makefile for the linux kernel.
 #
 
-obj-y     = fork.o exec_domain.o panic.o \
+obj-y     = fork.o exec_domain.o exec_state.o panic.o \
 	    cpu.o exit.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o user.o \
 	    signal.o sys.o umh.o workqueue.o pid.o task_work.o \
diff --git a/kernel/exec_state.c b/kernel/exec_state.c
new file mode 100644
index 000000000000..a0ca5d913900
--- /dev/null
+++ b/kernel/exec_state.c
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Christian Brauner <brauner@kernel.org> */
+#include <linux/init.h>
+#include <linux/rcupdate.h>
+#include <linux/refcount.h>
+#include <linux/sched.h>
+#include <linux/sched/coredump.h>
+#include <linux/sched/exec_state.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include <linux/user_namespace.h>
+
+static struct kmem_cache *task_exec_state_cachep;
+
+static void __free_task_exec_state(struct rcu_head *rcu)
+{
+	struct task_exec_state *exec_state = container_of(rcu, struct task_exec_state, rcu);
+
+	put_user_ns(exec_state->user_ns);
+	kmem_cache_free(task_exec_state_cachep, exec_state);
+}
+
+void put_task_exec_state(struct task_exec_state *exec_state)
+{
+	if (exec_state && refcount_dec_and_test(&exec_state->count))
+		call_rcu(&exec_state->rcu, __free_task_exec_state);
+}
+
+struct task_exec_state *alloc_task_exec_state(struct user_namespace *user_ns)
+{
+	struct task_exec_state *exec_state;
+
+	exec_state = kmem_cache_alloc(task_exec_state_cachep, GFP_KERNEL);
+	if (!exec_state)
+		return NULL;
+	refcount_set(&exec_state->count, 1);
+	exec_state->dumpable = TASK_DUMPABLE_OFF;
+	exec_state->user_ns = get_user_ns(user_ns);
+	return exec_state;
+}
+
+struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk)
+{
+	struct task_exec_state *exec_state;
+
+	exec_state = rcu_dereference_check(tsk->exec_state,
+					   lockdep_is_held(&tsk->alloc_lock));
+	WARN_ON_ONCE(!exec_state);
+	return exec_state;
+}
+
+struct task_exec_state *task_exec_state_replace(struct task_struct *tsk,
+						struct task_exec_state *exec_state)
+{
+	/*
+	 * Updates must hold both locks so callers needing a consistent
+	 * snapshot of mm + dumpability are covered.
+	 */
+	lockdep_assert_held(&tsk->alloc_lock);
+	lockdep_assert_held_write(&tsk->signal->exec_update_lock);
+
+	return rcu_replace_pointer(tsk->exec_state, exec_state, true);
+}
+
+/*
+ * The non-CLONE_VM clone path: allocate a fresh exec_state and
+ * inherit the parent's dumpable mode and user_ns reference.  CLONE_VM
+ * siblings refcount-share via copy_exec_state() in fork.c; only this
+ * path and execve() ever allocate.
+ */
+int task_exec_state_copy(struct task_struct *tsk)
+{
+	struct task_exec_state *src, *dst;
+
+	src = rcu_dereference_protected(current->exec_state, true);
+	dst = alloc_task_exec_state(src->user_ns);
+	if (!dst)
+		return -ENOMEM;
+	dst->dumpable = src->dumpable;
+	rcu_assign_pointer(tsk->exec_state, dst);
+	return 0;
+}
+
+/*
+ * Store TASK_DUMPABLE_* on current->exec_state.  All callers
+ * (commit_creds, begin_new_exec, prctl(PR_SET_DUMPABLE)) act on the
+ * running task, which guarantees ->exec_state is allocated and cannot
+ * be replaced under us.
+ */
+void task_exec_state_set_dumpable(enum task_dumpable value)
+{
+	struct task_exec_state *exec_state;
+
+	if (WARN_ON(value > TASK_DUMPABLE_ROOT))
+		value = TASK_DUMPABLE_OFF;
+
+	exec_state = rcu_dereference_protected(current->exec_state, true);
+	WRITE_ONCE(exec_state->dumpable, value);
+}
+
+enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task)
+{
+	struct task_exec_state *exec_state;
+
+	guard(rcu)();
+	exec_state = rcu_dereference(task->exec_state);
+	return READ_ONCE(exec_state->dumpable);
+}
+
+void __init exec_state_init(void)
+{
+	task_exec_state_cachep = kmem_cache_create("task_exec_state",
+			sizeof(struct task_exec_state), 0,
+			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT,
+			NULL);
+}

-- 
2.47.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC v3 3/4] ptrace: add ptracer_access_allowed()
  2026-05-20 21:48 [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata Christian Brauner (Amutable)
  2026-05-20 21:48 ` [PATCH RFC v3 1/4] sched/coredump: introduce enum task_dumpable Christian Brauner (Amutable)
  2026-05-20 21:48 ` [PATCH RFC v3 2/4] exec: introduce struct task_exec_state Christian Brauner (Amutable)
@ 2026-05-20 21:48 ` Christian Brauner (Amutable)
  2026-05-20 21:48 ` [PATCH RFC v3 4/4] exec_state: relocate dumpable information Christian Brauner (Amutable)
  3 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner (Amutable) @ 2026-05-20 21:48 UTC (permalink / raw)
  To: Jann Horn, Linus Torvalds, Oleg Nesterov
  Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory,
	Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Christian Brauner (Amutable)

Add a helper that encapsulates all of the logic for checking ptrace
access and remove open-coded versions in follow-up patches.

Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 include/linux/ptrace.h |  1 +
 kernel/ptrace.c        | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 90507d4afcd6..ef314f7a9ecc 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -17,6 +17,7 @@ struct syscall_info {
 	struct seccomp_data	data;
 };
 
+bool ptracer_access_allowed(struct task_struct *tsk);
 extern int ptrace_access_vm(struct task_struct *tsk, unsigned long addr,
 			    void *buf, int len, unsigned int gup_flags);
 
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 07398c9c8fe3..0e1f80f73a7f 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -13,6 +13,7 @@
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/coredump.h>
+#include <linux/sched/exec_state.h>
 #include <linux/sched/task.h>
 #include <linux/errno.h>
 #include <linux/mm.h>
@@ -36,6 +37,32 @@
 
 #include <asm/syscall.h>	/* for syscall_get_* */
 
+/**
+ * ptracer_access_allowed - may current peek/poke @tsk's address space?
+ * @tsk: tracee
+ *
+ * Per-access check used by ptrace_access_vm() and architecture-specific
+ * tag/register accessors.  Returns true iff current is the registered
+ * ptracer of @tsk and either @tsk is owner-dumpable or current holds
+ * CAP_SYS_PTRACE in @tsk's exec namespace.  Lighter than
+ * __ptrace_may_access(): it re-validates only dumpability and
+ * capability on every access, without re-running LSM hooks or
+ * cred_cap_issubset() checks performed at attach time.
+ */
+bool ptracer_access_allowed(struct task_struct *tsk)
+{
+	const struct task_exec_state *es;
+
+	if (!tsk->ptrace)
+		return false;
+	if (current != tsk->parent)
+		return false;
+	guard(rcu)();
+	es = task_exec_state_rcu(tsk);
+	return READ_ONCE(es->dumpable) == TASK_DUMPABLE_OWNER ||
+	       ptracer_capable(tsk, es->user_ns);
+}
+
 /*
  * Access another process' address space via ptrace.
  * Source/target buffer must be kernel space,

-- 
2.47.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC v3 4/4] exec_state: relocate dumpable information
  2026-05-20 21:48 [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata Christian Brauner (Amutable)
                   ` (2 preceding siblings ...)
  2026-05-20 21:48 ` [PATCH RFC v3 3/4] ptrace: add ptracer_access_allowed() Christian Brauner (Amutable)
@ 2026-05-20 21:48 ` Christian Brauner (Amutable)
  2026-05-21 10:05   ` Christian Brauner
  2026-05-21 11:16   ` Jann Horn
  3 siblings, 2 replies; 8+ messages in thread
From: Christian Brauner (Amutable) @ 2026-05-20 21:48 UTC (permalink / raw)
  To: Jann Horn, Linus Torvalds, Oleg Nesterov
  Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory,
	Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Christian Brauner (Amutable)

The dumpable flag captured at execve() is consulted by
__ptrace_may_access() and several /proc owner / visibility checks.
It lives on mm_struct today, which exit_mm() clears from the task
long before the task itself is reaped.

exec_state is anchored to the execve() that established the current
privilege domain.  CLONE_VM siblings refcount-share the parent's
exec_state via copy_exec_state(); non-CLONE_VM clones allocate a
fresh exec_state inheriting the parent's dumpable mode and user_ns
reference via task_exec_state_copy().  execve() allocates a fresh
instance (via alloc_task_exec_state() in begin_new_exec()) and
installs it under task_lock + exec_update_lock with
task_exec_state_replace().  init_task uses a static instance.

The dumpable mode now lives on task->exec_state->dumpable.
task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is
removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit
positions remain stable for the /proc/<pid>/coredump_filter ABI. The
task->user_dumpable cache bit and its assignment in exit_mm() are
removed; readers go through get_dumpable(task) directly.

coredump_params gains a snapshot field cprm.dumpable, populated from
get_dumpable(current) at vfs_coredump() entry, replacing the previous
__get_dumpable(cprm->mm_flags) consumers in fs/coredump.c and
fs/pidfs.c.

The user namespace recorded at execve() is consulted by
__ptrace_may_access() and by /proc/PID/* owner derivation. Move the
captured user_ns onto task_exec_state, which stays attached to the task
past exit_mm() and across exit_files().

bprm grows a user_ns field staged in bprm_mm_init() with the caller's
user_ns, narrowed by would_dump() to the closest privileged ancestor,
and consumed by exec_mmap() via alloc_task_exec_state(bprm->user_ns).
free_bprm() releases the staging reference.

mm_struct loses ->user_ns entirely.  Initializers in init-mm, efi_mm,
and the implicit one in mm_init()/dup_mm()/mm_alloc() are removed;
__mmdrop() drops the matching put_user_ns(). The kthread_use_mm()
WARN_ON_ONCE(!mm->user_ns) is no longer meaningful and goes too.

Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 arch/arm64/kernel/mte.c          |  6 ++----
 drivers/firmware/efi/efi.c       |  1 -
 fs/coredump.c                    | 20 +++++++-------------
 fs/exec.c                        | 39 ++++++++++++++++++++-------------------
 fs/pidfs.c                       | 17 ++++++-----------
 fs/proc/base.c                   | 39 ++++++++++++++++-----------------------
 include/linux/binfmts.h          |  2 ++
 include/linux/coredump.h         |  4 ++++
 include/linux/mm_types.h         |  9 ++++-----
 include/linux/sched.h            |  4 +---
 include/linux/sched/coredump.h   | 36 ++----------------------------------
 include/linux/sched/exec_state.h |  2 --
 init/init_task.c                 | 10 ++++++++++
 kernel/cred.c                    |  3 +--
 kernel/exit.c                    |  1 -
 kernel/fork.c                    | 32 ++++++++++++++++++++++++++------
 kernel/kthread.c                 |  1 -
 kernel/ptrace.c                  | 26 ++++++++------------------
 kernel/sys.c                     |  4 ++--
 mm/init-mm.c                     |  1 -
 20 files changed, 111 insertions(+), 146 deletions(-)

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 904ac41f93bc..1a9aad6ef22a 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -8,6 +8,7 @@
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/prctl.h>
+#include <linux/ptrace.h>
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
 #include <linux/string.h>
@@ -537,16 +538,13 @@ static int access_remote_tags(struct task_struct *tsk, unsigned long addr,
 	if (!mm)
 		return -EPERM;
 
-	if (!tsk->ptrace || (current != tsk->parent) ||
-	    ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) &&
-	     !ptracer_capable(tsk, mm->user_ns))) {
+	if (!ptracer_access_allowed(tsk)) {
 		mmput(mm);
 		return -EPERM;
 	}
 
 	ret = __access_remote_tags(mm, addr, kiov, gup_flags);
 	mmput(mm);
-
 	return ret;
 }
 
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index d04be38f1750..ae78bc021b41 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -73,7 +73,6 @@ struct mm_struct efi_mm = {
 	MMAP_LOCK_INITIALIZER(efi_mm)
 	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
 	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
-	.user_ns		= &init_user_ns,
 #ifdef CONFIG_SCHED_MM_CID
 	.mm_cid.lock		= __RAW_SPIN_LOCK_UNLOCKED(efi_mm.mm_cid.lock),
 #endif
diff --git a/fs/coredump.c b/fs/coredump.c
index f5348d5bc441..e943569e9b6d 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -395,8 +395,7 @@ static bool coredump_parse(struct core_name *cn, struct coredump_params *cprm,
 							  cred->gid));
 				break;
 			case 'd':
-				err = cn_printf(cn, "%d",
-					__get_dumpable(cprm->mm_flags));
+				err = cn_printf(cn, "%d", cprm->dumpable);
 				break;
 			/* signal that caused the coredump */
 			case 's':
@@ -869,11 +868,11 @@ static inline void coredump_sock_shutdown(struct file *file) { }
 static inline bool coredump_socket(struct core_name *cn, struct coredump_params *cprm) { return false; }
 #endif
 
-/* cprm->mm_flags contains a stable snapshot of dumpability flags. */
+/* cprm->dumpable is the snapshot of task dumpability at dump start. */
 static inline bool coredump_force_suid_safe(const struct coredump_params *cprm)
 {
 	/* Require nonrelative corefile path and be extra careful. */
-	return __get_dumpable(cprm->mm_flags) == TASK_DUMPABLE_ROOT;
+	return cprm->dumpable == TASK_DUMPABLE_ROOT;
 }
 
 static bool coredump_file(struct core_name *cn, struct coredump_params *cprm,
@@ -1085,7 +1084,7 @@ static inline bool coredump_skip(const struct coredump_params *cprm,
 		return true;
 	if (!binfmt->core_dump)
 		return true;
-	if (!__get_dumpable(cprm->mm_flags))
+	if (cprm->dumpable == TASK_DUMPABLE_OFF)
 		return true;
 	return false;
 }
@@ -1170,14 +1169,9 @@ void vfs_coredump(const kernel_siginfo_t *siginfo)
 	struct coredump_params cprm = {
 		.siginfo = siginfo,
 		.limit = rlimit(RLIMIT_CORE),
-		/*
-		 * We must use the same mm->flags while dumping core to avoid
-		 * inconsistency of bit flags, since this flag is not protected
-		 * by any locks.
-		 *
-		 * Note that we only care about MMF_DUMP* flags.
-		 */
-		.mm_flags = __mm_flags_get_dumpable(mm),
+		/* Snapshot MMF_DUMP_FILTER_* (unlocked) and dumpable for the dump. */
+		.mm_flags = __mm_flags_get_word(mm),
+		.dumpable = task_exec_state_get_dumpable(current),
 		.vma_meta = NULL,
 		.cpu = raw_smp_processor_id(),
 	};
diff --git a/fs/exec.c b/fs/exec.c
index f5663bb607d3..9e7f25e2cd41 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -35,6 +35,7 @@
 #include <linux/init.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/coredump.h>
+#include <linux/sched/exec_state.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/sched/task.h>
@@ -263,6 +264,9 @@ static int bprm_mm_init(struct linux_binprm *bprm)
 	if (!mm)
 		goto err;
 
+	/* Staged for would_dump() narrowing; consumed by begin_new_exec(). */
+	bprm->user_ns = get_user_ns(current_user_ns());
+
 	/* Save current stack limit for all calculations made during exec. */
 	task_lock(current->group_leader);
 	bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK];
@@ -834,12 +838,17 @@ EXPORT_SYMBOL(read_code);
  * On success, this function returns with exec_update_lock
  * held for writing.
  */
-static int exec_mmap(struct mm_struct *mm)
+static int exec_mmap(struct mm_struct *mm, struct user_namespace *user_ns)
 {
+	struct task_exec_state *exec_state __free(put_task_exec_state) = NULL;
 	struct task_struct *tsk;
 	struct mm_struct *old_mm, *active_mm;
 	int ret;
 
+	exec_state = alloc_task_exec_state(user_ns);
+	if (!exec_state)
+		return -ENOMEM;
+
 	/* Notify parent that we're no longer interested in the old VM */
 	tsk = current;
 	old_mm = current->mm;
@@ -870,6 +879,7 @@ static int exec_mmap(struct mm_struct *mm)
 	tsk->active_mm = mm;
 	tsk->mm = mm;
 	mm_init_cid(mm, tsk);
+	exec_state = task_exec_state_replace(tsk, exec_state);
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
@@ -1145,7 +1155,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	 * Release all of the old mmap stuff
 	 */
 	acct_arg_size(bprm, 0);
-	retval = exec_mmap(bprm->mm);
+	retval = exec_mmap(bprm->mm, bprm->user_ns);
 	if (retval)
 		goto out;
 
@@ -1210,9 +1220,9 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
 	    !(uid_eq(current_euid(), current_uid()) &&
 	      gid_eq(current_egid(), current_gid())))
-		set_dumpable(current->mm, suid_dumpable);
+		task_exec_state_set_dumpable(suid_dumpable);
 	else
-		set_dumpable(current->mm, TASK_DUMPABLE_OWNER);
+		task_exec_state_set_dumpable(TASK_DUMPABLE_OWNER);
 
 	perf_event_exec();
 
@@ -1261,7 +1271,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	 * wait until new credentials are committed
 	 * by commit_creds() above
 	 */
-	if (get_dumpable(me->mm) != TASK_DUMPABLE_OWNER)
+	if (task_exec_state_get_dumpable(me) != TASK_DUMPABLE_OWNER)
 		perf_event_exit_task(me);
 	/*
 	 * cred_guard_mutex must be held at least to this point to prevent
@@ -1298,14 +1308,14 @@ void would_dump(struct linux_binprm *bprm, struct file *file)
 		struct user_namespace *old, *user_ns;
 		bprm->interp_flags |= BINPRM_FLAGS_ENFORCE_NONDUMP;
 
-		/* Ensure mm->user_ns contains the executable */
-		user_ns = old = bprm->mm->user_ns;
+		/* Ensure bprm->user_ns contains the executable. */
+		user_ns = old = bprm->user_ns;
 		while ((user_ns != &init_user_ns) &&
 		       !privileged_wrt_inode_uidgid(user_ns, idmap, inode))
 			user_ns = user_ns->parent;
 
 		if (old != user_ns) {
-			bprm->mm->user_ns = get_user_ns(user_ns);
+			bprm->user_ns = get_user_ns(user_ns);
 			put_user_ns(old);
 		}
 	}
@@ -1375,6 +1385,8 @@ static void free_bprm(struct linux_binprm *bprm)
 		acct_arg_size(bprm, 0);
 		mmput(bprm->mm);
 	}
+	if (bprm->user_ns)
+		put_user_ns(bprm->user_ns);
 	free_arg_pages(bprm);
 	if (bprm->cred) {
 		/* in case exec fails before de_thread() succeeds */
@@ -1905,17 +1917,6 @@ void set_binfmt(struct linux_binfmt *new)
 }
 EXPORT_SYMBOL(set_binfmt);
 
-/*
- * set_dumpable stores three-value TASK_DUMPABLE_* into mm->flags.
- */
-void set_dumpable(struct mm_struct *mm, int value)
-{
-	if (WARN_ON((unsigned)value > TASK_DUMPABLE_ROOT))
-		return;
-
-	__mm_flags_set_mask_dumpable(mm, value);
-}
-
 static inline struct user_arg_ptr native_arg(const char __user *const __user *p)
 {
 	return (struct user_arg_ptr){.ptr.native = p};
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 9cd12f2f004c..b2ff950a096e 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -338,9 +338,9 @@ static inline bool pid_in_current_pidns(const struct pid *pid)
 	return false;
 }
 
-static __u32 pidfs_coredump_mask(unsigned long mm_flags)
+static __u32 pidfs_coredump_mask(enum task_dumpable dumpable)
 {
-	switch (__get_dumpable(mm_flags)) {
+	switch (dumpable) {
 	case TASK_DUMPABLE_OWNER:
 		return PIDFD_COREDUMP_USER;
 	case TASK_DUMPABLE_ROOT:
@@ -433,14 +433,9 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 		return -ESRCH;
 
 	if ((mask & PIDFD_INFO_COREDUMP) && !kinfo.coredump_mask) {
-		guard(task_lock)(task);
-		if (task->mm) {
-			unsigned long flags = __mm_flags_get_dumpable(task->mm);
-
-			kinfo.coredump_mask = pidfs_coredump_mask(flags);
-			kinfo.mask |= PIDFD_INFO_COREDUMP;
-			/* No coredump actually took place, so no coredump signal. */
-		}
+		kinfo.coredump_mask = pidfs_coredump_mask(task_exec_state_get_dumpable(task));
+		kinfo.mask |= PIDFD_INFO_COREDUMP;
+		/* No coredump actually took place, so no coredump signal. */
 	}
 
 	/* Unconditionally return identifiers and credentials, the rest only on request */
@@ -779,7 +774,7 @@ void pidfs_coredump(const struct coredump_params *cprm)
 	VFS_WARN_ON_ONCE(attr == PIDFS_PID_DEAD);
 
 	/* Note how we were coredumped and that we coredumped. */
-	attr->coredump_mask = pidfs_coredump_mask(cprm->mm_flags) |
+	attr->coredump_mask = pidfs_coredump_mask(cprm->dumpable) |
 			      PIDFD_COREDUMPED;
 	/* If coredumping is set to skip we should never end up here. */
 	VFS_WARN_ON_ONCE(attr->coredump_mask & PIDFD_COREDUMP_SKIP);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index da0b316befb8..65f56136ec3f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -91,6 +91,7 @@
 #include <linux/sched/mm.h>
 #include <linux/sched/coredump.h>
 #include <linux/sched/debug.h>
+#include <linux/sched/exec_state.h>
 #include <linux/sched/stat.h>
 #include <linux/posix-timers.h>
 #include <linux/time_namespace.h>
@@ -1893,7 +1894,6 @@ void task_dump_owner(struct task_struct *task, umode_t mode,
 	cred = __task_cred(task);
 	uid = cred->euid;
 	gid = cred->egid;
-	rcu_read_unlock();
 
 	/*
 	 * Before the /proc/pid/status file was created the only way to read
@@ -1903,29 +1903,22 @@ void task_dump_owner(struct task_struct *task, umode_t mode,
 	 * made this apply to all per process world readable and executable
 	 * directories.
 	 */
-	if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) {
-		struct mm_struct *mm;
-		task_lock(task);
-		mm = task->mm;
-		/* Make non-dumpable tasks owned by some root */
-		if (mm) {
-			if (get_dumpable(mm) != TASK_DUMPABLE_OWNER) {
-				struct user_namespace *user_ns = mm->user_ns;
-
-				uid = make_kuid(user_ns, 0);
-				if (!uid_valid(uid))
-					uid = GLOBAL_ROOT_UID;
-
-				gid = make_kgid(user_ns, 0);
-				if (!gid_valid(gid))
-					gid = GLOBAL_ROOT_GID;
-			}
-		} else {
-			uid = GLOBAL_ROOT_UID;
-			gid = GLOBAL_ROOT_GID;
+	if (mode != (S_IFDIR | S_IRUGO | S_IXUGO)) {
+		struct task_exec_state *exec_state;
+
+		exec_state = task_exec_state_rcu(task);
+		if (READ_ONCE(exec_state->dumpable) != TASK_DUMPABLE_OWNER) {
+			uid = make_kuid(exec_state->user_ns, 0);
+			if (!uid_valid(uid))
+				uid = GLOBAL_ROOT_UID;
+
+			gid = make_kgid(exec_state->user_ns, 0);
+			if (!gid_valid(gid))
+				gid = GLOBAL_ROOT_GID;
 		}
-		task_unlock(task);
 	}
+	rcu_read_unlock();
+
 	*ruid = uid;
 	*rgid = gid;
 }
@@ -2965,7 +2958,7 @@ static ssize_t proc_coredump_filter_read(struct file *file, char __user *buf,
 	ret = 0;
 	mm = get_task_mm(task);
 	if (mm) {
-		unsigned long flags = __mm_flags_get_dumpable(mm);
+		unsigned long flags = __mm_flags_get_word(mm);
 
 		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
 			       ((flags & MMF_DUMP_FILTER_MASK) >>
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 65abd5ab8836..a8379f4eee61 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -25,6 +25,8 @@ struct linux_binprm {
 	struct page *page[MAX_ARG_PAGES];
 #endif
 	struct mm_struct *mm;
+	/* user_ns published to task->exec_state at execve, narrowed by would_dump(). */
+	struct user_namespace *user_ns;
 	unsigned long p; /* current top of mem */
 	unsigned int
 		/* Should an execfd be passed to userspace? */
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 68861da4cf7c..7b38ee2e7913 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/mm.h>
 #include <linux/fs.h>
+#include <linux/sched/coredump.h>
 #include <asm/siginfo.h>
 
 #ifdef CONFIG_COREDUMP
@@ -20,7 +21,10 @@ struct coredump_params {
 	const kernel_siginfo_t *siginfo;
 	struct file *file;
 	unsigned long limit;
+	/* MMF_DUMP_FILTER_* bits, snapshot of mm->flags at dump start. */
 	unsigned long mm_flags;
+	/* Snapshot of dumpable at dump start. */
+	enum task_dumpable dumpable;
 	int cpu;
 	loff_t written;
 	loff_t pos;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 51ea37b2a0aa..9588ce3b16df 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1342,7 +1342,6 @@ struct mm_struct {
 		 */
 		struct task_struct __rcu *owner;
 #endif
-		struct user_namespace *user_ns;
 
 		/* store ref to file /proc/<pid>/exe symlink points to */
 		struct file __rcu *exe_file;
@@ -1907,11 +1906,11 @@ enum {
 /* mm flags */
 
 /*
- * The first two bits represent core dump modes for set-user-ID,
- * the modes are TASK_DUMPABLE_* defined in linux/sched/coredump.h
+ * Bits 0 and 1 were dumpability; that moved to task->exec_state.  Reserve
+ * the bits so MMF_DUMP_FILTER_* positions stay stable for the
+ * /proc/<pid>/coredump_filter ABI.
  */
 #define MMF_DUMPABLE_BITS 2
-#define MMF_DUMPABLE_MASK (BIT(MMF_DUMPABLE_BITS) - 1)
 /* coredump filter bits */
 #define MMF_DUMP_ANON_PRIVATE	2
 #define MMF_DUMP_ANON_SHARED	3
@@ -1972,7 +1971,7 @@ enum {
 #define MMF_TOPDOWN		31	/* mm searches top down by default */
 #define MMF_TOPDOWN_MASK	BIT(MMF_TOPDOWN)
 
-#define MMF_INIT_LEGACY_MASK	(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
+#define MMF_INIT_LEGACY_MASK	(MMF_DUMP_FILTER_MASK |\
 				 MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
 				 MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6674dbf960b5..258cb075478d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -85,6 +85,7 @@ struct seq_file;
 struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
+struct task_exec_state;
 struct task_group;
 struct task_struct;
 struct timespec64;
@@ -1004,9 +1005,6 @@ struct task_struct {
 	unsigned			sched_rt_mutex:1;
 #endif
 
-	/* Save user-dumpable when mm goes away */
-	unsigned			user_dumpable:1;
-
 	/* Bit to tell TOMOYO we're in execve(): */
 	unsigned			in_execve:1;
 	unsigned			in_iowait:1;
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index ed6547692b61..20957ccde3b5 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -2,8 +2,6 @@
 #ifndef _LINUX_SCHED_COREDUMP_H
 #define _LINUX_SCHED_COREDUMP_H
 
-#include <linux/mm_types.h>
-
 /*
  * Task dumpability mode.  Gates core dump production and ptrace_attach()
  * authorization.  The numeric values are stable ABI (suid_dumpable
@@ -15,37 +13,7 @@ enum task_dumpable {
 	TASK_DUMPABLE_ROOT	= 2,	/* dump as root; ptrace needs CAP_SYS_PTRACE */
 };
 
-static inline unsigned long __mm_flags_get_dumpable(const struct mm_struct *mm)
-{
-	/*
-	 * By convention, dumpable bits are contained in first 32 bits of the
-	 * bitmap, so we can simply access this first unsigned long directly.
-	 */
-	return __mm_flags_get_word(mm);
-}
-
-static inline void __mm_flags_set_mask_dumpable(struct mm_struct *mm, int value)
-{
-	__mm_flags_set_mask_bits_word(mm, MMF_DUMPABLE_MASK, value);
-}
-
-extern void set_dumpable(struct mm_struct *mm, int value);
-/*
- * This returns the actual value of the suid_dumpable flag. For things
- * that are using this for checking for privilege transitions, it must
- * test against TASK_DUMPABLE_OWNER rather than treating it as a boolean
- * value.
- */
-static inline int __get_dumpable(unsigned long mm_flags)
-{
-	return mm_flags & MMF_DUMPABLE_MASK;
-}
-
-static inline int get_dumpable(struct mm_struct *mm)
-{
-	unsigned long flags = __mm_flags_get_dumpable(mm);
-
-	return __get_dumpable(flags);
-}
+void task_exec_state_set_dumpable(enum task_dumpable value);
+enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task);
 
 #endif /* _LINUX_SCHED_COREDUMP_H */
diff --git a/include/linux/sched/exec_state.h b/include/linux/sched/exec_state.h
index dc5a795cbfe2..23fe4b55e010 100644
--- a/include/linux/sched/exec_state.h
+++ b/include/linux/sched/exec_state.h
@@ -21,8 +21,6 @@ void put_task_exec_state(struct task_exec_state *exec_state);
 struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk);
 struct task_exec_state *task_exec_state_replace(struct task_struct *tsk,
 						struct task_exec_state *exec_state);
-void task_exec_state_set_dumpable(enum task_dumpable value);
-enum task_dumpable task_exec_state_get_dumpable(struct task_struct *task);
 int task_exec_state_copy(struct task_struct *tsk);
 void __init exec_state_init(void);
 
diff --git a/init/init_task.c b/init/init_task.c
index b5f48ebdc2b6..47a651b05058 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -7,6 +7,8 @@
 #include <linux/sched/rt.h>
 #include <linux/sched/task.h>
 #include <linux/sched/ext.h>
+#include <linux/sched/exec_state.h>
+#include <linux/user_namespace.h>
 #include <linux/init.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
@@ -56,6 +58,13 @@ static struct sighand_struct init_sighand = {
 	.signalfd_wqh	= __WAIT_QUEUE_HEAD_INITIALIZER(init_sighand.signalfd_wqh),
 };
 
+/* init to 2 - one for init_task, one to ensure it is never freed */
+static struct task_exec_state init_task_exec_state = {
+	.count		= REFCOUNT_INIT(2),
+	.dumpable	= TASK_DUMPABLE_OWNER,
+	.user_ns	= &init_user_ns,
+};
+
 #ifdef CONFIG_SHADOW_CALL_STACK
 unsigned long init_shadow_call_stack[SCS_SIZE / sizeof(long)] = {
 	[(SCS_SIZE / sizeof(long)) - 1] = SCS_END_MAGIC
@@ -113,6 +122,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.nr_cpus_allowed= NR_CPUS,
 	.mm		= NULL,
 	.active_mm	= &init_mm,
+	.exec_state	= &init_task_exec_state,
 	.restart_block	= {
 		.fn = do_no_restart_syscall,
 	},
diff --git a/kernel/cred.c b/kernel/cred.c
index 12a7b1ce5131..dceb9fa4a4b4 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -384,8 +384,7 @@ int commit_creds(struct cred *new)
 	    !uid_eq(old->fsuid, new->fsuid) ||
 	    !gid_eq(old->fsgid, new->fsgid) ||
 	    !cred_cap_issubset(old, new)) {
-		if (task->mm)
-			set_dumpable(task->mm, suid_dumpable);
+		task_exec_state_set_dumpable(suid_dumpable);
 		task->pdeath_signal = 0;
 		/*
 		 * If a task drops privileges and becomes nondumpable,
diff --git a/kernel/exit.c b/kernel/exit.c
index 507eda655e8d..9a909993ab1d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -571,7 +571,6 @@ static void exit_mm(void)
 	 */
 	smp_mb__after_spinlock();
 	local_irq_disable();
-	current->user_dumpable = (get_dumpable(mm) == TASK_DUMPABLE_OWNER);
 	current->mm = NULL;
 	membarrier_update_current_mm(NULL);
 	enter_lazy_tlb(mm, current);
diff --git a/kernel/fork.c b/kernel/fork.c
index 5f3fdfdb14c7..b8b651abce8b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -23,6 +23,7 @@
 #include <linux/sched/task_stack.h>
 #include <linux/sched/cputime.h>
 #include <linux/sched/ext.h>
+#include <linux/sched/exec_state.h>
 #include <linux/seq_file.h>
 #include <linux/rtmutex.h>
 #include <linux/init.h>
@@ -555,6 +556,7 @@ void free_task(struct task_struct *tsk)
 	if (tsk->flags & PF_KTHREAD)
 		free_kthread_struct(tsk);
 	bpf_task_storage_free(tsk);
+	put_task_exec_state(tsk->exec_state);
 	free_task_struct(tsk);
 }
 EXPORT_SYMBOL(free_task);
@@ -731,7 +733,6 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_subscriptions_destroy(mm);
 	check_mm(mm);
-	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
 	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
@@ -1072,8 +1073,7 @@ static void mmap_init_lock(struct mm_struct *mm)
 #endif
 }
 
-static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
-	struct user_namespace *user_ns)
+static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
 	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
 	mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
@@ -1132,7 +1132,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 				     NR_MM_COUNTERS))
 		goto fail_pcpu;
 
-	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
 	return mm;
 
@@ -1163,7 +1162,7 @@ struct mm_struct *mm_alloc(void)
 		return NULL;
 
 	memset(mm, 0, sizeof(*mm));
-	return mm_init(mm, current, current_user_ns());
+	return mm_init(mm, current);
 }
 EXPORT_SYMBOL_IF_KUNIT(mm_alloc);
 
@@ -1527,7 +1526,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
 
 	memcpy(mm, oldmm, sizeof(*mm));
 
-	if (!mm_init(mm, tsk, mm->user_ns))
+	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 
 	uprobe_start_dup_mmap();
@@ -1593,6 +1592,23 @@ static int copy_mm(u64 clone_flags, struct task_struct *tsk)
 	return 0;
 }
 
+static int copy_exec_state(u64 clone_flags, struct task_struct *tsk)
+{
+	int ret;
+	struct task_exec_state *exec_state;
+
+	exec_state = rcu_access_pointer(tsk->exec_state);
+	if (clone_flags & CLONE_VM) {
+		refcount_inc(&exec_state->count);
+		return 0;
+	}
+
+	ret = task_exec_state_copy(tsk);
+	if (ret)
+		RCU_INIT_POINTER(tsk->exec_state, NULL);
+	return ret;
+}
+
 static int copy_fs(u64 clone_flags, struct task_struct *tsk)
 {
 	struct fs_struct *fs = current->fs;
@@ -2090,6 +2106,9 @@ __latent_entropy struct task_struct *copy_process(
 	p = dup_task_struct(current, node);
 	if (!p)
 		goto fork_out;
+	retval = copy_exec_state(clone_flags, p);
+	if (retval)
+		goto bad_fork_free;
 	p->flags &= ~PF_KTHREAD;
 	if (args->kthread)
 		p->flags |= PF_KTHREAD;
@@ -3098,6 +3117,7 @@ void __init proc_caches_init(void)
 			sizeof(struct signal_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
+	exec_state_init();
 	files_cachep = kmem_cache_create("files_cache",
 			sizeof(struct files_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 791210daf8b4..63beb59b7a3d 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1619,7 +1619,6 @@ void kthread_use_mm(struct mm_struct *mm)
 
 	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
 	WARN_ON_ONCE(tsk->mm);
-	WARN_ON_ONCE(!mm->user_ns);
 
 	/*
 	 * It is possible for mm to be the same as tsk->active_mm, but
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 0e1f80f73a7f..ea8a682e837d 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -72,21 +72,14 @@ int ptrace_access_vm(struct task_struct *tsk, unsigned long addr,
 		     void *buf, int len, unsigned int gup_flags)
 {
 	struct mm_struct *mm;
-	int ret;
+	int ret = 0;
 
 	mm = get_task_mm(tsk);
 	if (!mm)
 		return 0;
 
-	if (!tsk->ptrace ||
-	    (current != tsk->parent) ||
-	    ((get_dumpable(mm) != TASK_DUMPABLE_OWNER) &&
-	     !ptracer_capable(tsk, mm->user_ns))) {
-		mmput(mm);
-		return 0;
-	}
-
-	ret = access_remote_vm(mm, addr, buf, len, gup_flags);
+	if (ptracer_access_allowed(tsk))
+		ret = access_remote_vm(mm, addr, buf, len, gup_flags);
 	mmput(mm);
 
 	return ret;
@@ -301,16 +294,13 @@ static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
 
 static bool task_still_dumpable(struct task_struct *task, unsigned int mode)
 {
-	struct mm_struct *mm = task->mm;
-	if (mm) {
-		if (get_dumpable(mm) == TASK_DUMPABLE_OWNER)
-			return true;
-		return ptrace_has_cap(mm->user_ns, mode);
-	}
+	const struct task_exec_state *exec_state;
 
-	if (task->user_dumpable)
+	guard(rcu)();
+	exec_state = task_exec_state_rcu(task);
+	if (READ_ONCE(exec_state->dumpable) == TASK_DUMPABLE_OWNER)
 		return true;
-	return ptrace_has_cap(&init_user_ns, mode);
+	return ptrace_has_cap(exec_state->user_ns, mode);
 }
 
 /* Returns 0 on success, -errno on denial. */
diff --git a/kernel/sys.c b/kernel/sys.c
index f1189f719db5..df69bd71de03 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2565,14 +2565,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		error = put_user(me->pdeath_signal, (int __user *)arg2);
 		break;
 	case PR_GET_DUMPABLE:
-		error = get_dumpable(me->mm);
+		error = task_exec_state_get_dumpable(me);
 		break;
 	case PR_SET_DUMPABLE:
 		if (arg2 != TASK_DUMPABLE_OFF && arg2 != TASK_DUMPABLE_OWNER) {
 			error = -EINVAL;
 			break;
 		}
-		set_dumpable(me->mm, arg2);
+		task_exec_state_set_dumpable(arg2);
 		break;
 
 	case PR_SET_UNALIGN:
diff --git a/mm/init-mm.c b/mm/init-mm.c
index c5556bb9d5f0..3e792aad7626 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -43,7 +43,6 @@ struct mm_struct init_mm = {
 	.vma_writer_wait = __RCUWAIT_INITIALIZER(init_mm.vma_writer_wait),
 	.mm_lock_seq	= SEQCNT_ZERO(init_mm.mm_lock_seq),
 #endif
-	.user_ns	= &init_user_ns,
 #ifdef CONFIG_SCHED_MM_CID
 	.mm_cid.lock = __RAW_SPIN_LOCK_UNLOCKED(init_mm.mm_cid.lock),
 #endif

-- 
2.47.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v3 4/4] exec_state: relocate dumpable information
  2026-05-20 21:48 ` [PATCH RFC v3 4/4] exec_state: relocate dumpable information Christian Brauner (Amutable)
@ 2026-05-21 10:05   ` Christian Brauner
  2026-05-21 11:16   ` Jann Horn
  1 sibling, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2026-05-21 10:05 UTC (permalink / raw)
  To: Jann Horn, Linus Torvalds, Oleg Nesterov
  Cc: David Hildenbrand (Arm), Andrew Morton, Qualys Security Advisory,
	Kees Cook, Minchan Kim, linux-mm, Suren Baghdasaryan,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Michal Hocko

> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -384,8 +384,7 @@ int commit_creds(struct cred *new)
>  	    !uid_eq(old->fsuid, new->fsuid) ||
>  	    !gid_eq(old->fsgid, new->fsgid) ||
>  	    !cred_cap_issubset(old, new)) {
> -		if (task->mm)
> -			set_dumpable(task->mm, suid_dumpable);
> +		task_exec_state_set_dumpable(suid_dumpable);

When looking at this I wondered how the hell I ended up removing the mm
check and that was from one of the prior versions. So this check should
stay and I want to leave an explanation why.

So the check is obviously needed for two cases:

(1) kthreads

    Afaict, we don't have any kthreads that do commit_creds(). I think
    that is system call path only.

    (1.1) But kthreads are created with CLONE_VM and thus all start out
	  with kthread->mm == NULL and with task->exec_state shared as
	  well. So having them end up in commit_creds() with the
	  task->mm check is fine as we won't do anything.

    (1.2) kthreads that make use of kthread_use_mm() may _not_ call
	  commit_creds() in any form because they would alter
	  dumpability for all other kernel threads because while they
	  have assumed a new mm, they have not assumed a new exec_state.

(2) user mode helpers

    User mode helpers are created with CLONE_VM and are created as a
    child of a kernel threads but aren't actual kernel threads (in the
    sense that they aren't marked as such, +/- a few other details
    irrelevant to this).

    So at fork() time their umh->mm == NULL and the exec_state is shared
    with all other kthreads as well.

    user mode helpers _do_ commit_creds() but before they went through
    exec so umh->mm still is NULL and shared exec_state with other
    kthreads is unchanged.

    All umh's go through exec and afterwards they will have both a
    separate mm and a separate exec state and so it's all fine.

So I'm going to fold the following diff which asserts the invariant that
altering global exec_state is not supported:

diff --git a/include/linux/sched/exec_state.h b/include/linux/sched/exec_state.h
index 23fe4b55e010..9b61782510b8 100644
--- a/include/linux/sched/exec_state.h
+++ b/include/linux/sched/exec_state.h
@@ -16,6 +16,8 @@ struct task_exec_state {
        struct rcu_head         rcu;
 };

+extern struct task_exec_state init_task_exec_state;
+
 struct task_exec_state *alloc_task_exec_state(struct user_namespace *user_ns);
 void put_task_exec_state(struct task_exec_state *exec_state);
 struct task_exec_state *task_exec_state_rcu(const struct task_struct *tsk);
diff --git a/init/init_task.c b/init/init_task.c
index 47a651b05058..8cad78da469c 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -59,7 +59,7 @@ static struct sighand_struct init_sighand = {
 };

 /* init to 2 - one for init_task, one to ensure it is never freed */
-static struct task_exec_state init_task_exec_state = {
+struct task_exec_state init_task_exec_state = {
        .count          = REFCOUNT_INIT(2),
        .dumpable       = TASK_DUMPABLE_OWNER,
        .user_ns        = &init_user_ns,
diff --git a/kernel/cred.c b/kernel/cred.c
index dceb9fa4a4b4..3df4e15bd67f 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -384,7 +384,9 @@ int commit_creds(struct cred *new)
            !uid_eq(old->fsuid, new->fsuid) ||
            !gid_eq(old->fsgid, new->fsgid) ||
            !cred_cap_issubset(old, new)) {
-               task_exec_state_set_dumpable(suid_dumpable);
+               /* mm-less tasks share init_task's exec_state */
+               if (task->mm)
+                       task_exec_state_set_dumpable(suid_dumpable);
                task->pdeath_signal = 0;
                /*
                 * If a task drops privileges and becomes nondumpable,
diff --git a/kernel/exec_state.c b/kernel/exec_state.c
index 814a475fc786..2b7d0262d0f4 100644
--- a/kernel/exec_state.c
+++ b/kernel/exec_state.c
@@ -95,6 +95,9 @@ void task_exec_state_set_dumpable(enum task_dumpable value)
                value = TASK_DUMPABLE_OFF;

        exec_state = rcu_dereference_protected(current->exec_state, true);
+       /* mm-less tasks share init_task's exec_state; never mutate it */
+       if (WARN_ON_ONCE(exec_state == &init_task_exec_state))
+               return;
        WRITE_ONCE(exec_state->dumpable, value);
 }



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v3 4/4] exec_state: relocate dumpable information
  2026-05-20 21:48 ` [PATCH RFC v3 4/4] exec_state: relocate dumpable information Christian Brauner (Amutable)
  2026-05-21 10:05   ` Christian Brauner
@ 2026-05-21 11:16   ` Jann Horn
  2026-05-21 13:08     ` Christian Brauner
  1 sibling, 1 reply; 8+ messages in thread
From: Jann Horn @ 2026-05-21 11:16 UTC (permalink / raw)
  To: Christian Brauner (Amutable)
  Cc: Linus Torvalds, Oleg Nesterov, David Hildenbrand (Arm),
	Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim,
	linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Michal Hocko

On Wed, May 20, 2026 at 11:49 PM Christian Brauner (Amutable)
<brauner@kernel.org> wrote:
> @@ -2090,6 +2106,9 @@ __latent_entropy struct task_struct *copy_process(
>         p = dup_task_struct(current, node);
>         if (!p)
>                 goto fork_out;
> +       retval = copy_exec_state(clone_flags, p);
> +       if (retval)
> +               goto bad_fork_free;

AFAICS for state like this that is torn down in free_task(), normally
dup_task_struct() NULLs out pointers that require refcounting, and
then copy_process() initializes them properly, so that in
copy_process() we can bail out in the middle and have the task_struct
in a sufficiently clean state to go through more or less the normal
free_task() path.

In particular, I'm thinking of the handling of tsk->seccomp.filter -
dup_task_struct() sets `tsk->seccomp.filter = NULL`, and later
copy_process() calls copy_seccomp().

With your implementation, the error handling would break if anyone
tried to add another bailout between dup_task_struct() and
copy_exec_state().

(Sidenote: Ugh, the way dup_task_struct() just copies the entire
task_struct is so ugly...)

>         p->flags &= ~PF_KTHREAD;
>         if (args->kthread)
>                 p->flags |= PF_KTHREAD;


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v3 4/4] exec_state: relocate dumpable information
  2026-05-21 11:16   ` Jann Horn
@ 2026-05-21 13:08     ` Christian Brauner
  0 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2026-05-21 13:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linus Torvalds, Oleg Nesterov, David Hildenbrand (Arm),
	Andrew Morton, Qualys Security Advisory, Kees Cook, Minchan Kim,
	linux-mm, Suren Baghdasaryan, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Michal Hocko

On Thu, May 21, 2026 at 01:16:23PM +0200, Jann Horn wrote:
> On Wed, May 20, 2026 at 11:49 PM Christian Brauner (Amutable)
> <brauner@kernel.org> wrote:
> > @@ -2090,6 +2106,9 @@ __latent_entropy struct task_struct *copy_process(
> >         p = dup_task_struct(current, node);
> >         if (!p)
> >                 goto fork_out;
> > +       retval = copy_exec_state(clone_flags, p);
> > +       if (retval)
> > +               goto bad_fork_free;
> 
> AFAICS for state like this that is torn down in free_task(), normally
> dup_task_struct() NULLs out pointers that require refcounting, and
> then copy_process() initializes them properly, so that in
> copy_process() we can bail out in the middle and have the task_struct
> in a sufficiently clean state to go through more or less the normal
> free_task() path.
> 
> In particular, I'm thinking of the handling of tsk->seccomp.filter -
> dup_task_struct() sets `tsk->seccomp.filter = NULL`, and later
> copy_process() calls copy_seccomp().
> 
> With your implementation, the error handling would break if anyone
> tried to add another bailout between dup_task_struct() and
> copy_exec_state().

I folded:

diff --git a/kernel/fork.c b/kernel/fork.c
index 61a44b33da28..91545ed6463f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -947,6 +947,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
        tsk->seccomp.filter = NULL;
 #endif

+       RCU_INIT_POINTER(tsk->exec_state, NULL);
+
        setup_thread_stack(tsk, orig);
        clear_user_return_notifier(tsk);
        clear_tsk_need_resched(tsk);
@@ -1594,19 +1596,18 @@ static int copy_mm(u64 clone_flags, struct task_struct *tsk)

 static int copy_exec_state(u64 clone_flags, struct task_struct *tsk)
 {
-       int ret;
        struct task_exec_state *exec_state;

-       exec_state = rcu_access_pointer(tsk->exec_state);
+       /* CLONE_VM siblings refcount-share the parent's exec_state. */
        if (clone_flags & CLONE_VM) {
+               exec_state = rcu_dereference_protected(current->exec_state, true);
                refcount_inc(&exec_state->count);
+               rcu_assign_pointer(tsk->exec_state, exec_state);
                return 0;
        }

-       ret = task_exec_state_copy(tsk);
-       if (ret)
-               RCU_INIT_POINTER(tsk->exec_state, NULL);
-       return ret;
+       /* Everyone else inherits a fresh copy. */
+       return task_exec_state_copy(tsk);
 }

> 
> (Sidenote: Ugh, the way dup_task_struct() just copies the entire
> task_struct is so ugly...)

Yes, I agree.


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-05-21 13:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-20 21:48 [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata Christian Brauner (Amutable)
2026-05-20 21:48 ` [PATCH RFC v3 1/4] sched/coredump: introduce enum task_dumpable Christian Brauner (Amutable)
2026-05-20 21:48 ` [PATCH RFC v3 2/4] exec: introduce struct task_exec_state Christian Brauner (Amutable)
2026-05-20 21:48 ` [PATCH RFC v3 3/4] ptrace: add ptracer_access_allowed() Christian Brauner (Amutable)
2026-05-20 21:48 ` [PATCH RFC v3 4/4] exec_state: relocate dumpable information Christian Brauner (Amutable)
2026-05-21 10:05   ` Christian Brauner
2026-05-21 11:16   ` Jann Horn
2026-05-21 13:08     ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox