[PATCH 0/2] proc: protect ptrace_may_access() with exec_update

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
@ 2026-05-18 16:35 Jann Horn
  2026-05-18 16:35 ` [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1) Jann Horn
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Jann Horn @ 2026-05-18 16:35 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner
  Cc: Jan Kara, Arjan van de Ven, Eric W. Biederman, Jake Edge,
	linux-kernel, linux-fsdevel, Jann Horn, stable

My understanding is that procfs is effectively maintained by the VFS
maintainers (though scripts/get_maintainer.pl claims that there are
no maintainers for procfs because the VFS entry only claims files
directly in fs/, and the procfs entry has no maintainers listed on
it).

In procfs, most uses of ptrace_may_access() should use
exec_update_lock to avoid TOCTOU issues with concurrent privileged
execve() (like setuid binary execution).

This series doesn't fix all the remaining issues in procfs, but it fixes
the easy cases for now; I will probably follow up with fixes for the
gnarlier cases later unless someone else wants to do that.

I have checked that procfs files still work with these changes and that
CONFIG_PROVE_LOCKING=y doesn't generate any warnings.

(checkpatch complains about missing argument names in
proc_op::proc_get_link, but that was already the case before my patch.)

Signed-off-by: Jann Horn <jannh@google.com>
---
Jann Horn (2):
      proc: protect ptrace_may_access() with exec_update_lock (part 1)
      proc: protect ptrace_may_access() with exec_update_lock (FD links)

 fs/proc/array.c      |   6 ++
 fs/proc/base.c       | 159 ++++++++++++++++++++++-----------------------------
 fs/proc/fd.c         |  27 ++++-----
 fs/proc/internal.h   |   2 +-
 fs/proc/namespaces.c |  12 ++++
 5 files changed, 97 insertions(+), 109 deletions(-)
---
base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
change-id: 20260518-procfs-lockfix-part1-5dca2d95bc12

--  
Jann Horn <jannh@google.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1)
  2026-05-18 16:35 [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock Jann Horn
@ 2026-05-18 16:35 ` Jann Horn
  2026-05-26  8:48   ` Oleg Nesterov
  2026-06-05 14:36   ` Mark Brown
  2026-05-18 16:35 ` [PATCH 2/2] proc: protect ptrace_may_access() with exec_update_lock (FD links) Jann Horn
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 22+ messages in thread
From: Jann Horn @ 2026-05-18 16:35 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner
  Cc: Jan Kara, Arjan van de Ven, Eric W. Biederman, Jake Edge,
	linux-kernel, linux-fsdevel, Jann Horn, stable

Fix the easy cases where procfs currently calls ptrace_may_access() without
exec_update_lock protection, where the fix is to simply add the extra lock
or use mm_access():

 - do_task_stat(): grab exec_update_lock
 - proc_pid_wchan(): grab exec_update_lock
 - proc_map_files_lookup(): use mm_access() instead of get_task_mm()
 - proc_map_files_readdir(): use mm_access() instead of get_task_mm()
 - proc_ns_get_link(): grab exec_update_lock
 - proc_ns_readlink(): grab exec_update_lock

Fixes: f83ce3e6b02d ("proc: avoid information leaks to non-privileged processes")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
---
 fs/proc/array.c      |  6 ++++++
 fs/proc/base.c       | 40 ++++++++++++++++++++--------------------
 fs/proc/namespaces.c | 12 ++++++++++++
 3 files changed, 38 insertions(+), 20 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 90fb0c6b5f99..479ea8cb4ef4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -482,6 +482,11 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long flags;
 	int exit_code = task->exit_code;
 	struct signal_struct *sig = task->signal;
+	int ret;
+
+	ret = down_read_killable(&task->signal->exec_update_lock);
+	if (ret)
+		return ret;
 
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
@@ -657,6 +662,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 		seq_puts(m, " 0");
 
 	seq_putc(m, '\n');
+	up_read(&task->signal->exec_update_lock);
 	if (mm)
 		mmput(mm);
 	return 0;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d9acfa89c894..09b02d1621e5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -423,18 +423,24 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
 {
 	unsigned long wchan;
 	char symname[KSYM_NAME_LEN];
+	int err;
 
+	err = down_read_killable(&task->signal->exec_update_lock);
+	if (err)
+		return err;
 	if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
 		goto print0;
 
 	wchan = get_wchan(task);
 	if (wchan && !lookup_symbol_name(wchan, symname)) {
 		seq_puts(m, symname);
+		up_read(&task->signal->exec_update_lock);
 		return 0;
 	}
 
 print0:
 	seq_putc(m, '0');
+	up_read(&task->signal->exec_update_lock);
 	return 0;
 }
 #endif /* CONFIG_KALLSYMS */
@@ -2360,17 +2366,15 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
 	if (!task)
 		goto out;
 
-	result = ERR_PTR(-EACCES);
-	if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
-		goto out_put_task;
-
 	result = ERR_PTR(-ENOENT);
 	if (dname_to_vma_addr(dentry, &vm_start, &vm_end))
 		goto out_put_task;
 
-	mm = get_task_mm(task);
-	if (!mm)
+	mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
+	if (IS_ERR(mm)) {
+		result = ERR_CAST(mm);
 		goto out_put_task;
+	}
 
 	result = ERR_PTR(-EINTR);
 	if (mmap_read_lock_killable(mm))
@@ -2420,23 +2424,19 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	if (!task)
 		goto out;
 
-	ret = -EACCES;
-	if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
+	mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
+	if (IS_ERR(mm)) {
+		ret = PTR_ERR(mm);
 		goto out_put_task;
+	}
 
 	ret = 0;
 	if (!dir_emit_dots(file, ctx))
-		goto out_put_task;
-
-	mm = get_task_mm(task);
-	if (!mm)
-		goto out_put_task;
+		goto out_put_mm;
 
 	ret = mmap_read_lock_killable(mm);
-	if (ret) {
-		mmput(mm);
-		goto out_put_task;
-	}
+	if (ret)
+		goto out_put_mm;
 
 	nr_files = 0;
 
@@ -2462,8 +2462,7 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 		if (!p) {
 			ret = -ENOMEM;
 			mmap_read_unlock(mm);
-			mmput(mm);
-			goto out_put_task;
+			goto out_put_mm;
 		}
 
 		p->start = vma->vm_start;
@@ -2471,7 +2470,6 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 		p->mode = vma->vm_file->f_mode;
 	}
 	mmap_read_unlock(mm);
-	mmput(mm);
 
 	for (i = 0; i < nr_files; i++) {
 		char buf[4 * sizeof(long) + 2];	/* max: %lx-%lx\0 */
@@ -2488,6 +2486,8 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 		ctx->pos++;
 	}
 
+out_put_mm:
+	mmput(mm);
 out_put_task:
 	put_task_struct(task);
 out:
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 39f4169f669f..2f46f1396744 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -55,6 +55,10 @@ static const char *proc_ns_get_link(struct dentry *dentry,
 	if (!task)
 		return ERR_PTR(-EACCES);
 
+	error = down_read_killable(&task->signal->exec_update_lock);
+	if (error)
+		goto out_put_task;
+
 	if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
 		goto out;
 
@@ -64,6 +68,8 @@ static const char *proc_ns_get_link(struct dentry *dentry,
 
 	error = nd_jump_link(&ns_path);
 out:
+	up_read(&task->signal->exec_update_lock);
+out_put_task:
 	put_task_struct(task);
 	return ERR_PTR(error);
 }
@@ -80,11 +86,17 @@ static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int bufl
 	if (!task)
 		return res;
 
+	res = down_read_killable(&task->signal->exec_update_lock);
+	if (res)
+		goto out_put_task;
+
 	if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
 		res = ns_get_name(name, sizeof(name), task, ns_ops);
 		if (res >= 0)
 			res = readlink_copy(buffer, buflen, name, strlen(name));
 	}
+	up_read(&task->signal->exec_update_lock);
+out_put_task:
 	put_task_struct(task);
 	return res;
 }

-- 
2.54.0.563.g4f69b47b94-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] proc: protect ptrace_may_access() with exec_update_lock (FD links)
  2026-05-18 16:35 [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock Jann Horn
  2026-05-18 16:35 ` [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1) Jann Horn
@ 2026-05-18 16:35 ` Jann Horn
  2026-05-22 11:47 ` [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock Christian Brauner
  2026-05-25 19:56 ` Eric W. Biederman
  3 siblings, 0 replies; 22+ messages in thread
From: Jann Horn @ 2026-05-18 16:35 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner
  Cc: Jan Kara, Arjan van de Ven, Eric W. Biederman, Jake Edge,
	linux-kernel, linux-fsdevel, Jann Horn, stable

proc_pid_get_link() and proc_pid_readlink() currently look up the task from
the pid once, then do the ptrace access check on that task, then look up
the task from the pid a second time to do the actual access.
That's racy in several ways.

To fix it, pass the task to the ->proc_get_link() handler, and instead of
proc_fd_access_allowed(), introduce a new helper call_proc_get_link() that
looks up and locks the task, does the access check, and calls
->proc_get_link().

Fixes: 778c1144771f ("[PATCH] proc: Use sane permission checks on the /proc/<pid>/fd/ symlinks")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
---
 fs/proc/base.c     | 119 +++++++++++++++++++++--------------------------------
 fs/proc/fd.c       |  27 +++++-------
 fs/proc/internal.h |   2 +-
 3 files changed, 59 insertions(+), 89 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 09b02d1621e5..ef2f59461374 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -218,33 +218,24 @@ static int get_task_root(struct task_struct *task, struct path *root)
 	return result;
 }
 
-static int proc_cwd_link(struct dentry *dentry, struct path *path)
+static int proc_cwd_link(struct dentry *dentry, struct path *path,
+			 struct task_struct *task)
 {
-	struct task_struct *task = get_proc_task(d_inode(dentry));
 	int result = -ENOENT;
 
-	if (task) {
-		task_lock(task);
-		if (task->fs) {
-			get_fs_pwd(task->fs, path);
-			result = 0;
-		}
-		task_unlock(task);
-		put_task_struct(task);
+	task_lock(task);
+	if (task->fs) {
+		get_fs_pwd(task->fs, path);
+		result = 0;
 	}
+	task_unlock(task);
 	return result;
 }
 
-static int proc_root_link(struct dentry *dentry, struct path *path)
+static int proc_root_link(struct dentry *dentry, struct path *path,
+			  struct task_struct *task)
 {
-	struct task_struct *task = get_proc_task(d_inode(dentry));
-	int result = -ENOENT;
-
-	if (task) {
-		result = get_task_root(task, path);
-		put_task_struct(task);
-	}
-	return result;
+	return get_task_root(task, path);
 }
 
 /*
@@ -710,23 +701,6 @@ static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
 /*                       Here the fs part begins                        */
 /************************************************************************/
 
-/* permission checks */
-static bool proc_fd_access_allowed(struct inode *inode)
-{
-	struct task_struct *task;
-	bool allowed = false;
-	/* Allow access to a task's file descriptors if it is us or we
-	 * may use ptrace attach to the process and find out that
-	 * information.
-	 */
-	task = get_proc_task(inode);
-	if (task) {
-		allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
-		put_task_struct(task);
-	}
-	return allowed;
-}
-
 int proc_nochmod_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 		 struct iattr *attr)
 {
@@ -1783,16 +1757,12 @@ static const struct file_operations proc_pid_set_comm_operations = {
 	.release	= single_release,
 };
 
-static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
+static int proc_exe_link(struct dentry *dentry, struct path *exe_path,
+			 struct task_struct *task)
 {
-	struct task_struct *task;
 	struct file *exe_file;
 
-	task = get_proc_task(d_inode(dentry));
-	if (!task)
-		return -ENOENT;
 	exe_file = get_task_exe_file(task);
-	put_task_struct(task);
 	if (exe_file) {
 		*exe_path = exe_file->f_path;
 		path_get(&exe_file->f_path);
@@ -1802,26 +1772,42 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
 		return -ENOENT;
 }
 
+static int call_proc_get_link(struct dentry *dentry, struct inode *inode, struct path *path_out)
+{
+	struct task_struct *task;
+	int ret;
+
+	task = get_proc_task(inode);
+	if (!task)
+		return -ENOENT;
+	ret = down_read_killable(&task->signal->exec_update_lock);
+	if (ret)
+		goto out_put_task;
+	if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
+		ret = -EACCES;
+		goto out;
+	}
+	ret = PROC_I(inode)->op.proc_get_link(dentry, path_out, task);
+
+out:
+	up_read(&task->signal->exec_update_lock);
+out_put_task:
+	put_task_struct(task);
+	return ret;
+}
+
 static const char *proc_pid_get_link(struct dentry *dentry,
 				     struct inode *inode,
 				     struct delayed_call *done)
 {
 	struct path path;
-	int error = -EACCES;
+	int error;
 
 	if (!dentry)
 		return ERR_PTR(-ECHILD);
-
-	/* Are we allowed to snoop on the tasks file descriptors? */
-	if (!proc_fd_access_allowed(inode))
-		goto out;
-
-	error = PROC_I(inode)->op.proc_get_link(dentry, &path);
-	if (error)
-		goto out;
-
-	error = nd_jump_link(&path);
-out:
+	error = call_proc_get_link(dentry, inode, &path);
+	if (!error)
+		error = nd_jump_link(&path);
 	return ERR_PTR(error);
 }
 
@@ -1855,17 +1841,11 @@ static int proc_pid_readlink(struct dentry * dentry, char __user * buffer, int b
 	struct inode *inode = d_inode(dentry);
 	struct path path;
 
-	/* Are we allowed to snoop on the tasks file descriptors? */
-	if (!proc_fd_access_allowed(inode))
-		goto out;
-
-	error = PROC_I(inode)->op.proc_get_link(dentry, &path);
-	if (error)
-		goto out;
-
-	error = do_proc_readlink(&path, buffer, buflen);
-	path_put(&path);
-out:
+	error = call_proc_get_link(dentry, inode, &path);
+	if (!error) {
+		error = do_proc_readlink(&path, buffer, buflen);
+		path_put(&path);
+	}
 	return error;
 }
 
@@ -2256,21 +2236,16 @@ static const struct dentry_operations tid_map_files_dentry_operations = {
 	.d_delete	= pid_delete_dentry,
 };
 
-static int map_files_get_link(struct dentry *dentry, struct path *path)
+static int map_files_get_link(struct dentry *dentry, struct path *path,
+			      struct task_struct *task)
 {
 	unsigned long vm_start, vm_end;
 	struct vm_area_struct *vma;
-	struct task_struct *task;
 	struct mm_struct *mm;
 	int rc;
 
 	rc = -ENOENT;
-	task = get_proc_task(d_inode(dentry));
-	if (!task)
-		goto out;
-
 	mm = get_task_mm(task);
-	put_task_struct(task);
 	if (!mm)
 		goto out;
 
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 05c7513e77c7..0f9a1556f2a3 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -171,24 +171,19 @@ static const struct dentry_operations tid_fd_dentry_operations = {
 	.d_delete	= pid_delete_dentry,
 };
 
-static int proc_fd_link(struct dentry *dentry, struct path *path)
+static int proc_fd_link(struct dentry *dentry, struct path *path,
+			struct task_struct *task)
 {
-	struct task_struct *task;
 	int ret = -ENOENT;
-
-	task = get_proc_task(d_inode(dentry));
-	if (task) {
-		unsigned int fd = proc_fd(d_inode(dentry));
-		struct file *fd_file;
-
-		fd_file = fget_task(task, fd);
-		if (fd_file) {
-			*path = fd_file->f_path;
-			path_get(&fd_file->f_path);
-			ret = 0;
-			fput(fd_file);
-		}
-		put_task_struct(task);
+	unsigned int fd = proc_fd(d_inode(dentry));
+	struct file *fd_file;
+
+	fd_file = fget_task(task, fd);
+	if (fd_file) {
+		*path = fd_file->f_path;
+		path_get(&fd_file->f_path);
+		ret = 0;
+		fput(fd_file);
 	}
 
 	return ret;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 64dc44832808..d31984c3c797 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -107,7 +107,7 @@ extern struct kmem_cache *proc_dir_entry_cache;
 void pde_free(struct proc_dir_entry *pde);
 
 union proc_op {
-	int (*proc_get_link)(struct dentry *, struct path *);
+	int (*proc_get_link)(struct dentry *, struct path *, struct task_struct *);
 	int (*proc_show)(struct seq_file *m,
 		struct pid_namespace *ns, struct pid *pid,
 		struct task_struct *task);

-- 
2.54.0.563.g4f69b47b94-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
  2026-05-18 16:35 [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock Jann Horn
  2026-05-18 16:35 ` [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1) Jann Horn
  2026-05-18 16:35 ` [PATCH 2/2] proc: protect ptrace_may_access() with exec_update_lock (FD links) Jann Horn
@ 2026-05-22 11:47 ` Christian Brauner
  2026-05-25 19:56 ` Eric W. Biederman
  3 siblings, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2026-05-22 11:47 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Eric W. Biederman, Jake Edge, linux-kernel, linux-fsdevel, stable

On Mon, 18 May 2026 18:35:14 +0200, Jann Horn <jannh@google.com> wrote:
> [...]
> 
> (checkpatch complains about missing argument names in
> proc_op::proc_get_link, but that was already the case before my patch.)
> 
> Signed-off-by: Jann Horn <jannh@google.com>
> ---

Hm, not super nice as this may cause performance regressions but I think
you're right otherwise. While mostly info leaks - as you mentioned
elsewhere - it would still be nice to try and fix them. So if we can do
it without anyone noticing perf regressions it's probably worth it.

-- 
Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
  2026-05-18 16:35 [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock Jann Horn
                   ` (2 preceding siblings ...)
  2026-05-22 11:47 ` [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock Christian Brauner
@ 2026-05-25 19:56 ` Eric W. Biederman
  2026-05-26 11:10   ` Oleg Nesterov
  2026-05-26 18:22   ` Jann Horn
  3 siblings, 2 replies; 22+ messages in thread
From: Eric W. Biederman @ 2026-05-25 19:56 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Jake Edge, linux-kernel, linux-fsdevel, stable, Kees Cook,
	Oleg Nesterov

I have added a couple more people who might be interested.

Kees Cook because as you have structured this it is an exec problem.

Oleg Nesterov as he is knowledgable about ptrace.

Jann Horn <jannh@google.com> writes:

> My understanding is that procfs is effectively maintained by the VFS
> maintainers (though scripts/get_maintainer.pl claims that there are
> no maintainers for procfs because the VFS entry only claims files
> directly in fs/, and the procfs entry has no maintainers listed on
> it).
>
> In procfs, most uses of ptrace_may_access() should use
> exec_update_lock to avoid TOCTOU issues with concurrent privileged
> execve() (like setuid binary execution).
>
> This series doesn't fix all the remaining issues in procfs, but it fixes
> the easy cases for now; I will probably follow up with fixes for the
> gnarlier cases later unless someone else wants to do that.
>
> I have checked that procfs files still work with these changes and that
> CONFIG_PROVE_LOCKING=y doesn't generate any warnings.
>
> (checkpatch complains about missing argument names in
> proc_op::proc_get_link, but that was already the case before my
> patch.)

I think I finally have my context paged back in so I can intelligently
say something about this series.

The scenario you are worried about is when exec gains privileges,
and we read through proc and authenticate with the old credentials
instead of the new credentials.

Question 1.

Assuming the executable is world readable (which they generally are)
is there anything that becomes accessible in that race that was
not already accessible?

Question 2.

How does this race compare to racing with setresuid?
Do we need to fix the setresuid case as well?

Question 3.
Do we care about the case when a privileged process calls a setuid
process and drops privileges?

Question 4.
Is it possible to use a seq_lock instead of reader writer semaphore?
Or is that only for non-sleeping readers?

There have been a number of nasty cases lurking in the background
involving seccomp filters, PTRACE_EVENT_EXIT, de_thread and the like.

Blocking locks, especially ones that get widely used, just scare me in
this area.  Being able to see that something happened between start and
finish and say -EAGAIN or retrying internally seems like it would be
much less prone to weirdness.

The ugly with PTRACE_EVENT_EXIT as I recall is that if ptrace stops one
of the threads (not the one calling exec) at PTRACE_EVENT_EXIT it can
block de_thread, which blocks the rest of exec.  But there is something
in there where the ptracer hangs waiting for the exec to complete.  So
everything just stalls.  The ptracer waiting for exec the exec waiting
for the ptracer.  SIGKILL can get you out of that mess last I looked.
Still it is an ugly mess.

Getting everything away from that mess is why we have exec_update_lock
instead of just cred_guard_mutex.

I would really appreciate hearing the scenarios you are aiming to fix
and how this fixes them.  There are enough races and special cases
I don't feel comfortable reading that we just need exec_update_lock
around ptrace_may_access.  It is not clear to me that is sufficient
to close the small races we are worried about here.

If I could trace through someone else's logic I could be convinced
and the next people to deal with the code could look at it and see
ah.  That is the detail that was missed when it has to be fixed again.

Eric

> Signed-off-by: Jann Horn <jannh@google.com>
> ---
> Jann Horn (2):
>       proc: protect ptrace_may_access() with exec_update_lock (part 1)
>       proc: protect ptrace_may_access() with exec_update_lock (FD links)
>
>  fs/proc/array.c      |   6 ++
>  fs/proc/base.c       | 159 ++++++++++++++++++++++-----------------------------
>  fs/proc/fd.c         |  27 ++++-----
>  fs/proc/internal.h   |   2 +-
>  fs/proc/namespaces.c |  12 ++++
>  5 files changed, 97 insertions(+), 109 deletions(-)
> ---
> base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
> change-id: 20260518-procfs-lockfix-part1-5dca2d95bc12
>
> --  
> Jann Horn <jannh@google.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1)
  2026-05-18 16:35 ` [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1) Jann Horn
@ 2026-05-26  8:48   ` Oleg Nesterov
  2026-05-26  9:44     ` Oleg Nesterov
  2026-05-26 14:16     ` Jann Horn
  2026-06-05 14:36   ` Mark Brown
  1 sibling, 2 replies; 22+ messages in thread
From: Oleg Nesterov @ 2026-05-26  8:48 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Eric W. Biederman, Jake Edge, linux-kernel, linux-fsdevel, stable,
	Kees Cook

On 05/18, Jann Horn wrote:
>
> Fix the easy cases where procfs currently calls ptrace_may_access() without
> exec_update_lock protection, where the fix is to simply add the extra lock
> or use mm_access():

I thought about this too, but I do not know if it is fine performance wise...

And what about proc_coredump_filter_write() which doesn't use ptrace_may_access() ?

AFAICS, we can't rely on the open-time checks. /proc/$pid/coredump_filter can
be opened for writing, the task can do suid exec after that, the file remains
writable.

Not a big deal, but still.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1)
  2026-05-26  8:48   ` Oleg Nesterov
@ 2026-05-26  9:44     ` Oleg Nesterov
  2026-05-26 14:19       ` Jann Horn
  2026-05-26 14:16     ` Jann Horn
  1 sibling, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2026-05-26  9:44 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Eric W. Biederman, Jake Edge, linux-kernel, linux-fsdevel, stable,
	Kees Cook

Perhaps proc_pid_make_inode() can record task->self_exec_id in
proc_inode ? At least this can help to fix the
"if (ptrace_may_access(task)) mm = get_task_mm(task)" pattern...

On 05/26, Oleg Nesterov wrote:
>
> On 05/18, Jann Horn wrote:
> >
> > Fix the easy cases where procfs currently calls ptrace_may_access() without
> > exec_update_lock protection, where the fix is to simply add the extra lock
> > or use mm_access():
>
> I thought about this too, but I do not know if it is fine performance wise...
>
> And what about proc_coredump_filter_write() which doesn't use ptrace_may_access() ?
>
> AFAICS, we can't rely on the open-time checks. /proc/$pid/coredump_filter can
> be opened for writing, the task can do suid exec after that, the file remains
> writable.
>
> Not a big deal, but still.
>
> Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
  2026-05-25 19:56 ` Eric W. Biederman
@ 2026-05-26 11:10   ` Oleg Nesterov
  2026-05-26 18:22   ` Jann Horn
  1 sibling, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2026-05-26 11:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, Alexander Viro, Christian Brauner, Jan Kara,
	Arjan van de Ven, Jake Edge, linux-kernel, linux-fsdevel, stable,
	Kees Cook

On 05/25, Eric W. Biederman wrote:
>
> The ugly with PTRACE_EVENT_EXIT as I recall is that if ptrace stops one
> of the threads (not the one calling exec) at PTRACE_EVENT_EXIT it can
> block de_thread, which blocks the rest of exec.  But there is something
> in there where the ptracer hangs waiting for the exec to complete.  So
> everything just stalls.  The ptracer waiting for exec the exec waiting
> for the ptracer.  SIGKILL can get you out of that mess last I looked.
> Still it is an ugly mess.

Yes... note that even without PTRACE_EVENT_EXIT a traced sub-thread won't
autoreap, so de_thread which waits for --sig->notify_count in __exit_signal()
will block anyway.

Perhaps we can change ptrace_attach() to detect this case somehow and return
-EWOULDBLOCK... Yes this can confuse strace/gdb, but this is better than
the deadlock, even if it is killable.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1)
  2026-05-26  8:48   ` Oleg Nesterov
  2026-05-26  9:44     ` Oleg Nesterov
@ 2026-05-26 14:16     ` Jann Horn
  2026-05-26 18:22       ` Oleg Nesterov
  1 sibling, 1 reply; 22+ messages in thread
From: Jann Horn @ 2026-05-26 14:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Eric W. Biederman, Jake Edge, linux-kernel, linux-fsdevel, stable,
	Kees Cook

On Tue, May 26, 2026 at 10:48 AM Oleg Nesterov <oleg@redhat.com> wrote:
> On 05/18, Jann Horn wrote:
> >
> > Fix the easy cases where procfs currently calls ptrace_may_access() without
> > exec_update_lock protection, where the fix is to simply add the extra lock
> > or use mm_access():
>
> I thought about this too, but I do not know if it is fine performance wise...
>
> And what about proc_coredump_filter_write() which doesn't use ptrace_may_access() ?

Yeah, this series doesn't fix everything, but I figured it would be
better to at least start fixing some of this stuff rather than leaving
this code as-is...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1)
  2026-05-26  9:44     ` Oleg Nesterov
@ 2026-05-26 14:19       ` Jann Horn
  0 siblings, 0 replies; 22+ messages in thread
From: Jann Horn @ 2026-05-26 14:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Eric W. Biederman, Jake Edge, linux-kernel, linux-fsdevel, stable,
	Kees Cook

On Tue, May 26, 2026 at 11:44 AM Oleg Nesterov <oleg@redhat.com> wrote:
> Perhaps proc_pid_make_inode() can record task->self_exec_id in
> proc_inode ? At least this can help to fix the
> "if (ptrace_may_access(task)) mm = get_task_mm(task)" pattern...

Yes, I think something like that might be a good idea for files that
access the process in read/write handlers, though I think recording it
somewhere in file->private_data would be better than putting it in the
proc_inode.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1)
  2026-05-26 14:16     ` Jann Horn
@ 2026-05-26 18:22       ` Oleg Nesterov
  2026-05-26 18:30         ` Jann Horn
  0 siblings, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2026-05-26 18:22 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Eric W. Biederman, Jake Edge, linux-kernel, linux-fsdevel, stable,
	Kees Cook

On 05/26, Jann Horn wrote:
>
> On Tue, May 26, 2026 at 10:48 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > On 05/18, Jann Horn wrote:
> > >
> > > Fix the easy cases where procfs currently calls ptrace_may_access() without
> > > exec_update_lock protection, where the fix is to simply add the extra lock
> > > or use mm_access():
> >
> > I thought about this too, but I do not know if it is fine performance wise...
> >
> > And what about proc_coredump_filter_write() which doesn't use ptrace_may_access() ?
>
> Yeah, this series doesn't fix everything,

Aah... Of course, I understand. I wasn't clear. Sorry if it looked as
"you missed proc_coredump_filter_write" from my side.

What I actually tried to ask:

	- Do you think it makes sense to fix proc_coredump_filter_write()
	  as well?

	- If yes. Do you think we should add another down_read(exec_update_lock) +
	  ptrace_may_access() into proc_coredump_filter_write() ? Or perhaps we
	  should discuss other approaches (exec_id/seqcount/etc) from the very
	  beginning?

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
  2026-05-25 19:56 ` Eric W. Biederman
  2026-05-26 11:10   ` Oleg Nesterov
@ 2026-05-26 18:22   ` Jann Horn
  2026-05-27 12:01     ` Christian Brauner
       [not found]     ` <87wlwny905.fsf@email.froward.int.ebiederm.org>
  1 sibling, 2 replies; 22+ messages in thread
From: Jann Horn @ 2026-05-26 18:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Jake Edge, linux-kernel, linux-fsdevel, stable, Kees Cook,
	Oleg Nesterov

On Mon, May 25, 2026 at 9:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> I have added a couple more people who might be interested.
>
> Kees Cook because as you have structured this it is an exec problem.
>
> Oleg Nesterov as he is knowledgable about ptrace.
>
> Jann Horn <jannh@google.com> writes:
>
> > My understanding is that procfs is effectively maintained by the VFS
> > maintainers (though scripts/get_maintainer.pl claims that there are
> > no maintainers for procfs because the VFS entry only claims files
> > directly in fs/, and the procfs entry has no maintainers listed on
> > it).
> >
> > In procfs, most uses of ptrace_may_access() should use
> > exec_update_lock to avoid TOCTOU issues with concurrent privileged
> > execve() (like setuid binary execution).
> >
> > This series doesn't fix all the remaining issues in procfs, but it fixes
> > the easy cases for now; I will probably follow up with fixes for the
> > gnarlier cases later unless someone else wants to do that.
> >
> > I have checked that procfs files still work with these changes and that
> > CONFIG_PROVE_LOCKING=y doesn't generate any warnings.
> >
> > (checkpatch complains about missing argument names in
> > proc_op::proc_get_link, but that was already the case before my
> > patch.)
>
>
> I think I finally have my context paged back in so I can intelligently
> say something about this series.
>
> The scenario you are worried about is when exec gains privileges,
> and we read through proc and authenticate with the old credentials
> instead of the new credentials.
>
> Question 1.
>
> Assuming the executable is world readable (which they generally are)
> is there anything that becomes accessible in that race that was
> not already accessible?

I believe so - the gnarliest example I am thinking of is:
Memfds are always mode 0777 or 0666 (see __shmem_file_setup, which
sets S_IRWXUGO), so their access control is purely based on being able
to pathwalk to the memfd's inode. If you can race
open(/proc/$pid/fd/$n) with the process $pid going through setuid
execution and calling memfd_create(), you should be able to get
read+write access to the memfd created by the setuid binary that was
supposed to be private.

(But I have not tested that and don't know if there are actually any
setuid binaries that happen to use memfds.)

> Question 2.
>
> How does this race compare to racing with setresuid?
> Do we need to fix the setresuid case as well?

Which setresuid case? setresuid clears the dumpable flag and has a
memory barrier that is supposed to make that properly ordered against
ptrace_may_access(); so setresuid() should normally not cause a task
to become traceable, though that could maybe happen in weird
scenarios.

I think another case we should probably care about is what happens if
a process which is only protected against ptrace by being non-dumpable
goes through execve() - it shouldn't be possible to access resources
associated with the pre-execve state while checking against the
post-execve dumpability. It might be important for this that the
do_close_on_exec() logic currently happens after committing the
dumpable state in exec_mmap()...

> Question 3.
> Do we care about the case when a privileged process calls a setuid
> process and drops privileges?

I don't understand the question. Hmm - do you mean a case where a
process with ruid=1000, euid=0, suid=1000 does execve() on a setuid
1000 binary? I think we probably don't specifically care about that...

I think another scenario that we ideally might want to care about is
what happens if a process which runs with a normal user's UIDs, but is
non-dumpable, goes through execve() of a normal binary while another
process tries to inspect its FDs or address space layout - it probably
shouldn't be possible to get information about the pre-execve MM and
O_CLOEXEC file descriptors.

> Question 4.
> Is it possible to use a seq_lock instead of reader writer semaphore?
> Or is that only for non-sleeping readers?

Linux seqcounts are 32-bit, which means they are always kind of dodgy,
but they are particularly dodgy if a reader can be forced to sleep for
an extended amount of time. I don't see a reason why we couldn't, in
general, use a 64-bit sequence count for readers that may need to
sleep while reading.

> There have been a number of nasty cases lurking in the background
> involving seccomp filters, PTRACE_EVENT_EXIT, de_thread and the like.
>
> Blocking locks, especially ones that get widely used, just scare me in
> this area.  Being able to see that something happened between start and
> finish and say -EAGAIN or retrying internally seems like it would be
> much less prone to weirdness.

I guess for do_task_stat() we could just switch to down_read_trylock()
instead of down_read_killable(), and proceed with "permitted = 0" if
the trylock fails - almost all the values shown are related to the MM,
and are therefore not stable across execve() anyway.

I think using seqlocks with a retry loop wouldn't work with the code
as-is, because in the middle of execve, there are points where the
file descriptor table still contains entries that we don't want to be
accessible with the task's current dumpability, or where we have
already switched to a new MM without having updated the credentials
yet.
I think we could make it work - we could add another set of creds to
the task, and let ptrace_may_access() check against both the
pre-execve and post-execve credentials and dumpability, but that feels
overengineered.

> The ugly with PTRACE_EVENT_EXIT as I recall is that if ptrace stops one
> of the threads (not the one calling exec) at PTRACE_EVENT_EXIT it can
> block de_thread, which blocks the rest of exec.  But there is something
> in there where the ptracer hangs waiting for the exec to complete.  So
> everything just stalls.  The ptracer waiting for exec the exec waiting
> for the ptracer.  SIGKILL can get you out of that mess last I looked.
> Still it is an ugly mess.
>
> Getting everything away from that mess is why we have exec_update_lock
> instead of just cred_guard_mutex.

And the exec_update_lock avoids that because it is not held in
de_thread(), only across the following part of execve, where not much
stuff happens that could block for a long time, right?

load_elf_binary
  begin_new_exec
    exec_mmap
      down_write_killable(&tsk->signal->exec_update_lock)
      mmput [brauner@ has a patch to move this]
    flush_thread
    do_close_on_exec [notably this can lead to filp->f_op->flush()
calls, which AFAIK can block forever on FUSE/NFS]
    commit_creds
  setup_new_exec
    up_write(&me->signal->exec_update_lock)

I think we might want to do something about the do_close_on_exec()
stuff, like deferring the filp_flush() to a later time, but I don't
really see deadlock potential here.

> I would really appreciate hearing the scenarios you are aiming to fix
> and how this fixes them.  There are enough races and special cases
> I don't feel comfortable reading that we just need exec_update_lock
> around ptrace_may_access.  It is not clear to me that is sufficient
> to close the small races we are worried about here.

The main thing I'm trying to address here are scenarios of the shape
"process A accesses process B through procfs while process B goes
through a privileged execution (in particular by executing a setuid
binary)". /proc/$pid/fd/$fd (part of patch 2) seems particularly
egregious because it can likely be used to gain access to memfds of
setuid binaries; other files are less egregious, but might lead to
things like userspace ASLR/pointer leaks (in particular do_task_stat()
and proc_map_files_readdir()).

A second scenario I have in mind is "process A accesses process B
through procfs while process B goes through a normal execution that
makes it dumpable".

The overarching logic I have at the back of my mind here is: If an
"incarnation" is the combination of a process and an mm_struct, then
holding exec_update_lock ensures that the credentials/dumpability we
have observed are associated with the same incarnation as the MM and
the file descriptor table whose properties we read afterwards.

> If I could trace through someone else's logic I could be convinced
> and the next people to deal with the code could look at it and see
> ah.  That is the detail that was missed when it has to be fixed again.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1)
  2026-05-26 18:22       ` Oleg Nesterov
@ 2026-05-26 18:30         ` Jann Horn
  0 siblings, 0 replies; 22+ messages in thread
From: Jann Horn @ 2026-05-26 18:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Eric W. Biederman, Jake Edge, linux-kernel, linux-fsdevel, stable,
	Kees Cook

On Tue, May 26, 2026 at 8:22 PM Oleg Nesterov <oleg@redhat.com> wrote:
> On 05/26, Jann Horn wrote:
> >
> > On Tue, May 26, 2026 at 10:48 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > > On 05/18, Jann Horn wrote:
> > > >
> > > > Fix the easy cases where procfs currently calls ptrace_may_access() without
> > > > exec_update_lock protection, where the fix is to simply add the extra lock
> > > > or use mm_access():
> > >
> > > I thought about this too, but I do not know if it is fine performance wise...
> > >
> > > And what about proc_coredump_filter_write() which doesn't use ptrace_may_access() ?
> >
> > Yeah, this series doesn't fix everything,
>
> Aah... Of course, I understand. I wasn't clear. Sorry if it looked as
> "you missed proc_coredump_filter_write" from my side.
>
> What I actually tried to ask:
>
>         - Do you think it makes sense to fix proc_coredump_filter_write()
>           as well?

Yes. Another example I've seen that should probably also be fixed is
seq_show() in fs/proc/fd.c, the handler for /proc/$pid/fdinfo/$fd,
that also currently has zero checks at read() time.

>         - If yes. Do you think we should add another down_read(exec_update_lock) +
>           ptrace_may_access() into proc_coredump_filter_write() ? Or perhaps we
>           should discuss other approaches (exec_id/seqcount/etc) from the very
>           beginning?

I had thought that this series would be the easy, uncontroversial
improvements, and that we could then think about the harder aspect
with read handlers afterwards. I guess I was wrong about this being
the uncontroversial part.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
  2026-05-26 18:22   ` Jann Horn
@ 2026-05-27 12:01     ` Christian Brauner
  2026-05-27 12:31       ` Christian Brauner
  2026-05-27 13:44       ` Jann Horn
       [not found]     ` <87wlwny905.fsf@email.froward.int.ebiederm.org>
  1 sibling, 2 replies; 22+ messages in thread
From: Christian Brauner @ 2026-05-27 12:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: Eric W. Biederman, Alexander Viro, Jan Kara, Arjan van de Ven,
	Jake Edge, linux-kernel, linux-fsdevel, stable, Kees Cook,
	Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 4821 bytes --]

On Tue, May 26, 2026 at 08:22:38PM +0200, Jann Horn wrote:
> On Mon, May 25, 2026 at 9:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > I have added a couple more people who might be interested.
> >
> > Kees Cook because as you have structured this it is an exec problem.
> >
> > Oleg Nesterov as he is knowledgable about ptrace.
> >
> > Jann Horn <jannh@google.com> writes:
> >
> > > My understanding is that procfs is effectively maintained by the VFS
> > > maintainers (though scripts/get_maintainer.pl claims that there are
> > > no maintainers for procfs because the VFS entry only claims files
> > > directly in fs/, and the procfs entry has no maintainers listed on
> > > it).
> > >
> > > In procfs, most uses of ptrace_may_access() should use
> > > exec_update_lock to avoid TOCTOU issues with concurrent privileged
> > > execve() (like setuid binary execution).
> > >
> > > This series doesn't fix all the remaining issues in procfs, but it fixes
> > > the easy cases for now; I will probably follow up with fixes for the
> > > gnarlier cases later unless someone else wants to do that.
> > >
> > > I have checked that procfs files still work with these changes and that
> > > CONFIG_PROVE_LOCKING=y doesn't generate any warnings.
> > >
> > > (checkpatch complains about missing argument names in
> > > proc_op::proc_get_link, but that was already the case before my
> > > patch.)
> >
> >
> > I think I finally have my context paged back in so I can intelligently
> > say something about this series.
> >
> > The scenario you are worried about is when exec gains privileges,
> > and we read through proc and authenticate with the old credentials
> > instead of the new credentials.
> >
> > Question 1.
> >
> > Assuming the executable is world readable (which they generally are)
> > is there anything that becomes accessible in that race that was
> > not already accessible?
> 
> I believe so - the gnarliest example I am thinking of is:
> Memfds are always mode 0777 or 0666 (see __shmem_file_setup, which
> sets S_IRWXUGO), so their access control is purely based on being able
> to pathwalk to the memfd's inode. If you can race
> open(/proc/$pid/fd/$n) with the process $pid going through setuid
> execution and calling memfd_create(), you should be able to get
> read+write access to the memfd created by the setuid binary that was
> supposed to be private.
> 
> (But I have not tested that and don't know if there are actually any
> setuid binaries that happen to use memfds.)
> 
> > Question 2.
> >
> > How does this race compare to racing with setresuid?
> > Do we need to fix the setresuid case as well?
> 
> Which setresuid case? setresuid clears the dumpable flag and has a
> memory barrier that is supposed to make that properly ordered against
> ptrace_may_access(); so setresuid() should normally not cause a task
> to become traceable, though that could maybe happen in weird
> scenarios.
> 
> I think another case we should probably care about is what happens if
> a process which is only protected against ptrace by being non-dumpable
> goes through execve() - it shouldn't be possible to access resources
> associated with the pre-execve state while checking against the
> post-execve dumpability. It might be important for this that the
> do_close_on_exec() logic currently happens after committing the
> dumpable state in exec_mmap()...
> 
> > Question 3.
> > Do we care about the case when a privileged process calls a setuid
> > process and drops privileges?
> 
> I don't understand the question. Hmm - do you mean a case where a
> process with ruid=1000, euid=0, suid=1000 does execve() on a setuid
> 1000 binary? I think we probably don't specifically care about that...
> 
> I think another scenario that we ideally might want to care about is
> what happens if a process which runs with a normal user's UIDs, but is
> non-dumpable, goes through execve() of a normal binary while another
> process tries to inspect its FDs or address space layout - it probably
> shouldn't be possible to get information about the pre-execve MM and
> O_CLOEXEC file descriptors.
> 
> > Question 4.
> > Is it possible to use a seq_lock instead of reader writer semaphore?
> > Or is that only for non-sleeping readers?
> 
> Linux seqcounts are 32-bit, which means they are always kind of dodgy,
> but they are particularly dodgy if a reader can be forced to sleep for
> an extended amount of time. I don't see a reason why we couldn't, in
> general, use a 64-bit sequence count for readers that may need to
> sleep while reading.

I have a patch series for this that I started working after merging your
series for precisely this reason: performance. It's a few days old now.
I've tried various approaches and I started with a simple 32-bit counter
as the POC. See appended (untested) patches.

[-- Attachment #2: 0001-exec-bump-exec_update_seq-across-the-exec_update_loc.patch --]
[-- Type: text/x-diff, Size: 3190 bytes --]

From 6e3972c2f8d33216f6fa500618807fc75c6c1355 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 25 May 2026 09:16:13 +0200
Subject: [PATCH 1/8] exec: bump exec_update_seq across the exec_update_lock
 write side

execve() (exec_mmap() -> begin_new_exec() -> setup_new_exec()) and
Landlock TSYNC are the only writers of exec_update_lock. Bump
exec_update_seq to odd for the duration of each write-held section, the
way mmap_lock maintains mm->mm_lock_seq:

 - exec: exec_update_seq_begin() right after down_write_killable() in
   exec_mmap(); exec_update_seq_end() before each matching up_write() (the
   exec_mmap() error path, the begin_new_exec() error path, and the normal
   release in setup_new_exec()). Every acquire reaches exactly one of
   those releases, so the seqcount is even whenever the lock is not held
   for writing.
 - Landlock: begin after down_write_trylock(), end before up_write().

The bump uses the non-preempt-disabling do_raw_write_seqcount_*() helpers,
so the sleeping exec and TSYNC write sections are unaffected.

No reader consults exec_update_seq yet: no functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/exec.c                 | 4 ++++
 security/landlock/tsync.c | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 824b46c069ae..1915acb0b44d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -859,6 +859,7 @@ static int exec_mmap(struct linux_binprm *bprm)
 	ret = down_write_killable(&tsk->signal->exec_update_lock);
 	if (ret)
 		return ret;
+	exec_update_seq_begin(tsk->signal);
 
 	if (old_mm) {
 		/*
@@ -868,6 +869,7 @@ static int exec_mmap(struct linux_binprm *bprm)
 		 */
 		ret = mmap_read_lock_killable(old_mm);
 		if (ret) {
+			exec_update_seq_end(tsk->signal);
 			up_write(&tsk->signal->exec_update_lock);
 			return ret;
 		}
@@ -1300,6 +1302,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	return 0;
 
 out_unlock:
+	exec_update_seq_end(me->signal);
 	up_write(&me->signal->exec_update_lock);
 	if (!bprm->cred)
 		mutex_unlock(&me->signal->cred_guard_mutex);
@@ -1345,6 +1348,7 @@ void setup_new_exec(struct linux_binprm * bprm)
 	 * some architectures like powerpc
 	 */
 	me->mm->task_size = TASK_SIZE;
+	exec_update_seq_end(me->signal);
 	up_write(&me->signal->exec_update_lock);
 	mutex_unlock(&me->signal->cred_guard_mutex);
 
diff --git a/security/landlock/tsync.c b/security/landlock/tsync.c
index c5730bbd9ed3..472c02cf71e9 100644
--- a/security/landlock/tsync.c
+++ b/security/landlock/tsync.c
@@ -492,6 +492,7 @@ int landlock_restrict_sibling_threads(const struct cred *old_cred,
 	 */
 	if (!down_write_trylock(&current->signal->exec_update_lock))
 		return restart_syscall();
+	exec_update_seq_begin(current->signal);
 
 	/*
 	 * We schedule a pseudo-signal task_work for each of the calling task's
@@ -614,6 +615,7 @@ int landlock_restrict_sibling_threads(const struct cred *old_cred,
 		wait_for_completion(&shared_ctx.all_finished);
 
 	tsync_works_release(&works);
+	exec_update_seq_end(current->signal);
 	up_write(&current->signal->exec_update_lock);
 	return atomic_read(&shared_ctx.preparation_error);
 }
-- 
2.47.3


[-- Attachment #3: 0002-exec-add-a-speculate-or-lock-helper-for-exec_update-.patch --]
[-- Type: text/x-diff, Size: 2469 bytes --]

From 8c4daca0cc09a90e7a4acce4d650b3ea9b06a80b Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 25 May 2026 09:26:54 +0200
Subject: [PATCH 2/8] exec: add a speculate-or-lock helper for exec_update
 readers

The exec_update_seq readers share one control flow: try a lockless
seqcount read and fall back to exec_update_lock if a writer (exec or
Landlock TSYNC) is in flight. Rather than open-code it in every reader,
add a trio modeled on read_seqbegin_or_lock() / need_seqretry() /
done_seqretry() (seqlock.h), adapted for exec_update_seq paired with the
killable exec_update_lock:

  exec_update_read_begin_or_lock() - lockless first pass (seq even); on a
    racing writer escalate to down_read_killable() (seq odd); -EINTR if
    killed while waiting.
  exec_update_read_needs_retry()   - true if the lockless pass raced, in
    which case the caller drops any ref taken, sets seq = 1, and retries.
  exec_update_read_done()          - releases exec_update_lock if taken.

No users yet; no functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 include/linux/sched/signal.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 952f0368f201..0c1bb3b530e4 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -311,6 +311,36 @@ static inline bool exec_update_speculate_retry(struct signal_struct *sig,
 	return read_seqcount_retry(&sig->exec_update_seq, seq);
 }
 
+/* Speculate-or-lock exec_update reader, mirroring read_seqbegin_or_lock(). */
+static inline int exec_update_read_begin_or_killable(struct signal_struct *sig,
+						 unsigned int *seq)
+{
+	int ret;
+
+	if (!(*seq & 1)) {
+		if (exec_update_speculate_try_begin(sig, seq))
+			return 0;
+		*seq = 1;
+	}
+	ret = down_read_killable(&sig->exec_update_lock);
+	if (ret)
+		*seq = 0;
+	return ret;
+}
+
+static inline bool exec_update_read_needs_retry(struct signal_struct *sig,
+						unsigned int seq)
+{
+	return !(seq & 1) && exec_update_speculate_retry(sig, seq);
+}
+
+static inline void exec_update_read_done(struct signal_struct *sig,
+					 unsigned int seq)
+{
+	if (seq & 1)
+		up_read(&sig->exec_update_lock);
+}
+
 extern void flush_signals(struct task_struct *);
 extern void ignore_signals(struct task_struct *);
 extern void flush_signal_handlers(struct task_struct *, int force_default);
-- 
2.47.3


[-- Attachment #4: 0003-mm-take-a-lock-free-exec_update_seq-fast-path-in-mm_.patch --]
[-- Type: text/x-diff, Size: 2113 bytes --]

From 300838eeb13d5d96067d8475af29753016e83728 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 25 May 2026 09:27:57 +0200
Subject: [PATCH 3/8] mm: take a lock-free exec_update_seq fast path in
 mm_access()

mm_access() takes exec_update_lock for read only to check
ptrace_may_access() against stable credentials before grabbing the target
mm. Convert it to exec_update_read_begin_or_lock(): the common case
resolves and access-checks the mm with no lock; a concurrent exec()/TSYNC
falls back to exec_update_lock. The shared resolve/check logic moves to
__mm_access(), and the speculative mm reference is dropped before retry.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 kernel/fork.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 377125eff8a9..250a7e1125e6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1394,23 +1394,38 @@ static bool may_access_mm(struct mm_struct *mm, struct task_struct *task, unsign
 	return false;
 }
 
+static struct mm_struct *__mm_access(struct task_struct *task, unsigned int mode)
+{
+	struct mm_struct *mm = get_task_mm(task);
+
+	if (!mm)
+		return ERR_PTR(-ESRCH);
+	if (!may_access_mm(mm, task, mode)) {
+		mmput(mm);
+		return ERR_PTR(-EACCES);
+	}
+	return mm;
+}
+
 struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 {
+	struct signal_struct *sig = task->signal;
 	struct mm_struct *mm;
+	unsigned int seq = 0;
 	int err;
 
-	err =  down_read_killable(&task->signal->exec_update_lock);
+retry:
+	err = exec_update_read_begin_or_killable(sig, &seq);
 	if (err)
 		return ERR_PTR(err);
-
-	mm = get_task_mm(task);
-	if (!mm) {
-		mm = ERR_PTR(-ESRCH);
-	} else if (!may_access_mm(mm, task, mode)) {
-		mmput(mm);
-		mm = ERR_PTR(-EACCES);
-	}
-	up_read(&task->signal->exec_update_lock);
+	mm = __mm_access(task, mode);
+	if (exec_update_read_needs_retry(sig, seq)) {
+		if (!IS_ERR(mm))
+			mmput(mm);
+		seq = 1;
+		goto retry;
+	}
+	exec_update_read_done(sig, seq);
 
 	return mm;
 }
-- 
2.47.3


[-- Attachment #5: 0004-pidfd-take-a-lock-free-exec_update_seq-fast-path-in-.patch --]
[-- Type: text/x-diff, Size: 1952 bytes --]

From 4a005bdabfb5647992f776751dcb5221b6d0da21 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 25 May 2026 09:27:57 +0200
Subject: [PATCH 4/8] pidfd: take a lock-free exec_update_seq fast path in
 __pidfd_fget()

Convert __pidfd_fget() (pidfd_getfd(2)) to
exec_update_read_begin_or_lock(): check ptrace_may_access() and
fget_task() lock-free, falling back to exec_update_lock on a racing
exec()/TSYNC. The check + fget moves to pidfd_fget_access(), and a
speculative file reference is dropped before retry. The exiting-task
fixup (PF_EXITING -> ESRCH/EBADF) is unchanged.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 kernel/pid.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index fd5c2d4aa349..7ac85f417485 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -876,21 +876,32 @@ static __init int pid_namespace_sysctl_init(void)
 }
 subsys_initcall(pid_namespace_sysctl_init);
 
+static struct file *pidfd_fget_access(struct task_struct *task, int fd)
+{
+	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
+		return ERR_PTR(-EPERM);
+	return fget_task(task, fd);
+}
+
 static struct file *__pidfd_fget(struct task_struct *task, int fd)
 {
+	struct signal_struct *sig = task->signal;
 	struct file *file;
+	unsigned int seq = 0;
 	int ret;
 
-	ret = down_read_killable(&task->signal->exec_update_lock);
+retry:
+	ret = exec_update_read_begin_or_killable(sig, &seq);
 	if (ret)
 		return ERR_PTR(ret);
-
-	if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
-		file = fget_task(task, fd);
-	else
-		file = ERR_PTR(-EPERM);
-
-	up_read(&task->signal->exec_update_lock);
+	file = pidfd_fget_access(task, fd);
+	if (exec_update_read_needs_retry(sig, seq)) {
+		if (!IS_ERR_OR_NULL(file))
+			fput(file);
+		seq = 1;
+		goto retry;
+	}
+	exec_update_read_done(sig, seq);
 
 	if (!file) {
 		/*
-- 
2.47.3


[-- Attachment #6: 0005-futex-take-a-lock-free-exec_update_seq-fast-path-in-.patch --]
[-- Type: text/x-diff, Size: 2737 bytes --]

From 1e1d0adfe664adc4ba60f7209911f2eae90605ee Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 25 May 2026 09:27:58 +0200
Subject: [PATCH 5/8] futex: take a lock-free exec_update_seq fast path in
 get_robust_list

Convert futex_get_robust_list_common() to
exec_update_read_begin_or_lock(): check ptrace_may_access() and read the
target's robust_list head lock-free, falling back to exec_update_lock on
a racing exec()/TSYNC. Nothing is referenced in the protected section, so
there is nothing to undo on retry. Read the head with READ_ONCE() since
the owner may update it concurrently.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 kernel/futex/syscalls.c | 45 ++++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 23 deletions(-)

diff --git a/kernel/futex/syscalls.c b/kernel/futex/syscalls.c
index 77ad9691f6a6..ec7a174f2682 100644
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -43,16 +43,17 @@ static inline void __user *futex_task_robust_list(struct task_struct *p, bool co
 {
 #ifdef CONFIG_COMPAT
 	if (compat)
-		return p->compat_robust_list;
+		return READ_ONCE(p->compat_robust_list);
 #endif
-	return p->robust_list;
+	return READ_ONCE(p->robust_list);
 }
 
 static void __user *futex_get_robust_list_common(int pid, bool compat)
 {
 	struct task_struct *p = current;
 	void __user *head;
-	int ret;
+	unsigned int seq = 0;
+	int err;
 
 	scoped_guard(rcu) {
 		if (pid) {
@@ -64,29 +65,27 @@ static void __user *futex_get_robust_list_common(int pid, bool compat)
 	}
 
 	/*
-	 * Hold exec_update_lock to serialize with concurrent exec()
-	 * so ptrace_may_access() is checked against stable credentials
+	 * Serialize ptrace_may_access() against a concurrent exec() credential
+	 * change; lock-free fast path with an exec_update_lock fallback.
 	 */
-	ret = down_read_killable(&p->signal->exec_update_lock);
-	if (ret)
-		goto err_put;
-
-	ret = -EPERM;
-	if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
-		goto err_unlock;
-
-	head = futex_task_robust_list(p, compat);
-
-	up_read(&p->signal->exec_update_lock);
+retry:
+	err = exec_update_read_begin_or_killable(p->signal, &seq);
+	if (err) {
+		head = (void __user *)ERR_PTR(err);
+		goto out;
+	}
+	if (ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
+		head = futex_task_robust_list(p, compat);
+	else
+		head = (void __user *)ERR_PTR(-EPERM);
+	if (exec_update_read_needs_retry(p->signal, seq)) {
+		seq = 1;
+		goto retry;
+	}
+	exec_update_read_done(p->signal, seq);
+out:
 	put_task_struct(p);
-
 	return head;
-
-err_unlock:
-	up_read(&p->signal->exec_update_lock);
-err_put:
-	put_task_struct(p);
-	return (void __user *)ERR_PTR(ret);
 }
 
 /**
-- 
2.47.3


[-- Attachment #7: 0006-kcmp-take-a-lock-free-exec_update_seq-fast-path.patch --]
[-- Type: text/x-diff, Size: 5216 bytes --]

From c3d5427c1ef28537a5a50ef571087b04fde518ce Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 25 May 2026 09:29:17 +0200
Subject: [PATCH 6/8] kcmp: take a lock-free exec_update_seq fast path

kcmp() compares two tasks' resources after ptrace_may_access() checks on
both, today under both tasks' exec_update_locks (taken in pointer order to
avoid ABBA). Add a two-task seqcount fast path: snapshot both tasks'
exec_update_seq, run the checks and comparison, then revalidate both; on a
racing exec()/TSYNC of either task fall back to the existing ordered
double-lock (kcmp_lock). The fast path takes no lock, so it needs no
ordering. The comparison logic moves to kcmp_access() and pointer reads
use READ_ONCE(); get_file_raw_ptr() takes and drops its own reference, so
a retried comparison leaks nothing.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 kernel/kcmp.c | 106 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 66 insertions(+), 40 deletions(-)

diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index 7c1a65bd5f8d..8fee7a5752d4 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -132,39 +132,15 @@ static int kcmp_epoll_target(struct task_struct *task1,
 }
 #endif
 
-SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
-		unsigned long, idx1, unsigned long, idx2)
+/* Compare two tasks' resources by obfuscated pointer; caller serializes. */
+static int kcmp_access(struct task_struct *task1, struct task_struct *task2,
+		       int type, unsigned long idx1, unsigned long idx2)
 {
-	struct task_struct *task1, *task2;
 	int ret;
 
-	rcu_read_lock();
-
-	/*
-	 * Tasks are looked up in caller's PID namespace only.
-	 */
-	task1 = find_task_by_vpid(pid1);
-	task2 = find_task_by_vpid(pid2);
-	if (unlikely(!task1 || !task2))
-		goto err_no_task;
-
-	get_task_struct(task1);
-	get_task_struct(task2);
-
-	rcu_read_unlock();
-
-	/*
-	 * One should have enough rights to inspect task details.
-	 */
-	ret = kcmp_lock(&task1->signal->exec_update_lock,
-			&task2->signal->exec_update_lock);
-	if (ret)
-		goto err;
 	if (!ptrace_may_access(task1, PTRACE_MODE_READ_REALCREDS) ||
-	    !ptrace_may_access(task2, PTRACE_MODE_READ_REALCREDS)) {
-		ret = -EPERM;
-		goto err_unlock;
-	}
+	    !ptrace_may_access(task2, PTRACE_MODE_READ_REALCREDS))
+		return -EPERM;
 
 	switch (type) {
 	case KCMP_FILE: {
@@ -180,24 +156,29 @@ SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
 		break;
 	}
 	case KCMP_VM:
-		ret = kcmp_ptr(task1->mm, task2->mm, KCMP_VM);
+		ret = kcmp_ptr(READ_ONCE(task1->mm), READ_ONCE(task2->mm),
+			       KCMP_VM);
 		break;
 	case KCMP_FILES:
-		ret = kcmp_ptr(task1->files, task2->files, KCMP_FILES);
+		ret = kcmp_ptr(READ_ONCE(task1->files), READ_ONCE(task2->files),
+			       KCMP_FILES);
 		break;
 	case KCMP_FS:
-		ret = kcmp_ptr(task1->fs, task2->fs, KCMP_FS);
+		ret = kcmp_ptr(READ_ONCE(task1->fs), READ_ONCE(task2->fs),
+			       KCMP_FS);
 		break;
 	case KCMP_SIGHAND:
-		ret = kcmp_ptr(task1->sighand, task2->sighand, KCMP_SIGHAND);
+		ret = kcmp_ptr(READ_ONCE(task1->sighand),
+			       READ_ONCE(task2->sighand), KCMP_SIGHAND);
 		break;
 	case KCMP_IO:
-		ret = kcmp_ptr(task1->io_context, task2->io_context, KCMP_IO);
+		ret = kcmp_ptr(READ_ONCE(task1->io_context),
+			       READ_ONCE(task2->io_context), KCMP_IO);
 		break;
 	case KCMP_SYSVSEM:
 #ifdef CONFIG_SYSVIPC
-		ret = kcmp_ptr(task1->sysvsem.undo_list,
-			       task2->sysvsem.undo_list,
+		ret = kcmp_ptr(READ_ONCE(task1->sysvsem.undo_list),
+			       READ_ONCE(task2->sysvsem.undo_list),
 			       KCMP_SYSVSEM);
 #else
 		ret = -EOPNOTSUPP;
@@ -211,10 +192,55 @@ SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
 		break;
 	}
 
-err_unlock:
-	kcmp_unlock(&task1->signal->exec_update_lock,
-		    &task2->signal->exec_update_lock);
-err:
+	return ret;
+}
+
+SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
+		unsigned long, idx1, unsigned long, idx2)
+{
+	struct task_struct *task1, *task2;
+	struct signal_struct *sig1, *sig2;
+	unsigned int seq1, seq2;
+	int ret;
+
+	rcu_read_lock();
+
+	/*
+	 * Tasks are looked up in caller's PID namespace only.
+	 */
+	task1 = find_task_by_vpid(pid1);
+	task2 = find_task_by_vpid(pid2);
+	if (unlikely(!task1 || !task2))
+		goto err_no_task;
+
+	get_task_struct(task1);
+	get_task_struct(task2);
+
+	rcu_read_unlock();
+
+	sig1 = task1->signal;
+	sig2 = task2->signal;
+
+	/*
+	 * Lock-free fast path: snapshot both tasks' exec_update_seq, compare,
+	 * then revalidate both.  Falls back to taking both exec_update_locks in
+	 * a deadlock-safe order if either task is mid-exec.
+	 */
+	if (exec_update_speculate_try_begin(sig1, &seq1) &&
+	    exec_update_speculate_try_begin(sig2, &seq2)) {
+		ret = kcmp_access(task1, task2, type, idx1, idx2);
+		if (!exec_update_speculate_retry(sig1, seq1) &&
+		    !exec_update_speculate_retry(sig2, seq2))
+			goto out;
+	}
+
+	ret = kcmp_lock(&sig1->exec_update_lock, &sig2->exec_update_lock);
+	if (ret)
+		goto out;
+	ret = kcmp_access(task1, task2, type, idx1, idx2);
+	kcmp_unlock(&sig1->exec_update_lock, &sig2->exec_update_lock);
+
+out:
 	put_task_struct(task1);
 	put_task_struct(task2);
 
-- 
2.47.3


[-- Attachment #8: 0007-proc-lock-free-exec_update_seq-fast-path-for-stack-s.patch --]
[-- Type: text/x-diff, Size: 5397 bytes --]

From 66149a0d2a4a816d9ddda938c59a4ca823e4999c Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 25 May 2026 09:31:26 +0200
Subject: [PATCH 7/8] proc: lock-free exec_update_seq fast path for
 stack/syscall/personality

/proc/<pid>/{stack,syscall,personality} took exec_update_lock for read
(via lock_trace()) to check ptrace_may_access() and then read task state.
Convert all three to exec_update_read_begin_or_lock(): snapshot the
permission decision and the task state (stack trace, syscall info,
personality) in the speculative section, then emit after validation;
fall back to exec_update_lock on a racing exec()/TSYNC. With all three
callers converted, lock_trace() and unlock_trace() are removed.

Note: the stack unwind and task_current_syscall() now run inside the
speculative section and may re-run if a concurrent exec() of the target
is detected. They are idempotent (they fill a local buffer/struct and
output is emitted only after the section validates), and exec of a given
task is rare, so the bounded re-run is acceptable. /proc/<pid>/stack
stays root-only.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/proc/base.c | 93 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 59 insertions(+), 34 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 65f56136ec3f..83b851b7f9d9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -440,23 +440,6 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
 }
 #endif /* CONFIG_KALLSYMS */
 
-static int lock_trace(struct task_struct *task)
-{
-	int err = down_read_killable(&task->signal->exec_update_lock);
-	if (err)
-		return err;
-	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
-		up_read(&task->signal->exec_update_lock);
-		return -EPERM;
-	}
-	return 0;
-}
-
-static void unlock_trace(struct task_struct *task)
-{
-	up_read(&task->signal->exec_update_lock);
-}
-
 #ifdef CONFIG_STACKTRACE
 
 #define MAX_STACK_TRACE_DEPTH	64
@@ -464,7 +447,10 @@ static void unlock_trace(struct task_struct *task)
 static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
 			  struct pid *pid, struct task_struct *task)
 {
+	struct signal_struct *sig = task->signal;
 	unsigned long *entries;
+	unsigned int seq = 0, i, nr_entries = 0;
+	bool allowed = false;
 	int err;
 
 	/*
@@ -486,19 +472,27 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
 	if (!entries)
 		return -ENOMEM;
 
-	err = lock_trace(task);
-	if (!err) {
-		unsigned int i, nr_entries;
-
+retry:
+	err = exec_update_read_begin_or_killable(sig, &seq);
+	if (err)
+		goto out;
+	allowed = ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+	if (allowed)
 		nr_entries = stack_trace_save_tsk(task, entries,
 						  MAX_STACK_TRACE_DEPTH, 0);
+	if (exec_update_read_needs_retry(sig, seq)) {
+		seq = 1;
+		goto retry;
+	}
+	exec_update_read_done(sig, seq);
 
-		for (i = 0; i < nr_entries; i++) {
-			seq_printf(m, "[<0>] %pB\n", (void *)entries[i]);
-		}
-
-		unlock_trace(task);
+	if (!allowed) {
+		err = -EPERM;
+		goto out;
 	}
+	for (i = 0; i < nr_entries; i++)
+		seq_printf(m, "[<0>] %pB\n", (void *)entries[i]);
+out:
 	kfree(entries);
 
 	return err;
@@ -676,15 +670,31 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
 static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
 			    struct pid *pid, struct task_struct *task)
 {
+	struct signal_struct *sig = task->signal;
 	struct syscall_info info;
 	u64 *args = &info.data.args[0];
+	unsigned int seq = 0;
+	bool allowed = false;
+	int running = 0;
 	int res;
 
-	res = lock_trace(task);
+retry:
+	res = exec_update_read_begin_or_killable(sig, &seq);
 	if (res)
 		return res;
+	allowed = ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+	if (allowed)
+		running = task_current_syscall(task, &info);
+	if (exec_update_read_needs_retry(sig, seq)) {
+		seq = 1;
+		goto retry;
+	}
+	exec_update_read_done(sig, seq);
+
+	if (!allowed)
+		return -EPERM;
 
-	if (task_current_syscall(task, &info))
+	if (running)
 		seq_puts(m, "running\n");
 	else if (info.data.nr < 0)
 		seq_printf(m, "%d 0x%llx 0x%llx\n",
@@ -695,7 +705,6 @@ static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
 		       info.data.nr,
 		       args[0], args[1], args[2], args[3], args[4], args[5],
 		       info.sp, info.data.instruction_pointer);
-	unlock_trace(task);
 
 	return 0;
 }
@@ -3221,12 +3230,28 @@ static const struct file_operations proc_setgroups_operations = {
 static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
 				struct pid *pid, struct task_struct *task)
 {
-	int err = lock_trace(task);
-	if (!err) {
-		seq_printf(m, "%08x\n", task->personality);
-		unlock_trace(task);
+	struct signal_struct *sig = task->signal;
+	unsigned int seq = 0, personality = 0;
+	bool allowed = false;
+	int err;
+
+retry:
+	err = exec_update_read_begin_or_killable(sig, &seq);
+	if (err)
+		return err;
+	allowed = ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+	if (allowed)
+		personality = READ_ONCE(task->personality);
+	if (exec_update_read_needs_retry(sig, seq)) {
+		seq = 1;
+		goto retry;
 	}
-	return err;
+	exec_update_read_done(sig, seq);
+
+	if (!allowed)
+		return -EPERM;
+	seq_printf(m, "%08x\n", personality);
+	return 0;
 }
 
 #ifdef CONFIG_LIVEPATCH
-- 
2.47.3


[-- Attachment #9: 0008-proc-take-a-lock-free-exec_update_seq-fast-path-in-d.patch --]
[-- Type: text/x-diff, Size: 2629 bytes --]

From c981e64e0be97c0f074dc970aea7076522367640 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 25 May 2026 09:32:23 +0200
Subject: [PATCH 8/8] proc: take a lock-free exec_update_seq fast path in
 do_io_accounting()

/proc/<pid>/io took exec_update_lock for read to check
ptrace_may_access() before sampling the task's (or thread group's) IO
accounting. Convert it to exec_update_read_begin_or_lock(): snapshot the
accounting into a local under the speculative section (the whole-process
variant keeps its inner rcu + stats_lock), then emit after validation;
fall back to exec_update_lock on a racing exec()/TSYNC.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/proc/base.c | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 83b851b7f9d9..3706c5167df0 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3033,20 +3033,18 @@ static const struct file_operations proc_coredump_filter_operations = {
 #ifdef CONFIG_TASK_IO_ACCOUNTING
 static int do_io_accounting(struct task_struct *task, struct seq_file *m, int whole)
 {
+	struct signal_struct *sig = task->signal;
 	struct task_io_accounting acct;
+	unsigned int seq = 0;
+	bool allowed = false;
 	int result;
 
-	result = down_read_killable(&task->signal->exec_update_lock);
+retry:
+	result = exec_update_read_begin_or_killable(sig, &seq);
 	if (result)
 		return result;
-
-	if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
-		result = -EACCES;
-		goto out_unlock;
-	}
-
-	if (whole) {
-		struct signal_struct *sig = task->signal;
+	allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
+	if (allowed && whole) {
 		struct task_struct *t;
 
 		guard(rcu)();
@@ -3056,9 +3054,17 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
 				task_io_accounting_add(&acct, &t->ioac);
 
 		}
-	} else {
+	} else if (allowed) {
 		acct = task->ioac;
 	}
+	if (exec_update_read_needs_retry(sig, seq)) {
+		seq = 1;
+		goto retry;
+	}
+	exec_update_read_done(sig, seq);
+
+	if (!allowed)
+		return -EACCES;
 
 	seq_printf(m,
 		   "rchar: %llu\n"
@@ -3075,11 +3081,8 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
 		   (unsigned long long)acct.read_bytes,
 		   (unsigned long long)acct.write_bytes,
 		   (unsigned long long)acct.cancelled_write_bytes);
-	result = 0;
 
-out_unlock:
-	up_read(&task->signal->exec_update_lock);
-	return result;
+	return 0;
 }
 
 static int proc_tid_io_accounting(struct seq_file *m, struct pid_namespace *ns,
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
  2026-05-27 12:01     ` Christian Brauner
@ 2026-05-27 12:31       ` Christian Brauner
  2026-05-27 13:49         ` Jann Horn
  2026-05-27 13:44       ` Jann Horn
  1 sibling, 1 reply; 22+ messages in thread
From: Christian Brauner @ 2026-05-27 12:31 UTC (permalink / raw)
  To: Jann Horn
  Cc: Eric W. Biederman, Alexander Viro, Jan Kara, Arjan van de Ven,
	Jake Edge, linux-kernel, linux-fsdevel, stable, Kees Cook,
	Oleg Nesterov

On Wed, May 27, 2026 at 02:01:51PM +0200, Christian Brauner wrote:
> On Tue, May 26, 2026 at 08:22:38PM +0200, Jann Horn wrote:
> > On Mon, May 25, 2026 at 9:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > I have added a couple more people who might be interested.
> > >
> > > Kees Cook because as you have structured this it is an exec problem.
> > >
> > > Oleg Nesterov as he is knowledgable about ptrace.
> > >
> > > Jann Horn <jannh@google.com> writes:
> > >
> > > > My understanding is that procfs is effectively maintained by the VFS
> > > > maintainers (though scripts/get_maintainer.pl claims that there are
> > > > no maintainers for procfs because the VFS entry only claims files
> > > > directly in fs/, and the procfs entry has no maintainers listed on
> > > > it).
> > > >
> > > > In procfs, most uses of ptrace_may_access() should use
> > > > exec_update_lock to avoid TOCTOU issues with concurrent privileged
> > > > execve() (like setuid binary execution).
> > > >
> > > > This series doesn't fix all the remaining issues in procfs, but it fixes
> > > > the easy cases for now; I will probably follow up with fixes for the
> > > > gnarlier cases later unless someone else wants to do that.
> > > >
> > > > I have checked that procfs files still work with these changes and that
> > > > CONFIG_PROVE_LOCKING=y doesn't generate any warnings.
> > > >
> > > > (checkpatch complains about missing argument names in
> > > > proc_op::proc_get_link, but that was already the case before my
> > > > patch.)
> > >
> > >
> > > I think I finally have my context paged back in so I can intelligently
> > > say something about this series.
> > >
> > > The scenario you are worried about is when exec gains privileges,
> > > and we read through proc and authenticate with the old credentials
> > > instead of the new credentials.
> > >
> > > Question 1.
> > >
> > > Assuming the executable is world readable (which they generally are)
> > > is there anything that becomes accessible in that race that was
> > > not already accessible?
> > 
> > I believe so - the gnarliest example I am thinking of is:
> > Memfds are always mode 0777 or 0666 (see __shmem_file_setup, which
> > sets S_IRWXUGO), so their access control is purely based on being able
> > to pathwalk to the memfd's inode. If you can race
> > open(/proc/$pid/fd/$n) with the process $pid going through setuid
> > execution and calling memfd_create(), you should be able to get
> > read+write access to the memfd created by the setuid binary that was
> > supposed to be private.
> > 
> > (But I have not tested that and don't know if there are actually any
> > setuid binaries that happen to use memfds.)
> > 
> > > Question 2.
> > >
> > > How does this race compare to racing with setresuid?
> > > Do we need to fix the setresuid case as well?
> > 
> > Which setresuid case? setresuid clears the dumpable flag and has a
> > memory barrier that is supposed to make that properly ordered against
> > ptrace_may_access(); so setresuid() should normally not cause a task
> > to become traceable, though that could maybe happen in weird
> > scenarios.
> > 
> > I think another case we should probably care about is what happens if
> > a process which is only protected against ptrace by being non-dumpable
> > goes through execve() - it shouldn't be possible to access resources
> > associated with the pre-execve state while checking against the
> > post-execve dumpability. It might be important for this that the
> > do_close_on_exec() logic currently happens after committing the
> > dumpable state in exec_mmap()...
> > 
> > > Question 3.
> > > Do we care about the case when a privileged process calls a setuid
> > > process and drops privileges?
> > 
> > I don't understand the question. Hmm - do you mean a case where a
> > process with ruid=1000, euid=0, suid=1000 does execve() on a setuid
> > 1000 binary? I think we probably don't specifically care about that...
> > 
> > I think another scenario that we ideally might want to care about is
> > what happens if a process which runs with a normal user's UIDs, but is
> > non-dumpable, goes through execve() of a normal binary while another
> > process tries to inspect its FDs or address space layout - it probably
> > shouldn't be possible to get information about the pre-execve MM and
> > O_CLOEXEC file descriptors.
> > 
> > > Question 4.
> > > Is it possible to use a seq_lock instead of reader writer semaphore?
> > > Or is that only for non-sleeping readers?
> > 
> > Linux seqcounts are 32-bit, which means they are always kind of dodgy,
> > but they are particularly dodgy if a reader can be forced to sleep for
> > an extended amount of time. I don't see a reason why we couldn't, in
> > general, use a 64-bit sequence count for readers that may need to
> > sleep while reading.
> 
> I have a patch series for this that I started working after merging your
> series for precisely this reason: performance. It's a few days old now.
> I've tried various approaches and I started with a simple 32-bit counter
> as the POC. See appended (untested) patches.

In a bunch of cases we know that the critical section the callers cares
about just is very small: creds + mm. So in that case it is easy to
switch the credential computation into a prepare stage and a commit
stage and then the targeted critical section just becomes:
task->signal->seq_mm++ + task->cred = new_cred + task->mm = mm +
task->active_mm = mm + task->signal->seq_mm--. And then the reader
doesn't need to sleep at all and can just spin on the seqcount for the
small window they need.

I wasn't convinced it was valuable to use a fine-grained/multi-seqcount
approach though.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
  2026-05-27 12:01     ` Christian Brauner
  2026-05-27 12:31       ` Christian Brauner
@ 2026-05-27 13:44       ` Jann Horn
  2026-05-27 13:57         ` Christian Brauner
  1 sibling, 1 reply; 22+ messages in thread
From: Jann Horn @ 2026-05-27 13:44 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric W. Biederman, Alexander Viro, Jan Kara, Arjan van de Ven,
	Jake Edge, linux-kernel, linux-fsdevel, stable, Kees Cook,
	Oleg Nesterov

On Wed, May 27, 2026 at 2:01 PM Christian Brauner <brauner@kernel.org> wrote:
> On Tue, May 26, 2026 at 08:22:38PM +0200, Jann Horn wrote:
> > On Mon, May 25, 2026 at 9:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > Question 4.
> > > Is it possible to use a seq_lock instead of reader writer semaphore?
> > > Or is that only for non-sleeping readers?
> >
> > Linux seqcounts are 32-bit, which means they are always kind of dodgy,
> > but they are particularly dodgy if a reader can be forced to sleep for
> > an extended amount of time. I don't see a reason why we couldn't, in
> > general, use a 64-bit sequence count for readers that may need to
> > sleep while reading.
>
> I have a patch series for this that I started working after merging your
> series for precisely this reason: performance. It's a few days old now.
> I've tried various approaches and I started with a simple 32-bit counter
> as the POC. See appended (untested) patches.

It looks like there is a missing patch at the start of the series,
patch 1 uses exec_update_seq_begin without defining it.

I think performance improvements from seqcount use like this
(accelerating the fastpath) are different from what Eric was worried
about?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
  2026-05-27 12:31       ` Christian Brauner
@ 2026-05-27 13:49         ` Jann Horn
  0 siblings, 0 replies; 22+ messages in thread
From: Jann Horn @ 2026-05-27 13:49 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric W. Biederman, Alexander Viro, Jan Kara, Arjan van de Ven,
	Jake Edge, linux-kernel, linux-fsdevel, stable, Kees Cook,
	Oleg Nesterov

On Wed, May 27, 2026 at 2:31 PM Christian Brauner <brauner@kernel.org> wrote:
> On Wed, May 27, 2026 at 02:01:51PM +0200, Christian Brauner wrote:
> > On Tue, May 26, 2026 at 08:22:38PM +0200, Jann Horn wrote:
> > > On Mon, May 25, 2026 at 9:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > > I have added a couple more people who might be interested.
> > > >
> > > > Kees Cook because as you have structured this it is an exec problem.
> > > >
> > > > Oleg Nesterov as he is knowledgable about ptrace.
> > > >
> > > > Jann Horn <jannh@google.com> writes:
> > > >
> > > > > My understanding is that procfs is effectively maintained by the VFS
> > > > > maintainers (though scripts/get_maintainer.pl claims that there are
> > > > > no maintainers for procfs because the VFS entry only claims files
> > > > > directly in fs/, and the procfs entry has no maintainers listed on
> > > > > it).
> > > > >
> > > > > In procfs, most uses of ptrace_may_access() should use
> > > > > exec_update_lock to avoid TOCTOU issues with concurrent privileged
> > > > > execve() (like setuid binary execution).
> > > > >
> > > > > This series doesn't fix all the remaining issues in procfs, but it fixes
> > > > > the easy cases for now; I will probably follow up with fixes for the
> > > > > gnarlier cases later unless someone else wants to do that.
> > > > >
> > > > > I have checked that procfs files still work with these changes and that
> > > > > CONFIG_PROVE_LOCKING=y doesn't generate any warnings.
> > > > >
> > > > > (checkpatch complains about missing argument names in
> > > > > proc_op::proc_get_link, but that was already the case before my
> > > > > patch.)
> > > >
> > > >
> > > > I think I finally have my context paged back in so I can intelligently
> > > > say something about this series.
> > > >
> > > > The scenario you are worried about is when exec gains privileges,
> > > > and we read through proc and authenticate with the old credentials
> > > > instead of the new credentials.
> > > >
> > > > Question 1.
> > > >
> > > > Assuming the executable is world readable (which they generally are)
> > > > is there anything that becomes accessible in that race that was
> > > > not already accessible?
> > >
> > > I believe so - the gnarliest example I am thinking of is:
> > > Memfds are always mode 0777 or 0666 (see __shmem_file_setup, which
> > > sets S_IRWXUGO), so their access control is purely based on being able
> > > to pathwalk to the memfd's inode. If you can race
> > > open(/proc/$pid/fd/$n) with the process $pid going through setuid
> > > execution and calling memfd_create(), you should be able to get
> > > read+write access to the memfd created by the setuid binary that was
> > > supposed to be private.
> > >
> > > (But I have not tested that and don't know if there are actually any
> > > setuid binaries that happen to use memfds.)
> > >
> > > > Question 2.
> > > >
> > > > How does this race compare to racing with setresuid?
> > > > Do we need to fix the setresuid case as well?
> > >
> > > Which setresuid case? setresuid clears the dumpable flag and has a
> > > memory barrier that is supposed to make that properly ordered against
> > > ptrace_may_access(); so setresuid() should normally not cause a task
> > > to become traceable, though that could maybe happen in weird
> > > scenarios.
> > >
> > > I think another case we should probably care about is what happens if
> > > a process which is only protected against ptrace by being non-dumpable
> > > goes through execve() - it shouldn't be possible to access resources
> > > associated with the pre-execve state while checking against the
> > > post-execve dumpability. It might be important for this that the
> > > do_close_on_exec() logic currently happens after committing the
> > > dumpable state in exec_mmap()...
> > >
> > > > Question 3.
> > > > Do we care about the case when a privileged process calls a setuid
> > > > process and drops privileges?
> > >
> > > I don't understand the question. Hmm - do you mean a case where a
> > > process with ruid=1000, euid=0, suid=1000 does execve() on a setuid
> > > 1000 binary? I think we probably don't specifically care about that...
> > >
> > > I think another scenario that we ideally might want to care about is
> > > what happens if a process which runs with a normal user's UIDs, but is
> > > non-dumpable, goes through execve() of a normal binary while another
> > > process tries to inspect its FDs or address space layout - it probably
> > > shouldn't be possible to get information about the pre-execve MM and
> > > O_CLOEXEC file descriptors.
> > >
> > > > Question 4.
> > > > Is it possible to use a seq_lock instead of reader writer semaphore?
> > > > Or is that only for non-sleeping readers?
> > >
> > > Linux seqcounts are 32-bit, which means they are always kind of dodgy,
> > > but they are particularly dodgy if a reader can be forced to sleep for
> > > an extended amount of time. I don't see a reason why we couldn't, in
> > > general, use a 64-bit sequence count for readers that may need to
> > > sleep while reading.
> >
> > I have a patch series for this that I started working after merging your
> > series for precisely this reason: performance. It's a few days old now.
> > I've tried various approaches and I started with a simple 32-bit counter
> > as the POC. See appended (untested) patches.
>
> In a bunch of cases we know that the critical section the callers cares
> about just is very small: creds + mm. So in that case it is easy to
> switch the credential computation into a prepare stage and a commit
> stage and then the targeted critical section just becomes:
> task->signal->seq_mm++ + task->cred = new_cred + task->mm = mm +
> task->active_mm = mm + task->signal->seq_mm--. And then the reader
> doesn't need to sleep at all and can just spin on the seqcount for the
> small window they need.

I think it's probably good to avoid creating custom spinning
primitives if possible.
We'd have to also disable preemption around the writer side to ensure
that you can't get latency spikes when such a writer happens to be
preempted by a spinning reader at a bad time, and then we'd still not
have the proper paravirt spinning to deal with vCPU preemption that
qspinlocks provide, which AFAIU could theoretically also cause latency
spikes in virtualized scenarios...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
  2026-05-27 13:44       ` Jann Horn
@ 2026-05-27 13:57         ` Christian Brauner
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2026-05-27 13:57 UTC (permalink / raw)
  To: Jann Horn
  Cc: Eric W. Biederman, Alexander Viro, Jan Kara, Arjan van de Ven,
	Jake Edge, linux-kernel, linux-fsdevel, stable, Kees Cook,
	Oleg Nesterov

On Wed, May 27, 2026 at 03:44:17PM +0200, Jann Horn wrote:
> On Wed, May 27, 2026 at 2:01 PM Christian Brauner <brauner@kernel.org> wrote:
> > On Tue, May 26, 2026 at 08:22:38PM +0200, Jann Horn wrote:
> > > On Mon, May 25, 2026 at 9:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > > Question 4.
> > > > Is it possible to use a seq_lock instead of reader writer semaphore?
> > > > Or is that only for non-sleeping readers?
> > >
> > > Linux seqcounts are 32-bit, which means they are always kind of dodgy,
> > > but they are particularly dodgy if a reader can be forced to sleep for
> > > an extended amount of time. I don't see a reason why we couldn't, in
> > > general, use a 64-bit sequence count for readers that may need to
> > > sleep while reading.
> >
> > I have a patch series for this that I started working after merging your
> > series for precisely this reason: performance. It's a few days old now.
> > I've tried various approaches and I started with a simple 32-bit counter
> > as the POC. See appended (untested) patches.
> 
> It looks like there is a missing patch at the start of the series,
> patch 1 uses exec_update_seq_begin without defining it.

Yes. I was mainly worried about changing performance characteristics if
we start using the lock in more cases.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
       [not found]     ` <87wlwny905.fsf@email.froward.int.ebiederm.org>
@ 2026-05-28 14:20       ` Jann Horn
       [not found]         ` <87mrx9f8q2.fsf@email.froward.int.ebiederm.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Jann Horn @ 2026-05-28 14:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Jake Edge, linux-kernel, linux-fsdevel, stable, Kees Cook,
	Oleg Nesterov

On Thu, May 28, 2026 at 3:11 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Jann Horn <jannh@google.com> writes:
>
> > On Mon, May 25, 2026 at 9:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> I have added a couple more people who might be interested.
> >>
> >> Kees Cook because as you have structured this it is an exec problem.
> >>
> >> Oleg Nesterov as he is knowledgable about ptrace.
> >>
> >> Jann Horn <jannh@google.com> writes:
> >>
> >> > My understanding is that procfs is effectively maintained by the VFS
> >> > maintainers (though scripts/get_maintainer.pl claims that there are
> >> > no maintainers for procfs because the VFS entry only claims files
> >> > directly in fs/, and the procfs entry has no maintainers listed on
> >> > it).
> >> >
> >> > In procfs, most uses of ptrace_may_access() should use
> >> > exec_update_lock to avoid TOCTOU issues with concurrent privileged
> >> > execve() (like setuid binary execution).
> >> >
> >> > This series doesn't fix all the remaining issues in procfs, but it fixes
> >> > the easy cases for now; I will probably follow up with fixes for the
> >> > gnarlier cases later unless someone else wants to do that.
> >> >
> >> > I have checked that procfs files still work with these changes and that
> >> > CONFIG_PROVE_LOCKING=y doesn't generate any warnings.
> >> >
> >> > (checkpatch complains about missing argument names in
> >> > proc_op::proc_get_link, but that was already the case before my
> >> > patch.)
> >>
> >>
> >> I think I finally have my context paged back in so I can intelligently
> >> say something about this series.
> >>
> >> The scenario you are worried about is when exec gains privileges,
> >> and we read through proc and authenticate with the old credentials
> >> instead of the new credentials.
> >>
> >> Question 1.
> >>
> >> Assuming the executable is world readable (which they generally are)
> >> is there anything that becomes accessible in that race that was
> >> not already accessible?
> >
> > I believe so - the gnarliest example I am thinking of is:
> > Memfds are always mode 0777 or 0666 (see __shmem_file_setup, which
> > sets S_IRWXUGO), so their access control is purely based on being able
> > to pathwalk to the memfd's inode. If you can race
> > open(/proc/$pid/fd/$n) with the process $pid going through setuid
> > execution and calling memfd_create(), you should be able to get
> > read+write access to the memfd created by the setuid binary that was
> > supposed to be private.
> >
> > (But I have not tested that and don't know if there are actually any
> > setuid binaries that happen to use memfds.)
>
> I don't know about memfds.  I do know it has been a concern in the past
> about opening proc using credentials before exec and then using the
> credentials after exec.
>
> We certainly closed some of those races, if there are still some
> of those races present we should definitely close them.
>
> In my thinking there are the set of races that exist because we fail to
> present exec to userspace as an atomic operation. Then there are larger

Yes, and exec_update_lock is how we make sure we can present the core
of exec to userspace as an atomic operation.

> set of races that exist simply because exec happened.
>
>
> Looking at it proc fd's currently rely on the permission checks
> of the file descriptors themselves and don't make any guarantees
> about the path.

What do you mean by "the permission checks of the file descriptors themselves"?

> The issue you point out with memfd's definitely needs to be fixed.
> It should be separated out from the rest of the races simply because
> it is a completely different kind of issue.
>
> I wonder if anyone even anticipated you could open another file handle
> to memfd's through proc.  If so leaving everything to path based
> permissions assumed a feature of proc that doesn't exist.

I don't think memfds are particularly special here, they are just a
nice, clear example of a case where an inode is protected based on
which processes can path-walk to it.

As another example: Making a directory mode 0700 is also supposed to
prevent other users from accessing things inside it.

> My gut says the best fix for the entire memfd issue is to simply change
> memfd's and probably everything that calls shmem_file_setup to not have
> an open method.  That eliminates any chance anyone will do anything
> clever with proc.  But I can't see why it makes any sense to be able to
> open another file handle into memfd's, or anything else that calls
> shmem_file_setup for that matter.
>
> We can first try to remove the open method of memfd's set by
> shmem_file_setup, and if that doesn't work we can look at fixing proc to
> provide the guarantees that were assumed (as a security fix).
>
>
> As a quality of implementation issue I can see fixing the small race
> where when looking up a file descriptor through proc, exec does not
> appear to be an atomic operation.  I keep wondering if that is something
> that should be done in get_link or d_revalidate.

I don't see how d_revalidate would help, that still wouldn't be atomic.

> I suspect the answer for proc_pid_get_link is to either cache something
> like a seqcount, or simply to repeat the permission and existence checks
> just before calling nd_jump_link.

That seems like it results in complicated semantics, while a mutex
would provide clear semantics. Which is already what we use in places
like __pidfd_fget() and /proc/<pid>/syscall.

> >> Question 2.
> >>
> >> How does this race compare to racing with setresuid?
> >> Do we need to fix the setresuid case as well?
> >
> > Which setresuid case? setresuid clears the dumpable flag and has a
> > memory barrier that is supposed to make that properly ordered against
> > ptrace_may_access(); so setresuid() should normally not cause a task
> > to become traceable, though that could maybe happen in weird
> > scenarios.
>
> The cases where the dumpable flag get set are all part of exec.
>
> I was thinking of cases where we have a daemon that is started by
> root and then it changes it's uid to do something.

(Normally such a daemon would only change its EUID, which is mainly
considered when the daemon acts as a subject, unless it intends to
permanently drop privileges.)

> Looking at ptrace_may_access the uid based checks won't allow
> accessing of such a task unless it changes all of it's uids.
>
> At which point arguably it is on the process that calls setuid to make
> certain ptracing it won't be a problem.  I am not certain that ever
> actually works in practice but that does seem to be what the current
> code is saying.

When a daemon changes its EUID for some reason, commit_creds() will
change that daemon's dumpability to suid_dumpable, which will prevent
ptracing it even if the UIDs match.

The ptrace.2 manpage guarantees that a process which is not
SUID_DUMP_USER can't be accessed without CAP_SYS_PTRACE.

> Now I am wondering if dumpable should get set if setresuid changes
> a uid like I described above.

What do you mean by "get set"?

> > I think another case we should probably care about is what happens if
> > a process which is only protected against ptrace by being non-dumpable
> > goes through execve() - it shouldn't be possible to access resources
> > associated with the pre-execve state while checking against the
> > post-execve dumpability. It might be important for this that the
> > do_close_on_exec() logic currently happens after committing the
> > dumpable state in exec_mmap()...
> >
> >> Question 3.
> >> Do we care about the case when a privileged process calls a setuid
> >> process and drops privileges?
> >
> > I don't understand the question. Hmm - do you mean a case where a
> > process with ruid=1000, euid=0, suid=1000 does execve() on a setuid
> > 1000 binary? I think we probably don't specifically care about that...
> >
>
> The general case would be a daemon running as root forks and exec's a
> binary running as some unprivileged user fred.
>
> Mostly I bring it up is that it is easy to forget suid exec can drop
> privileges as well as raise them.

I do not understand the scenario you're describing. Can you give a
specific example you're thinking of - what the ruid/euid/suid would be
before execve(), and which mode the binary would be?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
       [not found]         ` <87mrx9f8q2.fsf@email.froward.int.ebiederm.org>
@ 2026-06-05 14:34           ` Jann Horn
  0 siblings, 0 replies; 22+ messages in thread
From: Jann Horn @ 2026-06-05 14:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Jake Edge, linux-kernel, linux-fsdevel, stable, Kees Cook,
	Oleg Nesterov

On Fri, Jun 5, 2026 at 2:54 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > On Thu, May 28, 2026 at 3:11 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> Jann Horn <jannh@google.com> writes:
> >> The issue you point out with memfd's definitely needs to be fixed.
> >> It should be separated out from the rest of the races simply because
> >> it is a completely different kind of issue.
> >>
> >> I wonder if anyone even anticipated you could open another file handle
> >> to memfd's through proc.  If so leaving everything to path based
> >> permissions assumed a feature of proc that doesn't exist.
> >
> > I don't think memfds are particularly special here, they are just a
> > nice, clear example of a case where an inode is protected based on
> > which processes can path-walk to it.
> >
> > As another example: Making a directory mode 0700 is also supposed to
> > prevent other users from accessing things inside it.
>
> The simple counter example is that linux has an open by inode
> facility.   That is exposed to nfs and as a syscall.
>
> Well strictly speaking the syscalls  name_to_handle_at and
> open_by_handle_at.
>
> Most filesystems including shmemfs support those operations.
> See shmem_export_ops.
>
> Which is a long way of saying that if someone can guess
> the inode and generation number of a memfd inode it can
> be opened with open_by_handle_at.  The usual permission checks
> are performed but unless I am misreading something the only
> permission checks that are relevant are the permissions on
> the inode.

No. open_by_handle_at() is not supposed to let you bypass
non-executable directories.

Filesystems without a special export_operations::permission handler go through:

do_handle_open -> handle_to_path -> may_decode_fh

which requires that the caller has global CAP_DAC_READ_SEARCH, or is
capable over the superblock, or is capable over the containing mount.

> >> My gut says the best fix for the entire memfd issue is to simply change
> >> memfd's and probably everything that calls shmem_file_setup to not have
> >> an open method.  That eliminates any chance anyone will do anything
> >> clever with proc.  But I can't see why it makes any sense to be able to
> >> open another file handle into memfd's, or anything else that calls
> >> shmem_file_setup for that matter.
> >>
> >> We can first try to remove the open method of memfd's set by
> >> shmem_file_setup, and if that doesn't work we can look at fixing proc to
> >> provide the guarantees that were assumed (as a security fix).
> >>
> >>
> >> As a quality of implementation issue I can see fixing the small race
> >> where when looking up a file descriptor through proc, exec does not
> >> appear to be an atomic operation.  I keep wondering if that is something
> >> that should be done in get_link or d_revalidate.
> >
> > I don't see how d_revalidate would help, that still wouldn't be
> > atomic.
>
> You have to pick the correct one, but in general it is the job of the
> revalidate methods to find something that is stale and see if it works
> in the current context.  AKA make it look like something that wasn't
> done atomically behaves semantically as atomically.

d_revalidate refreshes dentries, but it doesn't make anything about
the underlying inode atomic; and my understanding is that procfs wants
to avoid tying inodes to things like task_struct or mm_struct to avoid
keeping those objects alive unnecessarily.

I think d_revalidate would make sense if, for example, we wanted the
/proc/$pid/maps inode to hold a reference to the corresponding
mm_struct.

> >> I suspect the answer for proc_pid_get_link is to either cache something
> >> like a seqcount, or simply to repeat the permission and existence checks
> >> just before calling nd_jump_link.
> >
> > That seems like it results in complicated semantics, while a mutex
> > would provide clear semantics. Which is already what we use in places
> > like __pidfd_fget() and /proc/<pid>/syscall.
>
> An important point to remember when dealing with proc is that it is for
> implementation purposes a distributed filesystem.  Thinking of it like
> a syscall is wrong.
>
> There are lots of places that for correctness reasons (and to not burden
> the rest of the kernel) proc does something and then validates what
> it has done is correct.  AKA just like a distributed filesystem.
>
> As for semantics I am not proposing anything that will have complicated
> semantics to userspace.  It may have a slightly more complicated
> implementation in proc, but that can save complications in the
> rest of the kernel.
>
> Anything that needs to block exec to block to operate correctly is a
> real can of worms review wise, and something we have repeated gotten
> wrong in the past.  Something where exec can just do it's thing and
> proc can still give a point in time correct answer makes the analysis
> that things won't break much simpler.

Hmm... the reason we got it wrong in the past is that we were using
cred_guard_mutex, which was held across stuff that can (indirectly)
wait for ptrace, right? Whereas nowadays we basically only have to
ensure that we don't block on userspace actions while holding the
exec_update_lock? Which still requires some care but should be less
problematic.

> >> >> Question 2.
> >> >>
> >> >> How does this race compare to racing with setresuid?
> >> >> Do we need to fix the setresuid case as well?
> >> >
> >> > Which setresuid case? setresuid clears the dumpable flag and has a
> >> > memory barrier that is supposed to make that properly ordered against
> >> > ptrace_may_access(); so setresuid() should normally not cause a task
> >> > to become traceable, though that could maybe happen in weird
> >> > scenarios.
> >>
> >> The cases where the dumpable flag get set are all part of exec.
> >>
> >> I was thinking of cases where we have a daemon that is started by
> >> root and then it changes it's uid to do something.
> >
> > (Normally such a daemon would only change its EUID, which is mainly
> > considered when the daemon acts as a subject, unless it intends to
> > permanently drop privileges.)
> >
> >> Looking at ptrace_may_access the uid based checks won't allow
> >> accessing of such a task unless it changes all of it's uids.
> >>
> >> At which point arguably it is on the process that calls setuid to make
> >> certain ptracing it won't be a problem.  I am not certain that ever
> >> actually works in practice but that does seem to be what the current
> >> code is saying.
> >
> > When a daemon changes its EUID for some reason, commit_creds() will
> > change that daemon's dumpability to suid_dumpable, which will prevent
> > ptracing it even if the UIDs match.
> >
> > The ptrace.2 manpage guarantees that a process which is not
> > SUID_DUMP_USER can't be accessed without CAP_SYS_PTRACE.
> >
> >> Now I am wondering if dumpable should get set if setresuid changes
> >> a uid like I described above.
> >
> > What do you mean by "get set"?
>
> I was and still am wondering if dumpable should be set if setresuid
> completely drops all of it's uids.  AKA the case where dumpable is not
> set today.

You mean the case where the EUID is different from RUID and/or SUID,
and userspace calls setresuid(current_euid, current_euid,
current_euid)?

> Unless someone wants to do a bunch of work survey such code
> we should probably wait until a motivating example presents itself.
>
>
> >> > I think another case we should probably care about is what happens if
> >> > a process which is only protected against ptrace by being non-dumpable
> >> > goes through execve() - it shouldn't be possible to access resources
> >> > associated with the pre-execve state while checking against the
> >> > post-execve dumpability. It might be important for this that the
> >> > do_close_on_exec() logic currently happens after committing the
> >> > dumpable state in exec_mmap()...
> >> >
> >> >> Question 3.
> >> >> Do we care about the case when a privileged process calls a setuid
> >> >> process and drops privileges?
> >> >
> >> > I don't understand the question. Hmm - do you mean a case where a
> >> > process with ruid=1000, euid=0, suid=1000 does execve() on a setuid
> >> > 1000 binary? I think we probably don't specifically care about that...
> >> >
> >>
> >> The general case would be a daemon running as root forks and exec's a
> >> binary running as some unprivileged user fred.
> >>
> >> Mostly I bring it up is that it is easy to forget suid exec can drop
> >> privileges as well as raise them.
> >
> > I do not understand the scenario you're describing. Can you give a
> > specific example you're thinking of - what the ruid/euid/suid would be
> > before execve(), and which mode the binary would be?
>
>
> A process P1 with uid=euid=ruid=fsuid=0.
>
> A user fred with uid=1000.
>
> A binary B with uid=1000 (aka fred) gid=1000 (aka fred) that -r-sr-xr-x.
> AKA everyone can read and execute the binary, and the setuid bit is set.
>
> Then P1 execs B.

I think I see your point - in this case, if P1 calls
setresuid(current_euid, current_euid, current_euid), fred would
afterwards be allowed to ptrace P1.

That feels to me like it is a bit weird, but probably not very
problematic because fred owns the file; P1 is already running code
owned by fred. Though fred can't necessarily actually change the file
contents, if the file is immutable or the mount is readonly or such.

I... think it is also rare for setuid binaries to call
setresuid(current_euid, current_euid, current_euid), since normally it
is desired that they keep the original RUID for stuff like
signal-sending permissions? But I might be wrong about that.

> p.s.  My personal opinion is changing permissions upon exec is a dumb
> idea.  It might be worth doing the work to drop that support and to
> update userspace to just connect to more privileged daemons when more
> permissions are needed.  AKA ssh instead of su.

I agree that it would be nicer to get rid of that... I remember amluto
saying the same thing.

Lennart Poettering thinks that, too, and has added the sudo
replacement "run0" in systemd, which works by talking to systemd and
authenticating via polkit:
https://mastodon.social/@pid_eins/112353324518585654

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1)
  2026-05-18 16:35 ` [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1) Jann Horn
  2026-05-26  8:48   ` Oleg Nesterov
@ 2026-06-05 14:36   ` Mark Brown
  2026-06-05 14:39     ` Jann Horn
  1 sibling, 1 reply; 22+ messages in thread
From: Mark Brown @ 2026-06-05 14:36 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Eric W. Biederman, Jake Edge, linux-kernel, linux-fsdevel, stable

[-- Attachment #1: Type: text/plain, Size: 5799 bytes --]

On Mon, May 18, 2026 at 06:35:15PM +0200, Jann Horn wrote:
> Fix the easy cases where procfs currently calls ptrace_may_access() without
> exec_update_lock protection, where the fix is to simply add the extra lock
> or use mm_access():

>  - do_task_stat(): grab exec_update_lock
>  - proc_pid_wchan(): grab exec_update_lock
>  - proc_map_files_lookup(): use mm_access() instead of get_task_mm()
>  - proc_map_files_readdir(): use mm_access() instead of get_task_mm()
>  - proc_ns_get_link(): grab exec_update_lock
>  - proc_ns_readlink(): grab exec_update_lock

It seems that this patch is triggering a failure in the proc selftests
read test:

# selftests: proc: read
[  259.127414] ICMPv6: process `read' is using deprecated sysctl (syscall) net.ipv6.neigh.default.base_reachable_time - use net.ipv6.neigh.default.base_reachable_time_ms instead
[  259.158773] /proc/cgroups lists only v1 controllers, use cgroup.controllers of root cgroup for v2 info
[  259.177155] sysrq: HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(v) show-blocked-tasks(w) replay-kernel-logs(R) 
# read: proc.h:49: xreaddir: Assertion `de || errno == 0' failed.
# Aborted
not ok 19 selftests: proc: read # exit=134

Full log:

   https://lava.sirena.org.uk/scheduler/job/2835194#L12433

Everything except the assertation appears in a successful test:

   https://lava.sirena.org.uk/scheduler/job/2834737#L12287

bisect log:

# bad: [6e845bcb78c95af935094040bd4edc3c2b6dd784] Add linux-next specific files for 20260605
# good: [f9b5aeed37bc9023d700c9c8ff186f1e98692bc8] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
# good: [9582485a65eacfd7245ec7f0a9d7e2c34749d669] device property: fix fwnode reference leak in fwnode_graph_get_endpoint_by_id()
# good: [a9c12b783cc711de3ac7f188bed07d529bb818af] device core: make struct device_driver groups members constant arrays
# good: [34808ac8ddafc3e2c2a59e84eaab0a410e7a0fdc] regmap-i2c: fix sparse warning in regmap_smbus_word_write_reg16
# good: [25025253476a64c186592d952c27f24bc3490e42] leds: Adjust documentation of brightness sysfs node
# good: [a76640171b29fc91b9777a8e1bdc7e08db697275] Merge patch series "proc: subset=pid: Relax check of mount visibility"
# good: [78d797520f6a74ed402cb98c6bf74d96b4937965] sysfs: remove trivial sysfs_get_tree() wrapper
# good: [c5dffafb426f927db1630140552dc11d6f76e1a6] docs: proc: add documentation about mount restrictions
git bisect start '6e845bcb78c95af935094040bd4edc3c2b6dd784' 'f9b5aeed37bc9023d700c9c8ff186f1e98692bc8' '9582485a65eacfd7245ec7f0a9d7e2c34749d669' 'a9c12b783cc711de3ac7f188bed07d529bb818af' '34808ac8ddafc3e2c2a59e84eaab0a410e7a0fdc' '25025253476a64c186592d952c27f24bc3490e42' 'a76640171b29fc91b9777a8e1bdc7e08db697275' '78d797520f6a74ed402cb98c6bf74d96b4937965' 'c5dffafb426f927db1630140552dc11d6f76e1a6'
# test job: [9582485a65eacfd7245ec7f0a9d7e2c34749d669] https://lava.sirena.org.uk/scheduler/job/2804101
# test job: [a9c12b783cc711de3ac7f188bed07d529bb818af] https://lava.sirena.org.uk/scheduler/job/2803377
# test job: [34808ac8ddafc3e2c2a59e84eaab0a410e7a0fdc] https://lava.sirena.org.uk/scheduler/job/2783496
# test job: [25025253476a64c186592d952c27f24bc3490e42] https://lava.sirena.org.uk/scheduler/job/2803433
# test job: [a76640171b29fc91b9777a8e1bdc7e08db697275] https://lava.sirena.org.uk/scheduler/job/2827647
# test job: [78d797520f6a74ed402cb98c6bf74d96b4937965] https://lava.sirena.org.uk/scheduler/job/2827487
# test job: [c5dffafb426f927db1630140552dc11d6f76e1a6] https://lava.sirena.org.uk/scheduler/job/2827551
# test job: [6e845bcb78c95af935094040bd4edc3c2b6dd784] https://lava.sirena.org.uk/scheduler/job/2835194
# bad: [6e845bcb78c95af935094040bd4edc3c2b6dd784] Add linux-next specific files for 20260605
git bisect bad 6e845bcb78c95af935094040bd4edc3c2b6dd784
# test job: [0ec6945730e17fb8a44283114493b1a54caabf09] https://lava.sirena.org.uk/scheduler/job/2827595
# bad: [0ec6945730e17fb8a44283114493b1a54caabf09] proc: protect ptrace_may_access() with exec_update_lock (part 1)
git bisect bad 0ec6945730e17fb8a44283114493b1a54caabf09
# first bad commit: [0ec6945730e17fb8a44283114493b1a54caabf09] proc: protect ptrace_may_access() with exec_update_lock (part 1)
# test job: [f8823fb0641190098790d060a27b89bad4ddd73d] https://lava.sirena.org.uk/scheduler/job/2829222
# bad: [f8823fb0641190098790d060a27b89bad4ddd73d] proc: protect ptrace_may_access() with exec_update_lock (FD links)
git bisect bad f8823fb0641190098790d060a27b89bad4ddd73d
# test job: [abadd84dab07b3f9e79455b467d9ff60d12940b2] https://lava.sirena.org.uk/scheduler/job/2827425
# bad: [abadd84dab07b3f9e79455b467d9ff60d12940b2] Merge patch series "proc: protect ptrace_may_access() with exec_update_lock"
git bisect bad abadd84dab07b3f9e79455b467d9ff60d12940b2
# test job: [f8823fb0641190098790d060a27b89bad4ddd73d] https://lava.sirena.org.uk/scheduler/job/2829222
# bad: [f8823fb0641190098790d060a27b89bad4ddd73d] proc: protect ptrace_may_access() with exec_update_lock (FD links)
git bisect bad f8823fb0641190098790d060a27b89bad4ddd73d
# test job: [0ec6945730e17fb8a44283114493b1a54caabf09] https://lava.sirena.org.uk/scheduler/job/2827595
# bad: [0ec6945730e17fb8a44283114493b1a54caabf09] proc: protect ptrace_may_access() with exec_update_lock (part 1)
git bisect bad 0ec6945730e17fb8a44283114493b1a54caabf09
# first bad commit: [0ec6945730e17fb8a44283114493b1a54caabf09] proc: protect ptrace_may_access() with exec_update_lock (part 1)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1)
  2026-06-05 14:36   ` Mark Brown
@ 2026-06-05 14:39     ` Jann Horn
  0 siblings, 0 replies; 22+ messages in thread
From: Jann Horn @ 2026-06-05 14:39 UTC (permalink / raw)
  To: Mark Brown
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Arjan van de Ven,
	Eric W. Biederman, Jake Edge, linux-kernel, linux-fsdevel, stable

On Fri, Jun 5, 2026 at 4:36 PM Mark Brown <broonie@kernel.org> wrote:
> On Mon, May 18, 2026 at 06:35:15PM +0200, Jann Horn wrote:
> > Fix the easy cases where procfs currently calls ptrace_may_access() without
> > exec_update_lock protection, where the fix is to simply add the extra lock
> > or use mm_access():
>
> >  - do_task_stat(): grab exec_update_lock
> >  - proc_pid_wchan(): grab exec_update_lock
> >  - proc_map_files_lookup(): use mm_access() instead of get_task_mm()
> >  - proc_map_files_readdir(): use mm_access() instead of get_task_mm()
> >  - proc_ns_get_link(): grab exec_update_lock
> >  - proc_ns_readlink(): grab exec_update_lock
>
> It seems that this patch is triggering a failure in the proc selftests
> read test:
>
> # selftests: proc: read
> [  259.127414] ICMPv6: process `read' is using deprecated sysctl (syscall) net.ipv6.neigh.default.base_reachable_time - use net.ipv6.neigh.default.base_reachable_time_ms instead
> [  259.158773] /proc/cgroups lists only v1 controllers, use cgroup.controllers of root cgroup for v2 info
> [  259.177155] sysrq: HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(v) show-blocked-tasks(w) replay-kernel-logs(R)
> # read: proc.h:49: xreaddir: Assertion `de || errno == 0' failed.
> # Aborted
> not ok 19 selftests: proc: read # exit=134

Thanks for the report!

Yup, https://lore.kernel.org/oe-lkp/202606021924.b6d8a0c2-lkp@intel.com
reported this too, it should be fixed with
https://lore.kernel.org/all/20260604155806.1402880-1-jannh@google.com/
, which has been squashed into the current version of the VFS tree.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-06-05 14:40 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-18 16:35 [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock Jann Horn
2026-05-18 16:35 ` [PATCH 1/2] proc: protect ptrace_may_access() with exec_update_lock (part 1) Jann Horn
2026-05-26  8:48   ` Oleg Nesterov
2026-05-26  9:44     ` Oleg Nesterov
2026-05-26 14:19       ` Jann Horn
2026-05-26 14:16     ` Jann Horn
2026-05-26 18:22       ` Oleg Nesterov
2026-05-26 18:30         ` Jann Horn
2026-06-05 14:36   ` Mark Brown
2026-06-05 14:39     ` Jann Horn
2026-05-18 16:35 ` [PATCH 2/2] proc: protect ptrace_may_access() with exec_update_lock (FD links) Jann Horn
2026-05-22 11:47 ` [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock Christian Brauner
2026-05-25 19:56 ` Eric W. Biederman
2026-05-26 11:10   ` Oleg Nesterov
2026-05-26 18:22   ` Jann Horn
2026-05-27 12:01     ` Christian Brauner
2026-05-27 12:31       ` Christian Brauner
2026-05-27 13:49         ` Jann Horn
2026-05-27 13:44       ` Jann Horn
2026-05-27 13:57         ` Christian Brauner
     [not found]     ` <87wlwny905.fsf@email.froward.int.ebiederm.org>
2026-05-28 14:20       ` Jann Horn
     [not found]         ` <87mrx9f8q2.fsf@email.froward.int.ebiederm.org>
2026-06-05 14:34           ` Jann Horn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.