[PATCH v12] exec: Fix dead-lock in de_thread with ptrace

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v12] exec: Fix dead-lock in de_thread with ptrace_attach
       [not found] ` <AM8PR10MB470875B22B4C08BEAEC3F77FE4169@AM8PR10MB4708.EURPRD10.PROD.OUTLOOK.COM>
@ 2023-10-30  5:20   ` Bernd Edlinger
  2023-10-30  9:00     ` kernel test robot
       [not found]     ` <AS8P193MB12851AC1F862B97FCE9B3F4FE4AAA@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM>
  0 siblings, 2 replies; 39+ messages in thread
From: Bernd Edlinger @ 2023-10-30  5:20 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
	Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
	Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
	Suren Baghdasaryan, YiFei Zhu, Yafang Shao, Helge Deller,
	Eric W. Biederman, Adrian Reber, Thomas Gleixner, Jens Axboe,
	Alexei Starovoitov, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-kselftest, linux-mm, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Zheng Yejian

This introduces signal->exec_bprm, which is used to
fix the case when at least one of the sibling threads
is traced, and therefore the trace process may dead-lock
in ptrace_attach, but de_thread will need to wait for the
tracer to continue execution.

The solution is to detect this situation and allow
ptrace_attach to continue by temporarily releasing the
cred_guard_mutex, while de_thread() is still waiting for
traced zombies to be eventually released by the tracer.
In the case of the thread group leader we only have to wait
for the thread to become a zombie, which may also need
co-operation from the tracer due to PTRACE_O_TRACEEXIT.

When a tracer wants to ptrace_attach a task that already
is in execve, we simply retry the ptrace_may_access
check while temporarily installing the new credentials
and dumpability which are about to be used after execve
completes.  If the ptrace_attach happens on a thread that
is a sibling-thread of the thread doing execve, it is
sufficient to check against the old credentials, as this
thread will be waited for, before the new credentials are
installed.

Other threads die quickly since the cred_guard_mutex is
released, but a deadly signal is already pending.  In case
the mutex_lock_killable misses the signal, the non-zero
current->signal->exec_bprm makes sure they release the
mutex immediately and return with -ERESTARTNOINTR.

This means there is no API change, unlike the previous
version of this patch which was discussed here:

https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.de/

See tools/testing/selftests/ptrace/vmaccess.c
for a test case that gets fixed by this change.

Note that since the test case was originally designed to
test the ptrace_attach returning an error in this situation,
the test expectation needed to be adjusted, to allow the
API to succeed at the first attempt.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 fs/exec.c                                 | 69 ++++++++++++++++-------
 fs/proc/base.c                            |  6 ++
 include/linux/cred.h                      |  1 +
 include/linux/sched/signal.h              | 18 ++++++
 kernel/cred.c                             | 28 +++++++--
 kernel/ptrace.c                           | 32 +++++++++++
 kernel/seccomp.c                          | 12 +++-
 tools/testing/selftests/ptrace/vmaccess.c | 23 +++++---
 8 files changed, 155 insertions(+), 34 deletions(-)

v10: Changes to previous version, make the PTRACE_ATTACH
retun -EAGAIN, instead of execve return -ERESTARTSYS.
Added some lessions learned to the description.

v11: Check old and new credentials in PTRACE_ATTACH again without
changing the API.

Note: I got actually one response from an automatic checker to the v11 patch,

https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/

which is complaining about:

>> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct cred const *old_cred @@     got struct cred const [noderef] __rcu *real_cred @@

   417			struct linux_binprm *bprm = task->signal->exec_bprm;
   418			const struct cred *old_cred;
   419			struct mm_struct *old_mm;
   420	
   421			retval = down_write_killable(&task->signal->exec_update_lock);
   422			if (retval)
   423				goto unlock_creds;
   424			task_lock(task);
 > 425			old_cred = task->real_cred;

v12: Essentially identical to v11.

- Fixed a minor merge conflict in linux v5.17, and fixed the
above mentioned nit by adding __rcu to the declaration.

- re-tested the patch with all linux versions from v5.11 to v6.6

v10 was an alternative approach which did imply an API change.
But I would prefer to avoid such an API change.

The difficult part is getting the right dumpability flags assigned
before de_thread starts, hope you like this version.
If not, the v10 is of course also acceptable.


Thanks
Bernd.

diff --git a/fs/exec.c b/fs/exec.c
index 2f2b0acec4f0..902d3b230485 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1041,11 +1041,13 @@ static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
-static int de_thread(struct task_struct *tsk)
+static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
 {
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
 	spinlock_t *lock = &oldsighand->siglock;
+	struct task_struct *t = tsk;
+	bool unsafe_execve_in_progress = false;
 
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
@@ -1068,6 +1070,19 @@ static int de_thread(struct task_struct *tsk)
 	if (!thread_group_leader(tsk))
 		sig->notify_count--;
 
+	while_each_thread(tsk, t) {
+		if (unlikely(t->ptrace)
+		    && (t != tsk->group_leader || !t->exit_state))
+			unsafe_execve_in_progress = true;
+	}
+
+	if (unlikely(unsafe_execve_in_progress)) {
+		spin_unlock_irq(lock);
+		sig->exec_bprm = bprm;
+		mutex_unlock(&sig->cred_guard_mutex);
+		spin_lock_irq(lock);
+	}
+
 	while (sig->notify_count) {
 		__set_current_state(TASK_KILLABLE);
 		spin_unlock_irq(lock);
@@ -1158,6 +1173,11 @@ static int de_thread(struct task_struct *tsk)
 		release_task(leader);
 	}
 
+	if (unlikely(unsafe_execve_in_progress)) {
+		mutex_lock(&sig->cred_guard_mutex);
+		sig->exec_bprm = NULL;
+	}
+
 	sig->group_exec_task = NULL;
 	sig->notify_count = 0;
 
@@ -1169,6 +1189,11 @@ static int de_thread(struct task_struct *tsk)
 	return 0;
 
 killed:
+	if (unlikely(unsafe_execve_in_progress)) {
+		mutex_lock(&sig->cred_guard_mutex);
+		sig->exec_bprm = NULL;
+	}
+
 	/* protects against exit_notify() and __exit_signal() */
 	read_lock(&tasklist_lock);
 	sig->group_exec_task = NULL;
@@ -1253,6 +1278,24 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		return retval;
 
+	/* If the binary is not readable then enforce mm->dumpable=0 */
+	would_dump(bprm, bprm->file);
+	if (bprm->have_execfd)
+		would_dump(bprm, bprm->executable);
+
+	/*
+	 * Figure out dumpability. Note that this checking only of current
+	 * is wrong, but userspace depends on it. This should be testing
+	 * bprm->secureexec instead.
+	 */
+	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
+	    is_dumpability_changed(current_cred(), bprm->cred) ||
+	    !(uid_eq(current_euid(), current_uid()) &&
+	      gid_eq(current_egid(), current_gid())))
+		set_dumpable(bprm->mm, suid_dumpable);
+	else
+		set_dumpable(bprm->mm, SUID_DUMP_USER);
+
 	/*
 	 * Ensure all future errors are fatal.
 	 */
@@ -1261,7 +1304,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	/*
 	 * Make this the only thread in the thread group.
 	 */
-	retval = de_thread(me);
+	retval = de_thread(me, bprm);
 	if (retval)
 		goto out;
 
@@ -1284,11 +1327,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
-	/* If the binary is not readable then enforce mm->dumpable=0 */
-	would_dump(bprm, bprm->file);
-	if (bprm->have_execfd)
-		would_dump(bprm, bprm->executable);
-
 	/*
 	 * Release all of the old mmap stuff
 	 */
@@ -1350,18 +1388,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 
 	me->sas_ss_sp = me->sas_ss_size = 0;
 
-	/*
-	 * Figure out dumpability. Note that this checking only of current
-	 * is wrong, but userspace depends on it. This should be testing
-	 * bprm->secureexec instead.
-	 */
-	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
-	    !(uid_eq(current_euid(), current_uid()) &&
-	      gid_eq(current_egid(), current_gid())))
-		set_dumpable(current->mm, suid_dumpable);
-	else
-		set_dumpable(current->mm, SUID_DUMP_USER);
-
 	perf_event_exec();
 	__set_task_comm(me, kbasename(bprm->filename), true);
 
@@ -1480,6 +1506,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
 	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
+	if (unlikely(current->signal->exec_bprm)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		return -ERESTARTNOINTR;
+	}
+
 	bprm->cred = prepare_exec_creds();
 	if (likely(bprm->cred))
 		return 0;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ffd54617c354..0da9adfadb48 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2788,6 +2788,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 	if (rv < 0)
 		goto out_free;
 
+	if (unlikely(current->signal->exec_bprm)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		rv = -ERESTARTNOINTR;
+		goto out_free;
+	}
+
 	rv = security_setprocattr(PROC_I(inode)->op.lsm,
 				  file->f_path.dentry->d_name.name, page,
 				  count);
diff --git a/include/linux/cred.h b/include/linux/cred.h
index f923528d5cc4..b01e309f5686 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -159,6 +159,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
 extern struct cred *cred_alloc_blank(void);
 extern struct cred *prepare_creds(void);
 extern struct cred *prepare_exec_creds(void);
+extern bool is_dumpability_changed(const struct cred *, const struct cred *);
 extern int commit_creds(struct cred *);
 extern void abort_creds(struct cred *);
 extern const struct cred *override_creds(const struct cred *);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 0014d3adaf84..14df7073a0a8 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -234,9 +234,27 @@ struct signal_struct {
 	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
 					 * killed by the oom killer */
 
+	struct linux_binprm *exec_bprm;	/* Used to check ptrace_may_access
+					 * against new credentials while
+					 * de_thread is waiting for other
+					 * traced threads to terminate.
+					 * Set while de_thread is executing.
+					 * The cred_guard_mutex is released
+					 * after de_thread() has called
+					 * zap_other_threads(), therefore
+					 * a fatal signal is guaranteed to be
+					 * already pending in the unlikely
+					 * event, that
+					 * current->signal->exec_bprm happens
+					 * to be non-zero after the
+					 * cred_guard_mutex was acquired.
+					 */
+
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace)
+					 * Held while execve runs, except when
+					 * a sibling thread is being traced.
 					 * Deprecated do not use in new code.
 					 * Use exec_update_lock instead.
 					 */
diff --git a/kernel/cred.c b/kernel/cred.c
index 98cb4eca23fb..586cb6c7cf6b 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -433,6 +433,28 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
 	return false;
 }
 
+/**
+ * is_dumpability_changed - Will changing creds from old to new
+ * affect the dumpability in commit_creds?
+ *
+ * Return: false - dumpability will not be changed in commit_creds.
+ * Return: true  - dumpability will be changed to non-dumpable.
+ *
+ * @old: The old credentials
+ * @new: The new credentials
+ */
+bool is_dumpability_changed(const struct cred *old, const struct cred *new)
+{
+	if (!uid_eq(old->euid, new->euid) ||
+	    !gid_eq(old->egid, new->egid) ||
+	    !uid_eq(old->fsuid, new->fsuid) ||
+	    !gid_eq(old->fsgid, new->fsgid) ||
+	    !cred_cap_issubset(old, new))
+		return true;
+
+	return false;
+}
+
 /**
  * commit_creds - Install new credentials upon the current task
  * @new: The credentials to be assigned
@@ -467,11 +489,7 @@ int commit_creds(struct cred *new)
 	get_cred(new); /* we will require a ref for the subj creds too */
 
 	/* dumpability changes */
-	if (!uid_eq(old->euid, new->euid) ||
-	    !gid_eq(old->egid, new->egid) ||
-	    !uid_eq(old->fsuid, new->fsuid) ||
-	    !gid_eq(old->fsgid, new->fsgid) ||
-	    !cred_cap_issubset(old, new)) {
+	if (is_dumpability_changed(old, new)) {
 		if (task->mm)
 			set_dumpable(task->mm, suid_dumpable);
 		task->pdeath_signal = 0;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 443057bee87c..eb1c450bb7d7 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -20,6 +20,7 @@
 #include <linux/pagemap.h>
 #include <linux/ptrace.h>
 #include <linux/security.h>
+#include <linux/binfmts.h>
 #include <linux/signal.h>
 #include <linux/uio.h>
 #include <linux/audit.h>
@@ -435,6 +436,28 @@ static int ptrace_attach(struct task_struct *task, long request,
 	if (retval)
 		goto unlock_creds;
 
+	if (unlikely(task->in_execve)) {
+		struct linux_binprm *bprm = task->signal->exec_bprm;
+		const struct cred __rcu *old_cred;
+		struct mm_struct *old_mm;
+
+		retval = down_write_killable(&task->signal->exec_update_lock);
+		if (retval)
+			goto unlock_creds;
+		task_lock(task);
+		old_cred = task->real_cred;
+		old_mm = task->mm;
+		rcu_assign_pointer(task->real_cred, bprm->cred);
+		task->mm = bprm->mm;
+		retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+		rcu_assign_pointer(task->real_cred, old_cred);
+		task->mm = old_mm;
+		task_unlock(task);
+		up_write(&task->signal->exec_update_lock);
+		if (retval)
+			goto unlock_creds;
+	}
+
 	write_lock_irq(&tasklist_lock);
 	retval = -EPERM;
 	if (unlikely(task->exit_state))
@@ -508,6 +531,14 @@ static int ptrace_traceme(void)
 {
 	int ret = -EPERM;
 
+	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
+		return -ERESTARTNOINTR;
+
+	if (unlikely(current->signal->exec_bprm)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		return -ERESTARTNOINTR;
+	}
+
 	write_lock_irq(&tasklist_lock);
 	/* Are we already being traced? */
 	if (!current->ptrace) {
@@ -523,6 +554,7 @@ static int ptrace_traceme(void)
 		}
 	}
 	write_unlock_irq(&tasklist_lock);
+	mutex_unlock(&current->signal->cred_guard_mutex);
 
 	return ret;
 }
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 255999ba9190..b29bbfa0b044 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
-	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
-	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_put_fd;
+	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+		if (mutex_lock_killable(&current->signal->cred_guard_mutex))
+			goto out_put_fd;
+
+		if (unlikely(current->signal->exec_bprm)) {
+			mutex_unlock(&current->signal->cred_guard_mutex);
+			goto out_put_fd;
+		}
+	}
 
 	spin_lock_irq(&current->sighand->siglock);
 
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
index 4db327b44586..3b7d81fb99bb 100644
--- a/tools/testing/selftests/ptrace/vmaccess.c
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -39,8 +39,15 @@ TEST(vmaccess)
 	f = open(mm, O_RDONLY);
 	ASSERT_GE(f, 0);
 	close(f);
-	f = kill(pid, SIGCONT);
-	ASSERT_EQ(f, 0);
+	f = waitpid(-1, NULL, 0);
+	ASSERT_NE(f, -1);
+	ASSERT_NE(f, 0);
+	ASSERT_NE(f, pid);
+	f = waitpid(-1, NULL, 0);
+	ASSERT_EQ(f, pid);
+	f = waitpid(-1, NULL, 0);
+	ASSERT_EQ(f, -1);
+	ASSERT_EQ(errno, ECHILD);
 }
 
 TEST(attach)
@@ -57,22 +64,24 @@ TEST(attach)
 
 	sleep(1);
 	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
-	ASSERT_EQ(errno, EAGAIN);
-	ASSERT_EQ(k, -1);
+	ASSERT_EQ(k, 0);
 	k = waitpid(-1, &s, WNOHANG);
 	ASSERT_NE(k, -1);
 	ASSERT_NE(k, 0);
 	ASSERT_NE(k, pid);
 	ASSERT_EQ(WIFEXITED(s), 1);
 	ASSERT_EQ(WEXITSTATUS(s), 0);
-	sleep(1);
-	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
 	ASSERT_EQ(k, 0);
 	k = waitpid(-1, &s, 0);
 	ASSERT_EQ(k, pid);
 	ASSERT_EQ(WIFSTOPPED(s), 1);
 	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
-	k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
 	ASSERT_EQ(k, 0);
 	k = waitpid(-1, &s, 0);
 	ASSERT_EQ(k, pid);
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v12] exec: Fix dead-lock in de_thread with ptrace_attach
  2023-10-30  5:20   ` [PATCH v12] exec: Fix dead-lock in de_thread with ptrace_attach Bernd Edlinger
@ 2023-10-30  9:00     ` kernel test robot
       [not found]     ` <AS8P193MB12851AC1F862B97FCE9B3F4FE4AAA@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM>
  1 sibling, 0 replies; 39+ messages in thread
From: kernel test robot @ 2023-10-30  9:00 UTC (permalink / raw)
  To: Bernd Edlinger, Alexander Viro, Alexey Dobriyan, Oleg Nesterov,
	Kees Cook, Andy Lutomirski, Will Drewry, Christian Brauner,
	Andrew Morton, Michal Hocko, Serge Hallyn, James Morris,
	Randy Dunlap, Suren Baghdasaryan, YiFei Zhu, Yafang Shao,
	Helge Deller, Eric W. Biederman, Adrian Reber, Thomas Gleixner,
	Jens Axboe, Alexei Starovoitov, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-kselftest, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker
  Cc: oe-kbuild-all, Linux Memory Management List

Hi Bernd,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kees/for-next/execve]
[also build test WARNING on kees/for-next/seccomp shuah-kselftest/next shuah-kselftest/fixes linus/master v6.6]
[cannot apply to next-20231030]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Bernd-Edlinger/exec-Fix-dead-lock-in-de_thread-with-ptrace_attach/20231030-133021
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/execve
patch link:    https://lore.kernel.org/r/AS8P193MB1285DF698D7524EDE22ABFA1E4A1A%40AS8P193MB1285.EURP193.PROD.OUTLOOK.COM
patch subject: [PATCH v12] exec: Fix dead-lock in de_thread with ptrace_attach
config: loongarch-randconfig-002-20231030 (https://download.01.org/0day-ci/archive/20231030/202310301604.K866zRJ8-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231030/202310301604.K866zRJ8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310301604.K866zRJ8-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/cred.c:443: warning: duplicate section name 'Return'


vim +/Return +443 kernel/cred.c

   435	
   436	/**
   437	 * is_dumpability_changed - Will changing creds from old to new
   438	 * affect the dumpability in commit_creds?
   439	 *
   440	 * Return: false - dumpability will not be changed in commit_creds.
   441	 * Return: true  - dumpability will be changed to non-dumpable.
   442	 *
 > 443	 * @old: The old credentials
   444	 * @new: The new credentials
   445	 */
   446	bool is_dumpability_changed(const struct cred *old, const struct cred *new)
   447	{
   448		if (!uid_eq(old->euid, new->euid) ||
   449		    !gid_eq(old->egid, new->egid) ||
   450		    !uid_eq(old->fsuid, new->fsuid) ||
   451		    !gid_eq(old->fsgid, new->fsgid) ||
   452		    !cred_cap_issubset(old, new))
   453			return true;
   454	
   455		return false;
   456	}
   457	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach
       [not found]     ` <AS8P193MB12851AC1F862B97FCE9B3F4FE4AAA@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM>
@ 2024-01-15 19:22       ` Bernd Edlinger
  2024-01-15 19:37         ` Matthew Wilcox
                           ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Bernd Edlinger @ 2024-01-15 19:22 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
	Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
	Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
	Suren Baghdasaryan, Yafang Shao, Helge Deller, Eric W. Biederman,
	Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Alexey Dobriyan, Jens Axboe, Paul Moore,
	Elena Reshetova, David Windsor, Mateusz Guzik, YueHaibing,
	Ard Biesheuvel, Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, tiozhang

This introduces signal->exec_bprm, which is used to
fix the case when at least one of the sibling threads
is traced, and therefore the trace process may dead-lock
in ptrace_attach, but de_thread will need to wait for the
tracer to continue execution.

The problem happens when a tracer tries to ptrace_attach
to a multi-threaded process, that does an execve in one of
the threads at the same time, without doing that in a forked
sub-process.  That means: There is a race condition, when one
or more of the threads are already ptraced, but the thread
that invoked the execve is not yet traced.  Now in this
case the execve locks the cred_guard_mutex and waits for
de_thread to complete.  But that waits for the traced
sibling threads to exit, and those have to wait for the
tracer to receive the exit signal, but the tracer cannot
call wait right now, because it is waiting for the ptrace
call to complete, and this never does not happen.
The traced process and the tracer are now in a deadlock
situation, and can only be killed by a fatal signal.

The solution is to detect this situation and allow
ptrace_attach to continue by temporarily releasing the
cred_guard_mutex, while de_thread() is still waiting for
traced zombies to be eventually released by the tracer.
In the case of the thread group leader we only have to wait
for the thread to become a zombie, which may also need
co-operation from the tracer due to PTRACE_O_TRACEEXIT.

When a tracer wants to ptrace_attach a task that already
is in execve, we simply retry the ptrace_may_access
check while temporarily installing the new credentials
and dumpability which are about to be used after execve
completes.  If the ptrace_attach happens on a thread that
is a sibling-thread of the thread doing execve, it is
sufficient to check against the old credentials, as this
thread will be waited for, before the new credentials are
installed.

Other threads die quickly since the cred_guard_mutex is
released, but a deadly signal is already pending.  In case
the mutex_lock_killable misses the signal, the non-zero
current->signal->exec_bprm makes sure they release the
mutex immediately and return with -ERESTARTNOINTR.

This means there is no API change, unlike the previous
version of this patch which was discussed here:

https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.de/

See tools/testing/selftests/ptrace/vmaccess.c
for a test case that gets fixed by this change.

Note that since the test case was originally designed to
test the ptrace_attach returning an error in this situation,
the test expectation needed to be adjusted, to allow the
API to succeed at the first attempt.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 fs/exec.c                                 | 69 ++++++++++++++++-------
 fs/proc/base.c                            |  6 ++
 include/linux/cred.h                      |  1 +
 include/linux/sched/signal.h              | 18 ++++++
 kernel/cred.c                             | 28 +++++++--
 kernel/ptrace.c                           | 32 +++++++++++
 kernel/seccomp.c                          | 12 +++-
 tools/testing/selftests/ptrace/vmaccess.c | 23 +++++---
 8 files changed, 155 insertions(+), 34 deletions(-)

v10: Changes to previous version, make the PTRACE_ATTACH
return -EAGAIN, instead of execve return -ERESTARTSYS.
Added some lessions learned to the description.

v11: Check old and new credentials in PTRACE_ATTACH again without
changing the API.

Note: I got actually one response from an automatic checker to the v11 patch,

https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/

which is complaining about:

>> >> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct cred const *old_cred @@     got struct cred const [noderef] __rcu *real_cred @@

   417			struct linux_binprm *bprm = task->signal->exec_bprm;
   418			const struct cred *old_cred;
   419			struct mm_struct *old_mm;
   420	
   421			retval = down_write_killable(&task->signal->exec_update_lock);
   422			if (retval)
   423				goto unlock_creds;
   424			task_lock(task);
 > 425			old_cred = task->real_cred;

v12: Essentially identical to v11.

- Fixed a minor merge conflict in linux v5.17, and fixed the
above mentioned nit by adding __rcu to the declaration.

- re-tested the patch with all linux versions from v5.11 to v6.6

v10 was an alternative approach which did imply an API change.
But I would prefer to avoid such an API change.

The difficult part is getting the right dumpability flags assigned
before de_thread starts, hope you like this version.
If not, the v10 is of course also acceptable.

v13: Fixed duplicated Return section in function header of
is_dumpability_changed which was reported by the kernel test robot

v14: rebased to v6.7, refreshed and retested.
And added a more detailed description of the actual bug.


Thanks
Bernd.

diff --git a/fs/exec.c b/fs/exec.c
index 6d9ed2d765ef..f2cf7c58fe16 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1039,11 +1039,13 @@ static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
-static int de_thread(struct task_struct *tsk)
+static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
 {
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
 	spinlock_t *lock = &oldsighand->siglock;
+	struct task_struct *t = tsk;
+	bool unsafe_execve_in_progress = false;
 
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
@@ -1066,6 +1068,19 @@ static int de_thread(struct task_struct *tsk)
 	if (!thread_group_leader(tsk))
 		sig->notify_count--;
 
+	while_each_thread(tsk, t) {
+		if (unlikely(t->ptrace)
+		    && (t != tsk->group_leader || !t->exit_state))
+			unsafe_execve_in_progress = true;
+	}
+
+	if (unlikely(unsafe_execve_in_progress)) {
+		spin_unlock_irq(lock);
+		sig->exec_bprm = bprm;
+		mutex_unlock(&sig->cred_guard_mutex);
+		spin_lock_irq(lock);
+	}
+
 	while (sig->notify_count) {
 		__set_current_state(TASK_KILLABLE);
 		spin_unlock_irq(lock);
@@ -1156,6 +1171,11 @@ static int de_thread(struct task_struct *tsk)
 		release_task(leader);
 	}
 
+	if (unlikely(unsafe_execve_in_progress)) {
+		mutex_lock(&sig->cred_guard_mutex);
+		sig->exec_bprm = NULL;
+	}
+
 	sig->group_exec_task = NULL;
 	sig->notify_count = 0;
 
@@ -1167,6 +1187,11 @@ static int de_thread(struct task_struct *tsk)
 	return 0;
 
 killed:
+	if (unlikely(unsafe_execve_in_progress)) {
+		mutex_lock(&sig->cred_guard_mutex);
+		sig->exec_bprm = NULL;
+	}
+
 	/* protects against exit_notify() and __exit_signal() */
 	read_lock(&tasklist_lock);
 	sig->group_exec_task = NULL;
@@ -1251,6 +1276,24 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		return retval;
 
+	/* If the binary is not readable then enforce mm->dumpable=0 */
+	would_dump(bprm, bprm->file);
+	if (bprm->have_execfd)
+		would_dump(bprm, bprm->executable);
+
+	/*
+	 * Figure out dumpability. Note that this checking only of current
+	 * is wrong, but userspace depends on it. This should be testing
+	 * bprm->secureexec instead.
+	 */
+	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
+	    is_dumpability_changed(current_cred(), bprm->cred) ||
+	    !(uid_eq(current_euid(), current_uid()) &&
+	      gid_eq(current_egid(), current_gid())))
+		set_dumpable(bprm->mm, suid_dumpable);
+	else
+		set_dumpable(bprm->mm, SUID_DUMP_USER);
+
 	/*
 	 * Ensure all future errors are fatal.
 	 */
@@ -1259,7 +1302,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	/*
 	 * Make this the only thread in the thread group.
 	 */
-	retval = de_thread(me);
+	retval = de_thread(me, bprm);
 	if (retval)
 		goto out;
 
@@ -1282,11 +1325,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
-	/* If the binary is not readable then enforce mm->dumpable=0 */
-	would_dump(bprm, bprm->file);
-	if (bprm->have_execfd)
-		would_dump(bprm, bprm->executable);
-
 	/*
 	 * Release all of the old mmap stuff
 	 */
@@ -1348,18 +1386,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 
 	me->sas_ss_sp = me->sas_ss_size = 0;
 
-	/*
-	 * Figure out dumpability. Note that this checking only of current
-	 * is wrong, but userspace depends on it. This should be testing
-	 * bprm->secureexec instead.
-	 */
-	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
-	    !(uid_eq(current_euid(), current_uid()) &&
-	      gid_eq(current_egid(), current_gid())))
-		set_dumpable(current->mm, suid_dumpable);
-	else
-		set_dumpable(current->mm, SUID_DUMP_USER);
-
 	perf_event_exec();
 	__set_task_comm(me, kbasename(bprm->filename), true);
 
@@ -1478,6 +1504,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
 	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
+	if (unlikely(current->signal->exec_bprm)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		return -ERESTARTNOINTR;
+	}
+
 	bprm->cred = prepare_exec_creds();
 	if (likely(bprm->cred))
 		return 0;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index dd31e3b6bf77..99ff3420138b 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2784,6 +2784,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 	if (rv < 0)
 		goto out_free;
 
+	if (unlikely(current->signal->exec_bprm)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		rv = -ERESTARTNOINTR;
+		goto out_free;
+	}
+
 	rv = security_setprocattr(PROC_I(inode)->op.lsm,
 				  file->f_path.dentry->d_name.name, page,
 				  count);
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 2976f534a7a3..a1a1ac38f749 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -153,6 +153,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
 extern struct cred *cred_alloc_blank(void);
 extern struct cred *prepare_creds(void);
 extern struct cred *prepare_exec_creds(void);
+extern bool is_dumpability_changed(const struct cred *, const struct cred *);
 extern int commit_creds(struct cred *);
 extern void abort_creds(struct cred *);
 extern const struct cred *override_creds(const struct cred *);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 3499c1a8b929..85d8f8f2f44f 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -234,9 +234,27 @@ struct signal_struct {
 	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
 					 * killed by the oom killer */
 
+	struct linux_binprm *exec_bprm;	/* Used to check ptrace_may_access
+					 * against new credentials while
+					 * de_thread is waiting for other
+					 * traced threads to terminate.
+					 * Set while de_thread is executing.
+					 * The cred_guard_mutex is released
+					 * after de_thread() has called
+					 * zap_other_threads(), therefore
+					 * a fatal signal is guaranteed to be
+					 * already pending in the unlikely
+					 * event, that
+					 * current->signal->exec_bprm happens
+					 * to be non-zero after the
+					 * cred_guard_mutex was acquired.
+					 */
+
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace)
+					 * Held while execve runs, except when
+					 * a sibling thread is being traced.
 					 * Deprecated do not use in new code.
 					 * Use exec_update_lock instead.
 					 */
diff --git a/kernel/cred.c b/kernel/cred.c
index c033a201c808..72aadde3f10c 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -375,6 +375,28 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
 	return false;
 }
 
+/**
+ * is_dumpability_changed - Will changing creds from old to new
+ * affect the dumpability in commit_creds?
+ *
+ * Return: false - dumpability will not be changed in commit_creds.
+ *         true  - dumpability will be changed to non-dumpable.
+ *
+ * @old: The old credentials
+ * @new: The new credentials
+ */
+bool is_dumpability_changed(const struct cred *old, const struct cred *new)
+{
+	if (!uid_eq(old->euid, new->euid) ||
+	    !gid_eq(old->egid, new->egid) ||
+	    !uid_eq(old->fsuid, new->fsuid) ||
+	    !gid_eq(old->fsgid, new->fsgid) ||
+	    !cred_cap_issubset(old, new))
+		return true;
+
+	return false;
+}
+
 /**
  * commit_creds - Install new credentials upon the current task
  * @new: The credentials to be assigned
@@ -403,11 +425,7 @@ int commit_creds(struct cred *new)
 	get_cred(new); /* we will require a ref for the subj creds too */
 
 	/* dumpability changes */
-	if (!uid_eq(old->euid, new->euid) ||
-	    !gid_eq(old->egid, new->egid) ||
-	    !uid_eq(old->fsuid, new->fsuid) ||
-	    !gid_eq(old->fsgid, new->fsgid) ||
-	    !cred_cap_issubset(old, new)) {
+	if (is_dumpability_changed(old, new)) {
 		if (task->mm)
 			set_dumpable(task->mm, suid_dumpable);
 		task->pdeath_signal = 0;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index d8b5e13a2229..578bc02eea27 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -20,6 +20,7 @@
 #include <linux/pagemap.h>
 #include <linux/ptrace.h>
 #include <linux/security.h>
+#include <linux/binfmts.h>
 #include <linux/signal.h>
 #include <linux/uio.h>
 #include <linux/audit.h>
@@ -435,6 +436,28 @@ static int ptrace_attach(struct task_struct *task, long request,
 	if (retval)
 		goto unlock_creds;
 
+	if (unlikely(task->in_execve)) {
+		struct linux_binprm *bprm = task->signal->exec_bprm;
+		const struct cred __rcu *old_cred;
+		struct mm_struct *old_mm;
+
+		retval = down_write_killable(&task->signal->exec_update_lock);
+		if (retval)
+			goto unlock_creds;
+		task_lock(task);
+		old_cred = task->real_cred;
+		old_mm = task->mm;
+		rcu_assign_pointer(task->real_cred, bprm->cred);
+		task->mm = bprm->mm;
+		retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+		rcu_assign_pointer(task->real_cred, old_cred);
+		task->mm = old_mm;
+		task_unlock(task);
+		up_write(&task->signal->exec_update_lock);
+		if (retval)
+			goto unlock_creds;
+	}
+
 	write_lock_irq(&tasklist_lock);
 	retval = -EPERM;
 	if (unlikely(task->exit_state))
@@ -508,6 +531,14 @@ static int ptrace_traceme(void)
 {
 	int ret = -EPERM;
 
+	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
+		return -ERESTARTNOINTR;
+
+	if (unlikely(current->signal->exec_bprm)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		return -ERESTARTNOINTR;
+	}
+
 	write_lock_irq(&tasklist_lock);
 	/* Are we already being traced? */
 	if (!current->ptrace) {
@@ -523,6 +554,7 @@ static int ptrace_traceme(void)
 		}
 	}
 	write_unlock_irq(&tasklist_lock);
+	mutex_unlock(&current->signal->cred_guard_mutex);
 
 	return ret;
 }
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 255999ba9190..b29bbfa0b044 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
-	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
-	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_put_fd;
+	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+		if (mutex_lock_killable(&current->signal->cred_guard_mutex))
+			goto out_put_fd;
+
+		if (unlikely(current->signal->exec_bprm)) {
+			mutex_unlock(&current->signal->cred_guard_mutex);
+			goto out_put_fd;
+		}
+	}
 
 	spin_lock_irq(&current->sighand->siglock);
 
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
index 4db327b44586..3b7d81fb99bb 100644
--- a/tools/testing/selftests/ptrace/vmaccess.c
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -39,8 +39,15 @@ TEST(vmaccess)
 	f = open(mm, O_RDONLY);
 	ASSERT_GE(f, 0);
 	close(f);
-	f = kill(pid, SIGCONT);
-	ASSERT_EQ(f, 0);
+	f = waitpid(-1, NULL, 0);
+	ASSERT_NE(f, -1);
+	ASSERT_NE(f, 0);
+	ASSERT_NE(f, pid);
+	f = waitpid(-1, NULL, 0);
+	ASSERT_EQ(f, pid);
+	f = waitpid(-1, NULL, 0);
+	ASSERT_EQ(f, -1);
+	ASSERT_EQ(errno, ECHILD);
 }
 
 TEST(attach)
@@ -57,22 +64,24 @@ TEST(attach)
 
 	sleep(1);
 	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
-	ASSERT_EQ(errno, EAGAIN);
-	ASSERT_EQ(k, -1);
+	ASSERT_EQ(k, 0);
 	k = waitpid(-1, &s, WNOHANG);
 	ASSERT_NE(k, -1);
 	ASSERT_NE(k, 0);
 	ASSERT_NE(k, pid);
 	ASSERT_EQ(WIFEXITED(s), 1);
 	ASSERT_EQ(WEXITSTATUS(s), 0);
-	sleep(1);
-	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
 	ASSERT_EQ(k, 0);
 	k = waitpid(-1, &s, 0);
 	ASSERT_EQ(k, pid);
 	ASSERT_EQ(WIFSTOPPED(s), 1);
 	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
-	k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
 	ASSERT_EQ(k, 0);
 	k = waitpid(-1, &s, 0);
 	ASSERT_EQ(k, pid);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach
  2024-01-15 19:22       ` [PATCH v14] " Bernd Edlinger
@ 2024-01-15 19:37         ` Matthew Wilcox
  2024-01-17  9:51           ` Bernd Edlinger
  2024-01-16 15:22         ` Oleg Nesterov
       [not found]         ` <AS8P193MB1285937F9831CECAF2A9EEE2E4752@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM>
  2 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2024-01-15 19:37 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
	Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
	Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
	Suren Baghdasaryan, Yafang Shao, Helge Deller, Eric W. Biederman,
	Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Elena Reshetova, David Windsor,
	Mateusz Guzik, Ard Biesheuvel, Joel Fernandes (Google),
	Hans Liljestrand

On Mon, Jan 15, 2024 at 08:22:19PM +0100, Bernd Edlinger wrote:
> This introduces signal->exec_bprm, which is used to
> fix the case when at least one of the sibling threads
> is traced, and therefore the trace process may dead-lock
> in ptrace_attach, but de_thread will need to wait for the
> tracer to continue execution.

Not entirely sure why I've been added to the cc; this doesn't seem
like it's even remotely within my realm of expertise.

> +++ b/include/linux/cred.h
> @@ -153,6 +153,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
>  extern struct cred *cred_alloc_blank(void);
>  extern struct cred *prepare_creds(void);
>  extern struct cred *prepare_exec_creds(void);
> +extern bool is_dumpability_changed(const struct cred *, const struct cred *);

Using 'extern' for function declarations is deprecated.  More
importantly, you have two arguments of the same type, and how do I know
which one is which if you don't name them?

> +++ b/kernel/cred.c
> @@ -375,6 +375,28 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
>  	return false;
>  }
>  
> +/**
> + * is_dumpability_changed - Will changing creds from old to new
> + * affect the dumpability in commit_creds?
> + *
> + * Return: false - dumpability will not be changed in commit_creds.
> + *         true  - dumpability will be changed to non-dumpable.
> + *
> + * @old: The old credentials
> + * @new: The new credentials
> + */

Does kernel-doc really parse this correctly?  Normal style would be:

/**
 * is_dumpability_changed - Will changing creds affect dumpability?
 * @old: The old credentials.
 * @new: The new credentials.
 *
 * If the @new credentials have no elevated privileges compared to the
 * @old credentials, the task may remain dumpable.  Otherwise we have
 * to mark the task as undumpable to avoid information leaks from higher
 * to lower privilege domains.
 *
 * Return: True if the task will become undumpable.
 */

> @@ -508,6 +531,14 @@ static int ptrace_traceme(void)
>  {
>  	int ret = -EPERM;
>  
> +	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
> +		return -ERESTARTNOINTR;

Do you really want this to be interruptible by a timer signal or a
window resize event?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach
  2024-01-15 19:22       ` [PATCH v14] " Bernd Edlinger
  2024-01-15 19:37         ` Matthew Wilcox
@ 2024-01-16 15:22         ` Oleg Nesterov
  2024-01-17 15:07           ` Bernd Edlinger
       [not found]         ` <AS8P193MB1285937F9831CECAF2A9EEE2E4752@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM>
  2 siblings, 1 reply; 39+ messages in thread
From: Oleg Nesterov @ 2024-01-16 15:22 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Elena Reshetova, David Windsor,
	Mateusz Guzik, Ard Biesheuvel, Joel Fernandes (Google),
	Matthew Wilcox (Oracle), Hans Liljestrand

I'll try to recall this problem and actually read the patch tommorrow...

Hmm. but it doesn't apply to Linus's tree, you need to rebase it.
In particular, please note the recent commit 5431fdd2c181dd2eac2
("ptrace: Convert ptrace_attach() to use lock guards")

On 01/15, Bernd Edlinger wrote:
>
> The problem happens when a tracer tries to ptrace_attach
> to a multi-threaded process, that does an execve in one of
> the threads at the same time, without doing that in a forked
> sub-process.  That means: There is a race condition, when one
> or more of the threads are already ptraced, but the thread
> that invoked the execve is not yet traced.  Now in this
> case the execve locks the cred_guard_mutex and waits for
> de_thread to complete.  But that waits for the traced
> sibling threads to exit, and those have to wait for the
> tracer to receive the exit signal, but the tracer cannot
> call wait right now, because it is waiting for the ptrace
> call to complete, and this never does not happen.
> The traced process and the tracer are now in a deadlock
> situation, and can only be killed by a fatal signal.

This looks very confusing to me. And even misleading.

So IIRC the problem is "simple".

de_thread() sleeps with cred_guard_mutex waiting for other threads to
exit and pass release_task/__exit_signal.

If one of the sub-threads is traced, debugger should do ptrace_detach()
or wait() to release this tracee, the killed tracee won't autoreap.

Now. If debugger tries to take the same cred_guard_mutex before
detach/wait we have a deadlock. This is not specific to ptrace_attach(),
proc_pid_attr_write() takes this lock too.

Right? Or are there other issues?

> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>  {
>  	struct signal_struct *sig = tsk->signal;
>  	struct sighand_struct *oldsighand = tsk->sighand;
>  	spinlock_t *lock = &oldsighand->siglock;
> +	struct task_struct *t = tsk;
> +	bool unsafe_execve_in_progress = false;
>
>  	if (thread_group_empty(tsk))
>  		goto no_thread_group;
> @@ -1066,6 +1068,19 @@ static int de_thread(struct task_struct *tsk)
>  	if (!thread_group_leader(tsk))
>  		sig->notify_count--;
>
> +	while_each_thread(tsk, t) {

for_other_threads()

> +		if (unlikely(t->ptrace)
> +		    && (t != tsk->group_leader || !t->exit_state))
> +			unsafe_execve_in_progress = true;

The !t->exit_state is not right... This sub-thread can already be a zombie
with ->exit_state != 0 but see above, it won't be reaped until the debugger
does wait().

> +	if (unlikely(unsafe_execve_in_progress)) {
> +		spin_unlock_irq(lock);
> +		sig->exec_bprm = bprm;
> +		mutex_unlock(&sig->cred_guard_mutex);
> +		spin_lock_irq(lock);

I don't understand why do we need to unlock and lock siglock here...

But my main question is why do we need the unsafe_execve_in_progress boolean.
If this patch is correct and de_thread() can drop and re-acquire cread_guard_mutex
when one of the threads is traced, then why can't we do this unconditionally ?

Oleg.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach
  2024-01-15 19:37         ` Matthew Wilcox
@ 2024-01-17  9:51           ` Bernd Edlinger
  0 siblings, 0 replies; 39+ messages in thread
From: Bernd Edlinger @ 2024-01-17  9:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
	Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
	Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
	Suren Baghdasaryan, Yafang Shao, Helge Deller, Eric W. Biederman,
	Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Elena Reshetova, David Windsor,
	Mateusz Guzik, Ard Biesheuvel, Joel Fernandes (Google),
	Hans Liljestrand

On 1/15/24 20:37, Matthew Wilcox wrote:
> On Mon, Jan 15, 2024 at 08:22:19PM +0100, Bernd Edlinger wrote:
>> This introduces signal->exec_bprm, which is used to
>> fix the case when at least one of the sibling threads
>> is traced, and therefore the trace process may dead-lock
>> in ptrace_attach, but de_thread will need to wait for the
>> tracer to continue execution.
> 
> Not entirely sure why I've been added to the cc; this doesn't seem
> like it's even remotely within my realm of expertise.
> 

Ah, okay, never mind.
A couple new email addresses were found this time when I used
./scripts/get_maintainer.pl

>> +++ b/include/linux/cred.h
>> @@ -153,6 +153,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
>>  extern struct cred *cred_alloc_blank(void);
>>  extern struct cred *prepare_creds(void);
>>  extern struct cred *prepare_exec_creds(void);
>> +extern bool is_dumpability_changed(const struct cred *, const struct cred *);
> 
> Using 'extern' for function declarations is deprecated.  More
> importantly, you have two arguments of the same type, and how do I know
> which one is which if you don't name them?
> 
>> +++ b/kernel/cred.c
>> @@ -375,6 +375,28 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
>>  	return false;
>>  }
>>  
>> +/**
>> + * is_dumpability_changed - Will changing creds from old to new
>> + * affect the dumpability in commit_creds?
>> + *
>> + * Return: false - dumpability will not be changed in commit_creds.
>> + *         true  - dumpability will be changed to non-dumpable.
>> + *
>> + * @old: The old credentials
>> + * @new: The new credentials
>> + */
> 
> Does kernel-doc really parse this correctly?  Normal style would be:

Apparently yes, but I think I only added those lines to silence
some automatic checking bots.

> 
> /**
>  * is_dumpability_changed - Will changing creds affect dumpability?
>  * @old: The old credentials.
>  * @new: The new credentials.
>  *
>  * If the @new credentials have no elevated privileges compared to the
>  * @old credentials, the task may remain dumpable.  Otherwise we have
>  * to mark the task as undumpable to avoid information leaks from higher
>  * to lower privilege domains.
>  *
>  * Return: True if the task will become undumpable.
>  */
> 

Thanks a lot, that looks much better. I will use your suggestion as is,
when I re-send the patch next time.

>> @@ -508,6 +531,14 @@ static int ptrace_traceme(void)
>>  {
>>  	int ret = -EPERM;
>>  
>> +	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>> +		return -ERESTARTNOINTR;
> 
> Do you really want this to be interruptible by a timer signal or a
> window resize event?
> 

I think that is kind of okay, as most of the existing users lock the mutex
also interruptible, so I just wanted to follow those examples.


Thanks
Bernd.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach
  2024-01-16 15:22         ` Oleg Nesterov
@ 2024-01-17 15:07           ` Bernd Edlinger
  2024-01-17 16:38             ` Oleg Nesterov
  0 siblings, 1 reply; 39+ messages in thread
From: Bernd Edlinger @ 2024-01-17 15:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Elena Reshetova, David Windsor,
	Mateusz Guzik, Ard Biesheuvel, Joel Fernandes (Google),
	Matthew Wilcox (Oracle), Hans Liljestrand

On 1/16/24 16:22, Oleg Nesterov wrote:
> I'll try to recall this problem and actually read the patch tommorrow...
> 
> Hmm. but it doesn't apply to Linus's tree, you need to rebase it.
> In particular, please note the recent commit 5431fdd2c181dd2eac2
> ("ptrace: Convert ptrace_attach() to use lock guards")
> 

Oh, how ugly...
Will this new C++-like "feature" ever make it into a stable branch?

> On 01/15, Bernd Edlinger wrote:
>>
>> The problem happens when a tracer tries to ptrace_attach
>> to a multi-threaded process, that does an execve in one of
>> the threads at the same time, without doing that in a forked
>> sub-process.  That means: There is a race condition, when one
>> or more of the threads are already ptraced, but the thread
>> that invoked the execve is not yet traced.  Now in this
>> case the execve locks the cred_guard_mutex and waits for
>> de_thread to complete.  But that waits for the traced
>> sibling threads to exit, and those have to wait for the
>> tracer to receive the exit signal, but the tracer cannot
>> call wait right now, because it is waiting for the ptrace
>> call to complete, and this never does not happen.
>> The traced process and the tracer are now in a deadlock
>> situation, and can only be killed by a fatal signal.
> 
> This looks very confusing to me. And even misleading.
> 
> So IIRC the problem is "simple".
> 
> de_thread() sleeps with cred_guard_mutex waiting for other threads to
> exit and pass release_task/__exit_signal.
> 
> If one of the sub-threads is traced, debugger should do ptrace_detach()
> or wait() to release this tracee, the killed tracee won't autoreap.
> 

Yes. but the tracer has to do its job, and that is ptrace_attach the
remaining treads, it does not know that it would avoid a dead-lock
when it calls wait(), instead of ptrace_attach.  It does not know
that the tracee has just called execve in one of the not yet traced
threads.

> Now. If debugger tries to take the same cred_guard_mutex before
> detach/wait we have a deadlock. This is not specific to ptrace_attach(),
> proc_pid_attr_write() takes this lock too.
> 
> Right? Or are there other issues?
> 

No, proc_pid_attr_write has no problem if it waits for cred_guard_mutex,
because it is only called from one of the sibling threads, and
zap_other_threads sends a SIGKILL to each of them, thus the
mutex_lock_interruptible will stop waiting, and the thread will 
exit normally.
It is only problematic when another process wants to lock the cred_guard_mutex,
because it is not receiving a signal, when de_thread is waiting.
The only other place where I am aware of this happening is ptrace_attach.

>> -static int de_thread(struct task_struct *tsk)
>> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>>  {
>>  	struct signal_struct *sig = tsk->signal;
>>  	struct sighand_struct *oldsighand = tsk->sighand;
>>  	spinlock_t *lock = &oldsighand->siglock;
>> +	struct task_struct *t = tsk;
>> +	bool unsafe_execve_in_progress = false;
>>
>>  	if (thread_group_empty(tsk))
>>  		goto no_thread_group;
>> @@ -1066,6 +1068,19 @@ static int de_thread(struct task_struct *tsk)
>>  	if (!thread_group_leader(tsk))
>>  		sig->notify_count--;
>>
>> +	while_each_thread(tsk, t) {
> 
> for_other_threads()
> 

Ah, okay.

>> +		if (unlikely(t->ptrace)
>> +		    && (t != tsk->group_leader || !t->exit_state))
>> +			unsafe_execve_in_progress = true;
> 
> The !t->exit_state is not right... This sub-thread can already be a zombie
> with ->exit_state != 0 but see above, it won't be reaped until the debugger
> does wait().
> 

I dont think so.
de_thread() handles the group_leader different than normal threads.
That means normal threads have to wait for being released from the zombie
state by the tracer:
sig->notify_count > 0, and de_thread is woken up by __exit_signal
Once those are gone, de_thread waits for the group leader to reach
exit_state = ZOMBIE, but again only if the group_leader is not the
current thread:
signal->notify_count < 0, and de_thread is woken up by exit_notify.
So his reflects exactly what condition has to be met, see:

                        sig->notify_count = -1;
                        if (likely(leader->exit_state))
                                break;
                        __set_current_state(TASK_KILLABLE);
                        write_unlock_irq(&tasklist_lock);
                        cgroup_threadgroup_change_end(tsk);
                        schedule();
                        if (__fatal_signal_pending(tsk))
                                goto killed;

so when the group_leader's exit_state is already != 0 then the
second wait state will not be entered.
  
>> +	if (unlikely(unsafe_execve_in_progress)) {
>> +		spin_unlock_irq(lock);
>> +		sig->exec_bprm = bprm;
>> +		mutex_unlock(&sig->cred_guard_mutex);
>> +		spin_lock_irq(lock);
> 
> I don't understand why do we need to unlock and lock siglock here...
> 

That is just a precaution because I did want to release the
mutexes exactly in the reverse order as they were acquired.

> But my main question is why do we need the unsafe_execve_in_progress boolean.
> If this patch is correct and de_thread() can drop and re-acquire cread_guard_mutex
> when one of the threads is traced, then why can't we do this unconditionally ?
> 

I just wanted to keep the impact of the change as small as possible, including
possible performance degradation due to double checking of credentials.
Worst thing that could happen with this approach, is that a situation where today
a dead-lock is imminentm does still not work correctly, but when no tracer is attached,
nothing will change or be less performant than before.


Bernd.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach
  2024-01-17 15:07           ` Bernd Edlinger
@ 2024-01-17 16:38             ` Oleg Nesterov
  2024-01-22 13:24               ` Bernd Edlinger
  0 siblings, 1 reply; 39+ messages in thread
From: Oleg Nesterov @ 2024-01-17 16:38 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Elena Reshetova, David Windsor,
	Mateusz Guzik, Ard Biesheuvel, Joel Fernandes (Google),
	Matthew Wilcox (Oracle), Hans Liljestrand

On 01/17, Bernd Edlinger wrote:
>
> >>
> >> The problem happens when a tracer tries to ptrace_attach
> >> to a multi-threaded process, that does an execve in one of
> >> the threads at the same time, without doing that in a forked
> >> sub-process.  That means: There is a race condition, when one
> >> or more of the threads are already ptraced, but the thread
> >> that invoked the execve is not yet traced.  Now in this
> >> case the execve locks the cred_guard_mutex and waits for
> >> de_thread to complete.  But that waits for the traced
> >> sibling threads to exit, and those have to wait for the
> >> tracer to receive the exit signal, but the tracer cannot
> >> call wait right now, because it is waiting for the ptrace
> >> call to complete, and this never does not happen.
> >> The traced process and the tracer are now in a deadlock
> >> situation, and can only be killed by a fatal signal.
> >
> > This looks very confusing to me. And even misleading.
> >
> > So IIRC the problem is "simple".
> >
> > de_thread() sleeps with cred_guard_mutex waiting for other threads to
> > exit and pass release_task/__exit_signal.
> >
> > If one of the sub-threads is traced, debugger should do ptrace_detach()
> > or wait() to release this tracee, the killed tracee won't autoreap.
> >
>
> Yes. but the tracer has to do its job, and that is ptrace_attach the
> remaining treads, it does not know that it would avoid a dead-lock
> when it calls wait(), instead of ptrace_attach.  It does not know
> that the tracee has just called execve in one of the not yet traced
> threads.

Hmm. I don't understand you.

I agree we have a problem which should be fixed. Just the changelog
looks confusing to me, imo it doesn't explain the race/problem clearly.

> > Now. If debugger tries to take the same cred_guard_mutex before
> > detach/wait we have a deadlock. This is not specific to ptrace_attach(),
> > proc_pid_attr_write() takes this lock too.
> >
> > Right? Or are there other issues?
> >
>
> No, proc_pid_attr_write has no problem if it waits for cred_guard_mutex,
> because it is only called from one of the sibling threads,

OK, thanks, I was wrong. I forgot about "A task may only write its own attributes".
So yes, ptrace_attach() is the only source of problematic mutex_lock() today.
There were more in the past.

> >> +		if (unlikely(t->ptrace)
> >> +		    && (t != tsk->group_leader || !t->exit_state))
> >> +			unsafe_execve_in_progress = true;
> >
> > The !t->exit_state is not right... This sub-thread can already be a zombie
> > with ->exit_state != 0 but see above, it won't be reaped until the debugger
> > does wait().
> >
>
> I dont think so.
> de_thread() handles the group_leader different than normal threads.

I don't follow...

I didn't say that t is a group leader. I said it can be a zombie sub-thread
with ->exit_state != 0.

> That means normal threads have to wait for being released from the zombie
> state by the tracer:
> sig->notify_count > 0, and de_thread is woken up by __exit_signal

That is what I said before. Debugger should release a zombie sub-thread,
it won't do __exit_signal() on its own.

> >> +	if (unlikely(unsafe_execve_in_progress)) {
> >> +		spin_unlock_irq(lock);
> >> +		sig->exec_bprm = bprm;
> >> +		mutex_unlock(&sig->cred_guard_mutex);
> >> +		spin_lock_irq(lock);
> >
> > I don't understand why do we need to unlock and lock siglock here...
>
> That is just a precaution because I did want to release the
> mutexes exactly in the reverse order as they were acquired.

To me this adds the unnecessary complication.

> > But my main question is why do we need the unsafe_execve_in_progress boolean.
> > If this patch is correct and de_thread() can drop and re-acquire cread_guard_mutex
> > when one of the threads is traced, then why can't we do this unconditionally ?
> >
>
> I just wanted to keep the impact of the change as small as possible,

But the unsafe_execve_in_progress logic increases the impact and complicates
the patch.

I think the fix should be as simple as possible. (to be honest, right now
I don't think this is a right approach).

> including
> possible performance degradation due to double checking of credentials.

Not sure I understand, but you can add the performance improvements later.
Not to mention that this should be justified, and the for_other_threads()
loop added by this patch into de_thread() is not nice performance-wise.

Oleg.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach
  2024-01-17 16:38             ` Oleg Nesterov
@ 2024-01-22 13:24               ` Bernd Edlinger
  2024-01-22 13:44                 ` Oleg Nesterov
  2024-01-22 21:30                 ` Kees Cook
  0 siblings, 2 replies; 39+ messages in thread
From: Bernd Edlinger @ 2024-01-22 13:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Elena Reshetova, David Windsor,
	Mateusz Guzik, Ard Biesheuvel, Joel Fernandes (Google),
	Matthew Wilcox (Oracle), Hans Liljestrand

On 1/17/24 17:38, Oleg Nesterov wrote:
> On 01/17, Bernd Edlinger wrote:
>> Yes. but the tracer has to do its job, and that is ptrace_attach the
>> remaining treads, it does not know that it would avoid a dead-lock
>> when it calls wait(), instead of ptrace_attach.  It does not know
>> that the tracee has just called execve in one of the not yet traced
>> threads.
> 
> Hmm. I don't understand you.

Certainly I am willing to rephrase this until it is understandable.
Probably I have just not yet found the proper way to describe the issue
here, and your help in resolving that documentation issue is very important
to me.

> 
> I agree we have a problem which should be fixed. Just the changelog
> looks confusing to me, imo it doesn't explain the race/problem clearly.
> 

I am trying here to summarize what the test case "attach" in
./tools/testing/selftests/ptrace/vmaccess.c does.

I think it models the use case of a tracer that is trying to attach
to a multi-threaded process that is executing execve in a not-yet
traced thread while a different sub-thread is already traced,
it is not relevant that the test case uses PTRACE_TRACEME, to make
the sub-thead traced, the same would happen if the tracer uses
some out-of-band mechanism like /proc/pid/task to learn the thread_id
of the sub-threads and uses ptrace_attach to each of them.

The test case hits the dead-lock because there is a race condition
between before the PTRACE_ATTACH, and it cannot know that the
exit event from the sub-thread is already pending before the
PTRACE_ATTACH.  Of course a real tracer will not sleep a whole
second before a PTRACE_ATTACH, but even if it does a
waitpid(-1, &s, WNOHANG) immediately before the PTRACE_ATTACH
there is a tiny chance that the execve is entered just immediately
after waitpid has indicated that there is currently not
event pending.

>>>> +		if (unlikely(t->ptrace)
>>>> +		    && (t != tsk->group_leader || !t->exit_state))
>>>> +			unsafe_execve_in_progress = true;
>>>
>>> The !t->exit_state is not right... This sub-thread can already be a zombie
>>> with ->exit_state != 0 but see above, it won't be reaped until the debugger
>>> does wait().
>>>
>>
>> I dont think so.
>> de_thread() handles the group_leader different than normal threads.
> 
> I don't follow...
> 
> I didn't say that t is a group leader. I said it can be a zombie sub-thread
> with ->exit_state != 0.
> 

the condition here is 

(t != tsk->group_leader || !t->exit_state)

so in other words, if t is a sub-thread, i.e. t != tsk->group_leader
then the t->exit_state does not count, and the deadlock is possible.

But if t it is a group leader, then t == tsk->group_leader, but a
deadlock is only possible when t->exit_state == 0 at this time.
The most likely reason for this is PTRACE_O_TRACEEXIT.

I will add a new test case that demonstrates this in the next iteration
of this patch.  Here is a preview of what I have right now:

/*
 * Same test as previous, except that
 * the group leader is ptraced first,
 * but this time with PTRACE_O_TRACEEXIT,
 * and the thread that does execve is
 * not yet ptraced.  This exercises the
 * code block in de_thread where the
 * if (!thread_group_leader(tsk)) {
 * is executed and enters a wait state.
 */
static long thread2_tid;
static void *thread2(void *arg)
{
	thread2_tid = syscall(__NR_gettid);
	sleep(2);
	execlp("false", "false", NULL);
	return NULL;
}

TEST(attach2)
{
	int s, k, pid = fork();

	if (!pid) {
		pthread_t pt;

		pthread_create(&pt, NULL, thread2, NULL);
		pthread_join(pt, NULL);
		return;
	}

	sleep(1);
	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
	ASSERT_EQ(k, 0);
	k = waitpid(-1, &s, 0);
	ASSERT_EQ(k, pid);
	ASSERT_EQ(WIFSTOPPED(s), 1);
	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
	k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
	ASSERT_EQ(k, 0);
	thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
	ASSERT_NE(thread2_tid, -1);
	ASSERT_NE(thread2_tid, 0);
	ASSERT_NE(thread2_tid, pid);
	k = waitpid(-1, &s, WNOHANG);
	ASSERT_EQ(k, 0);
	sleep(2);
	/* deadlock may happen here */
	k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
	ASSERT_EQ(k, 0);
	k = waitpid(-1, &s, WNOHANG);
	ASSERT_EQ(k, pid);
	ASSERT_EQ(WIFSTOPPED(s), 1);
	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
	k = waitpid(-1, &s, WNOHANG);
	ASSERT_EQ(k, 0);
	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
	ASSERT_EQ(k, 0);
	k = waitpid(-1, &s, 0);
	ASSERT_EQ(k, pid);
	ASSERT_EQ(WIFSTOPPED(s), 1);
	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
	k = waitpid(-1, &s, WNOHANG);
	ASSERT_EQ(k, 0);
	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
	ASSERT_EQ(k, 0);
	k = waitpid(-1, &s, 0);
	ASSERT_EQ(k, pid);
	ASSERT_EQ(WIFSTOPPED(s), 1);
	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
	k = waitpid(-1, &s, WNOHANG);
	ASSERT_EQ(k, 0);
	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
	ASSERT_EQ(k, 0);
	k = waitpid(-1, &s, 0);
	ASSERT_EQ(k, pid);
	ASSERT_EQ(WIFEXITED(s), 1);
	ASSERT_EQ(WEXITSTATUS(s), 1);
	k = waitpid(-1, NULL, 0);
	ASSERT_EQ(k, -1);
	ASSERT_EQ(errno, ECHILD);
}

So the traced process does the execve in the sub-thread,
and the tracer attaches the thread leader first, and when
the next PTRACE_ATTACH happens, the thread leader is stopped
because of the PTRACE_O_TACEEXIT.  So at that time,
the t->exit_state == 0, and we receive the following:

	k = waitpid(-1, &s, WNOHANG);
	ASSERT_EQ(k, pid);
	ASSERT_EQ(WIFSTOPPED(s), 1);
	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);

yet the de_thread is not finished now, but only when

	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
	ASSERT_EQ(k, 0);

happens.

The only thing, that is admittedly pretty confusing here, is
the fact that the thread2_tid morphs into group leader's pid,
at this time, and is therefore never heard of again.

So pid refers to the former thread2_tid from now on, and the
former group leader does not enter the usual zombie state here,
because the sub-thread takes over it's role.

>>>> +	if (unlikely(unsafe_execve_in_progress)) {
>>>> +		spin_unlock_irq(lock);
>>>> +		sig->exec_bprm = bprm;
>>>> +		mutex_unlock(&sig->cred_guard_mutex);
>>>> +		spin_lock_irq(lock);
>>>
>>> I don't understand why do we need to unlock and lock siglock here...
>>
>> That is just a precaution because I did want to release the
>> mutexes exactly in the reverse order as they were acquired.
> 
> To me this adds the unnecessary complication.
> 

Well, the proposed change creates this sequence

mutex_lock(&sig_cred_guard_mutex);
spin_lock_irq(lock);
mutex_unlock(&sig_cred_guard_mutex);
spin_unlock_irq(lock);

I wanted to avoid that, because in a usual real-time os,
I'd expect the mutex_unlock to schedule another waiting
task, regardless of the spin lock state.

Are you saying that doing this is safe to do in linux,
because the scheduling does not happen except when
explicitly asked for e.g. by calling schedule() ?
And would that also be safe for real time linux port ?

>>> But my main question is why do we need the unsafe_execve_in_progress boolean.
>>> If this patch is correct and de_thread() can drop and re-acquire cread_guard_mutex
>>> when one of the threads is traced, then why can't we do this unconditionally ?
>>>
>>
>> I just wanted to keep the impact of the change as small as possible,
> 
> But the unsafe_execve_in_progress logic increases the impact and complicates
> the patch.
> 
> I think the fix should be as simple as possible. (to be honest, right now
> I don't think this is a right approach).
> 

The main concern was when a set-suid program is executed by execve.
Then it makes a difference if the current thread is traced before the
execve or not.  That means if the current thread is already traced,
the decision, which credentials will be used is different than otherwise.

So currently there are two possbilities, either the trace happens
before the execve, and the suid-bit will be ignored, or the trace
happens after the execve, but it is checked that the now potentially
more privileged credentials allow the tracer to proceed.

With this patch we will have a third prossibility, that is in order
to avoid the possible dead-lock we allow the suid-bit to take effect,
but only if the tracer's privileges allow both to attach the current
credentials and the new credentials.  But I would only do that as
a last resort, to avoid the possible dead-lock, and not unless a dead-lock
is really expected to happen.

Thanks
Bernd.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach
  2024-01-22 13:24               ` Bernd Edlinger
@ 2024-01-22 13:44                 ` Oleg Nesterov
  2024-01-22 21:30                 ` Kees Cook
  1 sibling, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2024-01-22 13:44 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Elena Reshetova, David Windsor,
	Mateusz Guzik, Ard Biesheuvel, Joel Fernandes (Google),
	Matthew Wilcox (Oracle), Hans Liljestrand

I'll try to read your email later, just one note for now...

On 01/22, Bernd Edlinger wrote:
>
> > I didn't say that t is a group leader. I said it can be a zombie sub-thread
> > with ->exit_state != 0.
>
> the condition here is
>
> (t != tsk->group_leader || !t->exit_state)
>
> so in other words, if t is a sub-thread, i.e. t != tsk->group_leader
> then the t->exit_state does not count,

Ah indeed, somehow I misread this check as if you skip the sub-threads
with ->exit_state != 0.

Sorry for noise.

Oleg.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach
  2024-01-22 13:24               ` Bernd Edlinger
  2024-01-22 13:44                 ` Oleg Nesterov
@ 2024-01-22 21:30                 ` Kees Cook
  2024-01-23 18:30                   ` Bernd Edlinger
  1 sibling, 1 reply; 39+ messages in thread
From: Kees Cook @ 2024-01-22 21:30 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Oleg Nesterov, Alexander Viro, Alexey Dobriyan, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Elena Reshetova, David Windsor,
	Mateusz Guzik, Ard Biesheuvel, Joel Fernandes (Google),
	Matthew Wilcox (Oracle), Hans Liljestrand

On Mon, Jan 22, 2024 at 02:24:37PM +0100, Bernd Edlinger wrote:
> The main concern was when a set-suid program is executed by execve.
> Then it makes a difference if the current thread is traced before the
> execve or not.  That means if the current thread is already traced,
> the decision, which credentials will be used is different than otherwise.
> 
> So currently there are two possbilities, either the trace happens
> before the execve, and the suid-bit will be ignored, or the trace
> happens after the execve, but it is checked that the now potentially
> more privileged credentials allow the tracer to proceed.
> 
> With this patch we will have a third prossibility, that is in order
> to avoid the possible dead-lock we allow the suid-bit to take effect,
> but only if the tracer's privileges allow both to attach the current
> credentials and the new credentials.  But I would only do that as
> a last resort, to avoid the possible dead-lock, and not unless a dead-lock
> is really expected to happen.

Instead of doing this special cred check (which I am worried could
become fragile -- I'd prefer all privilege checks happen in the same
place and in the same way...), could we just fail the ptrace_attach of
the execve?

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach
  2024-01-22 21:30                 ` Kees Cook
@ 2024-01-23 18:30                   ` Bernd Edlinger
  2024-01-24  0:09                     ` Kees Cook
  0 siblings, 1 reply; 39+ messages in thread
From: Bernd Edlinger @ 2024-01-23 18:30 UTC (permalink / raw)
  To: Kees Cook
  Cc: Oleg Nesterov, Alexander Viro, Alexey Dobriyan, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Elena Reshetova, David Windsor,
	Mateusz Guzik, Ard Biesheuvel, Joel Fernandes (Google),
	Matthew Wilcox (Oracle), Hans Liljestrand

On 1/22/24 22:30, Kees Cook wrote:
> On Mon, Jan 22, 2024 at 02:24:37PM +0100, Bernd Edlinger wrote:
>> The main concern was when a set-suid program is executed by execve.
>> Then it makes a difference if the current thread is traced before the
>> execve or not.  That means if the current thread is already traced,
>> the decision, which credentials will be used is different than otherwise.
>>
>> So currently there are two possbilities, either the trace happens
>> before the execve, and the suid-bit will be ignored, or the trace
>> happens after the execve, but it is checked that the now potentially
>> more privileged credentials allow the tracer to proceed.
>>
>> With this patch we will have a third prossibility, that is in order
>> to avoid the possible dead-lock we allow the suid-bit to take effect,
>> but only if the tracer's privileges allow both to attach the current
>> credentials and the new credentials.  But I would only do that as
>> a last resort, to avoid the possible dead-lock, and not unless a dead-lock
>> is really expected to happen.
> 
> Instead of doing this special cred check (which I am worried could
> become fragile -- I'd prefer all privilege checks happen in the same
> place and in the same way...), could we just fail the ptrace_attach of
> the execve?
> 

Hmm, yes.  That is also possible, and that was actually my first approach,
but I think the current patch is superior.  I have nevertheless tried it
again, to get a better picture of the differences between those two approaches.

See below for how that alternative approach would look like:
+ the advantage of that would be simplicity.
+ it avoids the dead-lock in the tracer.
- it is an API change, which we normally try to avoid.
- the adjusted test case(s) show that the tracer cannot successfully
attach to the resulting process before the execve'd process starts up.
So although there is no suid process involved in my test cases,
the traced program simply escapes out of the tracer's control.

The advantage of the current approach would be:
+ it avoids the dead-lock in the tracer
+ it avoids a potentially breaking API change.
+ the tracer is normally able to successfully attach to the
resulting process after the execve completes, before it starts
to execute.
+ the debug experience is just better.
- worst case that can happen, is that the security policy denies the
tracer the access to the new process after the execve.  In that case
the PTRACE_ATTACH will fail each time it is attempted, in a similar way
as the the alternate approach.  But the overall result is still correct.
The privileged process escapes, and that is okay in that case.
- it is theoretically possible that the security engine gets confused by
the additional call to security_ptrace_access_check, but that will be
something that can be fixed, when it happens.

However my main motivation, why I started this work was the security implication.

I assume the tracer is a privileged process, like an anti virus program,
that supervises all processes and if it detects some anomaly it can
ptrace attach to the target, check what it does and prevent it from doing
bad things.

- Currently a non-privileged program can potentially send such a privileged
tracer into a deadlock.
- With the alternative patch below that non-privileged can no longer send the
tracer into a deadlock, but it can still quickly escape out of the tracer's
control.
- But with my latest patch a sufficiently privileged tracer can neither be
sent into a deadlock nor can the attached process escape.  Mission completed.


Thanks
Bernd.


Here is the alternative patch for reference:
diff --git a/fs/exec.c b/fs/exec.c
index e88249a1ce07..0a948f5821b7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1045,6 +1045,8 @@ static int de_thread(struct task_struct *tsk)
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
 	spinlock_t *lock = &oldsighand->siglock;
+	struct task_struct *t;
+	bool unsafe_execve_in_progress = false;
 
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
@@ -1067,6 +1069,18 @@ static int de_thread(struct task_struct *tsk)
 	if (!thread_group_leader(tsk))
 		sig->notify_count--;
 
+	for_other_threads(tsk, t) {
+		if (unlikely(t->ptrace)
+		    && (t != tsk->group_leader || !t->exit_state))
+			unsafe_execve_in_progress = true;
+	}
+
+	if (unlikely(unsafe_execve_in_progress)) {
+		spin_unlock_irq(lock);
+		mutex_unlock(&sig->cred_guard_mutex);
+		spin_lock_irq(lock);
+	}
+
 	while (sig->notify_count) {
 		__set_current_state(TASK_KILLABLE);
 		spin_unlock_irq(lock);
@@ -1157,6 +1171,9 @@ static int de_thread(struct task_struct *tsk)
 		release_task(leader);
 	}
 
+	if (unlikely(unsafe_execve_in_progress))
+		mutex_lock(&sig->cred_guard_mutex);
+
 	sig->group_exec_task = NULL;
 	sig->notify_count = 0;
 
@@ -1168,6 +1185,9 @@ static int de_thread(struct task_struct *tsk)
 	return 0;
 
 killed:
+	if (unlikely(unsafe_execve_in_progress))
+		mutex_lock(&sig->cred_guard_mutex);
+
 	/* protects against exit_notify() and __exit_signal() */
 	read_lock(&tasklist_lock);
 	sig->group_exec_task = NULL;
@@ -1479,6 +1499,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
 	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
+	if (unlikely(current->signal->group_exec_task)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		return -ERESTARTNOINTR;
+	}
+
 	bprm->cred = prepare_exec_creds();
 	if (likely(bprm->cred))
 		return 0;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 98a031ac2648..55816320c103 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2785,6 +2785,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 	if (rv < 0)
 		goto out_free;
 
+	if (unlikely(current->signal->group_exec_task)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		rv = -ERESTARTNOINTR;
+		goto out_free;
+	}
+
 	rv = security_setprocattr(PROC_I(inode)->op.lsmid,
 				  file->f_path.dentry->d_name.name, page,
 				  count);
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 2fabd497d659..162e4c8f7b08 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -444,6 +444,9 @@ static int ptrace_attach(struct task_struct *task, long request,
 	scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR,
 			   &task->signal->cred_guard_mutex) {
 
+		if (unlikely(task->signal->group_exec_task))
+			return -EAGAIN;
+
 		scoped_guard (task_lock, task) {
 			retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
 			if (retval)
@@ -491,6 +494,14 @@ static int ptrace_traceme(void)
 {
 	int ret = -EPERM;
 
+	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
+		return -ERESTARTNOINTR;
+
+	if (unlikely(current->signal->group_exec_task)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		return -ERESTARTNOINTR;
+	}
+
 	write_lock_irq(&tasklist_lock);
 	/* Are we already being traced? */
 	if (!current->ptrace) {
@@ -506,6 +517,7 @@ static int ptrace_traceme(void)
 		}
 	}
 	write_unlock_irq(&tasklist_lock);
+	mutex_unlock(&current->signal->cred_guard_mutex);
 
 	return ret;
 }
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index aca7b437882e..6a136d6ddf7c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
-	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
-	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_put_fd;
+	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+		if (mutex_lock_killable(&current->signal->cred_guard_mutex))
+			goto out_put_fd;
+
+		if (unlikely(current->signal->group_exec_task)) {
+			mutex_unlock(&current->signal->cred_guard_mutex);
+			goto out_put_fd;
+		}
+	}
 
 	spin_lock_irq(&current->sighand->siglock);
 
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
index 4db327b44586..7a51a350a068 100644
--- a/tools/testing/selftests/ptrace/vmaccess.c
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -14,6 +14,7 @@
 #include <signal.h>
 #include <unistd.h>
 #include <sys/ptrace.h>
+#include <sys/syscall.h>
 
 static void *thread(void *arg)
 {
@@ -23,7 +24,7 @@ static void *thread(void *arg)
 
 TEST(vmaccess)
 {
-	int f, pid = fork();
+	int s, f, pid = fork();
 	char mm[64];
 
 	if (!pid) {
@@ -31,19 +32,42 @@ TEST(vmaccess)
 
 		pthread_create(&pt, NULL, thread, NULL);
 		pthread_join(pt, NULL);
-		execlp("true", "true", NULL);
+		execlp("false", "false", NULL);
+		return;
 	}
 
 	sleep(1);
 	sprintf(mm, "/proc/%d/mem", pid);
+	/* deadlock did happen here */
 	f = open(mm, O_RDONLY);
 	ASSERT_GE(f, 0);
 	close(f);
-	f = kill(pid, SIGCONT);
-	ASSERT_EQ(f, 0);
+	f = waitpid(-1, &s, WNOHANG);
+	ASSERT_NE(f, -1);
+	ASSERT_NE(f, 0);
+	ASSERT_NE(f, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 0);
+	f = waitpid(-1, &s, 0);
+	ASSERT_EQ(f, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 1);
+	f = waitpid(-1, NULL, 0);
+	ASSERT_EQ(f, -1);
+	ASSERT_EQ(errno, ECHILD);
 }
 
-TEST(attach)
+/*
+ * Same test as previous, except that
+ * we try to ptrace the group leader,
+ * which is about to call execve,
+ * when the other thread is already ptraced.
+ * This exercises the code in de_thread
+ * where it is waiting inside the
+ * while (sig->notify_count) {
+ * loop.
+ */
+TEST(attach1)
 {
 	int s, k, pid = fork();
 
@@ -52,19 +76,67 @@ TEST(attach)
 
 		pthread_create(&pt, NULL, thread, NULL);
 		pthread_join(pt, NULL);
-		execlp("sleep", "sleep", "2", NULL);
+		execlp("false", "false", NULL);
+		return;
 	}
 
 	sleep(1);
+	/* deadlock may happen here */
 	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
-	ASSERT_EQ(errno, EAGAIN);
 	ASSERT_EQ(k, -1);
+	ASSERT_EQ(errno, EAGAIN);
 	k = waitpid(-1, &s, WNOHANG);
 	ASSERT_NE(k, -1);
 	ASSERT_NE(k, 0);
 	ASSERT_NE(k, pid);
 	ASSERT_EQ(WIFEXITED(s), 1);
 	ASSERT_EQ(WEXITSTATUS(s), 0);
+	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+	ASSERT_EQ(k, -1);
+	ASSERT_EQ(errno, EAGAIN);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 1);
+	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+	ASSERT_EQ(k, -1);
+	ASSERT_EQ(errno, ESRCH);
+	k = waitpid(-1, NULL, 0);
+	ASSERT_EQ(k, -1);
+	ASSERT_EQ(errno, ECHILD);
+}
+
+/*
+ * Same test as previous, except that
+ * the group leader is ptraced first,
+ * but this time with PTRACE_O_TRACEEXIT,
+ * and the thread that does execve is
+ * not yet ptraced.  This exercises the
+ * code block in de_thread where the
+ * if (!thread_group_leader(tsk)) {
+ * is executed and enters a wait state.
+ */
+static long thread2_tid;
+static void *thread2(void *arg)
+{
+	thread2_tid = syscall(__NR_gettid);
+	sleep(2);
+	execlp("false", "false", NULL);
+	return NULL;
+}
+
+TEST(attach2)
+{
+	int s, k, pid = fork();
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread2, NULL);
+		pthread_join(pt, NULL);
+		return;
+	}
+
 	sleep(1);
 	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
 	ASSERT_EQ(k, 0);
@@ -72,12 +144,40 @@ TEST(attach)
 	ASSERT_EQ(k, pid);
 	ASSERT_EQ(WIFSTOPPED(s), 1);
 	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
-	k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+	k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
+	ASSERT_EQ(k, 0);
+	thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
+	ASSERT_NE(thread2_tid, -1);
+	ASSERT_NE(thread2_tid, 0);
+	ASSERT_NE(thread2_tid, pid);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	sleep(2);
+	/* deadlock may happen here */
+	k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
+	ASSERT_EQ(k, -1);
+	ASSERT_EQ(errno, EAGAIN);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+	k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
+	ASSERT_EQ(k, -1);
+	ASSERT_EQ(errno, EAGAIN);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
 	ASSERT_EQ(k, 0);
+	k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
+	ASSERT_EQ(k, -1);
+	ASSERT_EQ(errno, EAGAIN);
 	k = waitpid(-1, &s, 0);
 	ASSERT_EQ(k, pid);
 	ASSERT_EQ(WIFEXITED(s), 1);
-	ASSERT_EQ(WEXITSTATUS(s), 0);
+	ASSERT_EQ(WEXITSTATUS(s), 1);
+	k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
+	ASSERT_EQ(k, -1);
+	ASSERT_EQ(errno, ESRCH);
 	k = waitpid(-1, NULL, 0);
 	ASSERT_EQ(k, -1);
 	ASSERT_EQ(errno, ECHILD);



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach
  2024-01-23 18:30                   ` Bernd Edlinger
@ 2024-01-24  0:09                     ` Kees Cook
  0 siblings, 0 replies; 39+ messages in thread
From: Kees Cook @ 2024-01-24  0:09 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Oleg Nesterov, Alexander Viro, Alexey Dobriyan, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Elena Reshetova, David Windsor,
	Mateusz Guzik, Ard Biesheuvel, Joel Fernandes (Google),
	Matthew Wilcox (Oracle), Hans Liljestrand

On Tue, Jan 23, 2024 at 07:30:52PM +0100, Bernd Edlinger wrote:
> - Currently a non-privileged program can potentially send such a privileged
> tracer into a deadlock.
> - With the alternative patch below that non-privileged can no longer send the
> tracer into a deadlock, but it can still quickly escape out of the tracer's
> control.
> - But with my latest patch a sufficiently privileged tracer can neither be
> sent into a deadlock nor can the attached process escape.  Mission completed.

Thanks for the details. And it would be pretty unfriendly to fail the execve()
too (or, rather, it makes the execve failure unpredictable). I'll keep
reading your patch...

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v15] exec: Fix dead-lock in de_thread with ptrace_attach
       [not found]         ` <AS8P193MB1285937F9831CECAF2A9EEE2E4752@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM>
@ 2025-08-18  6:04           ` Jain, Ayush
  2025-08-18 20:53           ` [PATCH v16] " Bernd Edlinger
  1 sibling, 0 replies; 39+ messages in thread
From: Jain, Ayush @ 2025-08-18  6:04 UTC (permalink / raw)
  To: Bernd Edlinger, Alexander Viro, Alexey Dobriyan, Oleg Nesterov,
	Kees Cook, Andy Lutomirski, Will Drewry, Christian Brauner,
	Andrew Morton, Michal Hocko, Serge Hallyn, James Morris,
	Randy Dunlap, Suren Baghdasaryan, Yafang Shao, Helge Deller,
	Eric W. Biederman, Adrian Reber, Thomas Gleixner, Jens Axboe,
	Alexei Starovoitov, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-kselftest, linux-mm, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Zheng Yejian,
	Elena Reshetova, David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Naveen N Rao

Hello Bernd,
As it's been a long standing issue, and test still fails.

Will you be working on fixing it again in future?
or it might be better to remove the testcase?

  $ ./vmaccess
  TAP version 13
  1..2
  # Starting 2 tests from 1 test cases.
  #  RUN           global.vmaccess ...
  #            OK  global.vmaccess
  ok 1 global.vmaccess
  #  RUN           global.attach ...
  # attach: Test terminated by timeout
  #          FAIL  global.attach
  not ok 2 global.attach
  # FAILED: 1 / 2 tests passed.
  # Totals: pass:1 fail:1 xfail:0 xpass:0 skip:0 error:0

Thanks,
Ayush

On 1/23/2024 12:01 AM, Bernd Edlinger wrote:
> This introduces signal->exec_bprm, which is used to
> fix the case when at least one of the sibling threads
> is traced, and therefore the trace process may dead-lock
> in ptrace_attach, but de_thread will need to wait for the
> tracer to continue execution.
> 
> The problem happens when a tracer tries to ptrace_attach
> to a multi-threaded process, that does an execve in one of
> the threads at the same time, without doing that in a forked
> sub-process.  That means: There is a race condition, when one
> or more of the threads are already ptraced, but the thread
> that invoked the execve is not yet traced.  Now in this
> case the execve locks the cred_guard_mutex and waits for
> de_thread to complete.  But that waits for the traced
> sibling threads to exit, and those have to wait for the
> tracer to receive the exit signal, but the tracer cannot
> call wait right now, because it is waiting for the ptrace
> call to complete, and this never does not happen.
> The traced process and the tracer are now in a deadlock
> situation, and can only be killed by a fatal signal.
> 
> The solution is to detect this situation and allow
> ptrace_attach to continue by temporarily releasing the
> cred_guard_mutex, while de_thread() is still waiting for
> traced zombies to be eventually released by the tracer.
> In the case of the thread group leader we only have to wait
> for the thread to become a zombie, which may also need
> co-operation from the tracer due to PTRACE_O_TRACEEXIT.
> 
> When a tracer wants to ptrace_attach a task that already
> is in execve, we simply retry the ptrace_may_access
> check while temporarily installing the new credentials
> and dumpability which are about to be used after execve
> completes.  If the ptrace_attach happens on a thread that
> is a sibling-thread of the thread doing execve, it is
> sufficient to check against the old credentials, as this
> thread will be waited for, before the new credentials are
> installed.
> 
> Other threads die quickly since the cred_guard_mutex is
> released, but a deadly signal is already pending.  In case
> the mutex_lock_killable misses the signal, the non-zero
> current->signal->exec_bprm makes sure they release the
> mutex immediately and return with -ERESTARTNOINTR.
> 
> This means there is no API change, unlike the previous
> version of this patch which was discussed here:
> 
> https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.de/
> 
> See tools/testing/selftests/ptrace/vmaccess.c
> for a test case that gets fixed by this change.
> 
> Note that since the test case was originally designed to
> test the ptrace_attach returning an error in this situation,
> the test expectation needed to be adjusted, to allow the
> API to succeed at the first attempt.
> 
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
>  fs/exec.c                                 |  69 ++++++++---
>  fs/proc/base.c                            |   6 +
>  include/linux/cred.h                      |   1 +
>  include/linux/sched/signal.h              |  18 +++
>  kernel/cred.c                             |  30 ++++-
>  kernel/ptrace.c                           |  31 +++++
>  kernel/seccomp.c                          |  12 +-
>  tools/testing/selftests/ptrace/vmaccess.c | 135 ++++++++++++++++++++--
>  8 files changed, 265 insertions(+), 37 deletions(-)
> 
> v10: Changes to previous version, make the PTRACE_ATTACH
> return -EAGAIN, instead of execve return -ERESTARTSYS.
> Added some lessions learned to the description.
> 
> v11: Check old and new credentials in PTRACE_ATTACH again without
> changing the API.
> 
> Note: I got actually one response from an automatic checker to the v11 patch,
> 
> https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/
> 
> which is complaining about:
> 
>>>>> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct cred const *old_cred @@     got struct cred const [noderef] __rcu *real_cred @@
> 
>    417			struct linux_binprm *bprm = task->signal->exec_bprm;
>    418			const struct cred *old_cred;
>    419			struct mm_struct *old_mm;
>    420	
>    421			retval = down_write_killable(&task->signal->exec_update_lock);
>    422			if (retval)
>    423				goto unlock_creds;
>    424			task_lock(task);
>  > 425			old_cred = task->real_cred;
> 
> v12: Essentially identical to v11.
> 
> - Fixed a minor merge conflict in linux v5.17, and fixed the
> above mentioned nit by adding __rcu to the declaration.
> 
> - re-tested the patch with all linux versions from v5.11 to v6.6
> 
> v10 was an alternative approach which did imply an API change.
> But I would prefer to avoid such an API change.
> 
> The difficult part is getting the right dumpability flags assigned
> before de_thread starts, hope you like this version.
> If not, the v10 is of course also acceptable.
> 
> v13: Fixed duplicated Return section in function header of
> is_dumpability_changed which was reported by the kernel test robot
> 
> v14: rebased to v6.7, refreshed and retested.
> And added a more detailed description of the actual bug.
> 
> v15: rebased to v6.8-rc1, addressed some review comments.
> Split the test case vmaccess into vmaccess1 and vmaccess2
> to improve overall test coverage.
> 
> 
> Thanks
> Bernd.
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index e88249a1ce07..499380d74899 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1040,11 +1040,13 @@ static int exec_mmap(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>  {
>  	struct signal_struct *sig = tsk->signal;
>  	struct sighand_struct *oldsighand = tsk->sighand;
>  	spinlock_t *lock = &oldsighand->siglock;
> +	struct task_struct *t;
> +	bool unsafe_execve_in_progress = false;
>  
>  	if (thread_group_empty(tsk))
>  		goto no_thread_group;
> @@ -1067,6 +1069,19 @@ static int de_thread(struct task_struct *tsk)
>  	if (!thread_group_leader(tsk))
>  		sig->notify_count--;
>  
> +	for_other_threads(tsk, t) {
> +		if (unlikely(t->ptrace)
> +		    && (t != tsk->group_leader || !t->exit_state))
> +			unsafe_execve_in_progress = true;
> +	}
> +
> +	if (unlikely(unsafe_execve_in_progress)) {
> +		spin_unlock_irq(lock);
> +		sig->exec_bprm = bprm;
> +		mutex_unlock(&sig->cred_guard_mutex);
> +		spin_lock_irq(lock);
> +	}
> +
>  	while (sig->notify_count) {
>  		__set_current_state(TASK_KILLABLE);
>  		spin_unlock_irq(lock);
> @@ -1157,6 +1172,11 @@ static int de_thread(struct task_struct *tsk)
>  		release_task(leader);
>  	}
>  
> +	if (unlikely(unsafe_execve_in_progress)) {
> +		mutex_lock(&sig->cred_guard_mutex);
> +		sig->exec_bprm = NULL;
> +	}
> +
>  	sig->group_exec_task = NULL;
>  	sig->notify_count = 0;
>  
> @@ -1168,6 +1188,11 @@ static int de_thread(struct task_struct *tsk)
>  	return 0;
>  
>  killed:
> +	if (unlikely(unsafe_execve_in_progress)) {
> +		mutex_lock(&sig->cred_guard_mutex);
> +		sig->exec_bprm = NULL;
> +	}
> +
>  	/* protects against exit_notify() and __exit_signal() */
>  	read_lock(&tasklist_lock);
>  	sig->group_exec_task = NULL;
> @@ -1252,6 +1277,24 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		return retval;
>  
> +	/* If the binary is not readable then enforce mm->dumpable=0 */
> +	would_dump(bprm, bprm->file);
> +	if (bprm->have_execfd)
> +		would_dump(bprm, bprm->executable);
> +
> +	/*
> +	 * Figure out dumpability. Note that this checking only of current
> +	 * is wrong, but userspace depends on it. This should be testing
> +	 * bprm->secureexec instead.
> +	 */
> +	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> +	    is_dumpability_changed(current_cred(), bprm->cred) ||
> +	    !(uid_eq(current_euid(), current_uid()) &&
> +	      gid_eq(current_egid(), current_gid())))
> +		set_dumpable(bprm->mm, suid_dumpable);
> +	else
> +		set_dumpable(bprm->mm, SUID_DUMP_USER);
> +
>  	/*
>  	 * Ensure all future errors are fatal.
>  	 */
> @@ -1260,7 +1303,7 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	/*
>  	 * Make this the only thread in the thread group.
>  	 */
> -	retval = de_thread(me);
> +	retval = de_thread(me, bprm);
>  	if (retval)
>  		goto out;
>  
> @@ -1283,11 +1326,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> -	/* If the binary is not readable then enforce mm->dumpable=0 */
> -	would_dump(bprm, bprm->file);
> -	if (bprm->have_execfd)
> -		would_dump(bprm, bprm->executable);
> -
>  	/*
>  	 * Release all of the old mmap stuff
>  	 */
> @@ -1349,18 +1387,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>  
>  	me->sas_ss_sp = me->sas_ss_size = 0;
>  
> -	/*
> -	 * Figure out dumpability. Note that this checking only of current
> -	 * is wrong, but userspace depends on it. This should be testing
> -	 * bprm->secureexec instead.
> -	 */
> -	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> -	    !(uid_eq(current_euid(), current_uid()) &&
> -	      gid_eq(current_egid(), current_gid())))
> -		set_dumpable(current->mm, suid_dumpable);
> -	else
> -		set_dumpable(current->mm, SUID_DUMP_USER);
> -
>  	perf_event_exec();
>  	__set_task_comm(me, kbasename(bprm->filename), true);
>  
> @@ -1479,6 +1505,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
>  	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>  		return -ERESTARTNOINTR;
>  
> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		return -ERESTARTNOINTR;
> +	}
> +
>  	bprm->cred = prepare_exec_creds();
>  	if (likely(bprm->cred))
>  		return 0;
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 98a031ac2648..eab3461e4da7 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2785,6 +2785,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
>  	if (rv < 0)
>  		goto out_free;
>  
> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		rv = -ERESTARTNOINTR;
> +		goto out_free;
> +	}
> +
>  	rv = security_setprocattr(PROC_I(inode)->op.lsmid,
>  				  file->f_path.dentry->d_name.name, page,
>  				  count);
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index 2976f534a7a3..a1a1ac38f749 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -153,6 +153,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
>  extern struct cred *cred_alloc_blank(void);
>  extern struct cred *prepare_creds(void);
>  extern struct cred *prepare_exec_creds(void);
> +extern bool is_dumpability_changed(const struct cred *, const struct cred *);
>  extern int commit_creds(struct cred *);
>  extern void abort_creds(struct cred *);
>  extern const struct cred *override_creds(const struct cred *);
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 4b7664c56208..6364e115e9e9 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -235,9 +235,27 @@ struct signal_struct {
>  	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
>  					 * killed by the oom killer */
>  
> +	struct linux_binprm *exec_bprm;	/* Used to check ptrace_may_access
> +					 * against new credentials while
> +					 * de_thread is waiting for other
> +					 * traced threads to terminate.
> +					 * Set while de_thread is executing.
> +					 * The cred_guard_mutex is released
> +					 * after de_thread() has called
> +					 * zap_other_threads(), therefore
> +					 * a fatal signal is guaranteed to be
> +					 * already pending in the unlikely
> +					 * event, that
> +					 * current->signal->exec_bprm happens
> +					 * to be non-zero after the
> +					 * cred_guard_mutex was acquired.
> +					 */
> +
>  	struct mutex cred_guard_mutex;	/* guard against foreign influences on
>  					 * credential calculations
>  					 * (notably. ptrace)
> +					 * Held while execve runs, except when
> +					 * a sibling thread is being traced.
>  					 * Deprecated do not use in new code.
>  					 * Use exec_update_lock instead.
>  					 */
> diff --git a/kernel/cred.c b/kernel/cred.c
> index c033a201c808..0066b5b0f052 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -375,6 +375,30 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
>  	return false;
>  }
>  
> +/**
> + * is_dumpability_changed - Will changing creds affect dumpability?
> + * @old: The old credentials.
> + * @new: The new credentials.
> + *
> + * If the @new credentials have no elevated privileges compared to the
> + * @old credentials, the task may remain dumpable.  Otherwise we have
> + * to mark the task as undumpable to avoid information leaks from higher
> + * to lower privilege domains.
> + *
> + * Return: True if the task will become undumpable.
> + */
> +bool is_dumpability_changed(const struct cred *old, const struct cred *new)
> +{
> +	if (!uid_eq(old->euid, new->euid) ||
> +	    !gid_eq(old->egid, new->egid) ||
> +	    !uid_eq(old->fsuid, new->fsuid) ||
> +	    !gid_eq(old->fsgid, new->fsgid) ||
> +	    !cred_cap_issubset(old, new))
> +		return true;
> +
> +	return false;
> +}
> +
>  /**
>   * commit_creds - Install new credentials upon the current task
>   * @new: The credentials to be assigned
> @@ -403,11 +427,7 @@ int commit_creds(struct cred *new)
>  	get_cred(new); /* we will require a ref for the subj creds too */
>  
>  	/* dumpability changes */
> -	if (!uid_eq(old->euid, new->euid) ||
> -	    !gid_eq(old->egid, new->egid) ||
> -	    !uid_eq(old->fsuid, new->fsuid) ||
> -	    !gid_eq(old->fsgid, new->fsgid) ||
> -	    !cred_cap_issubset(old, new)) {
> +	if (is_dumpability_changed(old, new)) {
>  		if (task->mm)
>  			set_dumpable(task->mm, suid_dumpable);
>  		task->pdeath_signal = 0;
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 2fabd497d659..4b9a951b38f1 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -20,6 +20,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/ptrace.h>
>  #include <linux/security.h>
> +#include <linux/binfmts.h>
>  #include <linux/signal.h>
>  #include <linux/uio.h>
>  #include <linux/audit.h>
> @@ -450,6 +451,27 @@ static int ptrace_attach(struct task_struct *task, long request,
>  				return retval;
>  		}
>  
> +		if (unlikely(task->in_execve)) {
> +			retval = down_write_killable(&task->signal->exec_update_lock);
> +			if (retval)
> +				return retval;
> +
> +			scoped_guard (task_lock, task) {
> +				struct linux_binprm *bprm = task->signal->exec_bprm;
> +				const struct cred __rcu *old_cred = task->real_cred;
> +				struct mm_struct *old_mm = task->mm;
> +				rcu_assign_pointer(task->real_cred, bprm->cred);
> +				task->mm = bprm->mm;
> +				retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> +				rcu_assign_pointer(task->real_cred, old_cred);
> +				task->mm = old_mm;
> +			}
> +
> +			up_write(&task->signal->exec_update_lock);
> +			if (retval)
> +				return retval;
> +		}
> +
>  		scoped_guard (write_lock_irq, &tasklist_lock) {
>  			if (unlikely(task->exit_state))
>  				return -EPERM;
> @@ -491,6 +513,14 @@ static int ptrace_traceme(void)
>  {
>  	int ret = -EPERM;
>  
> +	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
> +		return -ERESTARTNOINTR;
> +
> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		return -ERESTARTNOINTR;
> +	}
> +
>  	write_lock_irq(&tasklist_lock);
>  	/* Are we already being traced? */
>  	if (!current->ptrace) {
> @@ -506,6 +536,7 @@ static int ptrace_traceme(void)
>  		}
>  	}
>  	write_unlock_irq(&tasklist_lock);
> +	mutex_unlock(&current->signal->cred_guard_mutex);
>  
>  	return ret;
>  }
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index aca7b437882e..32ed0da5939a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	 * Make sure we cannot change seccomp or nnp state via TSYNC
>  	 * while another thread is in the middle of calling exec.
>  	 */
> -	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> -	    mutex_lock_killable(&current->signal->cred_guard_mutex))
> -		goto out_put_fd;
> +	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
> +		if (mutex_lock_killable(&current->signal->cred_guard_mutex))
> +			goto out_put_fd;
> +
> +		if (unlikely(current->signal->exec_bprm)) {
> +			mutex_unlock(&current->signal->cred_guard_mutex);
> +			goto out_put_fd;
> +		}
> +	}
>  
>  	spin_lock_irq(&current->sighand->siglock);
>  
> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
> index 4db327b44586..5d4a65eb5a8d 100644
> --- a/tools/testing/selftests/ptrace/vmaccess.c
> +++ b/tools/testing/selftests/ptrace/vmaccess.c
> @@ -14,6 +14,7 @@
>  #include <signal.h>
>  #include <unistd.h>
>  #include <sys/ptrace.h>
> +#include <sys/syscall.h>
>  
>  static void *thread(void *arg)
>  {
> @@ -23,7 +24,7 @@ static void *thread(void *arg)
>  
>  TEST(vmaccess)
>  {
> -	int f, pid = fork();
> +	int s, f, pid = fork();
>  	char mm[64];
>  
>  	if (!pid) {
> @@ -31,19 +32,42 @@ TEST(vmaccess)
>  
>  		pthread_create(&pt, NULL, thread, NULL);
>  		pthread_join(pt, NULL);
> -		execlp("true", "true", NULL);
> +		execlp("false", "false", NULL);
> +		return;
>  	}
>  
>  	sleep(1);
>  	sprintf(mm, "/proc/%d/mem", pid);
> +	/* deadlock did happen here */
>  	f = open(mm, O_RDONLY);
>  	ASSERT_GE(f, 0);
>  	close(f);
> -	f = kill(pid, SIGCONT);
> -	ASSERT_EQ(f, 0);
> +	f = waitpid(-1, &s, WNOHANG);
> +	ASSERT_NE(f, -1);
> +	ASSERT_NE(f, 0);
> +	ASSERT_NE(f, pid);
> +	ASSERT_EQ(WIFEXITED(s), 1);
> +	ASSERT_EQ(WEXITSTATUS(s), 0);
> +	f = waitpid(-1, &s, 0);
> +	ASSERT_EQ(f, pid);
> +	ASSERT_EQ(WIFEXITED(s), 1);
> +	ASSERT_EQ(WEXITSTATUS(s), 1);
> +	f = waitpid(-1, NULL, 0);
> +	ASSERT_EQ(f, -1);
> +	ASSERT_EQ(errno, ECHILD);
>  }
>  
> -TEST(attach)
> +/*
> + * Same test as previous, except that
> + * we try to ptrace the group leader,
> + * which is about to call execve,
> + * when the other thread is already ptraced.
> + * This exercises the code in de_thread
> + * where it is waiting inside the
> + * while (sig->notify_count) {
> + * loop.
> + */
> +TEST(attach1)
>  {
>  	int s, k, pid = fork();
>  
> @@ -52,19 +76,76 @@ TEST(attach)
>  
>  		pthread_create(&pt, NULL, thread, NULL);
>  		pthread_join(pt, NULL);
> -		execlp("sleep", "sleep", "2", NULL);
> +		execlp("false", "false", NULL);
> +		return;
>  	}
>  
>  	sleep(1);
> +	/* deadlock may happen here */
>  	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
> -	ASSERT_EQ(errno, EAGAIN);
> -	ASSERT_EQ(k, -1);
> +	ASSERT_EQ(k, 0);
>  	k = waitpid(-1, &s, WNOHANG);
>  	ASSERT_NE(k, -1);
>  	ASSERT_NE(k, 0);
>  	ASSERT_NE(k, pid);
>  	ASSERT_EQ(WIFEXITED(s), 1);
>  	ASSERT_EQ(WEXITSTATUS(s), 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFEXITED(s), 1);
> +	ASSERT_EQ(WEXITSTATUS(s), 1);
> +	k = waitpid(-1, NULL, 0);
> +	ASSERT_EQ(k, -1);
> +	ASSERT_EQ(errno, ECHILD);
> +}
> +
> +/*
> + * Same test as previous, except that
> + * the group leader is ptraced first,
> + * but this time with PTRACE_O_TRACEEXIT,
> + * and the thread that does execve is
> + * not yet ptraced.  This exercises the
> + * code block in de_thread where the
> + * if (!thread_group_leader(tsk)) {
> + * is executed and enters a wait state.
> + */
> +static long thread2_tid;
> +static void *thread2(void *arg)
> +{
> +	thread2_tid = syscall(__NR_gettid);
> +	sleep(2);
> +	execlp("false", "false", NULL);
> +	return NULL;
> +}
> +
> +TEST(attach2)
> +{
> +	int s, k, pid = fork();
> +
> +	if (!pid) {
> +		pthread_t pt;
> +
> +		pthread_create(&pt, NULL, thread2, NULL);
> +		pthread_join(pt, NULL);
> +		return;
> +	}
> +
>  	sleep(1);
>  	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>  	ASSERT_EQ(k, 0);
> @@ -72,12 +153,46 @@ TEST(attach)
>  	ASSERT_EQ(k, pid);
>  	ASSERT_EQ(WIFSTOPPED(s), 1);
>  	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> -	k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
> +	k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
> +	ASSERT_EQ(k, 0);
> +	thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
> +	ASSERT_NE(thread2_tid, -1);
> +	ASSERT_NE(thread2_tid, 0);
> +	ASSERT_NE(thread2_tid, pid);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	sleep(2);
> +	/* deadlock may happen here */
> +	k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
>  	ASSERT_EQ(k, 0);
>  	k = waitpid(-1, &s, 0);
>  	ASSERT_EQ(k, pid);
>  	ASSERT_EQ(WIFEXITED(s), 1);
> -	ASSERT_EQ(WEXITSTATUS(s), 0);
> +	ASSERT_EQ(WEXITSTATUS(s), 1);
>  	k = waitpid(-1, NULL, 0);
>  	ASSERT_EQ(k, -1);
>  	ASSERT_EQ(errno, ECHILD);



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v16] exec: Fix dead-lock in de_thread with ptrace_attach
       [not found]         ` <AS8P193MB1285937F9831CECAF2A9EEE2E4752@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM>
  2025-08-18  6:04           ` [PATCH v15] " Jain, Ayush
@ 2025-08-18 20:53           ` Bernd Edlinger
  2025-08-19  4:36             ` Kees Cook
  2025-08-21 17:34             ` [PATCH v17] " Bernd Edlinger
  1 sibling, 2 replies; 39+ messages in thread
From: Bernd Edlinger @ 2025-08-18 20:53 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
	Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
	Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
	Suren Baghdasaryan, Yafang Shao, Helge Deller, Eric W. Biederman,
	Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Alexey Dobriyan, Jens Axboe, Paul Moore,
	Elena Reshetova, David Windsor, Mateusz Guzik, YueHaibing,
	Ard Biesheuvel, Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, tiozhang

This introduces signal->exec_bprm, which is used to
fix the case when at least one of the sibling threads
is traced, and therefore the trace process may dead-lock
in ptrace_attach, but de_thread will need to wait for the
tracer to continue execution.

The problem happens when a tracer tries to ptrace_attach
to a multi-threaded process, that does an execve in one of
the threads at the same time, without doing that in a forked
sub-process.  That means: There is a race condition, when one
or more of the threads are already ptraced, but the thread
that invoked the execve is not yet traced.  Now in this
case the execve locks the cred_guard_mutex and waits for
de_thread to complete.  But that waits for the traced
sibling threads to exit, and those have to wait for the
tracer to receive the exit signal, but the tracer cannot
call wait right now, because it is waiting for the ptrace
call to complete, and this never does not happen.
The traced process and the tracer are now in a deadlock
situation, and can only be killed by a fatal signal.

The solution is to detect this situation and allow
ptrace_attach to continue by temporarily releasing the
cred_guard_mutex, while de_thread() is still waiting for
traced zombies to be eventually released by the tracer.
In the case of the thread group leader we only have to wait
for the thread to become a zombie, which may also need
co-operation from the tracer due to PTRACE_O_TRACEEXIT.

When a tracer wants to ptrace_attach a task that already
is in execve, we simply retry the ptrace_may_access
check while temporarily installing the new credentials
and dumpability which are about to be used after execve
completes.  If the ptrace_attach happens on a thread that
is a sibling-thread of the thread doing execve, it is
sufficient to check against the old credentials, as this
thread will be waited for, before the new credentials are
installed.

Other threads die quickly since the cred_guard_mutex is
released, but a deadly signal is already pending.  In case
the mutex_lock_killable misses the signal, the non-zero
current->signal->exec_bprm makes sure they release the
mutex immediately and return with -ERESTARTNOINTR.

This means there is no API change, unlike the previous
version of this patch which was discussed here:

https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.de/

See tools/testing/selftests/ptrace/vmaccess.c
for a test case that gets fixed by this change.

Note that since the test case was originally designed to
test the ptrace_attach returning an error in this situation,
the test expectation needed to be adjusted, to allow the
API to succeed at the first attempt.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 fs/exec.c                                 |  69 ++++++++---
 fs/proc/base.c                            |   6 +
 include/linux/cred.h                      |   1 +
 include/linux/sched/signal.h              |  18 +++
 kernel/cred.c                             |  30 ++++-
 kernel/ptrace.c                           |  31 +++++
 kernel/seccomp.c                          |  12 +-
 tools/testing/selftests/ptrace/vmaccess.c | 135 ++++++++++++++++++++--
 8 files changed, 265 insertions(+), 37 deletions(-)

v10: Changes to previous version, make the PTRACE_ATTACH
return -EAGAIN, instead of execve return -ERESTARTSYS.
Added some lessions learned to the description.

v11: Check old and new credentials in PTRACE_ATTACH again without
changing the API.

Note: I got actually one response from an automatic checker to the v11 patch,

https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/

which is complaining about:

>> >> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct cred const *old_cred @@     got struct cred const [noderef] __rcu *real_cred @@

   417			struct linux_binprm *bprm = task->signal->exec_bprm;
   418			const struct cred *old_cred;
   419			struct mm_struct *old_mm;
   420	
   421			retval = down_write_killable(&task->signal->exec_update_lock);
   422			if (retval)
   423				goto unlock_creds;
   424			task_lock(task);
 > 425			old_cred = task->real_cred;

v12: Essentially identical to v11.

- Fixed a minor merge conflict in linux v5.17, and fixed the
above mentioned nit by adding __rcu to the declaration.

- re-tested the patch with all linux versions from v5.11 to v6.6

v10 was an alternative approach which did imply an API change.
But I would prefer to avoid such an API change.

The difficult part is getting the right dumpability flags assigned
before de_thread starts, hope you like this version.
If not, the v10 is of course also acceptable.

v13: Fixed duplicated Return section in function header of
is_dumpability_changed which was reported by the kernel test robot

v14: rebased to v6.7, refreshed and retested.
And added a more detailed description of the actual bug.

v15: rebased to v6.8-rc1, addressed some review comments.
Split the test case vmaccess into vmaccess1 and vmaccess2
to improve overall test coverage.

v16: rebased to 6.17-rc2, fixed some minor merge conflicts.


Thanks
Bernd.

diff --git a/fs/exec.c b/fs/exec.c
index 2a1e5e4042a1..31c6ceaa5f69 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -905,11 +905,13 @@ static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
-static int de_thread(struct task_struct *tsk)
+static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
 {
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
 	spinlock_t *lock = &oldsighand->siglock;
+	struct task_struct *t;
+	bool unsafe_execve_in_progress = false;
 
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
@@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
 	if (!thread_group_leader(tsk))
 		sig->notify_count--;
 
+	for_other_threads(tsk, t) {
+		if (unlikely(t->ptrace)
+		    && (t != tsk->group_leader || !t->exit_state))
+			unsafe_execve_in_progress = true;
+	}
+
+	if (unlikely(unsafe_execve_in_progress)) {
+		spin_unlock_irq(lock);
+		sig->exec_bprm = bprm;
+		mutex_unlock(&sig->cred_guard_mutex);
+		spin_lock_irq(lock);
+	}
+
 	while (sig->notify_count) {
 		__set_current_state(TASK_KILLABLE);
 		spin_unlock_irq(lock);
@@ -1021,6 +1036,11 @@ static int de_thread(struct task_struct *tsk)
 		release_task(leader);
 	}
 
+	if (unlikely(unsafe_execve_in_progress)) {
+		mutex_lock(&sig->cred_guard_mutex);
+		sig->exec_bprm = NULL;
+	}
+
 	sig->group_exec_task = NULL;
 	sig->notify_count = 0;
 
@@ -1032,6 +1052,11 @@ static int de_thread(struct task_struct *tsk)
 	return 0;
 
 killed:
+	if (unlikely(unsafe_execve_in_progress)) {
+		mutex_lock(&sig->cred_guard_mutex);
+		sig->exec_bprm = NULL;
+	}
+
 	/* protects against exit_notify() and __exit_signal() */
 	read_lock(&tasklist_lock);
 	sig->group_exec_task = NULL;
@@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
 	 */
 	trace_sched_prepare_exec(current, bprm);
 
+	/* If the binary is not readable then enforce mm->dumpable=0 */
+	would_dump(bprm, bprm->file);
+	if (bprm->have_execfd)
+		would_dump(bprm, bprm->executable);
+
+	/*
+	 * Figure out dumpability. Note that this checking only of current
+	 * is wrong, but userspace depends on it. This should be testing
+	 * bprm->secureexec instead.
+	 */
+	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
+	    is_dumpability_changed(current_cred(), bprm->cred) ||
+	    !(uid_eq(current_euid(), current_uid()) &&
+	      gid_eq(current_egid(), current_gid())))
+		set_dumpable(bprm->mm, suid_dumpable);
+	else
+		set_dumpable(bprm->mm, SUID_DUMP_USER);
+
 	/*
 	 * Ensure all future errors are fatal.
 	 */
 	bprm->point_of_no_return = true;
 
 	/* Make this the only thread in the thread group */
-	retval = de_thread(me);
+	retval = de_thread(me, bprm);
 	if (retval)
 		goto out;
 	/* see the comment in check_unsafe_exec() */
@@ -1144,11 +1187,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
-	/* If the binary is not readable then enforce mm->dumpable=0 */
-	would_dump(bprm, bprm->file);
-	if (bprm->have_execfd)
-		would_dump(bprm, bprm->executable);
-
 	/*
 	 * Release all of the old mmap stuff
 	 */
@@ -1210,18 +1248,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 
 	me->sas_ss_sp = me->sas_ss_size = 0;
 
-	/*
-	 * Figure out dumpability. Note that this checking only of current
-	 * is wrong, but userspace depends on it. This should be testing
-	 * bprm->secureexec instead.
-	 */
-	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
-	    !(uid_eq(current_euid(), current_uid()) &&
-	      gid_eq(current_egid(), current_gid())))
-		set_dumpable(current->mm, suid_dumpable);
-	else
-		set_dumpable(current->mm, SUID_DUMP_USER);
-
 	perf_event_exec();
 
 	/*
@@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
 	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
+	if (unlikely(current->signal->exec_bprm)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		return -ERESTARTNOINTR;
+	}
+
 	bprm->cred = prepare_exec_creds();
 	if (likely(bprm->cred))
 		return 0;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 62d35631ba8c..e5bcf812cee0 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2838,6 +2838,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 	if (rv < 0)
 		goto out_free;
 
+	if (unlikely(current->signal->exec_bprm)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		rv = -ERESTARTNOINTR;
+		goto out_free;
+	}
+
 	rv = security_setprocattr(PROC_I(inode)->op.lsmid,
 				  file->f_path.dentry->d_name.name, page,
 				  count);
diff --git a/include/linux/cred.h b/include/linux/cred.h
index a102a10f833f..fb0361911489 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -153,6 +153,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
 extern struct cred *cred_alloc_blank(void);
 extern struct cred *prepare_creds(void);
 extern struct cred *prepare_exec_creds(void);
+extern bool is_dumpability_changed(const struct cred *, const struct cred *);
 extern int commit_creds(struct cred *);
 extern void abort_creds(struct cred *);
 extern struct cred *prepare_kernel_cred(struct task_struct *);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1ef1edbaaf79..3c47d8b55863 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -237,9 +237,27 @@ struct signal_struct {
 	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
 					 * killed by the oom killer */
 
+	struct linux_binprm *exec_bprm;	/* Used to check ptrace_may_access
+					 * against new credentials while
+					 * de_thread is waiting for other
+					 * traced threads to terminate.
+					 * Set while de_thread is executing.
+					 * The cred_guard_mutex is released
+					 * after de_thread() has called
+					 * zap_other_threads(), therefore
+					 * a fatal signal is guaranteed to be
+					 * already pending in the unlikely
+					 * event, that
+					 * current->signal->exec_bprm happens
+					 * to be non-zero after the
+					 * cred_guard_mutex was acquired.
+					 */
+
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace)
+					 * Held while execve runs, except when
+					 * a sibling thread is being traced.
 					 * Deprecated do not use in new code.
 					 * Use exec_update_lock instead.
 					 */
diff --git a/kernel/cred.c b/kernel/cred.c
index 9676965c0981..0b2822c762df 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -375,6 +375,30 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
 	return false;
 }
 
+/**
+ * is_dumpability_changed - Will changing creds affect dumpability?
+ * @old: The old credentials.
+ * @new: The new credentials.
+ *
+ * If the @new credentials have no elevated privileges compared to the
+ * @old credentials, the task may remain dumpable.  Otherwise we have
+ * to mark the task as undumpable to avoid information leaks from higher
+ * to lower privilege domains.
+ *
+ * Return: True if the task will become undumpable.
+ */
+bool is_dumpability_changed(const struct cred *old, const struct cred *new)
+{
+	if (!uid_eq(old->euid, new->euid) ||
+	    !gid_eq(old->egid, new->egid) ||
+	    !uid_eq(old->fsuid, new->fsuid) ||
+	    !gid_eq(old->fsgid, new->fsgid) ||
+	    !cred_cap_issubset(old, new))
+		return true;
+
+	return false;
+}
+
 /**
  * commit_creds - Install new credentials upon the current task
  * @new: The credentials to be assigned
@@ -403,11 +427,7 @@ int commit_creds(struct cred *new)
 	get_cred(new); /* we will require a ref for the subj creds too */
 
 	/* dumpability changes */
-	if (!uid_eq(old->euid, new->euid) ||
-	    !gid_eq(old->egid, new->egid) ||
-	    !uid_eq(old->fsuid, new->fsuid) ||
-	    !gid_eq(old->fsgid, new->fsgid) ||
-	    !cred_cap_issubset(old, new)) {
+	if (is_dumpability_changed(old, new)) {
 		if (task->mm)
 			set_dumpable(task->mm, suid_dumpable);
 		task->pdeath_signal = 0;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 75a84efad40f..deacdf133f8b 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -20,6 +20,7 @@
 #include <linux/pagemap.h>
 #include <linux/ptrace.h>
 #include <linux/security.h>
+#include <linux/binfmts.h>
 #include <linux/signal.h>
 #include <linux/uio.h>
 #include <linux/audit.h>
@@ -453,6 +454,27 @@ static int ptrace_attach(struct task_struct *task, long request,
 				return retval;
 		}
 
+		if (unlikely(task->in_execve)) {
+			retval = down_write_killable(&task->signal->exec_update_lock);
+			if (retval)
+				return retval;
+
+			scoped_guard (task_lock, task) {
+				struct linux_binprm *bprm = task->signal->exec_bprm;
+				const struct cred __rcu *old_cred = task->real_cred;
+				struct mm_struct *old_mm = task->mm;
+				rcu_assign_pointer(task->real_cred, bprm->cred);
+				task->mm = bprm->mm;
+				retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+				rcu_assign_pointer(task->real_cred, old_cred);
+				task->mm = old_mm;
+			}
+
+			up_write(&task->signal->exec_update_lock);
+			if (retval)
+				return retval;
+		}
+
 		scoped_guard (write_lock_irq, &tasklist_lock) {
 			if (unlikely(task->exit_state))
 				return -EPERM;
@@ -488,6 +510,14 @@ static int ptrace_traceme(void)
 {
 	int ret = -EPERM;
 
+	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
+		return -ERESTARTNOINTR;
+
+	if (unlikely(current->signal->exec_bprm)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		return -ERESTARTNOINTR;
+	}
+
 	write_lock_irq(&tasklist_lock);
 	/* Are we already being traced? */
 	if (!current->ptrace) {
@@ -503,6 +533,7 @@ static int ptrace_traceme(void)
 		}
 	}
 	write_unlock_irq(&tasklist_lock);
+	mutex_unlock(&current->signal->cred_guard_mutex);
 
 	return ret;
 }
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 41aa761c7738..d61fc275235a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1994,9 +1994,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
-	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
-	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_put_fd;
+	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+		if (mutex_lock_killable(&current->signal->cred_guard_mutex))
+			goto out_put_fd;
+
+		if (unlikely(current->signal->exec_bprm)) {
+			mutex_unlock(&current->signal->cred_guard_mutex);
+			goto out_put_fd;
+		}
+	}
 
 	spin_lock_irq(&current->sighand->siglock);
 
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
index 4db327b44586..5d4a65eb5a8d 100644
--- a/tools/testing/selftests/ptrace/vmaccess.c
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -14,6 +14,7 @@
 #include <signal.h>
 #include <unistd.h>
 #include <sys/ptrace.h>
+#include <sys/syscall.h>
 
 static void *thread(void *arg)
 {
@@ -23,7 +24,7 @@ static void *thread(void *arg)
 
 TEST(vmaccess)
 {
-	int f, pid = fork();
+	int s, f, pid = fork();
 	char mm[64];
 
 	if (!pid) {
@@ -31,19 +32,42 @@ TEST(vmaccess)
 
 		pthread_create(&pt, NULL, thread, NULL);
 		pthread_join(pt, NULL);
-		execlp("true", "true", NULL);
+		execlp("false", "false", NULL);
+		return;
 	}
 
 	sleep(1);
 	sprintf(mm, "/proc/%d/mem", pid);
+	/* deadlock did happen here */
 	f = open(mm, O_RDONLY);
 	ASSERT_GE(f, 0);
 	close(f);
-	f = kill(pid, SIGCONT);
-	ASSERT_EQ(f, 0);
+	f = waitpid(-1, &s, WNOHANG);
+	ASSERT_NE(f, -1);
+	ASSERT_NE(f, 0);
+	ASSERT_NE(f, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 0);
+	f = waitpid(-1, &s, 0);
+	ASSERT_EQ(f, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 1);
+	f = waitpid(-1, NULL, 0);
+	ASSERT_EQ(f, -1);
+	ASSERT_EQ(errno, ECHILD);
 }
 
-TEST(attach)
+/*
+ * Same test as previous, except that
+ * we try to ptrace the group leader,
+ * which is about to call execve,
+ * when the other thread is already ptraced.
+ * This exercises the code in de_thread
+ * where it is waiting inside the
+ * while (sig->notify_count) {
+ * loop.
+ */
+TEST(attach1)
 {
 	int s, k, pid = fork();
 
@@ -52,19 +76,76 @@ TEST(attach)
 
 		pthread_create(&pt, NULL, thread, NULL);
 		pthread_join(pt, NULL);
-		execlp("sleep", "sleep", "2", NULL);
+		execlp("false", "false", NULL);
+		return;
 	}
 
 	sleep(1);
+	/* deadlock may happen here */
 	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
-	ASSERT_EQ(errno, EAGAIN);
-	ASSERT_EQ(k, -1);
+	ASSERT_EQ(k, 0);
 	k = waitpid(-1, &s, WNOHANG);
 	ASSERT_NE(k, -1);
 	ASSERT_NE(k, 0);
 	ASSERT_NE(k, pid);
 	ASSERT_EQ(WIFEXITED(s), 1);
 	ASSERT_EQ(WEXITSTATUS(s), 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 1);
+	k = waitpid(-1, NULL, 0);
+	ASSERT_EQ(k, -1);
+	ASSERT_EQ(errno, ECHILD);
+}
+
+/*
+ * Same test as previous, except that
+ * the group leader is ptraced first,
+ * but this time with PTRACE_O_TRACEEXIT,
+ * and the thread that does execve is
+ * not yet ptraced.  This exercises the
+ * code block in de_thread where the
+ * if (!thread_group_leader(tsk)) {
+ * is executed and enters a wait state.
+ */
+static long thread2_tid;
+static void *thread2(void *arg)
+{
+	thread2_tid = syscall(__NR_gettid);
+	sleep(2);
+	execlp("false", "false", NULL);
+	return NULL;
+}
+
+TEST(attach2)
+{
+	int s, k, pid = fork();
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread2, NULL);
+		pthread_join(pt, NULL);
+		return;
+	}
+
 	sleep(1);
 	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
 	ASSERT_EQ(k, 0);
@@ -72,12 +153,46 @@ TEST(attach)
 	ASSERT_EQ(k, pid);
 	ASSERT_EQ(WIFSTOPPED(s), 1);
 	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
-	k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+	k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
+	ASSERT_EQ(k, 0);
+	thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
+	ASSERT_NE(thread2_tid, -1);
+	ASSERT_NE(thread2_tid, 0);
+	ASSERT_NE(thread2_tid, pid);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	sleep(2);
+	/* deadlock may happen here */
+	k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
 	ASSERT_EQ(k, 0);
 	k = waitpid(-1, &s, 0);
 	ASSERT_EQ(k, pid);
 	ASSERT_EQ(WIFEXITED(s), 1);
-	ASSERT_EQ(WEXITSTATUS(s), 0);
+	ASSERT_EQ(WEXITSTATUS(s), 1);
 	k = waitpid(-1, NULL, 0);
 	ASSERT_EQ(k, -1);
 	ASSERT_EQ(errno, ECHILD);
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v16] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-08-18 20:53           ` [PATCH v16] " Bernd Edlinger
@ 2025-08-19  4:36             ` Kees Cook
  2025-08-19 18:53               ` Bernd Edlinger
  2025-08-21 17:34             ` [PATCH v17] " Bernd Edlinger
  1 sibling, 1 reply; 39+ messages in thread
From: Kees Cook @ 2025-08-19  4:36 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Zheng Yejian, Elena Reshetova, David Windsor,
	Mateusz Guzik, Ard Biesheuvel, Joel Fernandes (Google),
	Matthew Wilcox (Oracle), Hans Liljestrand

On Mon, Aug 18, 2025 at 10:53:43PM +0200, Bernd Edlinger wrote:
> This introduces signal->exec_bprm, which is used to
> fix the case when at least one of the sibling threads
> is traced, and therefore the trace process may dead-lock
> in ptrace_attach, but de_thread will need to wait for the
> tracer to continue execution.
> 
> The problem happens when a tracer tries to ptrace_attach
> to a multi-threaded process, that does an execve in one of
> the threads at the same time, without doing that in a forked
> sub-process.  That means: There is a race condition, when one
> or more of the threads are already ptraced, but the thread
> that invoked the execve is not yet traced.  Now in this
> case the execve locks the cred_guard_mutex and waits for
> de_thread to complete.  But that waits for the traced
> sibling threads to exit, and those have to wait for the
> tracer to receive the exit signal, but the tracer cannot
> call wait right now, because it is waiting for the ptrace
> call to complete, and this never does not happen.
> The traced process and the tracer are now in a deadlock
> situation, and can only be killed by a fatal signal.
> 
> The solution is to detect this situation and allow
> ptrace_attach to continue by temporarily releasing the
> cred_guard_mutex, while de_thread() is still waiting for
> traced zombies to be eventually released by the tracer.
> In the case of the thread group leader we only have to wait
> for the thread to become a zombie, which may also need
> co-operation from the tracer due to PTRACE_O_TRACEEXIT.
> 
> When a tracer wants to ptrace_attach a task that already
> is in execve, we simply retry the ptrace_may_access
> check while temporarily installing the new credentials
> and dumpability which are about to be used after execve
> completes.  If the ptrace_attach happens on a thread that
> is a sibling-thread of the thread doing execve, it is
> sufficient to check against the old credentials, as this
> thread will be waited for, before the new credentials are
> installed.
> 
> Other threads die quickly since the cred_guard_mutex is
> released, but a deadly signal is already pending.  In case
> the mutex_lock_killable misses the signal, the non-zero
> current->signal->exec_bprm makes sure they release the
> mutex immediately and return with -ERESTARTNOINTR.
> 
> This means there is no API change, unlike the previous
> version of this patch which was discussed here:
> 
> https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.de/
> 
> See tools/testing/selftests/ptrace/vmaccess.c
> for a test case that gets fixed by this change.
> 
> Note that since the test case was originally designed to
> test the ptrace_attach returning an error in this situation,
> the test expectation needed to be adjusted, to allow the
> API to succeed at the first attempt.
> 
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
>  fs/exec.c                                 |  69 ++++++++---
>  fs/proc/base.c                            |   6 +
>  include/linux/cred.h                      |   1 +
>  include/linux/sched/signal.h              |  18 +++
>  kernel/cred.c                             |  30 ++++-
>  kernel/ptrace.c                           |  31 +++++
>  kernel/seccomp.c                          |  12 +-
>  tools/testing/selftests/ptrace/vmaccess.c | 135 ++++++++++++++++++++--
>  8 files changed, 265 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 2a1e5e4042a1..31c6ceaa5f69 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -905,11 +905,13 @@ static int exec_mmap(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>  {
>  	struct signal_struct *sig = tsk->signal;
>  	struct sighand_struct *oldsighand = tsk->sighand;
>  	spinlock_t *lock = &oldsighand->siglock;
> +	struct task_struct *t;
> +	bool unsafe_execve_in_progress = false;
>  
>  	if (thread_group_empty(tsk))
>  		goto no_thread_group;
> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
>  	if (!thread_group_leader(tsk))
>  		sig->notify_count--;
>  
> +	for_other_threads(tsk, t) {
> +		if (unlikely(t->ptrace)
> +		    && (t != tsk->group_leader || !t->exit_state))
> +			unsafe_execve_in_progress = true;
> +	}
> +
> +	if (unlikely(unsafe_execve_in_progress)) {
> +		spin_unlock_irq(lock);
> +		sig->exec_bprm = bprm;
> +		mutex_unlock(&sig->cred_guard_mutex);
> +		spin_lock_irq(lock);
> +	}
> +

cred_guard_mutex has a comment about it being deprecated and shouldn't
be used in "new code"... Regardless, what cred_guard_mutex is trying to
protect is changing to credentials.

But what we want to stop is having new threads appear, which
spin_lock_irq(lock) should stop, yes?

>  	while (sig->notify_count) {
>  		__set_current_state(TASK_KILLABLE);
>  		spin_unlock_irq(lock);
> @@ -1021,6 +1036,11 @@ static int de_thread(struct task_struct *tsk)
>  		release_task(leader);
>  	}
>  
> +	if (unlikely(unsafe_execve_in_progress)) {
> +		mutex_lock(&sig->cred_guard_mutex);
> +		sig->exec_bprm = NULL;
> +	}
> +
>  	sig->group_exec_task = NULL;
>  	sig->notify_count = 0;
>  
> @@ -1032,6 +1052,11 @@ static int de_thread(struct task_struct *tsk)
>  	return 0;
>  
>  killed:
> +	if (unlikely(unsafe_execve_in_progress)) {
> +		mutex_lock(&sig->cred_guard_mutex);
> +		sig->exec_bprm = NULL;
> +	}

I think we need to document that cred_guard_mutex now protects
sig->exec_bprm.

> +
>  	/* protects against exit_notify() and __exit_signal() */
>  	read_lock(&tasklist_lock);
>  	sig->group_exec_task = NULL;
> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	 */
>  	trace_sched_prepare_exec(current, bprm);
>  
> +	/* If the binary is not readable then enforce mm->dumpable=0 */
> +	would_dump(bprm, bprm->file);
> +	if (bprm->have_execfd)
> +		would_dump(bprm, bprm->executable);
> +
> +	/*
> +	 * Figure out dumpability. Note that this checking only of current
> +	 * is wrong, but userspace depends on it. This should be testing
> +	 * bprm->secureexec instead.
> +	 */
> +	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> +	    is_dumpability_changed(current_cred(), bprm->cred) ||
> +	    !(uid_eq(current_euid(), current_uid()) &&
> +	      gid_eq(current_egid(), current_gid())))
> +		set_dumpable(bprm->mm, suid_dumpable);
> +	else
> +		set_dumpable(bprm->mm, SUID_DUMP_USER);
> +

Why is this move needed? While it's writing to bprm, I see it's reading
from "current". Is this safe to do before de_thread() has happened?
Can't a sibling thread manipulate things here? What lock am I missing?

>  	/*
>  	 * Ensure all future errors are fatal.
>  	 */
>  	bprm->point_of_no_return = true;
>  
>  	/* Make this the only thread in the thread group */
> -	retval = de_thread(me);
> +	retval = de_thread(me, bprm);
>  	if (retval)
>  		goto out;
>  	/* see the comment in check_unsafe_exec() */
> @@ -1144,11 +1187,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> -	/* If the binary is not readable then enforce mm->dumpable=0 */
> -	would_dump(bprm, bprm->file);
> -	if (bprm->have_execfd)
> -		would_dump(bprm, bprm->executable);
> -
>  	/*
>  	 * Release all of the old mmap stuff
>  	 */
> @@ -1210,18 +1248,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>  
>  	me->sas_ss_sp = me->sas_ss_size = 0;
>  
> -	/*
> -	 * Figure out dumpability. Note that this checking only of current
> -	 * is wrong, but userspace depends on it. This should be testing
> -	 * bprm->secureexec instead.
> -	 */
> -	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> -	    !(uid_eq(current_euid(), current_uid()) &&
> -	      gid_eq(current_egid(), current_gid())))
> -		set_dumpable(current->mm, suid_dumpable);
> -	else
> -		set_dumpable(current->mm, SUID_DUMP_USER);
> -
>  	perf_event_exec();
>  
>  	/*
> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
>  	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>  		return -ERESTARTNOINTR;
>  
> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		return -ERESTARTNOINTR;
> +	}
> +
>  	bprm->cred = prepare_exec_creds();
>  	if (likely(bprm->cred))
>  		return 0;
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 62d35631ba8c..e5bcf812cee0 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2838,6 +2838,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
>  	if (rv < 0)
>  		goto out_free;
>  
> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		rv = -ERESTARTNOINTR;
> +		goto out_free;
> +	}
> +
>  	rv = security_setprocattr(PROC_I(inode)->op.lsmid,
>  				  file->f_path.dentry->d_name.name, page,
>  				  count);
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index a102a10f833f..fb0361911489 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -153,6 +153,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
>  extern struct cred *cred_alloc_blank(void);
>  extern struct cred *prepare_creds(void);
>  extern struct cred *prepare_exec_creds(void);
> +extern bool is_dumpability_changed(const struct cred *, const struct cred *);
>  extern int commit_creds(struct cred *);
>  extern void abort_creds(struct cred *);
>  extern struct cred *prepare_kernel_cred(struct task_struct *);
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..3c47d8b55863 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -237,9 +237,27 @@ struct signal_struct {
>  	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
>  					 * killed by the oom killer */
>  
> +	struct linux_binprm *exec_bprm;	/* Used to check ptrace_may_access
> +					 * against new credentials while
> +					 * de_thread is waiting for other
> +					 * traced threads to terminate.
> +					 * Set while de_thread is executing.
> +					 * The cred_guard_mutex is released
> +					 * after de_thread() has called
> +					 * zap_other_threads(), therefore
> +					 * a fatal signal is guaranteed to be
> +					 * already pending in the unlikely
> +					 * event, that
> +					 * current->signal->exec_bprm happens
> +					 * to be non-zero after the
> +					 * cred_guard_mutex was acquired.
> +					 */
> +
>  	struct mutex cred_guard_mutex;	/* guard against foreign influences on
>  					 * credential calculations
>  					 * (notably. ptrace)
> +					 * Held while execve runs, except when
> +					 * a sibling thread is being traced.
>  					 * Deprecated do not use in new code.
>  					 * Use exec_update_lock instead.
>  					 */
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 9676965c0981..0b2822c762df 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -375,6 +375,30 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
>  	return false;
>  }
>  
> +/**
> + * is_dumpability_changed - Will changing creds affect dumpability?
> + * @old: The old credentials.
> + * @new: The new credentials.
> + *
> + * If the @new credentials have no elevated privileges compared to the
> + * @old credentials, the task may remain dumpable.  Otherwise we have
> + * to mark the task as undumpable to avoid information leaks from higher
> + * to lower privilege domains.
> + *
> + * Return: True if the task will become undumpable.
> + */
> +bool is_dumpability_changed(const struct cred *old, const struct cred *new)

This should be "static", I think?

> +{
> +	if (!uid_eq(old->euid, new->euid) ||
> +	    !gid_eq(old->egid, new->egid) ||
> +	    !uid_eq(old->fsuid, new->fsuid) ||
> +	    !gid_eq(old->fsgid, new->fsgid) ||
> +	    !cred_cap_issubset(old, new))
> +		return true;
> +
> +	return false;
> +}
> +
>  /**
>   * commit_creds - Install new credentials upon the current task
>   * @new: The credentials to be assigned
> @@ -403,11 +427,7 @@ int commit_creds(struct cred *new)
>  	get_cred(new); /* we will require a ref for the subj creds too */
>  
>  	/* dumpability changes */
> -	if (!uid_eq(old->euid, new->euid) ||
> -	    !gid_eq(old->egid, new->egid) ||
> -	    !uid_eq(old->fsuid, new->fsuid) ||
> -	    !gid_eq(old->fsgid, new->fsgid) ||
> -	    !cred_cap_issubset(old, new)) {
> +	if (is_dumpability_changed(old, new)) {
>  		if (task->mm)
>  			set_dumpable(task->mm, suid_dumpable);
>  		task->pdeath_signal = 0;
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 75a84efad40f..deacdf133f8b 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -20,6 +20,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/ptrace.h>
>  #include <linux/security.h>
> +#include <linux/binfmts.h>
>  #include <linux/signal.h>
>  #include <linux/uio.h>
>  #include <linux/audit.h>
> @@ -453,6 +454,27 @@ static int ptrace_attach(struct task_struct *task, long request,
>  				return retval;
>  		}
>  
> +		if (unlikely(task->in_execve)) {

Urgh, we're trying to get rid of this bit too.
https://lore.kernel.org/all/72da7003-a115-4162-b235-53cd3da8a90e@I-love.SAKURA.ne.jp/

Can we find a better indicator?

> +			retval = down_write_killable(&task->signal->exec_update_lock);
> +			if (retval)
> +				return retval;
> +
> +			scoped_guard (task_lock, task) {
> +				struct linux_binprm *bprm = task->signal->exec_bprm;
> +				const struct cred __rcu *old_cred = task->real_cred;
> +				struct mm_struct *old_mm = task->mm;
> +				rcu_assign_pointer(task->real_cred, bprm->cred);
> +				task->mm = bprm->mm;
> +				retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> +				rcu_assign_pointer(task->real_cred, old_cred);
> +				task->mm = old_mm;
> +			}
> +
> +			up_write(&task->signal->exec_update_lock);
> +			if (retval)
> +				return retval;
> +		}
> +
>  		scoped_guard (write_lock_irq, &tasklist_lock) {
>  			if (unlikely(task->exit_state))
>  				return -EPERM;
> @@ -488,6 +510,14 @@ static int ptrace_traceme(void)
>  {
>  	int ret = -EPERM;
>  
> +	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
> +		return -ERESTARTNOINTR;
> +
> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		return -ERESTARTNOINTR;
> +	}

I mention this hunk below...

> +
>  	write_lock_irq(&tasklist_lock);
>  	/* Are we already being traced? */
>  	if (!current->ptrace) {
> @@ -503,6 +533,7 @@ static int ptrace_traceme(void)
>  		}
>  	}
>  	write_unlock_irq(&tasklist_lock);
> +	mutex_unlock(&current->signal->cred_guard_mutex);
>  
>  	return ret;
>  }
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 41aa761c7738..d61fc275235a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1994,9 +1994,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	 * Make sure we cannot change seccomp or nnp state via TSYNC
>  	 * while another thread is in the middle of calling exec.
>  	 */
> -	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> -	    mutex_lock_killable(&current->signal->cred_guard_mutex))
> -		goto out_put_fd;
> +	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
> +		if (mutex_lock_killable(&current->signal->cred_guard_mutex))
> +			goto out_put_fd;
> +
> +		if (unlikely(current->signal->exec_bprm)) {
> +			mutex_unlock(&current->signal->cred_guard_mutex);
> +			goto out_put_fd;
> +		}

This updated test and the hunk noted above are _almost_ identical
(interruptible vs killable). Could a helper with a descriptive name be
used here instead? (And does the former hunk need interruptible, or
could it use killable?) I'd just like to avoid having repeated dependent
logic created ("we have to get the lock AND check for exec_bprm") when
something better named than "lock_if_not_racing_exec(...)" could be
used.

> +	}
>  
>  	spin_lock_irq(&current->sighand->siglock);
>  
> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
> index 4db327b44586..5d4a65eb5a8d 100644
> --- a/tools/testing/selftests/ptrace/vmaccess.c
> +++ b/tools/testing/selftests/ptrace/vmaccess.c
> @@ -14,6 +14,7 @@
>  #include <signal.h>
>  #include <unistd.h>
>  #include <sys/ptrace.h>
> +#include <sys/syscall.h>
>  
>  static void *thread(void *arg)
>  {
> @@ -23,7 +24,7 @@ static void *thread(void *arg)
>  
>  TEST(vmaccess)
>  {
> -	int f, pid = fork();
> +	int s, f, pid = fork();
>  	char mm[64];
>  
>  	if (!pid) {
> @@ -31,19 +32,42 @@ TEST(vmaccess)
>  
>  		pthread_create(&pt, NULL, thread, NULL);
>  		pthread_join(pt, NULL);
> -		execlp("true", "true", NULL);
> +		execlp("false", "false", NULL);
> +		return;
>  	}
>  
>  	sleep(1);
>  	sprintf(mm, "/proc/%d/mem", pid);
> +	/* deadlock did happen here */
>  	f = open(mm, O_RDONLY);
>  	ASSERT_GE(f, 0);
>  	close(f);
> -	f = kill(pid, SIGCONT);
> -	ASSERT_EQ(f, 0);
> +	f = waitpid(-1, &s, WNOHANG);
> +	ASSERT_NE(f, -1);
> +	ASSERT_NE(f, 0);
> +	ASSERT_NE(f, pid);
> +	ASSERT_EQ(WIFEXITED(s), 1);
> +	ASSERT_EQ(WEXITSTATUS(s), 0);
> +	f = waitpid(-1, &s, 0);
> +	ASSERT_EQ(f, pid);
> +	ASSERT_EQ(WIFEXITED(s), 1);
> +	ASSERT_EQ(WEXITSTATUS(s), 1);
> +	f = waitpid(-1, NULL, 0);
> +	ASSERT_EQ(f, -1);
> +	ASSERT_EQ(errno, ECHILD);
>  }
>  
> -TEST(attach)
> +/*
> + * Same test as previous, except that
> + * we try to ptrace the group leader,
> + * which is about to call execve,
> + * when the other thread is already ptraced.
> + * This exercises the code in de_thread
> + * where it is waiting inside the
> + * while (sig->notify_count) {
> + * loop.
> + */
> +TEST(attach1)
>  {
>  	int s, k, pid = fork();
>  
> @@ -52,19 +76,76 @@ TEST(attach)
>  
>  		pthread_create(&pt, NULL, thread, NULL);
>  		pthread_join(pt, NULL);
> -		execlp("sleep", "sleep", "2", NULL);
> +		execlp("false", "false", NULL);
> +		return;
>  	}
>  
>  	sleep(1);
> +	/* deadlock may happen here */
>  	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
> -	ASSERT_EQ(errno, EAGAIN);
> -	ASSERT_EQ(k, -1);
> +	ASSERT_EQ(k, 0);
>  	k = waitpid(-1, &s, WNOHANG);
>  	ASSERT_NE(k, -1);
>  	ASSERT_NE(k, 0);
>  	ASSERT_NE(k, pid);
>  	ASSERT_EQ(WIFEXITED(s), 1);
>  	ASSERT_EQ(WEXITSTATUS(s), 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFEXITED(s), 1);
> +	ASSERT_EQ(WEXITSTATUS(s), 1);
> +	k = waitpid(-1, NULL, 0);
> +	ASSERT_EQ(k, -1);
> +	ASSERT_EQ(errno, ECHILD);
> +}
> +
> +/*
> + * Same test as previous, except that
> + * the group leader is ptraced first,
> + * but this time with PTRACE_O_TRACEEXIT,
> + * and the thread that does execve is
> + * not yet ptraced.  This exercises the
> + * code block in de_thread where the
> + * if (!thread_group_leader(tsk)) {
> + * is executed and enters a wait state.
> + */
> +static long thread2_tid;
> +static void *thread2(void *arg)
> +{
> +	thread2_tid = syscall(__NR_gettid);
> +	sleep(2);
> +	execlp("false", "false", NULL);
> +	return NULL;
> +}
> +
> +TEST(attach2)
> +{
> +	int s, k, pid = fork();
> +
> +	if (!pid) {
> +		pthread_t pt;
> +
> +		pthread_create(&pt, NULL, thread2, NULL);
> +		pthread_join(pt, NULL);
> +		return;
> +	}
> +
>  	sleep(1);
>  	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>  	ASSERT_EQ(k, 0);
> @@ -72,12 +153,46 @@ TEST(attach)
>  	ASSERT_EQ(k, pid);
>  	ASSERT_EQ(WIFSTOPPED(s), 1);
>  	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> -	k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
> +	k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
> +	ASSERT_EQ(k, 0);
> +	thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
> +	ASSERT_NE(thread2_tid, -1);
> +	ASSERT_NE(thread2_tid, 0);
> +	ASSERT_NE(thread2_tid, pid);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	sleep(2);
> +	/* deadlock may happen here */
> +	k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
>  	ASSERT_EQ(k, 0);
>  	k = waitpid(-1, &s, 0);
>  	ASSERT_EQ(k, pid);
>  	ASSERT_EQ(WIFEXITED(s), 1);
> -	ASSERT_EQ(WEXITSTATUS(s), 0);
> +	ASSERT_EQ(WEXITSTATUS(s), 1);
>  	k = waitpid(-1, NULL, 0);
>  	ASSERT_EQ(k, -1);
>  	ASSERT_EQ(errno, ECHILD);

Thank you for adding tests! This will be a nice deadlock to get fixed.

-Kees

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-08-19  4:36             ` Kees Cook
@ 2025-08-19 18:53               ` Bernd Edlinger
  0 siblings, 0 replies; 39+ messages in thread
From: Bernd Edlinger @ 2025-08-19 18:53 UTC (permalink / raw)
  To: Kees Cook
  Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Elena Reshetova, David Windsor, Mateusz Guzik,
	Ard Biesheuvel, Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand

Hi Kees,

thanks for your response.

On 8/19/25 06:36, Kees Cook wrote:
> On Mon, Aug 18, 2025 at 10:53:43PM +0200, Bernd Edlinger wrote:
>> This introduces signal->exec_bprm, which is used to
>> fix the case when at least one of the sibling threads
>> is traced, and therefore the trace process may dead-lock
>> in ptrace_attach, but de_thread will need to wait for the
>> tracer to continue execution.
>>
[...]
>>  
>> -static int de_thread(struct task_struct *tsk)
>> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>>  {
>>  	struct signal_struct *sig = tsk->signal;
>>  	struct sighand_struct *oldsighand = tsk->sighand;
>>  	spinlock_t *lock = &oldsighand->siglock;
>> +	struct task_struct *t;
>> +	bool unsafe_execve_in_progress = false;
>>  
>>  	if (thread_group_empty(tsk))
>>  		goto no_thread_group;
>> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
>>  	if (!thread_group_leader(tsk))
>>  		sig->notify_count--;
>>  
>> +	for_other_threads(tsk, t) {
>> +		if (unlikely(t->ptrace)
>> +		    && (t != tsk->group_leader || !t->exit_state))
>> +			unsafe_execve_in_progress = true;
>> +	}
>> +
>> +	if (unlikely(unsafe_execve_in_progress)) {
>> +		spin_unlock_irq(lock);
>> +		sig->exec_bprm = bprm;
>> +		mutex_unlock(&sig->cred_guard_mutex);
>> +		spin_lock_irq(lock);
>> +	}
>> +
> 
> cred_guard_mutex has a comment about it being deprecated and shouldn't
> be used in "new code"... Regardless, what cred_guard_mutex is trying to
> protect is changing to credentials.
> 
> But what we want to stop is having new threads appear, which
> spin_lock_irq(lock) should stop, yes?
> 

Yes, but after the fatal signal is sent to all threads, releasing the
sighand->siglock is okay, since no new threads can be cloned any more,
because kernel/fork.c checks for pending fatal signals after acquiring the
mutex and before adding another thread to the list:

        spin_lock(&current->sighand->siglock);

        rv_task_fork(p);

        rseq_fork(p, clone_flags);

        /* Don't start children in a dying pid namespace */
        if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) {
                retval = -ENOMEM;
                goto bad_fork_core_free;
        }

        /* Let kill terminate clone/fork in the middle */
        if (fatal_signal_pending(current)) {
                retval = -EINTR;
                goto bad_fork_core_free;
        }

        /* No more failure paths after this point. */


>>  	while (sig->notify_count) {
>>  		__set_current_state(TASK_KILLABLE);
>>  		spin_unlock_irq(lock);
>> @@ -1021,6 +1036,11 @@ static int de_thread(struct task_struct *tsk)
>>  		release_task(leader);
>>  	}
>>  
>> +	if (unlikely(unsafe_execve_in_progress)) {
>> +		mutex_lock(&sig->cred_guard_mutex);
>> +		sig->exec_bprm = NULL;
>> +	}
>> +
>>  	sig->group_exec_task = NULL;
>>  	sig->notify_count = 0;
>>  
>> @@ -1032,6 +1052,11 @@ static int de_thread(struct task_struct *tsk)
>>  	return 0;
>>  
>>  killed:
>> +	if (unlikely(unsafe_execve_in_progress)) {
>> +		mutex_lock(&sig->cred_guard_mutex);
>> +		sig->exec_bprm = NULL;
>> +	}
> 
> I think we need to document that cred_guard_mutex now protects
> sig->exec_bprm.
> 

Maybe, I thought that it would be clear by this description:

--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -237,9 +237,27 @@ struct signal_struct {
        struct mm_struct *oom_mm;       /* recorded mm when the thread group got
                                         * killed by the oom killer */
 
+       struct linux_binprm *exec_bprm; /* Used to check ptrace_may_access
+                                        * against new credentials while
+                                        * de_thread is waiting for other
+                                        * traced threads to terminate.
+                                        * Set while de_thread is executing.
+                                        * The cred_guard_mutex is released
+                                        * after de_thread() has called
+                                        * zap_other_threads(), therefore
+                                        * a fatal signal is guaranteed to be
+                                        * already pending in the unlikely
+                                        * event, that
+                                        * current->signal->exec_bprm happens
+                                        * to be non-zero after the
+                                        * cred_guard_mutex was acquired.
+                                        */
+
        struct mutex cred_guard_mutex;  /* guard against foreign influences on
                                         * credential calculations
                                         * (notably. ptrace)
+                                        * Held while execve runs, except when
+                                        * a sibling thread is being traced.
                                         * Deprecated do not use in new code.
                                         * Use exec_update_lock instead.
                                         */


>> +
>>  	/* protects against exit_notify() and __exit_signal() */
>>  	read_lock(&tasklist_lock);
>>  	sig->group_exec_task = NULL;
>> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
>>  	 */
>>  	trace_sched_prepare_exec(current, bprm);
>>  
>> +	/* If the binary is not readable then enforce mm->dumpable=0 */
>> +	would_dump(bprm, bprm->file);
>> +	if (bprm->have_execfd)
>> +		would_dump(bprm, bprm->executable);
>> +
>> +	/*
>> +	 * Figure out dumpability. Note that this checking only of current
>> +	 * is wrong, but userspace depends on it. This should be testing
>> +	 * bprm->secureexec instead.
>> +	 */
>> +	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>> +	    is_dumpability_changed(current_cred(), bprm->cred) ||
>> +	    !(uid_eq(current_euid(), current_uid()) &&
>> +	      gid_eq(current_egid(), current_gid())))
>> +		set_dumpable(bprm->mm, suid_dumpable);
>> +	else
>> +		set_dumpable(bprm->mm, SUID_DUMP_USER);
>> +
> 
> Why is this move needed? While it's writing to bprm, I see it's reading
> from "current". Is this safe to do before de_thread() has happened?
> Can't a sibling thread manipulate things here? What lock am I missing?
> 

The credentials should be protected by cred_guard_mutex here, and/or
current->signal->exec_bprm in case there are ptraced threads.
 
I need this earlier because the outcome of ptrace_may_access
for the new credentials depends on the dumpability.

So when de_thread has to release the mutex, a ptrace_attach
is possible without deadlock, but the tracer's access rights must
be sufficient to access the current and the new credentials because,
the tracer will receive the stop event immediately after the
execve succeeds and the new credentials are in effect, or
if the execve fails the stop event will see the old process
using the old credentials dying of a fatal signal.

>>  	/*
>>  	 * Ensure all future errors are fatal.
>>  	 */
>>  	bprm->point_of_no_return = true;
>>  
>>  	/* Make this the only thread in the thread group */
>> -	retval = de_thread(me);
>> +	retval = de_thread(me, bprm);
>>  	if (retval)
>>  		goto out;
>>  	/* see the comment in check_unsafe_exec() */
>> @@ -1144,11 +1187,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>>  	if (retval)
>>  		goto out;
>>  
>> -	/* If the binary is not readable then enforce mm->dumpable=0 */
>> -	would_dump(bprm, bprm->file);
>> -	if (bprm->have_execfd)
>> -		would_dump(bprm, bprm->executable);
>> -
>>  	/*
>>  	 * Release all of the old mmap stuff
>>  	 */
>> @@ -1210,18 +1248,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>>  
>>  	me->sas_ss_sp = me->sas_ss_size = 0;
>>  
>> -	/*
>> -	 * Figure out dumpability. Note that this checking only of current
>> -	 * is wrong, but userspace depends on it. This should be testing
>> -	 * bprm->secureexec instead.
>> -	 */
>> -	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>> -	    !(uid_eq(current_euid(), current_uid()) &&
>> -	      gid_eq(current_egid(), current_gid())))
>> -		set_dumpable(current->mm, suid_dumpable);
>> -	else
>> -		set_dumpable(current->mm, SUID_DUMP_USER);
>> -
>>  	perf_event_exec();
>>  
>>  	/*
>> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
>>  	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>>  		return -ERESTARTNOINTR;
>>  
>> +	if (unlikely(current->signal->exec_bprm)) {
>> +		mutex_unlock(&current->signal->cred_guard_mutex);
>> +		return -ERESTARTNOINTR;
>> +	}
>> +
>>  	bprm->cred = prepare_exec_creds();
>>  	if (likely(bprm->cred))
>>  		return 0;
[...]
>> diff --git a/kernel/cred.c b/kernel/cred.c
>> index 9676965c0981..0b2822c762df 100644
>> --- a/kernel/cred.c
>> +++ b/kernel/cred.c
>> @@ -375,6 +375,30 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
>>  	return false;
>>  }
>>  
>> +/**
>> + * is_dumpability_changed - Will changing creds affect dumpability?
>> + * @old: The old credentials.
>> + * @new: The new credentials.
>> + *
>> + * If the @new credentials have no elevated privileges compared to the
>> + * @old credentials, the task may remain dumpable.  Otherwise we have
>> + * to mark the task as undumpable to avoid information leaks from higher
>> + * to lower privilege domains.
>> + *
>> + * Return: True if the task will become undumpable.
>> + */
>> +bool is_dumpability_changed(const struct cred *old, const struct cred *new)
> 
> This should be "static", I think?
> 

No, because I want to use this function also in begin_new_exec in the moved code,
to include the expected outcome of commit_creds before de_thread, to make sure the
result of ptrace_may_access is as if the new credentials were already committed.


>> +{
>> +	if (!uid_eq(old->euid, new->euid) ||
>> +	    !gid_eq(old->egid, new->egid) ||
>> +	    !uid_eq(old->fsuid, new->fsuid) ||
>> +	    !gid_eq(old->fsgid, new->fsgid) ||
>> +	    !cred_cap_issubset(old, new))
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>>  /**
>>   * commit_creds - Install new credentials upon the current task
>>   * @new: The credentials to be assigned
>> @@ -403,11 +427,7 @@ int commit_creds(struct cred *new)
>>  	get_cred(new); /* we will require a ref for the subj creds too */
>>  
>>  	/* dumpability changes */
>> -	if (!uid_eq(old->euid, new->euid) ||
>> -	    !gid_eq(old->egid, new->egid) ||
>> -	    !uid_eq(old->fsuid, new->fsuid) ||
>> -	    !gid_eq(old->fsgid, new->fsgid) ||
>> -	    !cred_cap_issubset(old, new)) {
>> +	if (is_dumpability_changed(old, new)) {
>>  		if (task->mm)
>>  			set_dumpable(task->mm, suid_dumpable);
>>  		task->pdeath_signal = 0;
>> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
>> index 75a84efad40f..deacdf133f8b 100644
>> --- a/kernel/ptrace.c
>> +++ b/kernel/ptrace.c
>> @@ -20,6 +20,7 @@
>>  #include <linux/pagemap.h>
>>  #include <linux/ptrace.h>
>>  #include <linux/security.h>
>> +#include <linux/binfmts.h>
>>  #include <linux/signal.h>
>>  #include <linux/uio.h>
>>  #include <linux/audit.h>
>> @@ -453,6 +454,27 @@ static int ptrace_attach(struct task_struct *task, long request,
>>  				return retval;
>>  		}
>>  
>> +		if (unlikely(task->in_execve)) {
> 
> Urgh, we're trying to get rid of this bit too.
> https://lore.kernel.org/all/72da7003-a115-4162-b235-53cd3da8a90e@I-love.SAKURA.ne.jp/
> 
> Can we find a better indicator?
> 

Hmm, Yes, I agree.
I start to think that task->signal->exec_bprm would indeed be a better indicator:

--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -454,7 +454,7 @@ static int ptrace_attach(struct task_struct *task, long request,
                                return retval;
                }
 
-               if (unlikely(task->in_execve)) {
+               if (unlikely(task->signal->exec_bprm)) {
                        retval = down_write_killable(&task->signal->exec_update_lock);
                        if (retval)
                                return retval;

So I will add this to the next patch version.

>> +			retval = down_write_killable(&task->signal->exec_update_lock);
>> +			if (retval)
>> +				return retval;
>> +
>> +			scoped_guard (task_lock, task) {
>> +				struct linux_binprm *bprm = task->signal->exec_bprm;
>> +				const struct cred __rcu *old_cred = task->real_cred;
>> +				struct mm_struct *old_mm = task->mm;
>> +				rcu_assign_pointer(task->real_cred, bprm->cred);
>> +				task->mm = bprm->mm;
>> +				retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
>> +				rcu_assign_pointer(task->real_cred, old_cred);
>> +				task->mm = old_mm;
>> +			}
>> +
>> +			up_write(&task->signal->exec_update_lock);
>> +			if (retval)
>> +				return retval;
>> +		}
>> +
>>  		scoped_guard (write_lock_irq, &tasklist_lock) {
>>  			if (unlikely(task->exit_state))
>>  				return -EPERM;
>> @@ -488,6 +510,14 @@ static int ptrace_traceme(void)
>>  {
>>  	int ret = -EPERM;
>>  
>> +	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>> +		return -ERESTARTNOINTR;
>> +
>> +	if (unlikely(current->signal->exec_bprm)) {
>> +		mutex_unlock(&current->signal->cred_guard_mutex);
>> +		return -ERESTARTNOINTR;
>> +	}
> 
> I mention this hunk below...
> 
>> +
>>  	write_lock_irq(&tasklist_lock);
>>  	/* Are we already being traced? */
>>  	if (!current->ptrace) {
>> @@ -503,6 +533,7 @@ static int ptrace_traceme(void)
>>  		}
>>  	}
>>  	write_unlock_irq(&tasklist_lock);
>> +	mutex_unlock(&current->signal->cred_guard_mutex);
>>  
>>  	return ret;
>>  }
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 41aa761c7738..d61fc275235a 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -1994,9 +1994,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
>>  	 * Make sure we cannot change seccomp or nnp state via TSYNC
>>  	 * while another thread is in the middle of calling exec.
>>  	 */
>> -	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>> -	    mutex_lock_killable(&current->signal->cred_guard_mutex))
>> -		goto out_put_fd;
>> +	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
>> +		if (mutex_lock_killable(&current->signal->cred_guard_mutex))
>> +			goto out_put_fd;
>> +
>> +		if (unlikely(current->signal->exec_bprm)) {
>> +			mutex_unlock(&current->signal->cred_guard_mutex);
>> +			goto out_put_fd;
>> +		}
> 
> This updated test and the hunk noted above are _almost_ identical
> (interruptible vs killable). Could a helper with a descriptive name be
> used here instead? (And does the former hunk need interruptible, or
> could it use killable?) I'd just like to avoid having repeated dependent
> logic created ("we have to get the lock AND check for exec_bprm") when
> something better named than "lock_if_not_racing_exec(...)" could be
> used.
> 

My problem here is that we have a pre-exisiting code using
mutex_lock_killable(&current->signal->cred_guard_mutex))
and other pre-existiong code using mutex_lock_interruptible I would not want
to make any changes here sine there are probably good reasons why one is
ignoring interrupts while the others do not ignore interrupts.

So currently I would not think that a helper function for the two interruptible
mutex locking instances would really make the code more easy to understand...


>> +	}
>>  
>>  	spin_lock_irq(&current->sighand->siglock);
>>  
>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
>> index 4db327b44586..5d4a65eb5a8d 100644
>> --- a/tools/testing/selftests/ptrace/vmaccess.c
>> +++ b/tools/testing/selftests/ptrace/vmaccess.c
>> @@ -14,6 +14,7 @@
>>  #include <signal.h>
>>  #include <unistd.h>
>>  #include <sys/ptrace.h>
>> +#include <sys/syscall.h>
>>  
>>  static void *thread(void *arg)
>>  {
>> @@ -23,7 +24,7 @@ static void *thread(void *arg)
>>  
>>  TEST(vmaccess)
>>  {
>> -	int f, pid = fork();
>> +	int s, f, pid = fork();
>>  	char mm[64];
>>  
>>  	if (!pid) {
>> @@ -31,19 +32,42 @@ TEST(vmaccess)
>>  
>>  		pthread_create(&pt, NULL, thread, NULL);
>>  		pthread_join(pt, NULL);
>> -		execlp("true", "true", NULL);
>> +		execlp("false", "false", NULL);
>> +		return;
>>  	}
>>  
>>  	sleep(1);
>>  	sprintf(mm, "/proc/%d/mem", pid);
>> +	/* deadlock did happen here */
>>  	f = open(mm, O_RDONLY);
>>  	ASSERT_GE(f, 0);
>>  	close(f);
>> -	f = kill(pid, SIGCONT);
>> -	ASSERT_EQ(f, 0);
>> +	f = waitpid(-1, &s, WNOHANG);
>> +	ASSERT_NE(f, -1);
>> +	ASSERT_NE(f, 0);
>> +	ASSERT_NE(f, pid);
>> +	ASSERT_EQ(WIFEXITED(s), 1);
>> +	ASSERT_EQ(WEXITSTATUS(s), 0);
>> +	f = waitpid(-1, &s, 0);
>> +	ASSERT_EQ(f, pid);
>> +	ASSERT_EQ(WIFEXITED(s), 1);
>> +	ASSERT_EQ(WEXITSTATUS(s), 1);
>> +	f = waitpid(-1, NULL, 0);
>> +	ASSERT_EQ(f, -1);
>> +	ASSERT_EQ(errno, ECHILD);
>>  }
>>  
>> -TEST(attach)
>> +/*
>> + * Same test as previous, except that
>> + * we try to ptrace the group leader,
>> + * which is about to call execve,
>> + * when the other thread is already ptraced.
>> + * This exercises the code in de_thread
>> + * where it is waiting inside the
>> + * while (sig->notify_count) {
>> + * loop.
>> + */
>> +TEST(attach1)
>>  {
>>  	int s, k, pid = fork();
>>  
>> @@ -52,19 +76,76 @@ TEST(attach)
>>  
>>  		pthread_create(&pt, NULL, thread, NULL);
>>  		pthread_join(pt, NULL);
>> -		execlp("sleep", "sleep", "2", NULL);
>> +		execlp("false", "false", NULL);
>> +		return;
>>  	}
>>  
>>  	sleep(1);
>> +	/* deadlock may happen here */
>>  	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>> -	ASSERT_EQ(errno, EAGAIN);
>> -	ASSERT_EQ(k, -1);
>> +	ASSERT_EQ(k, 0);
>>  	k = waitpid(-1, &s, WNOHANG);
>>  	ASSERT_NE(k, -1);
>>  	ASSERT_NE(k, 0);
>>  	ASSERT_NE(k, pid);
>>  	ASSERT_EQ(WIFEXITED(s), 1);
>>  	ASSERT_EQ(WEXITSTATUS(s), 0);
>> +	k = waitpid(-1, &s, 0);
>> +	ASSERT_EQ(k, pid);
>> +	ASSERT_EQ(WIFSTOPPED(s), 1);
>> +	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
>> +	k = waitpid(-1, &s, WNOHANG);
>> +	ASSERT_EQ(k, 0);
>> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
>> +	ASSERT_EQ(k, 0);
>> +	k = waitpid(-1, &s, 0);
>> +	ASSERT_EQ(k, pid);
>> +	ASSERT_EQ(WIFSTOPPED(s), 1);
>> +	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
>> +	k = waitpid(-1, &s, WNOHANG);
>> +	ASSERT_EQ(k, 0);
>> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
>> +	ASSERT_EQ(k, 0);
>> +	k = waitpid(-1, &s, 0);
>> +	ASSERT_EQ(k, pid);
>> +	ASSERT_EQ(WIFEXITED(s), 1);
>> +	ASSERT_EQ(WEXITSTATUS(s), 1);
>> +	k = waitpid(-1, NULL, 0);
>> +	ASSERT_EQ(k, -1);
>> +	ASSERT_EQ(errno, ECHILD);
>> +}
>> +
>> +/*
>> + * Same test as previous, except that
>> + * the group leader is ptraced first,
>> + * but this time with PTRACE_O_TRACEEXIT,
>> + * and the thread that does execve is
>> + * not yet ptraced.  This exercises the
>> + * code block in de_thread where the
>> + * if (!thread_group_leader(tsk)) {
>> + * is executed and enters a wait state.
>> + */
>> +static long thread2_tid;
>> +static void *thread2(void *arg)
>> +{
>> +	thread2_tid = syscall(__NR_gettid);
>> +	sleep(2);
>> +	execlp("false", "false", NULL);
>> +	return NULL;
>> +}
>> +
>> +TEST(attach2)
>> +{
>> +	int s, k, pid = fork();
>> +
>> +	if (!pid) {
>> +		pthread_t pt;
>> +
>> +		pthread_create(&pt, NULL, thread2, NULL);
>> +		pthread_join(pt, NULL);
>> +		return;
>> +	}
>> +
>>  	sleep(1);
>>  	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>>  	ASSERT_EQ(k, 0);
>> @@ -72,12 +153,46 @@ TEST(attach)
>>  	ASSERT_EQ(k, pid);
>>  	ASSERT_EQ(WIFSTOPPED(s), 1);
>>  	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
>> -	k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
>> +	k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
>> +	ASSERT_EQ(k, 0);
>> +	thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
>> +	ASSERT_NE(thread2_tid, -1);
>> +	ASSERT_NE(thread2_tid, 0);
>> +	ASSERT_NE(thread2_tid, pid);
>> +	k = waitpid(-1, &s, WNOHANG);
>> +	ASSERT_EQ(k, 0);
>> +	sleep(2);
>> +	/* deadlock may happen here */
>> +	k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
>> +	ASSERT_EQ(k, 0);
>> +	k = waitpid(-1, &s, WNOHANG);
>> +	ASSERT_EQ(k, pid);
>> +	ASSERT_EQ(WIFSTOPPED(s), 1);
>> +	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
>> +	k = waitpid(-1, &s, WNOHANG);
>> +	ASSERT_EQ(k, 0);
>> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
>> +	ASSERT_EQ(k, 0);
>> +	k = waitpid(-1, &s, 0);
>> +	ASSERT_EQ(k, pid);
>> +	ASSERT_EQ(WIFSTOPPED(s), 1);
>> +	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
>> +	k = waitpid(-1, &s, WNOHANG);
>> +	ASSERT_EQ(k, 0);
>> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
>> +	ASSERT_EQ(k, 0);
>> +	k = waitpid(-1, &s, 0);
>> +	ASSERT_EQ(k, pid);
>> +	ASSERT_EQ(WIFSTOPPED(s), 1);
>> +	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
>> +	k = waitpid(-1, &s, WNOHANG);
>> +	ASSERT_EQ(k, 0);
>> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
>>  	ASSERT_EQ(k, 0);
>>  	k = waitpid(-1, &s, 0);
>>  	ASSERT_EQ(k, pid);
>>  	ASSERT_EQ(WIFEXITED(s), 1);
>> -	ASSERT_EQ(WEXITSTATUS(s), 0);
>> +	ASSERT_EQ(WEXITSTATUS(s), 1);
>>  	k = waitpid(-1, NULL, 0);
>>  	ASSERT_EQ(k, -1);
>>  	ASSERT_EQ(errno, ECHILD);
> 
> Thank you for adding tests! This will be a nice deadlock to get fixed.
> 
> -Kees
> 

Thank you for your review, if you would like me to add some more commments and
maybe have any suggestions what to write and where, then I would be happy to add that
in the next patch version.


Thanks
Bernd.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-08-18 20:53           ` [PATCH v16] " Bernd Edlinger
  2025-08-19  4:36             ` Kees Cook
@ 2025-08-21 17:34             ` Bernd Edlinger
  2025-10-27  6:26               ` Bernd Edlinger
                                 ` (4 more replies)
  1 sibling, 5 replies; 39+ messages in thread
From: Bernd Edlinger @ 2025-08-21 17:34 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
	Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
	Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
	Suren Baghdasaryan, Yafang Shao, Helge Deller, Eric W. Biederman,
	Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Alexey Dobriyan,
	Jens Axboe, Paul Moore, Elena Reshetova, David Windsor,
	Mateusz Guzik, YueHaibing, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, tiozhang, Penglei Jiang, Lorenzo Stoakes,
	Adrian Ratiu, Ingo Molnar, Peter Zijlstra (Intel),
	Cyrill Gorcunov, Eric Dumazet

This introduces signal->exec_bprm, which is used to
fix the case when at least one of the sibling threads
is traced, and therefore the trace process may dead-lock
in ptrace_attach, but de_thread will need to wait for the
tracer to continue execution.

The problem happens when a tracer tries to ptrace_attach
to a multi-threaded process, that does an execve in one of
the threads at the same time, without doing that in a forked
sub-process.  That means: There is a race condition, when one
or more of the threads are already ptraced, but the thread
that invoked the execve is not yet traced.  Now in this
case the execve locks the cred_guard_mutex and waits for
de_thread to complete.  But that waits for the traced
sibling threads to exit, and those have to wait for the
tracer to receive the exit signal, but the tracer cannot
call wait right now, because it is waiting for the ptrace
call to complete, and this never does not happen.
The traced process and the tracer are now in a deadlock
situation, and can only be killed by a fatal signal.

The solution is to detect this situation and allow
ptrace_attach to continue by temporarily releasing the
cred_guard_mutex, while de_thread() is still waiting for
traced zombies to be eventually released by the tracer.
In the case of the thread group leader we only have to wait
for the thread to become a zombie, which may also need
co-operation from the tracer due to PTRACE_O_TRACEEXIT.

When a tracer wants to ptrace_attach a task that already
is in execve, we simply retry the ptrace_may_access
check while temporarily installing the new credentials
and dumpability which are about to be used after execve
completes.  If the ptrace_attach happens on a thread that
is a sibling-thread of the thread doing execve, it is
sufficient to check against the old credentials, as this
thread will be waited for, before the new credentials are
installed.

Other threads die quickly since the cred_guard_mutex is
released, but a deadly signal is already pending.  In case
the mutex_lock_killable misses the signal, the non-zero
current->signal->exec_bprm makes sure they release the
mutex immediately and return with -ERESTARTNOINTR.

This means there is no API change, unlike the previous
version of this patch which was discussed here:

https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.de/

See tools/testing/selftests/ptrace/vmaccess.c
for a test case that gets fixed by this change.

Note that since the test case was originally designed to
test the ptrace_attach returning an error in this situation,
the test expectation needed to be adjusted, to allow the
API to succeed at the first attempt.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
---
 fs/exec.c                                 |  69 ++++++++---
 fs/proc/base.c                            |   6 +
 include/linux/cred.h                      |   1 +
 include/linux/sched/signal.h              |  18 +++
 kernel/cred.c                             |  30 ++++-
 kernel/ptrace.c                           |  32 +++++
 kernel/seccomp.c                          |  12 +-
 tools/testing/selftests/ptrace/vmaccess.c | 135 ++++++++++++++++++++--
 8 files changed, 266 insertions(+), 37 deletions(-)

v10: Changes to previous version, make the PTRACE_ATTACH
return -EAGAIN, instead of execve return -ERESTARTSYS.
Added some lessions learned to the description.

v11: Check old and new credentials in PTRACE_ATTACH again without
changing the API.

Note: I got actually one response from an automatic checker to the v11 patch,

https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/

which is complaining about:

>> >> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct cred const *old_cred @@     got struct cred const [noderef] __rcu *real_cred @@

   417			struct linux_binprm *bprm = task->signal->exec_bprm;
   418			const struct cred *old_cred;
   419			struct mm_struct *old_mm;
   420	
   421			retval = down_write_killable(&task->signal->exec_update_lock);
   422			if (retval)
   423				goto unlock_creds;
   424			task_lock(task);
 > 425			old_cred = task->real_cred;

v12: Essentially identical to v11.

- Fixed a minor merge conflict in linux v5.17, and fixed the
above mentioned nit by adding __rcu to the declaration.

- re-tested the patch with all linux versions from v5.11 to v6.6

v10 was an alternative approach which did imply an API change.
But I would prefer to avoid such an API change.

The difficult part is getting the right dumpability flags assigned
before de_thread starts, hope you like this version.
If not, the v10 is of course also acceptable.

v13: Fixed duplicated Return section in function header of
is_dumpability_changed which was reported by the kernel test robot

v14: rebased to v6.7, refreshed and retested.
And added a more detailed description of the actual bug.

v15: rebased to v6.8-rc1, addressed some review comments.
Split the test case vmaccess into vmaccess1 and vmaccess2
to improve overall test coverage.

v16: rebased to 6.17-rc2, fixed some minor merge conflicts.

v17: avoid use of task->in_execve in ptrace_attach.


Thanks
Bernd.

diff --git a/fs/exec.c b/fs/exec.c
index 2a1e5e4042a1..31c6ceaa5f69 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -905,11 +905,13 @@ static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
-static int de_thread(struct task_struct *tsk)
+static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
 {
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
 	spinlock_t *lock = &oldsighand->siglock;
+	struct task_struct *t;
+	bool unsafe_execve_in_progress = false;
 
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
@@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
 	if (!thread_group_leader(tsk))
 		sig->notify_count--;
 
+	for_other_threads(tsk, t) {
+		if (unlikely(t->ptrace)
+		    && (t != tsk->group_leader || !t->exit_state))
+			unsafe_execve_in_progress = true;
+	}
+
+	if (unlikely(unsafe_execve_in_progress)) {
+		spin_unlock_irq(lock);
+		sig->exec_bprm = bprm;
+		mutex_unlock(&sig->cred_guard_mutex);
+		spin_lock_irq(lock);
+	}
+
 	while (sig->notify_count) {
 		__set_current_state(TASK_KILLABLE);
 		spin_unlock_irq(lock);
@@ -1021,6 +1036,11 @@ static int de_thread(struct task_struct *tsk)
 		release_task(leader);
 	}
 
+	if (unlikely(unsafe_execve_in_progress)) {
+		mutex_lock(&sig->cred_guard_mutex);
+		sig->exec_bprm = NULL;
+	}
+
 	sig->group_exec_task = NULL;
 	sig->notify_count = 0;
 
@@ -1032,6 +1052,11 @@ static int de_thread(struct task_struct *tsk)
 	return 0;
 
 killed:
+	if (unlikely(unsafe_execve_in_progress)) {
+		mutex_lock(&sig->cred_guard_mutex);
+		sig->exec_bprm = NULL;
+	}
+
 	/* protects against exit_notify() and __exit_signal() */
 	read_lock(&tasklist_lock);
 	sig->group_exec_task = NULL;
@@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
 	 */
 	trace_sched_prepare_exec(current, bprm);
 
+	/* If the binary is not readable then enforce mm->dumpable=0 */
+	would_dump(bprm, bprm->file);
+	if (bprm->have_execfd)
+		would_dump(bprm, bprm->executable);
+
+	/*
+	 * Figure out dumpability. Note that this checking only of current
+	 * is wrong, but userspace depends on it. This should be testing
+	 * bprm->secureexec instead.
+	 */
+	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
+	    is_dumpability_changed(current_cred(), bprm->cred) ||
+	    !(uid_eq(current_euid(), current_uid()) &&
+	      gid_eq(current_egid(), current_gid())))
+		set_dumpable(bprm->mm, suid_dumpable);
+	else
+		set_dumpable(bprm->mm, SUID_DUMP_USER);
+
 	/*
 	 * Ensure all future errors are fatal.
 	 */
 	bprm->point_of_no_return = true;
 
 	/* Make this the only thread in the thread group */
-	retval = de_thread(me);
+	retval = de_thread(me, bprm);
 	if (retval)
 		goto out;
 	/* see the comment in check_unsafe_exec() */
@@ -1144,11 +1187,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
-	/* If the binary is not readable then enforce mm->dumpable=0 */
-	would_dump(bprm, bprm->file);
-	if (bprm->have_execfd)
-		would_dump(bprm, bprm->executable);
-
 	/*
 	 * Release all of the old mmap stuff
 	 */
@@ -1210,18 +1248,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 
 	me->sas_ss_sp = me->sas_ss_size = 0;
 
-	/*
-	 * Figure out dumpability. Note that this checking only of current
-	 * is wrong, but userspace depends on it. This should be testing
-	 * bprm->secureexec instead.
-	 */
-	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
-	    !(uid_eq(current_euid(), current_uid()) &&
-	      gid_eq(current_egid(), current_gid())))
-		set_dumpable(current->mm, suid_dumpable);
-	else
-		set_dumpable(current->mm, SUID_DUMP_USER);
-
 	perf_event_exec();
 
 	/*
@@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
 	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
+	if (unlikely(current->signal->exec_bprm)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		return -ERESTARTNOINTR;
+	}
+
 	bprm->cred = prepare_exec_creds();
 	if (likely(bprm->cred))
 		return 0;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 62d35631ba8c..e5bcf812cee0 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2838,6 +2838,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 	if (rv < 0)
 		goto out_free;
 
+	if (unlikely(current->signal->exec_bprm)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		rv = -ERESTARTNOINTR;
+		goto out_free;
+	}
+
 	rv = security_setprocattr(PROC_I(inode)->op.lsmid,
 				  file->f_path.dentry->d_name.name, page,
 				  count);
diff --git a/include/linux/cred.h b/include/linux/cred.h
index a102a10f833f..fb0361911489 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -153,6 +153,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
 extern struct cred *cred_alloc_blank(void);
 extern struct cred *prepare_creds(void);
 extern struct cred *prepare_exec_creds(void);
+extern bool is_dumpability_changed(const struct cred *, const struct cred *);
 extern int commit_creds(struct cred *);
 extern void abort_creds(struct cred *);
 extern struct cred *prepare_kernel_cred(struct task_struct *);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1ef1edbaaf79..3c47d8b55863 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -237,9 +237,27 @@ struct signal_struct {
 	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
 					 * killed by the oom killer */
 
+	struct linux_binprm *exec_bprm;	/* Used to check ptrace_may_access
+					 * against new credentials while
+					 * de_thread is waiting for other
+					 * traced threads to terminate.
+					 * Set while de_thread is executing.
+					 * The cred_guard_mutex is released
+					 * after de_thread() has called
+					 * zap_other_threads(), therefore
+					 * a fatal signal is guaranteed to be
+					 * already pending in the unlikely
+					 * event, that
+					 * current->signal->exec_bprm happens
+					 * to be non-zero after the
+					 * cred_guard_mutex was acquired.
+					 */
+
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
 					 * (notably. ptrace)
+					 * Held while execve runs, except when
+					 * a sibling thread is being traced.
 					 * Deprecated do not use in new code.
 					 * Use exec_update_lock instead.
 					 */
diff --git a/kernel/cred.c b/kernel/cred.c
index 9676965c0981..0b2822c762df 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -375,6 +375,30 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
 	return false;
 }
 
+/**
+ * is_dumpability_changed - Will changing creds affect dumpability?
+ * @old: The old credentials.
+ * @new: The new credentials.
+ *
+ * If the @new credentials have no elevated privileges compared to the
+ * @old credentials, the task may remain dumpable.  Otherwise we have
+ * to mark the task as undumpable to avoid information leaks from higher
+ * to lower privilege domains.
+ *
+ * Return: True if the task will become undumpable.
+ */
+bool is_dumpability_changed(const struct cred *old, const struct cred *new)
+{
+	if (!uid_eq(old->euid, new->euid) ||
+	    !gid_eq(old->egid, new->egid) ||
+	    !uid_eq(old->fsuid, new->fsuid) ||
+	    !gid_eq(old->fsgid, new->fsgid) ||
+	    !cred_cap_issubset(old, new))
+		return true;
+
+	return false;
+}
+
 /**
  * commit_creds - Install new credentials upon the current task
  * @new: The credentials to be assigned
@@ -403,11 +427,7 @@ int commit_creds(struct cred *new)
 	get_cred(new); /* we will require a ref for the subj creds too */
 
 	/* dumpability changes */
-	if (!uid_eq(old->euid, new->euid) ||
-	    !gid_eq(old->egid, new->egid) ||
-	    !uid_eq(old->fsuid, new->fsuid) ||
-	    !gid_eq(old->fsgid, new->fsgid) ||
-	    !cred_cap_issubset(old, new)) {
+	if (is_dumpability_changed(old, new)) {
 		if (task->mm)
 			set_dumpable(task->mm, suid_dumpable);
 		task->pdeath_signal = 0;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 75a84efad40f..230298817dbf 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -20,6 +20,7 @@
 #include <linux/pagemap.h>
 #include <linux/ptrace.h>
 #include <linux/security.h>
+#include <linux/binfmts.h>
 #include <linux/signal.h>
 #include <linux/uio.h>
 #include <linux/audit.h>
@@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
 				return retval;
 		}
 
+		if (unlikely(task == task->signal->group_exec_task)) {
+			retval = down_write_killable(&task->signal->exec_update_lock);
+			if (retval)
+				return retval;
+
+			scoped_guard (task_lock, task) {
+				struct linux_binprm *bprm = task->signal->exec_bprm;
+				const struct cred __rcu *old_cred = task->real_cred;
+				struct mm_struct *old_mm = task->mm;
+
+				rcu_assign_pointer(task->real_cred, bprm->cred);
+				task->mm = bprm->mm;
+				retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+				rcu_assign_pointer(task->real_cred, old_cred);
+				task->mm = old_mm;
+			}
+
+			up_write(&task->signal->exec_update_lock);
+			if (retval)
+				return retval;
+		}
+
 		scoped_guard (write_lock_irq, &tasklist_lock) {
 			if (unlikely(task->exit_state))
 				return -EPERM;
@@ -488,6 +511,14 @@ static int ptrace_traceme(void)
 {
 	int ret = -EPERM;
 
+	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
+		return -ERESTARTNOINTR;
+
+	if (unlikely(current->signal->exec_bprm)) {
+		mutex_unlock(&current->signal->cred_guard_mutex);
+		return -ERESTARTNOINTR;
+	}
+
 	write_lock_irq(&tasklist_lock);
 	/* Are we already being traced? */
 	if (!current->ptrace) {
@@ -503,6 +534,7 @@ static int ptrace_traceme(void)
 		}
 	}
 	write_unlock_irq(&tasklist_lock);
+	mutex_unlock(&current->signal->cred_guard_mutex);
 
 	return ret;
 }
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 41aa761c7738..d61fc275235a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1994,9 +1994,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
-	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
-	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_put_fd;
+	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+		if (mutex_lock_killable(&current->signal->cred_guard_mutex))
+			goto out_put_fd;
+
+		if (unlikely(current->signal->exec_bprm)) {
+			mutex_unlock(&current->signal->cred_guard_mutex);
+			goto out_put_fd;
+		}
+	}
 
 	spin_lock_irq(&current->sighand->siglock);
 
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
index 4db327b44586..5d4a65eb5a8d 100644
--- a/tools/testing/selftests/ptrace/vmaccess.c
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -14,6 +14,7 @@
 #include <signal.h>
 #include <unistd.h>
 #include <sys/ptrace.h>
+#include <sys/syscall.h>
 
 static void *thread(void *arg)
 {
@@ -23,7 +24,7 @@ static void *thread(void *arg)
 
 TEST(vmaccess)
 {
-	int f, pid = fork();
+	int s, f, pid = fork();
 	char mm[64];
 
 	if (!pid) {
@@ -31,19 +32,42 @@ TEST(vmaccess)
 
 		pthread_create(&pt, NULL, thread, NULL);
 		pthread_join(pt, NULL);
-		execlp("true", "true", NULL);
+		execlp("false", "false", NULL);
+		return;
 	}
 
 	sleep(1);
 	sprintf(mm, "/proc/%d/mem", pid);
+	/* deadlock did happen here */
 	f = open(mm, O_RDONLY);
 	ASSERT_GE(f, 0);
 	close(f);
-	f = kill(pid, SIGCONT);
-	ASSERT_EQ(f, 0);
+	f = waitpid(-1, &s, WNOHANG);
+	ASSERT_NE(f, -1);
+	ASSERT_NE(f, 0);
+	ASSERT_NE(f, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 0);
+	f = waitpid(-1, &s, 0);
+	ASSERT_EQ(f, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 1);
+	f = waitpid(-1, NULL, 0);
+	ASSERT_EQ(f, -1);
+	ASSERT_EQ(errno, ECHILD);
 }
 
-TEST(attach)
+/*
+ * Same test as previous, except that
+ * we try to ptrace the group leader,
+ * which is about to call execve,
+ * when the other thread is already ptraced.
+ * This exercises the code in de_thread
+ * where it is waiting inside the
+ * while (sig->notify_count) {
+ * loop.
+ */
+TEST(attach1)
 {
 	int s, k, pid = fork();
 
@@ -52,19 +76,76 @@ TEST(attach)
 
 		pthread_create(&pt, NULL, thread, NULL);
 		pthread_join(pt, NULL);
-		execlp("sleep", "sleep", "2", NULL);
+		execlp("false", "false", NULL);
+		return;
 	}
 
 	sleep(1);
+	/* deadlock may happen here */
 	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
-	ASSERT_EQ(errno, EAGAIN);
-	ASSERT_EQ(k, -1);
+	ASSERT_EQ(k, 0);
 	k = waitpid(-1, &s, WNOHANG);
 	ASSERT_NE(k, -1);
 	ASSERT_NE(k, 0);
 	ASSERT_NE(k, pid);
 	ASSERT_EQ(WIFEXITED(s), 1);
 	ASSERT_EQ(WEXITSTATUS(s), 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFEXITED(s), 1);
+	ASSERT_EQ(WEXITSTATUS(s), 1);
+	k = waitpid(-1, NULL, 0);
+	ASSERT_EQ(k, -1);
+	ASSERT_EQ(errno, ECHILD);
+}
+
+/*
+ * Same test as previous, except that
+ * the group leader is ptraced first,
+ * but this time with PTRACE_O_TRACEEXIT,
+ * and the thread that does execve is
+ * not yet ptraced.  This exercises the
+ * code block in de_thread where the
+ * if (!thread_group_leader(tsk)) {
+ * is executed and enters a wait state.
+ */
+static long thread2_tid;
+static void *thread2(void *arg)
+{
+	thread2_tid = syscall(__NR_gettid);
+	sleep(2);
+	execlp("false", "false", NULL);
+	return NULL;
+}
+
+TEST(attach2)
+{
+	int s, k, pid = fork();
+
+	if (!pid) {
+		pthread_t pt;
+
+		pthread_create(&pt, NULL, thread2, NULL);
+		pthread_join(pt, NULL);
+		return;
+	}
+
 	sleep(1);
 	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
 	ASSERT_EQ(k, 0);
@@ -72,12 +153,46 @@ TEST(attach)
 	ASSERT_EQ(k, pid);
 	ASSERT_EQ(WIFSTOPPED(s), 1);
 	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
-	k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+	k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
+	ASSERT_EQ(k, 0);
+	thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
+	ASSERT_NE(thread2_tid, -1);
+	ASSERT_NE(thread2_tid, 0);
+	ASSERT_NE(thread2_tid, pid);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	sleep(2);
+	/* deadlock may happen here */
+	k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
+	ASSERT_EQ(k, 0);
+	k = waitpid(-1, &s, 0);
+	ASSERT_EQ(k, pid);
+	ASSERT_EQ(WIFSTOPPED(s), 1);
+	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
+	k = waitpid(-1, &s, WNOHANG);
+	ASSERT_EQ(k, 0);
+	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
 	ASSERT_EQ(k, 0);
 	k = waitpid(-1, &s, 0);
 	ASSERT_EQ(k, pid);
 	ASSERT_EQ(WIFEXITED(s), 1);
-	ASSERT_EQ(WEXITSTATUS(s), 0);
+	ASSERT_EQ(WEXITSTATUS(s), 1);
 	k = waitpid(-1, NULL, 0);
 	ASSERT_EQ(k, -1);
 	ASSERT_EQ(errno, ECHILD);
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-08-21 17:34             ` [PATCH v17] " Bernd Edlinger
@ 2025-10-27  6:26               ` Bernd Edlinger
  2025-10-27 12:06               ` Peter Zijlstra
                                 ` (3 subsequent siblings)
  4 siblings, 0 replies; 39+ messages in thread
From: Bernd Edlinger @ 2025-10-27  6:26 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
	Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
	Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
	Suren Baghdasaryan, Yafang Shao, Helge Deller, Eric W. Biederman,
	Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Alexey Dobriyan,
	Jens Axboe, Paul Moore, Elena Reshetova, David Windsor,
	Mateusz Guzik, YueHaibing, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, tiozhang, Penglei Jiang, Lorenzo Stoakes,
	Adrian Ratiu, Ingo Molnar, Peter Zijlstra (Intel),
	Cyrill Gorcunov, Eric Dumazet

Hi all,

This is a friendly ping, just a gentle reminder since this series has been around a while.
FYI the patch still applies cleanly to current kernel sources, compiles correctly and
tests are still passed.


Thanks
Bernd.

On 8/21/25 19:34, Bernd Edlinger wrote:
> This introduces signal->exec_bprm, which is used to
> fix the case when at least one of the sibling threads
> is traced, and therefore the trace process may dead-lock
> in ptrace_attach, but de_thread will need to wait for the
> tracer to continue execution.
> 
> The problem happens when a tracer tries to ptrace_attach
> to a multi-threaded process, that does an execve in one of
> the threads at the same time, without doing that in a forked
> sub-process.  That means: There is a race condition, when one
> or more of the threads are already ptraced, but the thread
> that invoked the execve is not yet traced.  Now in this
> case the execve locks the cred_guard_mutex and waits for
> de_thread to complete.  But that waits for the traced
> sibling threads to exit, and those have to wait for the
> tracer to receive the exit signal, but the tracer cannot
> call wait right now, because it is waiting for the ptrace
> call to complete, and this never does not happen.
> The traced process and the tracer are now in a deadlock
> situation, and can only be killed by a fatal signal.
> 
> The solution is to detect this situation and allow
> ptrace_attach to continue by temporarily releasing the
> cred_guard_mutex, while de_thread() is still waiting for
> traced zombies to be eventually released by the tracer.
> In the case of the thread group leader we only have to wait
> for the thread to become a zombie, which may also need
> co-operation from the tracer due to PTRACE_O_TRACEEXIT.
> 
> When a tracer wants to ptrace_attach a task that already
> is in execve, we simply retry the ptrace_may_access
> check while temporarily installing the new credentials
> and dumpability which are about to be used after execve
> completes.  If the ptrace_attach happens on a thread that
> is a sibling-thread of the thread doing execve, it is
> sufficient to check against the old credentials, as this
> thread will be waited for, before the new credentials are
> installed.
> 
> Other threads die quickly since the cred_guard_mutex is
> released, but a deadly signal is already pending.  In case
> the mutex_lock_killable misses the signal, the non-zero
> current->signal->exec_bprm makes sure they release the
> mutex immediately and return with -ERESTARTNOINTR.
> 
> This means there is no API change, unlike the previous
> version of this patch which was discussed here:
> 
> https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.de/
> 
> See tools/testing/selftests/ptrace/vmaccess.c
> for a test case that gets fixed by this change.
> 
> Note that since the test case was originally designed to
> test the ptrace_attach returning an error in this situation,
> the test expectation needed to be adjusted, to allow the
> API to succeed at the first attempt.
> 
> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
> ---
>  fs/exec.c                                 |  69 ++++++++---
>  fs/proc/base.c                            |   6 +
>  include/linux/cred.h                      |   1 +
>  include/linux/sched/signal.h              |  18 +++
>  kernel/cred.c                             |  30 ++++-
>  kernel/ptrace.c                           |  32 +++++
>  kernel/seccomp.c                          |  12 +-
>  tools/testing/selftests/ptrace/vmaccess.c | 135 ++++++++++++++++++++--
>  8 files changed, 266 insertions(+), 37 deletions(-)
> 
> v10: Changes to previous version, make the PTRACE_ATTACH
> return -EAGAIN, instead of execve return -ERESTARTSYS.
> Added some lessions learned to the description.
> 
> v11: Check old and new credentials in PTRACE_ATTACH again without
> changing the API.
> 
> Note: I got actually one response from an automatic checker to the v11 patch,
> 
> https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/
> 
> which is complaining about:
> 
>>>>> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct cred const *old_cred @@     got struct cred const [noderef] __rcu *real_cred @@
> 
>    417			struct linux_binprm *bprm = task->signal->exec_bprm;
>    418			const struct cred *old_cred;
>    419			struct mm_struct *old_mm;
>    420	
>    421			retval = down_write_killable(&task->signal->exec_update_lock);
>    422			if (retval)
>    423				goto unlock_creds;
>    424			task_lock(task);
>  > 425			old_cred = task->real_cred;
> 
> v12: Essentially identical to v11.
> 
> - Fixed a minor merge conflict in linux v5.17, and fixed the
> above mentioned nit by adding __rcu to the declaration.
> 
> - re-tested the patch with all linux versions from v5.11 to v6.6
> 
> v10 was an alternative approach which did imply an API change.
> But I would prefer to avoid such an API change.
> 
> The difficult part is getting the right dumpability flags assigned
> before de_thread starts, hope you like this version.
> If not, the v10 is of course also acceptable.
> 
> v13: Fixed duplicated Return section in function header of
> is_dumpability_changed which was reported by the kernel test robot
> 
> v14: rebased to v6.7, refreshed and retested.
> And added a more detailed description of the actual bug.
> 
> v15: rebased to v6.8-rc1, addressed some review comments.
> Split the test case vmaccess into vmaccess1 and vmaccess2
> to improve overall test coverage.
> 
> v16: rebased to 6.17-rc2, fixed some minor merge conflicts.
> 
> v17: avoid use of task->in_execve in ptrace_attach.
> 
> 
> Thanks
> Bernd.
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 2a1e5e4042a1..31c6ceaa5f69 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -905,11 +905,13 @@ static int exec_mmap(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>  {
>  	struct signal_struct *sig = tsk->signal;
>  	struct sighand_struct *oldsighand = tsk->sighand;
>  	spinlock_t *lock = &oldsighand->siglock;
> +	struct task_struct *t;
> +	bool unsafe_execve_in_progress = false;
>  
>  	if (thread_group_empty(tsk))
>  		goto no_thread_group;
> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
>  	if (!thread_group_leader(tsk))
>  		sig->notify_count--;
>  
> +	for_other_threads(tsk, t) {
> +		if (unlikely(t->ptrace)
> +		    && (t != tsk->group_leader || !t->exit_state))
> +			unsafe_execve_in_progress = true;
> +	}
> +
> +	if (unlikely(unsafe_execve_in_progress)) {
> +		spin_unlock_irq(lock);
> +		sig->exec_bprm = bprm;
> +		mutex_unlock(&sig->cred_guard_mutex);
> +		spin_lock_irq(lock);
> +	}
> +
>  	while (sig->notify_count) {
>  		__set_current_state(TASK_KILLABLE);
>  		spin_unlock_irq(lock);
> @@ -1021,6 +1036,11 @@ static int de_thread(struct task_struct *tsk)
>  		release_task(leader);
>  	}
>  
> +	if (unlikely(unsafe_execve_in_progress)) {
> +		mutex_lock(&sig->cred_guard_mutex);
> +		sig->exec_bprm = NULL;
> +	}
> +
>  	sig->group_exec_task = NULL;
>  	sig->notify_count = 0;
>  
> @@ -1032,6 +1052,11 @@ static int de_thread(struct task_struct *tsk)
>  	return 0;
>  
>  killed:
> +	if (unlikely(unsafe_execve_in_progress)) {
> +		mutex_lock(&sig->cred_guard_mutex);
> +		sig->exec_bprm = NULL;
> +	}
> +
>  	/* protects against exit_notify() and __exit_signal() */
>  	read_lock(&tasklist_lock);
>  	sig->group_exec_task = NULL;
> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	 */
>  	trace_sched_prepare_exec(current, bprm);
>  
> +	/* If the binary is not readable then enforce mm->dumpable=0 */
> +	would_dump(bprm, bprm->file);
> +	if (bprm->have_execfd)
> +		would_dump(bprm, bprm->executable);
> +
> +	/*
> +	 * Figure out dumpability. Note that this checking only of current
> +	 * is wrong, but userspace depends on it. This should be testing
> +	 * bprm->secureexec instead.
> +	 */
> +	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> +	    is_dumpability_changed(current_cred(), bprm->cred) ||
> +	    !(uid_eq(current_euid(), current_uid()) &&
> +	      gid_eq(current_egid(), current_gid())))
> +		set_dumpable(bprm->mm, suid_dumpable);
> +	else
> +		set_dumpable(bprm->mm, SUID_DUMP_USER);
> +
>  	/*
>  	 * Ensure all future errors are fatal.
>  	 */
>  	bprm->point_of_no_return = true;
>  
>  	/* Make this the only thread in the thread group */
> -	retval = de_thread(me);
> +	retval = de_thread(me, bprm);
>  	if (retval)
>  		goto out;
>  	/* see the comment in check_unsafe_exec() */
> @@ -1144,11 +1187,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> -	/* If the binary is not readable then enforce mm->dumpable=0 */
> -	would_dump(bprm, bprm->file);
> -	if (bprm->have_execfd)
> -		would_dump(bprm, bprm->executable);
> -
>  	/*
>  	 * Release all of the old mmap stuff
>  	 */
> @@ -1210,18 +1248,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>  
>  	me->sas_ss_sp = me->sas_ss_size = 0;
>  
> -	/*
> -	 * Figure out dumpability. Note that this checking only of current
> -	 * is wrong, but userspace depends on it. This should be testing
> -	 * bprm->secureexec instead.
> -	 */
> -	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> -	    !(uid_eq(current_euid(), current_uid()) &&
> -	      gid_eq(current_egid(), current_gid())))
> -		set_dumpable(current->mm, suid_dumpable);
> -	else
> -		set_dumpable(current->mm, SUID_DUMP_USER);
> -
>  	perf_event_exec();
>  
>  	/*
> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
>  	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>  		return -ERESTARTNOINTR;
>  
> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		return -ERESTARTNOINTR;
> +	}
> +
>  	bprm->cred = prepare_exec_creds();
>  	if (likely(bprm->cred))
>  		return 0;
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 62d35631ba8c..e5bcf812cee0 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2838,6 +2838,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
>  	if (rv < 0)
>  		goto out_free;
>  
> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		rv = -ERESTARTNOINTR;
> +		goto out_free;
> +	}
> +
>  	rv = security_setprocattr(PROC_I(inode)->op.lsmid,
>  				  file->f_path.dentry->d_name.name, page,
>  				  count);
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index a102a10f833f..fb0361911489 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -153,6 +153,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
>  extern struct cred *cred_alloc_blank(void);
>  extern struct cred *prepare_creds(void);
>  extern struct cred *prepare_exec_creds(void);
> +extern bool is_dumpability_changed(const struct cred *, const struct cred *);
>  extern int commit_creds(struct cred *);
>  extern void abort_creds(struct cred *);
>  extern struct cred *prepare_kernel_cred(struct task_struct *);
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..3c47d8b55863 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -237,9 +237,27 @@ struct signal_struct {
>  	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
>  					 * killed by the oom killer */
>  
> +	struct linux_binprm *exec_bprm;	/* Used to check ptrace_may_access
> +					 * against new credentials while
> +					 * de_thread is waiting for other
> +					 * traced threads to terminate.
> +					 * Set while de_thread is executing.
> +					 * The cred_guard_mutex is released
> +					 * after de_thread() has called
> +					 * zap_other_threads(), therefore
> +					 * a fatal signal is guaranteed to be
> +					 * already pending in the unlikely
> +					 * event, that
> +					 * current->signal->exec_bprm happens
> +					 * to be non-zero after the
> +					 * cred_guard_mutex was acquired.
> +					 */
> +
>  	struct mutex cred_guard_mutex;	/* guard against foreign influences on
>  					 * credential calculations
>  					 * (notably. ptrace)
> +					 * Held while execve runs, except when
> +					 * a sibling thread is being traced.
>  					 * Deprecated do not use in new code.
>  					 * Use exec_update_lock instead.
>  					 */
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 9676965c0981..0b2822c762df 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -375,6 +375,30 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
>  	return false;
>  }
>  
> +/**
> + * is_dumpability_changed - Will changing creds affect dumpability?
> + * @old: The old credentials.
> + * @new: The new credentials.
> + *
> + * If the @new credentials have no elevated privileges compared to the
> + * @old credentials, the task may remain dumpable.  Otherwise we have
> + * to mark the task as undumpable to avoid information leaks from higher
> + * to lower privilege domains.
> + *
> + * Return: True if the task will become undumpable.
> + */
> +bool is_dumpability_changed(const struct cred *old, const struct cred *new)
> +{
> +	if (!uid_eq(old->euid, new->euid) ||
> +	    !gid_eq(old->egid, new->egid) ||
> +	    !uid_eq(old->fsuid, new->fsuid) ||
> +	    !gid_eq(old->fsgid, new->fsgid) ||
> +	    !cred_cap_issubset(old, new))
> +		return true;
> +
> +	return false;
> +}
> +
>  /**
>   * commit_creds - Install new credentials upon the current task
>   * @new: The credentials to be assigned
> @@ -403,11 +427,7 @@ int commit_creds(struct cred *new)
>  	get_cred(new); /* we will require a ref for the subj creds too */
>  
>  	/* dumpability changes */
> -	if (!uid_eq(old->euid, new->euid) ||
> -	    !gid_eq(old->egid, new->egid) ||
> -	    !uid_eq(old->fsuid, new->fsuid) ||
> -	    !gid_eq(old->fsgid, new->fsgid) ||
> -	    !cred_cap_issubset(old, new)) {
> +	if (is_dumpability_changed(old, new)) {
>  		if (task->mm)
>  			set_dumpable(task->mm, suid_dumpable);
>  		task->pdeath_signal = 0;
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 75a84efad40f..230298817dbf 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -20,6 +20,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/ptrace.h>
>  #include <linux/security.h>
> +#include <linux/binfmts.h>
>  #include <linux/signal.h>
>  #include <linux/uio.h>
>  #include <linux/audit.h>
> @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
>  				return retval;
>  		}
>  
> +		if (unlikely(task == task->signal->group_exec_task)) {
> +			retval = down_write_killable(&task->signal->exec_update_lock);
> +			if (retval)
> +				return retval;
> +
> +			scoped_guard (task_lock, task) {
> +				struct linux_binprm *bprm = task->signal->exec_bprm;
> +				const struct cred __rcu *old_cred = task->real_cred;
> +				struct mm_struct *old_mm = task->mm;
> +
> +				rcu_assign_pointer(task->real_cred, bprm->cred);
> +				task->mm = bprm->mm;
> +				retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> +				rcu_assign_pointer(task->real_cred, old_cred);
> +				task->mm = old_mm;
> +			}
> +
> +			up_write(&task->signal->exec_update_lock);
> +			if (retval)
> +				return retval;
> +		}
> +
>  		scoped_guard (write_lock_irq, &tasklist_lock) {
>  			if (unlikely(task->exit_state))
>  				return -EPERM;
> @@ -488,6 +511,14 @@ static int ptrace_traceme(void)
>  {
>  	int ret = -EPERM;
>  
> +	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
> +		return -ERESTARTNOINTR;
> +
> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		return -ERESTARTNOINTR;
> +	}
> +
>  	write_lock_irq(&tasklist_lock);
>  	/* Are we already being traced? */
>  	if (!current->ptrace) {
> @@ -503,6 +534,7 @@ static int ptrace_traceme(void)
>  		}
>  	}
>  	write_unlock_irq(&tasklist_lock);
> +	mutex_unlock(&current->signal->cred_guard_mutex);
>  
>  	return ret;
>  }
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 41aa761c7738..d61fc275235a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1994,9 +1994,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	 * Make sure we cannot change seccomp or nnp state via TSYNC
>  	 * while another thread is in the middle of calling exec.
>  	 */
> -	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> -	    mutex_lock_killable(&current->signal->cred_guard_mutex))
> -		goto out_put_fd;
> +	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
> +		if (mutex_lock_killable(&current->signal->cred_guard_mutex))
> +			goto out_put_fd;
> +
> +		if (unlikely(current->signal->exec_bprm)) {
> +			mutex_unlock(&current->signal->cred_guard_mutex);
> +			goto out_put_fd;
> +		}
> +	}
>  
>  	spin_lock_irq(&current->sighand->siglock);
>  
> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
> index 4db327b44586..5d4a65eb5a8d 100644
> --- a/tools/testing/selftests/ptrace/vmaccess.c
> +++ b/tools/testing/selftests/ptrace/vmaccess.c
> @@ -14,6 +14,7 @@
>  #include <signal.h>
>  #include <unistd.h>
>  #include <sys/ptrace.h>
> +#include <sys/syscall.h>
>  
>  static void *thread(void *arg)
>  {
> @@ -23,7 +24,7 @@ static void *thread(void *arg)
>  
>  TEST(vmaccess)
>  {
> -	int f, pid = fork();
> +	int s, f, pid = fork();
>  	char mm[64];
>  
>  	if (!pid) {
> @@ -31,19 +32,42 @@ TEST(vmaccess)
>  
>  		pthread_create(&pt, NULL, thread, NULL);
>  		pthread_join(pt, NULL);
> -		execlp("true", "true", NULL);
> +		execlp("false", "false", NULL);
> +		return;
>  	}
>  
>  	sleep(1);
>  	sprintf(mm, "/proc/%d/mem", pid);
> +	/* deadlock did happen here */
>  	f = open(mm, O_RDONLY);
>  	ASSERT_GE(f, 0);
>  	close(f);
> -	f = kill(pid, SIGCONT);
> -	ASSERT_EQ(f, 0);
> +	f = waitpid(-1, &s, WNOHANG);
> +	ASSERT_NE(f, -1);
> +	ASSERT_NE(f, 0);
> +	ASSERT_NE(f, pid);
> +	ASSERT_EQ(WIFEXITED(s), 1);
> +	ASSERT_EQ(WEXITSTATUS(s), 0);
> +	f = waitpid(-1, &s, 0);
> +	ASSERT_EQ(f, pid);
> +	ASSERT_EQ(WIFEXITED(s), 1);
> +	ASSERT_EQ(WEXITSTATUS(s), 1);
> +	f = waitpid(-1, NULL, 0);
> +	ASSERT_EQ(f, -1);
> +	ASSERT_EQ(errno, ECHILD);
>  }
>  
> -TEST(attach)
> +/*
> + * Same test as previous, except that
> + * we try to ptrace the group leader,
> + * which is about to call execve,
> + * when the other thread is already ptraced.
> + * This exercises the code in de_thread
> + * where it is waiting inside the
> + * while (sig->notify_count) {
> + * loop.
> + */
> +TEST(attach1)
>  {
>  	int s, k, pid = fork();
>  
> @@ -52,19 +76,76 @@ TEST(attach)
>  
>  		pthread_create(&pt, NULL, thread, NULL);
>  		pthread_join(pt, NULL);
> -		execlp("sleep", "sleep", "2", NULL);
> +		execlp("false", "false", NULL);
> +		return;
>  	}
>  
>  	sleep(1);
> +	/* deadlock may happen here */
>  	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
> -	ASSERT_EQ(errno, EAGAIN);
> -	ASSERT_EQ(k, -1);
> +	ASSERT_EQ(k, 0);
>  	k = waitpid(-1, &s, WNOHANG);
>  	ASSERT_NE(k, -1);
>  	ASSERT_NE(k, 0);
>  	ASSERT_NE(k, pid);
>  	ASSERT_EQ(WIFEXITED(s), 1);
>  	ASSERT_EQ(WEXITSTATUS(s), 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFEXITED(s), 1);
> +	ASSERT_EQ(WEXITSTATUS(s), 1);
> +	k = waitpid(-1, NULL, 0);
> +	ASSERT_EQ(k, -1);
> +	ASSERT_EQ(errno, ECHILD);
> +}
> +
> +/*
> + * Same test as previous, except that
> + * the group leader is ptraced first,
> + * but this time with PTRACE_O_TRACEEXIT,
> + * and the thread that does execve is
> + * not yet ptraced.  This exercises the
> + * code block in de_thread where the
> + * if (!thread_group_leader(tsk)) {
> + * is executed and enters a wait state.
> + */
> +static long thread2_tid;
> +static void *thread2(void *arg)
> +{
> +	thread2_tid = syscall(__NR_gettid);
> +	sleep(2);
> +	execlp("false", "false", NULL);
> +	return NULL;
> +}
> +
> +TEST(attach2)
> +{
> +	int s, k, pid = fork();
> +
> +	if (!pid) {
> +		pthread_t pt;
> +
> +		pthread_create(&pt, NULL, thread2, NULL);
> +		pthread_join(pt, NULL);
> +		return;
> +	}
> +
>  	sleep(1);
>  	k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>  	ASSERT_EQ(k, 0);
> @@ -72,12 +153,46 @@ TEST(attach)
>  	ASSERT_EQ(k, pid);
>  	ASSERT_EQ(WIFSTOPPED(s), 1);
>  	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> -	k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
> +	k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
> +	ASSERT_EQ(k, 0);
> +	thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
> +	ASSERT_NE(thread2_tid, -1);
> +	ASSERT_NE(thread2_tid, 0);
> +	ASSERT_NE(thread2_tid, pid);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	sleep(2);
> +	/* deadlock may happen here */
> +	k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
> +	ASSERT_EQ(k, 0);
> +	k = waitpid(-1, &s, 0);
> +	ASSERT_EQ(k, pid);
> +	ASSERT_EQ(WIFSTOPPED(s), 1);
> +	ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> +	k = waitpid(-1, &s, WNOHANG);
> +	ASSERT_EQ(k, 0);
> +	k = ptrace(PTRACE_CONT, pid, 0L, 0L);
>  	ASSERT_EQ(k, 0);
>  	k = waitpid(-1, &s, 0);
>  	ASSERT_EQ(k, pid);
>  	ASSERT_EQ(WIFEXITED(s), 1);
> -	ASSERT_EQ(WEXITSTATUS(s), 0);
> +	ASSERT_EQ(WEXITSTATUS(s), 1);
>  	k = waitpid(-1, NULL, 0);
>  	ASSERT_EQ(k, -1);
>  	ASSERT_EQ(errno, ECHILD);



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-08-21 17:34             ` [PATCH v17] " Bernd Edlinger
  2025-10-27  6:26               ` Bernd Edlinger
@ 2025-10-27 12:06               ` Peter Zijlstra
  2025-11-02 16:17               ` Oleg Nesterov
                                 ` (2 subsequent siblings)
  4 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2025-10-27 12:06 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Alexander Viro, Alexey Dobriyan, Oleg Nesterov, Kees Cook,
	Andy Lutomirski, Will Drewry, Christian Brauner, Andrew Morton,
	Michal Hocko, Serge Hallyn, James Morris, Randy Dunlap,
	Suren Baghdasaryan, Yafang Shao, Helge Deller, Eric W. Biederman,
	Adrian Reber, Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Cyrill Gorcunov, Eric Dumazet

On Thu, Aug 21, 2025 at 07:34:58PM +0200, Bernd Edlinger wrote:

> The solution is to detect this situation and allow
> ptrace_attach to continue by temporarily releasing the
> cred_guard_mutex, while de_thread() is still waiting for
> traced zombies to be eventually released by the tracer.
> In the case of the thread group leader we only have to wait
> for the thread to become a zombie, which may also need
> co-operation from the tracer due to PTRACE_O_TRACEEXIT.
> 
> When a tracer wants to ptrace_attach a task that already
> is in execve, we simply retry the ptrace_may_access
> check while temporarily installing the new credentials
> and dumpability which are about to be used after execve
> completes.  If the ptrace_attach happens on a thread that
> is a sibling-thread of the thread doing execve, it is
> sufficient to check against the old credentials, as this
> thread will be waited for, before the new credentials are
> installed.
> 
> Other threads die quickly since the cred_guard_mutex is
> released, but a deadly signal is already pending.  In case
> the mutex_lock_killable misses the signal, the non-zero
> current->signal->exec_bprm makes sure they release the
> mutex immediately and return with -ERESTARTNOINTR.



> diff --git a/fs/exec.c b/fs/exec.c
> index 2a1e5e4042a1..31c6ceaa5f69 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -905,11 +905,13 @@ static int exec_mmap(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>  {
>  	struct signal_struct *sig = tsk->signal;
>  	struct sighand_struct *oldsighand = tsk->sighand;
>  	spinlock_t *lock = &oldsighand->siglock;
> +	struct task_struct *t;
> +	bool unsafe_execve_in_progress = false;
>  
>  	if (thread_group_empty(tsk))
>  		goto no_thread_group;
> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
>  	if (!thread_group_leader(tsk))
>  		sig->notify_count--;
>  
> +	for_other_threads(tsk, t) {
> +		if (unlikely(t->ptrace)
> +		    && (t != tsk->group_leader || !t->exit_state))

&& goes at the end of the previous line

> +			unsafe_execve_in_progress = true;
> +	}
> +
> +	if (unlikely(unsafe_execve_in_progress)) {
> +		spin_unlock_irq(lock);
> +		sig->exec_bprm = bprm;
> +		mutex_unlock(&sig->cred_guard_mutex);
> +		spin_lock_irq(lock);

I'm not clear why we need to drop and re-acquire siglock here.

And I would like a very large comment here explaining why it is safe to
drop cred_guard_mutex here.

> +	}
> +
>  	while (sig->notify_count) {
>  		__set_current_state(TASK_KILLABLE);
>  		spin_unlock_irq(lock);
> @@ -1021,6 +1036,11 @@ static int de_thread(struct task_struct *tsk)
>  		release_task(leader);
>  	}
>  
> +	if (unlikely(unsafe_execve_in_progress)) {
> +		mutex_lock(&sig->cred_guard_mutex);
> +		sig->exec_bprm = NULL;
> +	}
> +
>  	sig->group_exec_task = NULL;
>  	sig->notify_count = 0;
>  
> @@ -1032,6 +1052,11 @@ static int de_thread(struct task_struct *tsk)
>  	return 0;
>  
>  killed:
> +	if (unlikely(unsafe_execve_in_progress)) {
> +		mutex_lock(&sig->cred_guard_mutex);
> +		sig->exec_bprm = NULL;
> +	}
> +
>  	/* protects against exit_notify() and __exit_signal() */
>  	read_lock(&tasklist_lock);
>  	sig->group_exec_task = NULL;
> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	 */
>  	trace_sched_prepare_exec(current, bprm);
>  
> +	/* If the binary is not readable then enforce mm->dumpable=0 */
> +	would_dump(bprm, bprm->file);
> +	if (bprm->have_execfd)
> +		would_dump(bprm, bprm->executable);
> +
> +	/*
> +	 * Figure out dumpability. Note that this checking only of current
> +	 * is wrong, but userspace depends on it. This should be testing
> +	 * bprm->secureexec instead.
> +	 */
> +	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> +	    is_dumpability_changed(current_cred(), bprm->cred) ||
> +	    !(uid_eq(current_euid(), current_uid()) &&
> +	      gid_eq(current_egid(), current_gid())))
> +		set_dumpable(bprm->mm, suid_dumpable);
> +	else
> +		set_dumpable(bprm->mm, SUID_DUMP_USER);
> +

I feel like moving this dumpable stuff around could be a separate patch.
Which can explain how that is correct and why it is needed and all that.

>  	/*
>  	 * Ensure all future errors are fatal.
>  	 */
>  	bprm->point_of_no_return = true;
>  
>  	/* Make this the only thread in the thread group */
> -	retval = de_thread(me);
> +	retval = de_thread(me, bprm);
>  	if (retval)
>  		goto out;
>  	/* see the comment in check_unsafe_exec() */
> @@ -1144,11 +1187,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> -	/* If the binary is not readable then enforce mm->dumpable=0 */
> -	would_dump(bprm, bprm->file);
> -	if (bprm->have_execfd)
> -		would_dump(bprm, bprm->executable);
> -
>  	/*
>  	 * Release all of the old mmap stuff
>  	 */
> @@ -1210,18 +1248,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>  
>  	me->sas_ss_sp = me->sas_ss_size = 0;
>  
> -	/*
> -	 * Figure out dumpability. Note that this checking only of current
> -	 * is wrong, but userspace depends on it. This should be testing
> -	 * bprm->secureexec instead.
> -	 */
> -	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> -	    !(uid_eq(current_euid(), current_uid()) &&
> -	      gid_eq(current_egid(), current_gid())))
> -		set_dumpable(current->mm, suid_dumpable);
> -	else
> -		set_dumpable(current->mm, SUID_DUMP_USER);
> -
>  	perf_event_exec();
>  
>  	/*
> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
>  	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>  		return -ERESTARTNOINTR;
>  
> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		return -ERESTARTNOINTR;
> +	}

#1

> +
>  	bprm->cred = prepare_exec_creds();
>  	if (likely(bprm->cred))
>  		return 0;
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 62d35631ba8c..e5bcf812cee0 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2838,6 +2838,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
>  	if (rv < 0)
>  		goto out_free;
>  

Comment explaining why this needs checking goes here.

> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		rv = -ERESTARTNOINTR;
> +		goto out_free;
> +	}
> +
>  	rv = security_setprocattr(PROC_I(inode)->op.lsmid,
>  				  file->f_path.dentry->d_name.name, page,
>  				  count);
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index a102a10f833f..fb0361911489 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -153,6 +153,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
>  extern struct cred *cred_alloc_blank(void);
>  extern struct cred *prepare_creds(void);
>  extern struct cred *prepare_exec_creds(void);
> +extern bool is_dumpability_changed(const struct cred *, const struct cred *);
>  extern int commit_creds(struct cred *);
>  extern void abort_creds(struct cred *);
>  extern struct cred *prepare_kernel_cred(struct task_struct *);
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..3c47d8b55863 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -237,9 +237,27 @@ struct signal_struct {
>  	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
>  					 * killed by the oom killer */
>  
> +	struct linux_binprm *exec_bprm;	/* Used to check ptrace_may_access
> +					 * against new credentials while
> +					 * de_thread is waiting for other
> +					 * traced threads to terminate.
> +					 * Set while de_thread is executing.
> +					 * The cred_guard_mutex is released
> +					 * after de_thread() has called
> +					 * zap_other_threads(), therefore
> +					 * a fatal signal is guaranteed to be
> +					 * already pending in the unlikely
> +					 * event, that
> +					 * current->signal->exec_bprm happens
> +					 * to be non-zero after the
> +					 * cred_guard_mutex was acquired.
> +					 */
> +
>  	struct mutex cred_guard_mutex;	/* guard against foreign influences on
>  					 * credential calculations
>  					 * (notably. ptrace)
> +					 * Held while execve runs, except when
> +					 * a sibling thread is being traced.
>  					 * Deprecated do not use in new code.
>  					 * Use exec_update_lock instead.
>  					 */
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 9676965c0981..0b2822c762df 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -375,6 +375,30 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
>  	return false;
>  }
>  
> +/**
> + * is_dumpability_changed - Will changing creds affect dumpability?
> + * @old: The old credentials.
> + * @new: The new credentials.
> + *
> + * If the @new credentials have no elevated privileges compared to the
> + * @old credentials, the task may remain dumpable.  Otherwise we have
> + * to mark the task as undumpable to avoid information leaks from higher
> + * to lower privilege domains.
> + *
> + * Return: True if the task will become undumpable.
> + */
> +bool is_dumpability_changed(const struct cred *old, const struct cred *new)
> +{
> +	if (!uid_eq(old->euid, new->euid) ||
> +	    !gid_eq(old->egid, new->egid) ||
> +	    !uid_eq(old->fsuid, new->fsuid) ||
> +	    !gid_eq(old->fsgid, new->fsgid) ||
> +	    !cred_cap_issubset(old, new))
> +		return true;
> +
> +	return false;
> +}
> +
>  /**
>   * commit_creds - Install new credentials upon the current task
>   * @new: The credentials to be assigned
> @@ -403,11 +427,7 @@ int commit_creds(struct cred *new)
>  	get_cred(new); /* we will require a ref for the subj creds too */
>  
>  	/* dumpability changes */
> -	if (!uid_eq(old->euid, new->euid) ||
> -	    !gid_eq(old->egid, new->egid) ||
> -	    !uid_eq(old->fsuid, new->fsuid) ||
> -	    !gid_eq(old->fsgid, new->fsgid) ||
> -	    !cred_cap_issubset(old, new)) {
> +	if (is_dumpability_changed(old, new)) {
>  		if (task->mm)
>  			set_dumpable(task->mm, suid_dumpable);
>  		task->pdeath_signal = 0;
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 75a84efad40f..230298817dbf 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -20,6 +20,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/ptrace.h>
>  #include <linux/security.h>
> +#include <linux/binfmts.h>
>  #include <linux/signal.h>
>  #include <linux/uio.h>
>  #include <linux/audit.h>
> @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
>  				return retval;
>  		}
>  
> +		if (unlikely(task == task->signal->group_exec_task)) {
> +			retval = down_write_killable(&task->signal->exec_update_lock);
> +			if (retval)
> +				return retval;

This could be written like:

			ACQUIRE(rwsem_write_kill, guard)(&task->signal->exec_update_lock);
			retval = ACQUIRE_ERR(rwsem_write_kill, guard);
			if (retval)
				return retval;

> +
> +			scoped_guard (task_lock, task) {
> +				struct linux_binprm *bprm = task->signal->exec_bprm;
> +				const struct cred __rcu *old_cred = task->real_cred;
> +				struct mm_struct *old_mm = task->mm;
> +
> +				rcu_assign_pointer(task->real_cred, bprm->cred);
> +				task->mm = bprm->mm;
> +				retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> +				rcu_assign_pointer(task->real_cred, old_cred);
> +				task->mm = old_mm;
> +			}
> +
> +			up_write(&task->signal->exec_update_lock);

And then this goes away ^

> +			if (retval)
> +				return retval;
> +		}
> +
>  		scoped_guard (write_lock_irq, &tasklist_lock) {
>  			if (unlikely(task->exit_state))
>  				return -EPERM;
> @@ -488,6 +511,14 @@ static int ptrace_traceme(void)
>  {
>  	int ret = -EPERM;
>  

This needs comments.

> +	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
> +		return -ERESTARTNOINTR;
> +
> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		return -ERESTARTNOINTR;
> +	}

#2

> +
>  	write_lock_irq(&tasklist_lock);
>  	/* Are we already being traced? */
>  	if (!current->ptrace) {
> @@ -503,6 +534,7 @@ static int ptrace_traceme(void)
>  		}
>  	}
>  	write_unlock_irq(&tasklist_lock);
> +	mutex_unlock(&current->signal->cred_guard_mutex);
>  
>  	return ret;
>  }
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 41aa761c7738..d61fc275235a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1994,9 +1994,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	 * Make sure we cannot change seccomp or nnp state via TSYNC
>  	 * while another thread is in the middle of calling exec.
>  	 */
> -	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> -	    mutex_lock_killable(&current->signal->cred_guard_mutex))
> -		goto out_put_fd;
> +	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
> +		if (mutex_lock_killable(&current->signal->cred_guard_mutex))
> +			goto out_put_fd;
> +
> +		if (unlikely(current->signal->exec_bprm)) {
> +			mutex_unlock(&current->signal->cred_guard_mutex);
> +			goto out_put_fd;
> +		}

#3, and after typing this same pattern 3 times, you didn't think it
needed a helper function ?

> +	}
>  
>  	spin_lock_irq(&current->sighand->siglock);
>  


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-08-21 17:34             ` [PATCH v17] " Bernd Edlinger
  2025-10-27  6:26               ` Bernd Edlinger
  2025-10-27 12:06               ` Peter Zijlstra
@ 2025-11-02 16:17               ` Oleg Nesterov
  2025-11-05 14:32               ` Oleg Nesterov
  2025-11-09 17:14               ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
  4 siblings, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2025-11-02 16:17 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

On 08/21, Bernd Edlinger wrote:
>
> v16: rebased to 6.17-rc2, fixed some minor merge conflicts.
>
> v17: avoid use of task->in_execve in ptrace_attach.

So I guess this version doesn't really differ from v14 I tried to review...
(yes, iirc my review wasn't really good, sorry).

Perhaps I am wrong, but I still think we need another approach. de_thread()
should not wait until all sub-threads are reaped, it should drop cred_guard_mutex
earlier.

I mean, something like

	[PATCH V2 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
	https://lore.kernel.org/lkml/20170213180454.GA2858@redhat.com/

which is hopelessly outdated now.

Again, perhaps I am wrong. I'll try to take another look next week.

Oleg.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-08-21 17:34             ` [PATCH v17] " Bernd Edlinger
                                 ` (2 preceding siblings ...)
  2025-11-02 16:17               ` Oleg Nesterov
@ 2025-11-05 14:32               ` Oleg Nesterov
  2025-11-11  9:21                 ` Christian Brauner
  2025-11-09 17:14               ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
  4 siblings, 1 reply; 39+ messages in thread
From: Oleg Nesterov @ 2025-11-05 14:32 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

I am still thinking about another approach, will write another email.
But let me take a closer look at your patch.

First of all, can you split it? See below.

On 08/21, Bernd Edlinger wrote:
>
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>  {
>  	struct signal_struct *sig = tsk->signal;
>  	struct sighand_struct *oldsighand = tsk->sighand;
>  	spinlock_t *lock = &oldsighand->siglock;
> +	struct task_struct *t;
> +	bool unsafe_execve_in_progress = false;
>
>  	if (thread_group_empty(tsk))
>  		goto no_thread_group;
> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
>  	if (!thread_group_leader(tsk))
>  		sig->notify_count--;
>
> +	for_other_threads(tsk, t) {
> +		if (unlikely(t->ptrace)
> +		    && (t != tsk->group_leader || !t->exit_state))
> +			unsafe_execve_in_progress = true;

you can add "break" into the "if ()" block...

But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
unconditionally.

If you really think it makes sense, please make another patch with the
changelog.

I'd certainly prefer to avoid this boolean at least for the start. If nothing
else to catch the potential problems earlier.

> +	if (unlikely(unsafe_execve_in_progress)) {
> +		spin_unlock_irq(lock);
> +		sig->exec_bprm = bprm;
> +		mutex_unlock(&sig->cred_guard_mutex);
> +		spin_lock_irq(lock);

I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...

> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	 */
>  	trace_sched_prepare_exec(current, bprm);
>
> +	/* If the binary is not readable then enforce mm->dumpable=0 */
> +	would_dump(bprm, bprm->file);
> +	if (bprm->have_execfd)
> +		would_dump(bprm, bprm->executable);
> +
> +	/*
> +	 * Figure out dumpability. Note that this checking only of current
> +	 * is wrong, but userspace depends on it. This should be testing
> +	 * bprm->secureexec instead.
> +	 */
> +	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> +	    is_dumpability_changed(current_cred(), bprm->cred) ||
> +	    !(uid_eq(current_euid(), current_uid()) &&
> +	      gid_eq(current_egid(), current_gid())))
> +		set_dumpable(bprm->mm, suid_dumpable);
> +	else
> +		set_dumpable(bprm->mm, SUID_DUMP_USER);
> +

OK, we need to do this before de_thread() drops cred_guard_mutex.
But imo this too should be done in a separate patch, the changelog should
explain this change.

> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
>  	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>  		return -ERESTARTNOINTR;
>
> +	if (unlikely(current->signal->exec_bprm)) {
> +		mutex_unlock(&current->signal->cred_guard_mutex);
> +		return -ERESTARTNOINTR;
> +	}

OK, if signal->exec_bprm != NULL, then current is already killed. But
proc_pid_attr_write() and ptrace_traceme() do the same. So how about
something like

	int lock_current_cgm(void)
	{
		if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
			return -ERESTARTNOINTR;

		if (!current->signal->group_exec_task)
			return 0;

		WARN_ON(!fatal_signal_pending(current));
		mutex_unlock(&current->signal->cred_guard_mutex);
		return -ERESTARTNOINTR;
	}

?

Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
come in a separate patch too, but I won't insist.

> @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
>  				return retval;
>  		}
>
> +		if (unlikely(task == task->signal->group_exec_task)) {
> +			retval = down_write_killable(&task->signal->exec_update_lock);
> +			if (retval)
> +				return retval;
> +
> +			scoped_guard (task_lock, task) {
> +				struct linux_binprm *bprm = task->signal->exec_bprm;
> +				const struct cred __rcu *old_cred = task->real_cred;
> +				struct mm_struct *old_mm = task->mm;
> +
> +				rcu_assign_pointer(task->real_cred, bprm->cred);
> +				task->mm = bprm->mm;
> +				retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> +				rcu_assign_pointer(task->real_cred, old_cred);
> +				task->mm = old_mm;
> +			}

This is the most problematic change which I can't review...

Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
looks dangerous to me.

Say, current_is_single_threaded() called by another CLONE_VM process can
miss group_exec_task and falsely return true. Probably not that bad, in
this case old_mm should go away soon, but still...

And I don't know if this can fool the users of task_cred_xxx/__task_cred
somehow.

Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
ptrace the execing task after that? I have no idea what the security hooks
can do...

Again, can't review this part.

Oleg.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach()
  2025-08-21 17:34             ` [PATCH v17] " Bernd Edlinger
                                 ` (3 preceding siblings ...)
  2025-11-05 14:32               ` Oleg Nesterov
@ 2025-11-09 17:14               ` Oleg Nesterov
  2025-11-09 17:14                 ` [RFC PATCH 1/3] exec: make setup_new_exec() return int Oleg Nesterov
                                   ` (3 more replies)
  4 siblings, 4 replies; 39+ messages in thread
From: Oleg Nesterov @ 2025-11-09 17:14 UTC (permalink / raw)
  To: Bernd Edlinger, Linus Torvalds, Dmitry Levin
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

Not for inclusion yet. 2/2 is untested, incomplete, possibly buggy.

But could you review at least the intent? Do you see any problem with
this approach?

This problem is very, very old. It seems that nobody can suggest a
simple/clean fix...

Oleg.
---

 fs/binfmt_elf.c         |   4 +-
 fs/binfmt_elf_fdpic.c   |   4 +-
 fs/binfmt_flat.c        |   4 +-
 fs/exec.c               | 142 +++++++++++++++++++++++-------------------------
 include/linux/binfmts.h |   2 +-
 kernel/exit.c           |   9 +--
 kernel/signal.c         |   6 +-
 7 files changed, 87 insertions(+), 84 deletions(-)



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC PATCH 1/3] exec: make setup_new_exec() return int
  2025-11-09 17:14               ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
@ 2025-11-09 17:14                 ` Oleg Nesterov
  2025-11-09 17:15                 ` [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2025-11-09 17:14 UTC (permalink / raw)
  To: Bernd Edlinger, Linus Torvalds, Dmitry Levin
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

Preparation. After the next change setup_new_exec() can fail.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/binfmt_elf.c         | 4 +++-
 fs/binfmt_elf_fdpic.c   | 4 +++-
 fs/binfmt_flat.c        | 4 +++-
 fs/exec.c               | 4 +++-
 include/linux/binfmts.h | 2 +-
 5 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index e4653bb99946..8a38689ae4b0 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1021,7 +1021,9 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	if (!(current->personality & ADDR_NO_RANDOMIZE) && snapshot_randomize_va_space)
 		current->flags |= PF_RANDOMIZE;
 
-	setup_new_exec(bprm);
+	retval = setup_new_exec(bprm);
+	if (retval)
+		goto out_free_dentry;
 
 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 48fd2de3bca0..7ad98b8132fc 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -351,7 +351,9 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
 	if (elf_read_implies_exec(&exec_params.hdr, executable_stack))
 		current->personality |= READ_IMPLIES_EXEC;
 
-	setup_new_exec(bprm);
+	retval = setup_new_exec(bprm);
+	if (retval)
+		goto error;
 
 	set_binfmt(&elf_fdpic_format);
 
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index b5b5ca1a44f7..78e9ca128ea7 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -512,7 +512,9 @@ static int load_flat_file(struct linux_binprm *bprm,
 
 	/* OK, This is the point of no return */
 	set_personality(PER_LINUX_32BIT);
-	setup_new_exec(bprm);
+	ret = setup_new_exec(bprm);
+	if (ret)
+		goto err;
 
 	/*
 	 * calculate the extra space we need to map in
diff --git a/fs/exec.c b/fs/exec.c
index 6b70c6726d31..136a7ab5d91c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1321,7 +1321,7 @@ void would_dump(struct linux_binprm *bprm, struct file *file)
 }
 EXPORT_SYMBOL(would_dump);
 
-void setup_new_exec(struct linux_binprm * bprm)
+int setup_new_exec(struct linux_binprm * bprm)
 {
 	/* Setup things that can depend upon the personality */
 	struct task_struct *me = current;
@@ -1337,6 +1337,8 @@ void setup_new_exec(struct linux_binprm * bprm)
 	me->mm->task_size = TASK_SIZE;
 	up_write(&me->signal->exec_update_lock);
 	mutex_unlock(&me->signal->cred_guard_mutex);
+
+	return 0;
 }
 EXPORT_SYMBOL(setup_new_exec);
 
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 65abd5ab8836..678b7525ac5a 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -123,7 +123,7 @@ extern void unregister_binfmt(struct linux_binfmt *);
 
 extern int __must_check remove_arg_zero(struct linux_binprm *);
 extern int begin_new_exec(struct linux_binprm * bprm);
-extern void setup_new_exec(struct linux_binprm * bprm);
+extern int setup_new_exec(struct linux_binprm * bprm);
 extern void finalize_exec(struct linux_binprm *bprm);
 extern void would_dump(struct linux_binprm *, struct file *);
 
-- 
2.25.1.362.g51ebf55




^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held
  2025-11-09 17:14               ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
  2025-11-09 17:14                 ` [RFC PATCH 1/3] exec: make setup_new_exec() return int Oleg Nesterov
@ 2025-11-09 17:15                 ` Oleg Nesterov
  2025-11-10 10:58                   ` Cyrill Gorcunov
  2025-11-09 17:16                 ` [RFC PATCH 3/3] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec Oleg Nesterov
  2025-11-10  5:28                 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Bernd Edlinger
  3 siblings, 1 reply; 39+ messages in thread
From: Oleg Nesterov @ 2025-11-09 17:15 UTC (permalink / raw)
  To: Bernd Edlinger, Linus Torvalds, Dmitry Levin
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

This simple program

    #include <unistd.h>
    #include <signal.h>
    #include <sys/ptrace.h>
    #include <pthread.h>

    void *thread(void *arg)
    {
	    ptrace(PTRACE_TRACEME, 0,0,0);
	    return NULL;
    }

    int main(void)
    {
	    int pid = fork();

	    if (!pid) {
		    pthread_t pt;
		    pthread_create(&pt, NULL, thread, NULL);
		    pthread_join(pt, NULL);
		    execlp("echo", "echo", "passed", NULL);
	    }

	    sleep(1);
	    ptrace(PTRACE_ATTACH, pid, 0,0);
	    kill(pid, SIGCONT);

	    return 0;
    }

hangs because de_thread() waits for debugger which should release the killed
thread with cred_guard_mutex held, while the debugger sleeps waiting for the
same mutex. Not really that bad, the tracer can be killed, but still this is
a bug and people hit it in practice.

With this patch:

	- de_thread() waits until all the sub-threads pass exit_notify() and
	  become zombies.

	- setup_new_exec() waits until all the sub-threads are reaped without
	  cred_guard_mutex held.

	- unshare_sighand() and flush_signal_handlers() are moved from
	  begin_new_exec() to setup_new_exec(), we can't call them until all
	  sub-threads go away.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/exec.c       | 140 +++++++++++++++++++++++-------------------------
 kernel/exit.c   |   9 ++--
 kernel/signal.c |   2 +-
 3 files changed, 71 insertions(+), 80 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 136a7ab5d91c..2bac7deb9a98 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -905,42 +905,56 @@ static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
-static int de_thread(struct task_struct *tsk)
+static int kill_sub_threads(struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;
-	struct sighand_struct *oldsighand = tsk->sighand;
-	spinlock_t *lock = &oldsighand->siglock;
-
-	if (thread_group_empty(tsk))
-		goto no_thread_group;
+	int err = -EINTR;
 
-	/*
-	 * Kill all other threads in the thread group.
-	 */
-	spin_lock_irq(lock);
-	if ((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task) {
-		/*
-		 * Another group action in progress, just
-		 * return so that the signal is processed.
-		 */
-		spin_unlock_irq(lock);
-		return -EAGAIN;
+	read_lock(&tasklist_lock);
+	spin_lock_irq(&tsk->sighand->siglock);
+	if (!((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task)) {
+		sig->group_exec_task = tsk;
+		sig->notify_count = -zap_other_threads(tsk);
+		err = 0;
 	}
+	spin_unlock_irq(&tsk->sighand->siglock);
+	read_unlock(&tasklist_lock);
 
-	sig->group_exec_task = tsk;
-	sig->notify_count = zap_other_threads(tsk);
-	if (!thread_group_leader(tsk))
-		sig->notify_count--;
+	return err;
+}
 
-	while (sig->notify_count) {
-		__set_current_state(TASK_KILLABLE);
-		spin_unlock_irq(lock);
-		schedule();
+static int wait_for_notify_count(struct task_struct *tsk)
+{
+	for (;;) {
 		if (__fatal_signal_pending(tsk))
-			goto killed;
-		spin_lock_irq(lock);
+			return -EINTR;
+		set_current_state(TASK_KILLABLE);
+		if (!tsk->signal->notify_count)
+			break;
+		schedule();
 	}
-	spin_unlock_irq(lock);
+	__set_current_state(TASK_RUNNING);
+	return 0;
+}
+
+static void clear_group_exec_task(struct task_struct *tsk)
+{
+	struct signal_struct *sig = tsk->signal;
+
+	/* protects against exit_notify() and __exit_signal() */
+	read_lock(&tasklist_lock);
+	sig->group_exec_task = NULL;
+	sig->notify_count = 0;
+	read_unlock(&tasklist_lock);
+}
+
+static int de_thread(struct task_struct *tsk)
+{
+	if (thread_group_empty(tsk))
+		goto no_thread_group;
+
+	if (kill_sub_threads(tsk) || wait_for_notify_count(tsk))
+		return -EINTR;
 
 	/*
 	 * At this point all other threads have exited, all we have to
@@ -948,26 +962,10 @@ static int de_thread(struct task_struct *tsk)
 	 * and to assume its PID:
 	 */
 	if (!thread_group_leader(tsk)) {
-		struct task_struct *leader = tsk->group_leader;
-
-		for (;;) {
-			cgroup_threadgroup_change_begin(tsk);
-			write_lock_irq(&tasklist_lock);
-			/*
-			 * Do this under tasklist_lock to ensure that
-			 * exit_notify() can't miss ->group_exec_task
-			 */
-			sig->notify_count = -1;
-			if (likely(leader->exit_state))
-				break;
-			__set_current_state(TASK_KILLABLE);
-			write_unlock_irq(&tasklist_lock);
-			cgroup_threadgroup_change_end(tsk);
-			schedule();
-			if (__fatal_signal_pending(tsk))
-				goto killed;
-		}
+		struct task_struct *leader = tsk->group_leader, *t;
 
+		cgroup_threadgroup_change_begin(tsk);
+		write_lock_irq(&tasklist_lock);
 		/*
 		 * The only record we have of the real-time age of a
 		 * process, regardless of execs it's done, is start_time.
@@ -1000,8 +998,8 @@ static int de_thread(struct task_struct *tsk)
 		list_replace_rcu(&leader->tasks, &tsk->tasks);
 		list_replace_init(&leader->sibling, &tsk->sibling);
 
-		tsk->group_leader = tsk;
-		leader->group_leader = tsk;
+		for_each_thread(tsk, t)
+			t->group_leader = tsk;
 
 		tsk->exit_signal = SIGCHLD;
 		leader->exit_signal = -1;
@@ -1021,23 +1019,11 @@ static int de_thread(struct task_struct *tsk)
 		release_task(leader);
 	}
 
-	sig->group_exec_task = NULL;
-	sig->notify_count = 0;
-
 no_thread_group:
 	/* we have changed execution domain */
 	tsk->exit_signal = SIGCHLD;
-
 	BUG_ON(!thread_group_leader(tsk));
 	return 0;
-
-killed:
-	/* protects against exit_notify() and __exit_signal() */
-	read_lock(&tasklist_lock);
-	sig->group_exec_task = NULL;
-	sig->notify_count = 0;
-	read_unlock(&tasklist_lock);
-	return -EAGAIN;
 }
 
 
@@ -1171,13 +1157,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 	flush_itimer_signals();
 #endif
 
-	/*
-	 * Make the signal table private.
-	 */
-	retval = unshare_sighand(me);
-	if (retval)
-		goto out_unlock;
-
 	me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC |
 					PF_NOFREEZE | PF_NO_SETAFFINITY);
 	flush_thread();
@@ -1249,7 +1228,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 	/* An exec changes our domain. We are no longer part of the thread
 	   group */
 	WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
-	flush_signal_handlers(me, 0);
 
 	retval = set_cred_ucounts(bprm->cred);
 	if (retval < 0)
@@ -1293,8 +1271,9 @@ int begin_new_exec(struct linux_binprm * bprm)
 	up_write(&me->signal->exec_update_lock);
 	if (!bprm->cred)
 		mutex_unlock(&me->signal->cred_guard_mutex);
-
 out:
+	if (me->signal->group_exec_task == me)
+		clear_group_exec_task(me);
 	return retval;
 }
 EXPORT_SYMBOL(begin_new_exec);
@@ -1325,6 +1304,8 @@ int setup_new_exec(struct linux_binprm * bprm)
 {
 	/* Setup things that can depend upon the personality */
 	struct task_struct *me = current;
+	struct signal_struct *sig = me->signal;
+	int err = 0;
 
 	arch_pick_mmap_layout(me->mm, &bprm->rlim_stack);
 
@@ -1335,10 +1316,23 @@ int setup_new_exec(struct linux_binprm * bprm)
 	 * some architectures like powerpc
 	 */
 	me->mm->task_size = TASK_SIZE;
-	up_write(&me->signal->exec_update_lock);
-	mutex_unlock(&me->signal->cred_guard_mutex);
+	up_write(&sig->exec_update_lock);
+	mutex_unlock(&sig->cred_guard_mutex);
 
-	return 0;
+	if (sig->group_exec_task) {
+		spin_lock_irq(&me->sighand->siglock);
+		sig->notify_count = sig->nr_threads - 1;
+		spin_unlock_irq(&me->sighand->siglock);
+
+		err = wait_for_notify_count(me);
+		clear_group_exec_task(me);
+	}
+
+	if (!err)
+		err = unshare_sighand(me);
+	if (!err)
+		flush_signal_handlers(me, 0);
+	return err;
 }
 EXPORT_SYMBOL(setup_new_exec);
 
diff --git a/kernel/exit.c b/kernel/exit.c
index f041f0c05ebb..bcde78c97253 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -178,10 +178,7 @@ static void __exit_signal(struct release_task_post *post, struct task_struct *ts
 		tty = sig->tty;
 		sig->tty = NULL;
 	} else {
-		/*
-		 * If there is any task waiting for the group exit
-		 * then notify it:
-		 */
+		/* mt-exec, setup_new_exec() -> wait_for_notify_count() */
 		if (sig->notify_count > 0 && !--sig->notify_count)
 			wake_up_process(sig->group_exec_task);
 
@@ -766,8 +763,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 		list_add(&tsk->ptrace_entry, &dead);
 	}
 
-	/* mt-exec, de_thread() is waiting for group leader */
-	if (unlikely(tsk->signal->notify_count < 0))
+	/* mt-exec, de_thread() -> wait_for_notify_count() */
+	if (tsk->signal->notify_count < 0 && !++tsk->signal->notify_count)
 		wake_up_process(tsk->signal->group_exec_task);
 	write_unlock_irq(&tasklist_lock);
 
diff --git a/kernel/signal.c b/kernel/signal.c
index fe9190d84f28..334212044940 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1343,13 +1343,13 @@ int zap_other_threads(struct task_struct *p)
 
 	for_other_threads(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
 
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
 			continue;
 		sigaddset(&t->pending.signal, SIGKILL);
 		signal_wake_up(t, 1);
+		count++;
 	}
 
 	return count;
-- 
2.25.1.362.g51ebf55




^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 3/3] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec
  2025-11-09 17:14               ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
  2025-11-09 17:14                 ` [RFC PATCH 1/3] exec: make setup_new_exec() return int Oleg Nesterov
  2025-11-09 17:15                 ` [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
@ 2025-11-09 17:16                 ` Oleg Nesterov
  2025-11-10  5:28                 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Bernd Edlinger
  3 siblings, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2025-11-09 17:16 UTC (permalink / raw)
  To: Bernd Edlinger, Linus Torvalds, Dmitry Levin
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

The previous patch fixed the deadlock when mt-exec waits for debugger
which should reap a zombie thread, but we can hit the same problem if
the killed sub-thread stops in ptrace_event(PTRACE_EVENT_EXIT). Change
ptrace_stop() to check signal->group_exit_task.

This is a user-visible change. But hopefully it can't break anything.
Note that the semantics of PTRACE_EVENT_EXIT was never really defined,
it depends on /dev/random. Just for example, currently a sub-thread
killed by exec will stop, but if it exits on its own and races with
exec it will not stop, so nobody can rely on PTRACE_EVENT_EXIT anyway.

We really need to finally define what PTRACE_EVENT_EXIT should actually
do, but this needs other changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/signal.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/signal.c b/kernel/signal.c
index 334212044940..59f61e07905b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2376,6 +2376,10 @@ static int ptrace_stop(int exit_code, int why, unsigned long message,
 	if (!current->ptrace || __fatal_signal_pending(current))
 		return exit_code;
 
+	/* de_thread() -> wait_for_notify_count() waits for us */
+	if (current->signal->group_exec_task)
+		return exit_code;
+
 	set_special_state(TASK_TRACED);
 	current->jobctl |= JOBCTL_TRACED;
 
-- 
2.25.1.362.g51ebf55




^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach()
  2025-11-09 17:14               ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
                                   ` (2 preceding siblings ...)
  2025-11-09 17:16                 ` [RFC PATCH 3/3] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec Oleg Nesterov
@ 2025-11-10  5:28                 ` Bernd Edlinger
  2025-11-10 14:47                   ` Oleg Nesterov
  3 siblings, 1 reply; 39+ messages in thread
From: Bernd Edlinger @ 2025-11-10  5:28 UTC (permalink / raw)
  To: Oleg Nesterov, Linus Torvalds, Dmitry Levin
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Christian Brauner, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

Hi Oleg,

I have not been able to update my patch with your and Peter Peter Zijlstra's
kind suggestions, because I am currently too busy with my role as openssl maintainer.

Just for clarification, my patch is 10% about deadlocks, and 90% about security.
The idea is that if the de_thread is blocked, and the debugger may be trying to
ptrace the exec thread.  That must succeed or fail. So the debugger can release
the zombie threads.

The security issue is when the debugged process tries to exec a SUID process
like /usr/bin/passwd

In that case the new credentials are determined differently when the PTRACE is
already attached (i.e. non-root), than when it is not yet attached (root user).
My attempt at fixing this, determines the new credentials and the new dumpability
as root user when the debugger did not yet attach before the de_thread.
And keeps this decision.

When the debugger wants to attach the de_thread the debug-user access rights are
checked against the current user and additionally against the new user credentials.
This I did by quickly switching the user credenitals to the next user and back again,
under the cred_guard_mutex, which should make that safe.

So at this time I have only one request for you.
Could you please try out how the test case in my patch behaves with your fix?

Thanks
Bernd.

On 11/9/25 18:14, Oleg Nesterov wrote:
> Not for inclusion yet. 2/2 is untested, incomplete, possibly buggy.
> 
> But could you review at least the intent? Do you see any problem with
> this approach?
> 
> This problem is very, very old. It seems that nobody can suggest a
> simple/clean fix...
> 
> Oleg.
> ---
> 
>  fs/binfmt_elf.c         |   4 +-
>  fs/binfmt_elf_fdpic.c   |   4 +-
>  fs/binfmt_flat.c        |   4 +-
>  fs/exec.c               | 142 +++++++++++++++++++++++-------------------------
>  include/linux/binfmts.h |   2 +-
>  kernel/exit.c           |   9 +--
>  kernel/signal.c         |   6 +-
>  7 files changed, 87 insertions(+), 84 deletions(-)
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held
  2025-11-09 17:15                 ` [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
@ 2025-11-10 10:58                   ` Cyrill Gorcunov
  2025-11-10 15:09                     ` Oleg Nesterov
  0 siblings, 1 reply; 39+ messages in thread
From: Cyrill Gorcunov @ 2025-11-10 10:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module

On Sun, Nov 09, 2025 at 06:15:33PM +0100, Oleg Nesterov wrote:
..
> static int kill_sub_threads(struct task_struct *tsk)
> {
>  	struct signal_struct *sig = tsk->signal;
> 	int err = -EINTR;
> 
> 	read_lock(&tasklist_lock);
> 	spin_lock_irq(&tsk->sighand->siglock);
> 	if (!((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task)) {
> 		sig->group_exec_task = tsk;
> 		sig->notify_count = -zap_other_threads(tsk);

Hi Oleg! I somehow manage to miss a moment -- why negative result here?

> 		err = 0;
> 	}
> 	spin_unlock_irq(&tsk->sighand->siglock);
> 	read_unlock(&tasklist_lock);
> 
> 	return err;
> }

p.s. i've dropped long CC but left ML intact)

	Cyrill


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach()
  2025-11-10  5:28                 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Bernd Edlinger
@ 2025-11-10 14:47                   ` Oleg Nesterov
  0 siblings, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2025-11-10 14:47 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Linus Torvalds, Dmitry Levin, Alexander Viro, Alexey Dobriyan,
	Kees Cook, Andy Lutomirski, Will Drewry, Christian Brauner,
	Andrew Morton, Michal Hocko, Serge Hallyn, James Morris,
	Randy Dunlap, Suren Baghdasaryan, Yafang Shao, Helge Deller,
	Eric W. Biederman, Adrian Reber, Thomas Gleixner, Jens Axboe,
	Alexei Starovoitov, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-kselftest, linux-mm,
	linux-security-module, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Elena Reshetova, David Windsor, Mateusz Guzik,
	Ard Biesheuvel, Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

Hi Bernd,

On 11/10, Bernd Edlinger wrote:
>
> When the debugger wants to attach the de_thread the debug-user access rights are
> checked against the current user and additionally against the new user credentials.
> This I did by quickly switching the user credenitals to the next user and back again,
> under the cred_guard_mutex, which should make that safe.

Let me repeat, I can't really comment this part, I don't know if it is
actually safe. But the very fact your patch changes ->mm and ->cred of
the execing task in ptrace_attach() makes me worry... At least I think
you should update or remove this comment in begin_new_exec:

	/*
	 * cred_guard_mutex must be held at least to this point to prevent
	 * ptrace_attach() from altering our determination of the task's
	 * credentials; any time after this it may be unlocked.
	 */
	security_bprm_committed_creds(bprm);

> So at this time I have only one request for you.
> Could you please try out how the test case in my patch behaves with your fix?

The new TEST(attach2) added by your patch fails as expected, see 3/3.

   128  static long thread2_tid;
   129  static void *thread2(void *arg)
   130  {
   131          thread2_tid = syscall(__NR_gettid);
   132          sleep(2);
   133          execlp("false", "false", NULL);
   134          return NULL;
   135  }
   136
   137  TEST(attach2)
   138  {
   139          int s, k, pid = fork();
   140
   141          if (!pid) {
   142                  pthread_t pt;
   143
   144                  pthread_create(&pt, NULL, thread2, NULL);
   145                  pthread_join(pt, NULL);
   146                  return;
   147          }
   148
   149          sleep(1);
   150          k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
   151          ASSERT_EQ(k, 0);
   152          k = waitpid(-1, &s, 0);
   153          ASSERT_EQ(k, pid);
   154          ASSERT_EQ(WIFSTOPPED(s), 1);
   155          ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
   156          k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
   157          ASSERT_EQ(k, 0);
   158          thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
   159          ASSERT_NE(thread2_tid, -1);
   160          ASSERT_NE(thread2_tid, 0);
   161          ASSERT_NE(thread2_tid, pid);
   162          k = waitpid(-1, &s, WNOHANG);
   163          ASSERT_EQ(k, 0);
   164          sleep(2);
   165          /* deadlock may happen here */
   166          k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);

PTRACE_ATTACH fails.

thread2() kills the old leader, takes it pid, execlp() succeeds.

Oleg.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held
  2025-11-10 10:58                   ` Cyrill Gorcunov
@ 2025-11-10 15:09                     ` Oleg Nesterov
  2025-11-10 21:49                       ` Cyrill Gorcunov
  0 siblings, 1 reply; 39+ messages in thread
From: Oleg Nesterov @ 2025-11-10 15:09 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module

Hi Cyrill,

On 11/10, Cyrill Gorcunov wrote:
>
> On Sun, Nov 09, 2025 at 06:15:33PM +0100, Oleg Nesterov wrote:
> ..
> > static int kill_sub_threads(struct task_struct *tsk)
> > {
> >  	struct signal_struct *sig = tsk->signal;
> > 	int err = -EINTR;
> >
> > 	read_lock(&tasklist_lock);
> > 	spin_lock_irq(&tsk->sighand->siglock);
> > 	if (!((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task)) {
> > 		sig->group_exec_task = tsk;
> > 		sig->notify_count = -zap_other_threads(tsk);
>
> Hi Oleg! I somehow manage to miss a moment -- why negative result here?

You know, initially I wrote

		sig->notify_count = 0 - zap_other_threads(tsk);

to make it clear that this is not a typo ;)


This is for exit_notify() which does

	/* mt-exec, de_thread() -> wait_for_notify_count() */
	if (tsk->signal->notify_count < 0 && !++tsk->signal->notify_count)
		wake_up_process(tsk->signal->group_exec_task);

Then setup_new_exec() sets notify_count > 0 for __exit_signal() which does

	/* mt-exec, setup_new_exec() -> wait_for_notify_count() */
	if (sig->notify_count > 0 && !--sig->notify_count)
		wake_up_process(sig->group_exec_task);

Yes this needs more comments and (with or without this patch) cleanups.
Note that exit_notify() and __exit_signal() already (before this patch)
use ->notify_count almost the same way, just exit_notify() assumes that
notify_count < 0 means the !thread_group_leader() case in de_thread().

Oleg.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held
  2025-11-10 15:09                     ` Oleg Nesterov
@ 2025-11-10 21:49                       ` Cyrill Gorcunov
  2025-11-11 14:09                         ` Oleg Nesterov
  0 siblings, 1 reply; 39+ messages in thread
From: Cyrill Gorcunov @ 2025-11-10 21:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module

On Mon, Nov 10, 2025 at 04:09:05PM +0100, Oleg Nesterov wrote:
...
> > > 	if (!((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task)) {
> > > 		sig->group_exec_task = tsk;
> > > 		sig->notify_count = -zap_other_threads(tsk);
> >
> > Hi Oleg! I somehow manage to miss a moment -- why negative result here?
> 
> You know, initially I wrote
> 
> 		sig->notify_count = 0 - zap_other_threads(tsk);
> 
> to make it clear that this is not a typo ;)

Aha! Thanks a huge for explanation :)

> 
> This is for exit_notify() which does
> 
> 	/* mt-exec, de_thread() -> wait_for_notify_count() */
> 	if (tsk->signal->notify_count < 0 && !++tsk->signal->notify_count)
> 		wake_up_process(tsk->signal->group_exec_task);
> 
> Then setup_new_exec() sets notify_count > 0 for __exit_signal() which does
> 
> 	/* mt-exec, setup_new_exec() -> wait_for_notify_count() */
> 	if (sig->notify_count > 0 && !--sig->notify_count)
> 		wake_up_process(sig->group_exec_task);
> 
> Yes this needs more comments and (with or without this patch) cleanups.
> Note that exit_notify() and __exit_signal() already (before this patch)
> use ->notify_count almost the same way, just exit_notify() assumes that
> notify_count < 0 means the !thread_group_leader() case in de_thread().

Yeah, just realized. It's been a long time since I looked into this signals
and tasks related code so to be honest don't think I would be helpful here)
Anyway while looking into patch I got wonder why

+static int wait_for_notify_count(struct task_struct *tsk)
+{
+	for (;;) {
+			return -EINTR;
+		set_current_state(TASK_KILLABLE);
+		if (!tsk->signal->notify_count)
+			break;

We have no any barrier here in fetching @notify_count? I mean updating
this value is done under locks (spin or read/write) in turn condition
test is a raw one. Not a big deal since set_current_state() and schedule()
are buffer flushers by themselves and after all not immediate update of
notify_count simply force us to yield one more schedule() call but I've
been a bit confused that we don't use some read_once here or something.
Another (more likely) that I've just said something stupid)

+		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
+	return 0;
+}

	Cyrill


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-11-05 14:32               ` Oleg Nesterov
@ 2025-11-11  9:21                 ` Christian Brauner
  2025-11-11 11:07                   ` Bernd Edlinger
  2025-11-17  6:31                   ` Bernd Edlinger
  0 siblings, 2 replies; 39+ messages in thread
From: Christian Brauner @ 2025-11-11  9:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Bernd Edlinger, Alexander Viro, Alexey Dobriyan, Kees Cook,
	Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
> I am still thinking about another approach, will write another email.
> But let me take a closer look at your patch.
> 
> First of all, can you split it? See below.
> 
> On 08/21, Bernd Edlinger wrote:
> >
> > -static int de_thread(struct task_struct *tsk)
> > +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
> >  {
> >  	struct signal_struct *sig = tsk->signal;
> >  	struct sighand_struct *oldsighand = tsk->sighand;
> >  	spinlock_t *lock = &oldsighand->siglock;
> > +	struct task_struct *t;
> > +	bool unsafe_execve_in_progress = false;
> >
> >  	if (thread_group_empty(tsk))
> >  		goto no_thread_group;
> > @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
> >  	if (!thread_group_leader(tsk))
> >  		sig->notify_count--;
> >
> > +	for_other_threads(tsk, t) {
> > +		if (unlikely(t->ptrace)
> > +		    && (t != tsk->group_leader || !t->exit_state))
> > +			unsafe_execve_in_progress = true;
> 
> you can add "break" into the "if ()" block...
> 
> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
> unconditionally.
> 
> If you really think it makes sense, please make another patch with the
> changelog.
> 
> I'd certainly prefer to avoid this boolean at least for the start. If nothing
> else to catch the potential problems earlier.
> 
> > +	if (unlikely(unsafe_execve_in_progress)) {
> > +		spin_unlock_irq(lock);
> > +		sig->exec_bprm = bprm;
> > +		mutex_unlock(&sig->cred_guard_mutex);
> > +		spin_lock_irq(lock);
> 
> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
> 
> > @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
> >  	 */
> >  	trace_sched_prepare_exec(current, bprm);
> >
> > +	/* If the binary is not readable then enforce mm->dumpable=0 */
> > +	would_dump(bprm, bprm->file);
> > +	if (bprm->have_execfd)
> > +		would_dump(bprm, bprm->executable);
> > +
> > +	/*
> > +	 * Figure out dumpability. Note that this checking only of current
> > +	 * is wrong, but userspace depends on it. This should be testing
> > +	 * bprm->secureexec instead.
> > +	 */
> > +	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> > +	    is_dumpability_changed(current_cred(), bprm->cred) ||
> > +	    !(uid_eq(current_euid(), current_uid()) &&
> > +	      gid_eq(current_egid(), current_gid())))
> > +		set_dumpable(bprm->mm, suid_dumpable);
> > +	else
> > +		set_dumpable(bprm->mm, SUID_DUMP_USER);
> > +
> 
> OK, we need to do this before de_thread() drops cred_guard_mutex.
> But imo this too should be done in a separate patch, the changelog should
> explain this change.
> 
> > @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
> >  	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
> >  		return -ERESTARTNOINTR;
> >
> > +	if (unlikely(current->signal->exec_bprm)) {
> > +		mutex_unlock(&current->signal->cred_guard_mutex);
> > +		return -ERESTARTNOINTR;
> > +	}
> 
> OK, if signal->exec_bprm != NULL, then current is already killed. But
> proc_pid_attr_write() and ptrace_traceme() do the same. So how about
> something like
> 
> 	int lock_current_cgm(void)
> 	{
> 		if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
> 			return -ERESTARTNOINTR;
> 
> 		if (!current->signal->group_exec_task)
> 			return 0;
> 
> 		WARN_ON(!fatal_signal_pending(current));
> 		mutex_unlock(&current->signal->cred_guard_mutex);
> 		return -ERESTARTNOINTR;
> 	}
> 
> ?
> 
> Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
> come in a separate patch too, but I won't insist.
> 
> > @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
> >  				return retval;
> >  		}
> >
> > +		if (unlikely(task == task->signal->group_exec_task)) {
> > +			retval = down_write_killable(&task->signal->exec_update_lock);
> > +			if (retval)
> > +				return retval;
> > +
> > +			scoped_guard (task_lock, task) {
> > +				struct linux_binprm *bprm = task->signal->exec_bprm;
> > +				const struct cred __rcu *old_cred = task->real_cred;
> > +				struct mm_struct *old_mm = task->mm;
> > +
> > +				rcu_assign_pointer(task->real_cred, bprm->cred);
> > +				task->mm = bprm->mm;
> > +				retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> > +				rcu_assign_pointer(task->real_cred, old_cred);
> > +				task->mm = old_mm;
> > +			}
> 
> This is the most problematic change which I can't review...
> 
> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
> looks dangerous to me.

Yeah, that is not ok. This is effectively override_creds for real_cred
and that is not a pattern I want to see us establish at all! Temporary
credential overrides for the subjective credentials is already terrible
but at least we have the explicit split between real_cred and cred
expressely for that. So no, that's not an acceptable solution.

> 
> Say, current_is_single_threaded() called by another CLONE_VM process can
> miss group_exec_task and falsely return true. Probably not that bad, in
> this case old_mm should go away soon, but still...
> 
> And I don't know if this can fool the users of task_cred_xxx/__task_cred
> somehow.
> 
> Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
> ptrace the execing task after that? I have no idea what the security hooks
> can do...
> 
> Again, can't review this part.
> 
> Oleg.
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-11-11  9:21                 ` Christian Brauner
@ 2025-11-11 11:07                   ` Bernd Edlinger
  2025-11-11 13:12                     ` Oleg Nesterov
  2025-11-17  6:31                   ` Bernd Edlinger
  1 sibling, 1 reply; 39+ messages in thread
From: Bernd Edlinger @ 2025-11-11 11:07 UTC (permalink / raw)
  To: Christian Brauner, Oleg Nesterov
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Andrew Morton, Michal Hocko, Serge Hallyn,
	James Morris, Randy Dunlap, Suren Baghdasaryan, Yafang Shao,
	Helge Deller, Eric W. Biederman, Adrian Reber, Thomas Gleixner,
	Jens Axboe, Alexei Starovoitov, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-kselftest, linux-mm,
	linux-security-module, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Elena Reshetova, David Windsor, Mateusz Guzik,
	Ard Biesheuvel, Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

On 11/11/25 10:21, Christian Brauner wrote:
> On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>> I am still thinking about another approach, will write another email.
>> But let me take a closer look at your patch.
>>
>> First of all, can you split it? See below.
>>
>> On 08/21, Bernd Edlinger wrote:
>>>
>>> -static int de_thread(struct task_struct *tsk)
>>> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>>>  {
>>>  	struct signal_struct *sig = tsk->signal;
>>>  	struct sighand_struct *oldsighand = tsk->sighand;
>>>  	spinlock_t *lock = &oldsighand->siglock;
>>> +	struct task_struct *t;
>>> +	bool unsafe_execve_in_progress = false;
>>>
>>>  	if (thread_group_empty(tsk))
>>>  		goto no_thread_group;
>>> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
>>>  	if (!thread_group_leader(tsk))
>>>  		sig->notify_count--;
>>>
>>> +	for_other_threads(tsk, t) {
>>> +		if (unlikely(t->ptrace)
>>> +		    && (t != tsk->group_leader || !t->exit_state))
>>> +			unsafe_execve_in_progress = true;
>>
>> you can add "break" into the "if ()" block...
>>
>> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
>> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
>> unconditionally.
>>
>> If you really think it makes sense, please make another patch with the
>> changelog.
>>
>> I'd certainly prefer to avoid this boolean at least for the start. If nothing
>> else to catch the potential problems earlier.
>>
>>> +	if (unlikely(unsafe_execve_in_progress)) {
>>> +		spin_unlock_irq(lock);
>>> +		sig->exec_bprm = bprm;
>>> +		mutex_unlock(&sig->cred_guard_mutex);
>>> +		spin_lock_irq(lock);
>>
>> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
>>
>>> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
>>>  	 */
>>>  	trace_sched_prepare_exec(current, bprm);
>>>
>>> +	/* If the binary is not readable then enforce mm->dumpable=0 */
>>> +	would_dump(bprm, bprm->file);
>>> +	if (bprm->have_execfd)
>>> +		would_dump(bprm, bprm->executable);
>>> +
>>> +	/*
>>> +	 * Figure out dumpability. Note that this checking only of current
>>> +	 * is wrong, but userspace depends on it. This should be testing
>>> +	 * bprm->secureexec instead.
>>> +	 */
>>> +	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>>> +	    is_dumpability_changed(current_cred(), bprm->cred) ||
>>> +	    !(uid_eq(current_euid(), current_uid()) &&
>>> +	      gid_eq(current_egid(), current_gid())))
>>> +		set_dumpable(bprm->mm, suid_dumpable);
>>> +	else
>>> +		set_dumpable(bprm->mm, SUID_DUMP_USER);
>>> +
>>
>> OK, we need to do this before de_thread() drops cred_guard_mutex.
>> But imo this too should be done in a separate patch, the changelog should
>> explain this change.
>>
>>> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
>>>  	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>>>  		return -ERESTARTNOINTR;
>>>
>>> +	if (unlikely(current->signal->exec_bprm)) {
>>> +		mutex_unlock(&current->signal->cred_guard_mutex);
>>> +		return -ERESTARTNOINTR;
>>> +	}
>>
>> OK, if signal->exec_bprm != NULL, then current is already killed. But
>> proc_pid_attr_write() and ptrace_traceme() do the same. So how about
>> something like
>>
>> 	int lock_current_cgm(void)
>> 	{
>> 		if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>> 			return -ERESTARTNOINTR;
>>
>> 		if (!current->signal->group_exec_task)
>> 			return 0;
>>
>> 		WARN_ON(!fatal_signal_pending(current));
>> 		mutex_unlock(&current->signal->cred_guard_mutex);
>> 		return -ERESTARTNOINTR;
>> 	}
>>
>> ?
>>
>> Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
>> come in a separate patch too, but I won't insist.
>>
>>> @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
>>>  				return retval;
>>>  		}
>>>
>>> +		if (unlikely(task == task->signal->group_exec_task)) {
>>> +			retval = down_write_killable(&task->signal->exec_update_lock);
>>> +			if (retval)
>>> +				return retval;
>>> +
>>> +			scoped_guard (task_lock, task) {
>>> +				struct linux_binprm *bprm = task->signal->exec_bprm;
>>> +				const struct cred __rcu *old_cred = task->real_cred;
>>> +				struct mm_struct *old_mm = task->mm;
>>> +
>>> +				rcu_assign_pointer(task->real_cred, bprm->cred);
>>> +				task->mm = bprm->mm;
>>> +				retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
>>> +				rcu_assign_pointer(task->real_cred, old_cred);
>>> +				task->mm = old_mm;
>>> +			}
>>
>> This is the most problematic change which I can't review...
>>
>> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
>> looks dangerous to me.
> 
> Yeah, that is not ok. This is effectively override_creds for real_cred
> and that is not a pattern I want to see us establish at all! Temporary
> credential overrides for the subjective credentials is already terrible
> but at least we have the explicit split between real_cred and cred
> expressely for that. So no, that's not an acceptable solution.
> 

Well when this is absolutely not acceptable then I would have to change
all security engines to be aware of the current and the new credentials.
That may be as well be possible but would be a rather big change.
Of course that was only meant as a big exception, and somehow safe
as long as it is protected under the right mutexes: cred_guard_mutex,
exec_update_lock and task_lock at least.

>>
>> Say, current_is_single_threaded() called by another CLONE_VM process can
>> miss group_exec_task and falsely return true. Probably not that bad, in
>> this case old_mm should go away soon, but still...
>>
>> And I don't know if this can fool the users of task_cred_xxx/__task_cred
>> somehow.
>>
>> Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
>> ptrace the execing task after that? I have no idea what the security hooks
>> can do...
>>
>> Again, can't review this part.
>>

Never mind, your review was really helpful.  At the very least it pointed
out some places where better comments are needed.

Thanks
Bernd.

>> Oleg.
>>



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-11-11 11:07                   ` Bernd Edlinger
@ 2025-11-11 13:12                     ` Oleg Nesterov
  2025-11-11 13:45                       ` Bernd Edlinger
  0 siblings, 1 reply; 39+ messages in thread
From: Oleg Nesterov @ 2025-11-11 13:12 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Alexander Viro, Alexey Dobriyan, Kees Cook,
	Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

On 11/11, Bernd Edlinger wrote:
>
> On 11/11/25 10:21, Christian Brauner wrote:
> > On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
> >>
> >> This is the most problematic change which I can't review...
> >>
> >> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
> >> looks dangerous to me.
> >
> > Yeah, that is not ok. This is effectively override_creds for real_cred
> > and that is not a pattern I want to see us establish at all! Temporary
> > credential overrides for the subjective credentials is already terrible
> > but at least we have the explicit split between real_cred and cred
> > expressely for that. So no, that's not an acceptable solution.
> >
>
> Well when this is absolutely not acceptable then I would have to change
> all security engines to be aware of the current and the new credentials.

Hmm... even if we find another way to avoid the deadlock? Say, the patches
I sent...

Oleg.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-11-11 13:12                     ` Oleg Nesterov
@ 2025-11-11 13:45                       ` Bernd Edlinger
  2025-11-12  9:52                         ` Oleg Nesterov
  0 siblings, 1 reply; 39+ messages in thread
From: Bernd Edlinger @ 2025-11-11 13:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Alexander Viro, Alexey Dobriyan, Kees Cook,
	Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

On 11/11/25 14:12, Oleg Nesterov wrote:
> On 11/11, Bernd Edlinger wrote:
>>
>> On 11/11/25 10:21, Christian Brauner wrote:
>>> On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>>>>
>>>> This is the most problematic change which I can't review...
>>>>
>>>> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
>>>> looks dangerous to me.
>>>
>>> Yeah, that is not ok. This is effectively override_creds for real_cred
>>> and that is not a pattern I want to see us establish at all! Temporary
>>> credential overrides for the subjective credentials is already terrible
>>> but at least we have the explicit split between real_cred and cred
>>> expressely for that. So no, that's not an acceptable solution.
>>>
>>
>> Well when this is absolutely not acceptable then I would have to change
>> all security engines to be aware of the current and the new credentials.
> 
> Hmm... even if we find another way to avoid the deadlock? Say, the patches
> I sent...
> 

Maybe, but it looks almost too simple ;-)

   164          sleep(2);
   165          /* deadlock may happen here */
   166          k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);

what happens if you change the test expectation here, that the
ptrace may fail instead of succeed?

What signals does the debugger receive after that point?
Is the debugger notified that the debugged process continues,
has the same PID, and is no longer ptraced?

Thanks
Bernd.

> Oleg.
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held
  2025-11-10 21:49                       ` Cyrill Gorcunov
@ 2025-11-11 14:09                         ` Oleg Nesterov
  0 siblings, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2025-11-11 14:09 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module

On 11/11, Cyrill Gorcunov wrote:
>
> Anyway while looking into patch I got wonder why
>
> +static int wait_for_notify_count(struct task_struct *tsk)
> +{
> +	for (;;) {
> +			return -EINTR;
> +		set_current_state(TASK_KILLABLE);
> +		if (!tsk->signal->notify_count)
> +			break;
>
> We have no any barrier here in fetching @notify_count? I mean updating
> this value is done under locks (spin or read/write) in turn condition
> test is a raw one. Not a big deal since set_current_state() and schedule()

Yes, so I think that, correctness-wise, this doesn't need additional barriers.

> but I've
> been a bit confused that we don't use some read_once here or something.

Yes, this needs READ_ONCE() to avoid the warnings from KCSAN. And in fact
this code was written with READ_ONCE() but I removed it before sending this
RFC.

I was going to do this later. I always forget how KCSAN works, IIUC I also
need to add WRITE_ONCE() into exit_notify() and __exit_signal() to make
KCSAN happy, even if ->notify_count is always updated under the lock.

Same for the "if (me->signal->group_exec_task == me)" check in begin_new_exec().

Right now I would like to know if this RFC (approach) makes any sense,
especially because 3/3 adds a user-visible change.

Oleg.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-11-11 13:45                       ` Bernd Edlinger
@ 2025-11-12  9:52                         ` Oleg Nesterov
  0 siblings, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2025-11-12  9:52 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Alexander Viro, Alexey Dobriyan, Kees Cook,
	Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

On 11/11, Bernd Edlinger wrote:
>
> On 11/11/25 14:12, Oleg Nesterov wrote:
> > On 11/11, Bernd Edlinger wrote:
> >>
> >> Well when this is absolutely not acceptable then I would have to change
> >> all security engines to be aware of the current and the new credentials.
> >
> > Hmm... even if we find another way to avoid the deadlock? Say, the patches
> > I sent...
> >
>
> Maybe, but it looks almost too simple ;-)
>
>    164          sleep(2);
>    165          /* deadlock may happen here */
>    166          k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
>
> what happens if you change the test expectation here, that the
> ptrace may fail instead of succeed?
>
> What signals does the debugger receive after that point?
> Is the debugger notified that the debugged process continues,
> has the same PID, and is no longer ptraced?

Ah, but this is another thing... OK, you dislike 3/3 and I have to agree.

Yes, de_thread() silently untraces/reaps the old leader and after 3/3 debugger
can't rely on PTRACE_EVENT_EXIT, so unless the debugger has already attached to
all sub-threads (at least to execing thread) it looks as if the leader was just
untraced somehow.

OK, this is probably too bad, we need another solution...

Oleg.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-11-11  9:21                 ` Christian Brauner
  2025-11-11 11:07                   ` Bernd Edlinger
@ 2025-11-17  6:31                   ` Bernd Edlinger
  2025-11-17 15:01                     ` Oleg Nesterov
  1 sibling, 1 reply; 39+ messages in thread
From: Bernd Edlinger @ 2025-11-17  6:31 UTC (permalink / raw)
  To: Christian Brauner, Oleg Nesterov
  Cc: Alexander Viro, Alexey Dobriyan, Kees Cook, Andy Lutomirski,
	Will Drewry, Andrew Morton, Michal Hocko, Serge Hallyn,
	James Morris, Randy Dunlap, Suren Baghdasaryan, Yafang Shao,
	Helge Deller, Eric W. Biederman, Adrian Reber, Thomas Gleixner,
	Jens Axboe, Alexei Starovoitov, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-kselftest, linux-mm,
	linux-security-module, tiozhang, Luis Chamberlain,
	Paulo Alcantara (SUSE), Sergey Senozhatsky, Frederic Weisbecker,
	YueHaibing, Paul Moore, Aleksa Sarai, Stefan Roesch, Chao Yu,
	xu xin, Jeff Layton, Jan Kara, David Hildenbrand, Dave Chinner,
	Shuah Khan, Elena Reshetova, David Windsor, Mateusz Guzik,
	Ard Biesheuvel, Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

On 11/11/25 10:21, Christian Brauner wrote:
> On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>> I am still thinking about another approach, will write another email.
>> But let me take a closer look at your patch.
>>
>> First of all, can you split it? See below.
>>
>> On 08/21, Bernd Edlinger wrote:
>>>
>>> -static int de_thread(struct task_struct *tsk)
>>> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>>>  {
>>>  	struct signal_struct *sig = tsk->signal;
>>>  	struct sighand_struct *oldsighand = tsk->sighand;
>>>  	spinlock_t *lock = &oldsighand->siglock;
>>> +	struct task_struct *t;
>>> +	bool unsafe_execve_in_progress = false;
>>>
>>>  	if (thread_group_empty(tsk))
>>>  		goto no_thread_group;
>>> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
>>>  	if (!thread_group_leader(tsk))
>>>  		sig->notify_count--;
>>>
>>> +	for_other_threads(tsk, t) {
>>> +		if (unlikely(t->ptrace)
>>> +		    && (t != tsk->group_leader || !t->exit_state))
>>> +			unsafe_execve_in_progress = true;
>>
>> you can add "break" into the "if ()" block...
>>

ok.

>> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
>> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
>> unconditionally.
>>

I would not like to drop the mutex when no absolutely necessary for performance reasons.
But I can at least try out if something crashes immedately if that is always done.

>> If you really think it makes sense, please make another patch with the
>> changelog.
>>
>> I'd certainly prefer to avoid this boolean at least for the start. If nothing
>> else to catch the potential problems earlier.
>>
>>> +	if (unlikely(unsafe_execve_in_progress)) {
>>> +		spin_unlock_irq(lock);
>>> +		sig->exec_bprm = bprm;
>>> +		mutex_unlock(&sig->cred_guard_mutex);
>>> +		spin_lock_irq(lock);
>>
>> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
>>

Since the spin lock was acquired while holding the mutex, both should be
unlocked in reverse sequence and the spin lock re-acquired after releasing
the mutex.
I'd expect the scheduler to do a task switch after the cred_guard_mutex is
unlocked, at least in the RT-linux variant, while the spin lock is not yet
unlocked.

>>> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
>>>  	 */
>>>  	trace_sched_prepare_exec(current, bprm);
>>>
>>> +	/* If the binary is not readable then enforce mm->dumpable=0 */
>>> +	would_dump(bprm, bprm->file);
>>> +	if (bprm->have_execfd)
>>> +		would_dump(bprm, bprm->executable);
>>> +
>>> +	/*
>>> +	 * Figure out dumpability. Note that this checking only of current
>>> +	 * is wrong, but userspace depends on it. This should be testing
>>> +	 * bprm->secureexec instead.
>>> +	 */
>>> +	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>>> +	    is_dumpability_changed(current_cred(), bprm->cred) ||
>>> +	    !(uid_eq(current_euid(), current_uid()) &&
>>> +	      gid_eq(current_egid(), current_gid())))
>>> +		set_dumpable(bprm->mm, suid_dumpable);
>>> +	else
>>> +		set_dumpable(bprm->mm, SUID_DUMP_USER);
>>> +
>>
>> OK, we need to do this before de_thread() drops cred_guard_mutex.
>> But imo this too should be done in a separate patch, the changelog should
>> explain this change.
>>

The dumpability need to be determined before de_thread, because ptrace_may_access
needs this information to determine if the tracer is allowed to ptrace. That is
part of the core of the patch, it would not work without that.
I will add more comments to make that more easy to understand. 

>>> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
>>>  	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>>>  		return -ERESTARTNOINTR;
>>>
>>> +	if (unlikely(current->signal->exec_bprm)) {
>>> +		mutex_unlock(&current->signal->cred_guard_mutex);
>>> +		return -ERESTARTNOINTR;
>>> +	}
>>
>> OK, if signal->exec_bprm != NULL, then current is already killed. But
>> proc_pid_attr_write() and ptrace_traceme() do the same. So how about
>> something like
>>
>> 	int lock_current_cgm(void)
>> 	{
>> 		if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>> 			return -ERESTARTNOINTR;
>>
>> 		if (!current->signal->group_exec_task)
>> 			return 0;
>>
>> 		WARN_ON(!fatal_signal_pending(current));
>> 		mutex_unlock(&current->signal->cred_guard_mutex);
>> 		return -ERESTARTNOINTR;
>> 	}
>>
>> ?
>>

Some use mutex_lock_interruptible and some use mutex_lock_killable here,
so it wont work for all of them.  I would not consider this a new kind
of dead-lock free mutex, but just an open-coded state machine, handling
the state that the tasks have whild de_thread is running.

>> Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
>> come in a separate patch too, but I won't insist.
>>
>>> @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
>>>  				return retval;
>>>  		}
>>>
>>> +		if (unlikely(task == task->signal->group_exec_task)) {
>>> +			retval = down_write_killable(&task->signal->exec_update_lock);
>>> +			if (retval)
>>> +				return retval;
>>> +
>>> +			scoped_guard (task_lock, task) {
>>> +				struct linux_binprm *bprm = task->signal->exec_bprm;
>>> +				const struct cred __rcu *old_cred = task->real_cred;
>>> +				struct mm_struct *old_mm = task->mm;
>>> +
>>> +				rcu_assign_pointer(task->real_cred, bprm->cred);
>>> +				task->mm = bprm->mm;
>>> +				retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
>>> +				rcu_assign_pointer(task->real_cred, old_cred);
>>> +				task->mm = old_mm;
>>> +			}
>>
>> This is the most problematic change which I can't review...
>>
>> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
>> looks dangerous to me.
> 
> Yeah, that is not ok. This is effectively override_creds for real_cred
> and that is not a pattern I want to see us establish at all! Temporary
> credential overrides for the subjective credentials is already terrible
> but at least we have the explicit split between real_cred and cred
> expressely for that. So no, that's not an acceptable solution.
> 

Okay I understand your point.
I did this originally just to avoid to have to change the interface to all
the security engines, but instead I could add a flag PTRACE_MODE_BPRMCREDS to
the ptrace_may_access which must be handled in all security engines, to use
child->signal->exec_bprm->creds instead of __task_cred(child).

>>
>> Say, current_is_single_threaded() called by another CLONE_VM process can
>> miss group_exec_task and falsely return true. Probably not that bad, in
>> this case old_mm should go away soon, but still...
>>

Oh, that's nice, I was not aware of that one.
Access to current are not a problem, since the task is trapped in de_thread,
however by code review I found also other places where task credentials are
checked and then used without holdning any lock, e.g. in rdtgroup_task_write_permission
and in quite similar code in __cgroup1_procs_write, I dont know if that is a
security problem.

>> And I don't know if this can fool the users of task_cred_xxx/__task_cred
>> somehow.
>>
>> Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
>> ptrace the execing task after that? I have no idea what the security hooks
>> can do...

That means the tracee is already ptraced before the execve, and SUID-bits
do not work as usual, and are more or less ignored.  But in this patch
the tracee is not yet ptraced.  Only some sibling threads.  But they will
either be zapped and go away or the tracer wants to attach to the main thread,
and in that case the tracer is only able to ptrace the main thread if he has
also access permissions for the credentials that are in effect after the execve
completes.

>>
>> Again, can't review this part.
>>
>> Oleg.
>>

Thanks
Bernd.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
  2025-11-17  6:31                   ` Bernd Edlinger
@ 2025-11-17 15:01                     ` Oleg Nesterov
  0 siblings, 0 replies; 39+ messages in thread
From: Oleg Nesterov @ 2025-11-17 15:01 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Christian Brauner, Alexander Viro, Alexey Dobriyan, Kees Cook,
	Andy Lutomirski, Will Drewry, Andrew Morton, Michal Hocko,
	Serge Hallyn, James Morris, Randy Dunlap, Suren Baghdasaryan,
	Yafang Shao, Helge Deller, Eric W. Biederman, Adrian Reber,
	Thomas Gleixner, Jens Axboe, Alexei Starovoitov,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest, linux-mm, linux-security-module, tiozhang,
	Luis Chamberlain, Paulo Alcantara (SUSE), Sergey Senozhatsky,
	Frederic Weisbecker, YueHaibing, Paul Moore, Aleksa Sarai,
	Stefan Roesch, Chao Yu, xu xin, Jeff Layton, Jan Kara,
	David Hildenbrand, Dave Chinner, Shuah Khan, Elena Reshetova,
	David Windsor, Mateusz Guzik, Ard Biesheuvel,
	Joel Fernandes (Google), Matthew Wilcox (Oracle),
	Hans Liljestrand, Penglei Jiang, Lorenzo Stoakes, Adrian Ratiu,
	Ingo Molnar, Peter Zijlstra (Intel), Cyrill Gorcunov,
	Eric Dumazet

On 11/17, Bernd Edlinger wrote:
>
> On 11/11/25 10:21, Christian Brauner wrote:
> > On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>
> >> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
> >> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
> >> unconditionally.
> >>
>
> I would not like to drop the mutex when no absolutely necessary for performance reasons.

OK, I won't insist... But I don't really understand how this can help to
improve the performance. If nothing else, this adds another for_other_threads()
loop.

And again, the unsafe_execve_in_progress == T case is unlikely. I'm afraid this
case (de_thread() without cred_guard_mutex) won't have enough testing.

In any case, why you dislike the suggestion to add this unsafe_execve_in_progress
logic in a separate patch?

> >>> +	if (unlikely(unsafe_execve_in_progress)) {
> >>> +		spin_unlock_irq(lock);
> >>> +		sig->exec_bprm = bprm;
> >>> +		mutex_unlock(&sig->cred_guard_mutex);
> >>> +		spin_lock_irq(lock);
> >>
> >> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
> >>
>
> Since the spin lock was acquired while holding the mutex, both should be
> unlocked in reverse sequence and the spin lock re-acquired after releasing
> the mutex.

Why?

> I'd expect the scheduler to do a task switch after the cred_guard_mutex is
> unlocked, at least in the RT-linux variant, while the spin lock is not yet
> unlocked.

I must have missed something, but I still don't understand why this would
be wrong...

> >>> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
> >>>  	 */
> >>>  	trace_sched_prepare_exec(current, bprm);
> >>>
> >>> +	/* If the binary is not readable then enforce mm->dumpable=0 */
> >>> +	would_dump(bprm, bprm->file);
> >>> +	if (bprm->have_execfd)
> >>> +		would_dump(bprm, bprm->executable);
> >>> +
> >>> +	/*
> >>> +	 * Figure out dumpability. Note that this checking only of current
> >>> +	 * is wrong, but userspace depends on it. This should be testing
> >>> +	 * bprm->secureexec instead.
> >>> +	 */
> >>> +	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> >>> +	    is_dumpability_changed(current_cred(), bprm->cred) ||
> >>> +	    !(uid_eq(current_euid(), current_uid()) &&
> >>> +	      gid_eq(current_egid(), current_gid())))
> >>> +		set_dumpable(bprm->mm, suid_dumpable);
> >>> +	else
> >>> +		set_dumpable(bprm->mm, SUID_DUMP_USER);
> >>> +
> >>
> >> OK, we need to do this before de_thread() drops cred_guard_mutex.
> >> But imo this too should be done in a separate patch, the changelog should
> >> explain this change.
> >>
>
> The dumpability need to be determined before de_thread, because ptrace_may_access
> needs this information to determine if the tracer is allowed to ptrace. That is
> part of the core of the patch, it would not work without that.

Yes,

> I will add more comments to make that more easy to understand.

But again, why this change can't come in a separate patch? Before the patch which
drops cred_guard_mutex in de_thread().

> >> 	int lock_current_cgm(void)
> >> 	{
> >> 		if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
> >> 			return -ERESTARTNOINTR;
> >>
> >> 		if (!current->signal->group_exec_task)
> >> 			return 0;
> >>
> >> 		WARN_ON(!fatal_signal_pending(current));
> >> 		mutex_unlock(&current->signal->cred_guard_mutex);
> >> 		return -ERESTARTNOINTR;
> >> 	}
> >>
> >> ?
> >>
>
> Some use mutex_lock_interruptible and some use mutex_lock_killable here,
> so it wont work for all of them.  I would not consider this a new kind
> of dead-lock free mutex, but just an open-coded state machine, handling
> the state that the tasks have whild de_thread is running.

OK. and we don't have mutex_lock_state(). I think that all users could
use mutex_lock_killable(), but you are right anyway, and this is minor.

> >> Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
> >> come in a separate patch too, but I won't insist.

Yes. Although this is minor too ;)

> >> This is the most problematic change which I can't review...
> >>
> >> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
> >> looks dangerous to me.
> >
> > Yeah, that is not ok. This is effectively override_creds for real_cred
> > and that is not a pattern I want to see us establish at all! Temporary
> > credential overrides for the subjective credentials is already terrible
> > but at least we have the explicit split between real_cred and cred
> > expressely for that. So no, that's not an acceptable solution.
> >
>
> Okay I understand your point.
> I did this originally just to avoid to have to change the interface to all
> the security engines, but instead I could add a flag PTRACE_MODE_BPRMCREDS to
> the ptrace_may_access which must be handled in all security engines, to use
> child->signal->exec_bprm->creds instead of __task_cred(child).

Can't comment... I don't understand your idea, but this is my fault. I guess
this needs more changes, in particular __ptrace_may_access_mm_cred(), but
most probably I misunderstood your idea.

>
> >> Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
> >> ptrace the execing task after that? I have no idea what the security hooks
> >> can do...
>
> That means the tracee is already ptraced before the execve, and SUID-bits
> do not work as usual, and are more or less ignored.  But in this patch
> the tracee is not yet ptraced.

Well. I meant that if LSM_UNSAFE_PTRACE is not set, then currently (say)
security_bprm_committing_creds() has all rights to assume that the execing
task is not ptraced. Yes, I don't see any potential problem right now, but
still.

And just in case... Lets look at this code

	+                               rcu_assign_pointer(task->real_cred, bprm->cred);
	+                               task->mm = bprm->mm;
	+                               retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
	+                               rcu_assign_pointer(task->real_cred, old_cred);
	+                               task->mm = old_mm;

again.

This is mostly theoretical, but what if begin_new_exec() fails after de_thread()
and before exec_mmap() and/or commit_creds(bprm->cred) ? In this case the execing
thread will report SIGSEGV to debugger which can (say) read old_mm.

No?

I am starting to think that ptrace_attach() should simply fail with -EWOULDBLOCK
if it detects "unsafe_execve_in_progress" ... And perhaps this is what you already
tried to do in the past, I can't recall :/

Oleg.



^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-11-17 15:02 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <AM8PR10MB470801D01A0CF24BC32C25E7E40E9@AM8PR10MB4708.EURPRD10.PROD.OUTLOOK.COM>
     [not found] ` <AM8PR10MB470875B22B4C08BEAEC3F77FE4169@AM8PR10MB4708.EURPRD10.PROD.OUTLOOK.COM>
2023-10-30  5:20   ` [PATCH v12] exec: Fix dead-lock in de_thread with ptrace_attach Bernd Edlinger
2023-10-30  9:00     ` kernel test robot
     [not found]     ` <AS8P193MB12851AC1F862B97FCE9B3F4FE4AAA@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM>
2024-01-15 19:22       ` [PATCH v14] " Bernd Edlinger
2024-01-15 19:37         ` Matthew Wilcox
2024-01-17  9:51           ` Bernd Edlinger
2024-01-16 15:22         ` Oleg Nesterov
2024-01-17 15:07           ` Bernd Edlinger
2024-01-17 16:38             ` Oleg Nesterov
2024-01-22 13:24               ` Bernd Edlinger
2024-01-22 13:44                 ` Oleg Nesterov
2024-01-22 21:30                 ` Kees Cook
2024-01-23 18:30                   ` Bernd Edlinger
2024-01-24  0:09                     ` Kees Cook
     [not found]         ` <AS8P193MB1285937F9831CECAF2A9EEE2E4752@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM>
2025-08-18  6:04           ` [PATCH v15] " Jain, Ayush
2025-08-18 20:53           ` [PATCH v16] " Bernd Edlinger
2025-08-19  4:36             ` Kees Cook
2025-08-19 18:53               ` Bernd Edlinger
2025-08-21 17:34             ` [PATCH v17] " Bernd Edlinger
2025-10-27  6:26               ` Bernd Edlinger
2025-10-27 12:06               ` Peter Zijlstra
2025-11-02 16:17               ` Oleg Nesterov
2025-11-05 14:32               ` Oleg Nesterov
2025-11-11  9:21                 ` Christian Brauner
2025-11-11 11:07                   ` Bernd Edlinger
2025-11-11 13:12                     ` Oleg Nesterov
2025-11-11 13:45                       ` Bernd Edlinger
2025-11-12  9:52                         ` Oleg Nesterov
2025-11-17  6:31                   ` Bernd Edlinger
2025-11-17 15:01                     ` Oleg Nesterov
2025-11-09 17:14               ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Oleg Nesterov
2025-11-09 17:14                 ` [RFC PATCH 1/3] exec: make setup_new_exec() return int Oleg Nesterov
2025-11-09 17:15                 ` [RFC PATCH 2/3] exec: don't wait for zombie threads with cred_guard_mutex held Oleg Nesterov
2025-11-10 10:58                   ` Cyrill Gorcunov
2025-11-10 15:09                     ` Oleg Nesterov
2025-11-10 21:49                       ` Cyrill Gorcunov
2025-11-11 14:09                         ` Oleg Nesterov
2025-11-09 17:16                 ` [RFC PATCH 3/3] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec Oleg Nesterov
2025-11-10  5:28                 ` [RFC PATCH 0/3] mt-exec: fix deadlock with ptrace_attach() Bernd Edlinger
2025-11-10 14:47                   ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).