[PATCH v3 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit
@ 2025-03-20  9:29 Christian Brauner
  2025-03-20  9:29 ` [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling Christian Brauner
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Christian Brauner @ 2025-03-20  9:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Oleg,

Don't kill me but this is another attempt at trying to make pidfd
polling for multi-threaded exec and premature thread-group leader exit
consistent.

A quick recap of these two cases:

(1) During a multi-threaded exec by a subthread, i.e., non-thread-group
    leader thread, all other threads in the thread-group including the
    thread-group leader are killed and the struct pid of the
    thread-group leader will be taken over by the subthread that called
    exec. IOW, two tasks change their TIDs.

(2) A premature thread-group leader exit means that the thread-group
    leader exited before all of the other subthreads in the thread-group
    have exited.

Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD.
Any caller that holds a PIDFD_THREAD pidfd to the current thread-group
leader may or may not see an exit notification on the file descriptor
depending on when poll is performed. If the poll is performed before the
exec of the subthread has concluded an exit notification is generated
for the old thread-group leader. If the poll is performed after the exec
of the subthread has concluded no exit notification is generated for the
old thread-group leader.

The correct behavior would be to simply not generate an exit
notification on the struct pid of a subhthread exec because the struct
pid is taken over by the subthread and thus remains alive.

But this is difficult to handle because a thread-group may exit
premature as mentioned in (2). In that case an exit notification is
reliably generated but the subthreads may continue to run for an
indeterminate amount of time and thus also may exec at some point.

This tiny series tries to address this problem. If that works correctly
then no exit notifications are generated for a PIDFD_THREAD pidfd for a
thread-group leader until all subthreads have been reaped. If a
subthread should exec before no exit notification will be generated
until that task exits or it creates subthreads and repeates the cycle.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v3:
- Simplify the implementation based on Oleg's suggestion.
- Link to v2: https://lore.kernel.org/r/20250318-work-pidfs-thread_group-v2-0-2677898ffa2e@kernel.org

Changes in v2:
- This simplifies the implementation of pidfs_poll() a bit since the
  check for multi-threaded exec I added is wrong and actually not even
  needed afaict because pid->delayed_leader cleanly handles both
  multi-threaded exec and premature thread-group leader exec not
  followed by a subthread exec.
- Link to v1: https://lore.kernel.org/r/20250317-work-pidfs-thread_group-v1-0-bc9e5ed283e9@kernel.org

---
Christian Brauner (4):
      pidfs: improve multi-threaded exec and premature thread-group leader exit polling
      selftests/pidfd: first test for multi-threaded exec polling
      selftests/pidfd: second test for multi-threaded exec polling
      selftests/pidfd: third test for multi-threaded exec polling

 fs/pidfs.c                                      |  22 ++-
 kernel/exit.c                                   |  12 +-
 kernel/signal.c                                 |   6 +-
 tools/testing/selftests/pidfd/pidfd_info_test.c | 237 +++++++++++++++++++++---
 4 files changed, 250 insertions(+), 27 deletions(-)
---
base-commit: 68db272741834d5e528b5e1af6b83b479fec6cbb
change-id: 20250317-work-pidfs-thread_group-141682f9a50a

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling
  2025-03-20  9:29 [PATCH v3 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit Christian Brauner
@ 2025-03-20  9:29 ` Christian Brauner
  2025-03-20 10:57   ` Oleg Nesterov
  2025-03-20  9:29 ` [PATCH v3 2/4] selftests/pidfd: first test for multi-threaded exec polling Christian Brauner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2025-03-20  9:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

This is another attempt trying to make pidfd polling for multi-threaded
exec and premature thread-group leader exit consistent.

A quick recap of these two cases:

(1) During a multi-threaded exec by a subthread, i.e., non-thread-group
    leader thread, all other threads in the thread-group including the
    thread-group leader are killed and the struct pid of the
    thread-group leader will be taken over by the subthread that called
    exec. IOW, two tasks change their TIDs.

(2) A premature thread-group leader exit means that the thread-group
    leader exited before all of the other subthreads in the thread-group
    have exited.

Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD.
Any caller that holds a PIDFD_THREAD pidfd to the current thread-group
leader may or may not see an exit notification on the file descriptor
depending on when poll is performed. If the poll is performed before the
exec of the subthread has concluded an exit notification is generated
for the old thread-group leader. If the poll is performed after the exec
of the subthread has concluded no exit notification is generated for the
old thread-group leader.

The correct behavior would be to simply not generate an exit
notification on the struct pid of a subhthread exec because the struct
pid is taken over by the subthread and thus remains alive.

But this is difficult to handle because a thread-group may exit
prematurely as mentioned in (2). In that case an exit notification is
reliably generated but the subthreads may continue to run for an
indeterminate amount of time and thus also may exec at some point.

So far there was no way to distinguish between (1) and (2) internally.
This tiny series tries to address this problem by discarding
PIDFD_THREAD notification on premature thread-group leader exit.

If that works correctly then no exit notifications are generated for a
PIDFD_THREAD pidfd for a thread-group leader until all subthreads have
been reaped. If a subthread should exec aftewards no exit notification
will be generated until that task exits or it creates subthreads and
repeates the cycle.

Co-Developed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c      | 22 +++++++++++++++++++++-
 kernel/exit.c   | 12 +++++++++---
 kernel/signal.c |  6 ++++--
 3 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index a48cc44ced6f..f1c49a7540f3 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -218,12 +218,32 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
 	/*
 	 * Depending on PIDFD_THREAD, inform pollers when the thread
 	 * or the whole thread-group exits.
+	 *
+	 * There are two corner cases to consider:
+	 *
+	 * (1) If a thread-group leader of a thread-group with
+	 *     subthreads exits prematurely, i.e., before all of the
+	 *     subthreads of the thread-group have exited then no
+	 *     notification will be generated for PIDFD_THREAD pidfds
+	 *     referring to the thread-group leader.
+	 *
+	 *     The exit notification for the thread-group leader will be
+	 *     delayed until the last subthread of the thread-group
+	 *     exits.
+	 *
+	 * (2) If a subthread of a thread-group execs then the
+	 *     current thread-group leader will be SIGKILLed and the
+	 *     subthread will assume the struct pid of the now defunct
+	 *     old thread-group leader. No exit notification will be
+	 *     generated for PIDFD_THREAD pidfds referring to the old
+	 *     thread-group leader as they continue referring to the new
+	 *     thread-group leader.
 	 */
 	guard(rcu)();
 	task = pid_task(pid, PIDTYPE_PID);
 	if (!task)
 		poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP;
-	else if (task->exit_state && (thread || thread_group_empty(task)))
+	else if (task->exit_state && !delay_group_leader(task))
 		poll_flags = EPOLLIN | EPOLLRDNORM;
 
 	return poll_flags;
diff --git a/kernel/exit.c b/kernel/exit.c
index 9916305e34d3..ce5cdad5ba9c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -271,6 +271,9 @@ void release_task(struct task_struct *p)
 		 * If we were the last child thread and the leader has
 		 * exited already, and the leader's parent ignores SIGCHLD,
 		 * then we are the one who should release the leader.
+		 *
+		 * This will also wake PIDFD_THREAD pidfds for the
+		 * thread-group leader that already exited.
 		 */
 		zap_leader = do_notify_parent(leader, leader->exit_signal);
 		if (zap_leader)
@@ -743,10 +746,13 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 
 	tsk->exit_state = EXIT_ZOMBIE;
 	/*
-	 * sub-thread or delay_group_leader(), wake up the
-	 * PIDFD_THREAD waiters.
+	 * Wake up PIDFD_THREAD waiters if this is a proper subthread
+	 * exit. If this is a premature thread-group leader exit delay
+	 * the notification until the last subthread exits. If a
+	 * subthread should exec before then no notification will be
+	 * generated.
 	 */
-	if (!thread_group_empty(tsk))
+	if (!delay_group_leader(tsk))
 		do_notify_pidfd(tsk);
 
 	if (unlikely(tsk->ptrace)) {
diff --git a/kernel/signal.c b/kernel/signal.c
index 081f19a24506..0ccef8783dff 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2180,8 +2180,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	WARN_ON_ONCE(!tsk->ptrace &&
 	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
 	/*
-	 * tsk is a group leader and has no threads, wake up the
-	 * non-PIDFD_THREAD waiters.
+	 * This is a thread-group leader without subthreads so wake up
+	 * the non-PIDFD_THREAD waiters. This also wakes the
+	 * PIDFD_THREAD waiters for the thread-group leader in case it
+	 * exited prematurely from release_task().
 	 */
 	if (thread_group_empty(tsk))
 		do_notify_pidfd(tsk);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/4] selftests/pidfd: first test for multi-threaded exec polling
  2025-03-20  9:29 [PATCH v3 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit Christian Brauner
  2025-03-20  9:29 ` [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling Christian Brauner
@ 2025-03-20  9:29 ` Christian Brauner
  2025-03-20  9:29 ` [PATCH v3 3/4] selftests/pidfd: second " Christian Brauner
  2025-03-20  9:29 ` [PATCH v3 4/4] selftests/pidfd: third " Christian Brauner
  3 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2025-03-20  9:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Add first test for premature thread-group leader exit.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 38 ++++++++++++++++++++-----
 1 file changed, 31 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 09bc4ae7aed5..28a28ae4686a 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -236,7 +236,7 @@ static void *pidfd_info_pause_thread(void *arg)
 
 TEST_F(pidfd_info, thread_group)
 {
-	pid_t pid_leader, pid_thread;
+	pid_t pid_leader, pid_poller, pid_thread;
 	pthread_t thread;
 	int nevents, pidfd_leader, pidfd_thread, pidfd_leader_thread, ret;
 	int ipc_sockets[2];
@@ -262,6 +262,35 @@ TEST_F(pidfd_info, thread_group)
 		syscall(__NR_exit, EXIT_SUCCESS);
 	}
 
+	/*
+	 * Opening a PIDFD_THREAD aka thread-specific pidfd based on a
+	 * thread-group leader must succeed.
+	 */
+	pidfd_leader_thread = sys_pidfd_open(pid_leader, PIDFD_THREAD);
+	ASSERT_GE(pidfd_leader_thread, 0);
+
+	pid_poller = fork();
+	ASSERT_GE(pid_poller, 0);
+	if (pid_poller == 0) {
+		/*
+		 * We can't poll and wait for the old thread-group
+		 * leader to exit using a thread-specific pidfd. The
+		 * thread-group leader exited prematurely and
+		 * notification is delayed until all subthreads have
+		 * exited.
+		 */
+		fds.events = POLLIN;
+		fds.fd = pidfd_leader_thread;
+		nevents = poll(&fds, 1, 10000 /* wait 5 seconds */);
+		if (nevents != 0)
+			_exit(EXIT_FAILURE);
+		if (fds.revents & POLLIN)
+			_exit(EXIT_FAILURE);
+		if (fds.revents & POLLHUP)
+			_exit(EXIT_FAILURE);
+		_exit(EXIT_SUCCESS);
+	}
+
 	/* Retrieve the tid of the thread. */
 	EXPECT_EQ(close(ipc_sockets[1]), 0);
 	ASSERT_EQ(read_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
@@ -275,12 +304,7 @@ TEST_F(pidfd_info, thread_group)
 	pidfd_thread = sys_pidfd_open(pid_thread, PIDFD_THREAD);
 	ASSERT_GE(pidfd_thread, 0);
 
-	/*
-	 * Opening a PIDFD_THREAD aka thread-specific pidfd based on a
-	 * thread-group leader must succeed.
-	 */
-	pidfd_leader_thread = sys_pidfd_open(pid_leader, PIDFD_THREAD);
-	ASSERT_GE(pidfd_leader_thread, 0);
+	ASSERT_EQ(wait_for_pid(pid_poller), 0);
 
 	/*
 	 * Note that pidfd_leader is a thread-group pidfd, so polling on it

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 3/4] selftests/pidfd: second test for multi-threaded exec polling
  2025-03-20  9:29 [PATCH v3 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit Christian Brauner
  2025-03-20  9:29 ` [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling Christian Brauner
  2025-03-20  9:29 ` [PATCH v3 2/4] selftests/pidfd: first test for multi-threaded exec polling Christian Brauner
@ 2025-03-20  9:29 ` Christian Brauner
  2025-03-20  9:29 ` [PATCH v3 4/4] selftests/pidfd: third " Christian Brauner
  3 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2025-03-20  9:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Ensure that during a multi-threaded exec and premature thread-group
leader exit no exit notification is generated.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 72 ++++++++++++++++---------
 1 file changed, 48 insertions(+), 24 deletions(-)

diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 28a28ae4686a..4169780c9e55 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -413,7 +413,7 @@ static void *pidfd_info_thread_exec(void *arg)
 
 TEST_F(pidfd_info, thread_group_exec)
 {
-	pid_t pid_leader, pid_thread;
+	pid_t pid_leader, pid_poller, pid_thread;
 	pthread_t thread;
 	int nevents, pidfd_leader, pidfd_leader_thread, pidfd_thread, ret;
 	int ipc_sockets[2];
@@ -439,6 +439,37 @@ TEST_F(pidfd_info, thread_group_exec)
 		syscall(__NR_exit, EXIT_SUCCESS);
 	}
 
+	/* Open a thread-specific pidfd for the thread-group leader. */
+	pidfd_leader_thread = sys_pidfd_open(pid_leader, PIDFD_THREAD);
+	ASSERT_GE(pidfd_leader_thread, 0);
+
+	pid_poller = fork();
+	ASSERT_GE(pid_poller, 0);
+	if (pid_poller == 0) {
+		/*
+		 * We can't poll and wait for the old thread-group
+		 * leader to exit using a thread-specific pidfd. The
+		 * thread-group leader exited prematurely and
+		 * notification is delayed until all subthreads have
+		 * exited.
+		 *
+		 * When the thread has execed it will taken over the old
+		 * thread-group leaders struct pid. Calling poll after
+		 * the thread execed will thus block again because a new
+		 * thread-group has started.
+		 */
+		fds.events = POLLIN;
+		fds.fd = pidfd_leader_thread;
+		nevents = poll(&fds, 1, 10000 /* wait 5 seconds */);
+		if (nevents != 0)
+			_exit(EXIT_FAILURE);
+		if (fds.revents & POLLIN)
+			_exit(EXIT_FAILURE);
+		if (fds.revents & POLLHUP)
+			_exit(EXIT_FAILURE);
+		_exit(EXIT_SUCCESS);
+	}
+
 	/* Retrieve the tid of the thread. */
 	EXPECT_EQ(close(ipc_sockets[1]), 0);
 	ASSERT_EQ(read_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
@@ -447,33 +478,12 @@ TEST_F(pidfd_info, thread_group_exec)
 	pidfd_thread = sys_pidfd_open(pid_thread, PIDFD_THREAD);
 	ASSERT_GE(pidfd_thread, 0);
 
-	/* Open a thread-specific pidfd for the thread-group leader. */
-	pidfd_leader_thread = sys_pidfd_open(pid_leader, PIDFD_THREAD);
-	ASSERT_GE(pidfd_leader_thread, 0);
-
-	/*
-	 * We can poll and wait for the old thread-group leader to exit
-	 * using a thread-specific pidfd.
-	 *
-	 * This only works until the thread has execed. When the thread
-	 * has execed it will have taken over the old thread-group
-	 * leaders struct pid. Calling poll after the thread execed will
-	 * thus block again because a new thread-group has started (Yes,
-	 * it's fscked.).
-	 */
-	fds.events = POLLIN;
-	fds.fd = pidfd_leader_thread;
-	nevents = poll(&fds, 1, -1);
-	ASSERT_EQ(nevents, 1);
-	/* The thread-group leader has exited. */
-	ASSERT_TRUE(!!(fds.revents & POLLIN));
-	/* The thread-group leader hasn't been reaped. */
-	ASSERT_FALSE(!!(fds.revents & POLLHUP));
-
 	/* Now that we've opened a thread-specific pidfd the thread can exec. */
 	ASSERT_EQ(write_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
 	EXPECT_EQ(close(ipc_sockets[0]), 0);
 
+	ASSERT_EQ(wait_for_pid(pid_poller), 0);
+
 	/* Wait until the kernel has SIGKILLed the thread. */
 	fds.events = POLLHUP;
 	fds.fd = pidfd_thread;
@@ -506,6 +516,20 @@ TEST_F(pidfd_info, thread_group_exec)
 
 	/* Take down the thread-group leader. */
 	EXPECT_EQ(sys_pidfd_send_signal(pidfd_leader, SIGKILL, NULL, 0), 0);
+
+	/*
+	 * Afte the exec we're dealing with an empty thread-group so now
+	 * we must see an exit notification on the thread-specific pidfd
+	 * for the thread-group leader as there's no subthread that can
+	 * revive the struct pid.
+	 */
+	fds.events = POLLIN;
+	fds.fd = pidfd_leader_thread;
+	nevents = poll(&fds, 1, -1);
+	ASSERT_EQ(nevents, 1);
+	ASSERT_TRUE(!!(fds.revents & POLLIN));
+	ASSERT_FALSE(!!(fds.revents & POLLHUP));
+
 	EXPECT_EQ(sys_waitid(P_PIDFD, pidfd_leader, NULL, WEXITED), 0);
 
 	/* Retrieve exit information for the thread-group leader. */

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 4/4] selftests/pidfd: third test for multi-threaded exec polling
  2025-03-20  9:29 [PATCH v3 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit Christian Brauner
                   ` (2 preceding siblings ...)
  2025-03-20  9:29 ` [PATCH v3 3/4] selftests/pidfd: second " Christian Brauner
@ 2025-03-20  9:29 ` Christian Brauner
  3 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2025-03-20  9:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Ensure that during a multi-threaded exec and premature thread-group
leader exit no exit notification is generated.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 147 ++++++++++++++++++++++++
 1 file changed, 147 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 4169780c9e55..1758a1b0457b 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -542,4 +542,151 @@ TEST_F(pidfd_info, thread_group_exec)
 	EXPECT_EQ(close(pidfd_thread), 0);
 }
 
+static void *pidfd_info_thread_exec_sane(void *arg)
+{
+	pid_t pid_thread = gettid();
+	int ipc_socket = *(int *)arg;
+
+	/* Inform the grand-parent what the tid of this thread is. */
+	if (write_nointr(ipc_socket, &pid_thread, sizeof(pid_thread)) != sizeof(pid_thread))
+		return NULL;
+
+	if (read_nointr(ipc_socket, &pid_thread, sizeof(pid_thread)) != sizeof(pid_thread))
+		return NULL;
+
+	close(ipc_socket);
+
+	sys_execveat(AT_FDCWD, "pidfd_exec_helper", NULL, NULL, 0);
+	return NULL;
+}
+
+TEST_F(pidfd_info, thread_group_exec_thread)
+{
+	pid_t pid_leader, pid_poller, pid_thread;
+	pthread_t thread;
+	int nevents, pidfd_leader, pidfd_leader_thread, pidfd_thread, ret;
+	int ipc_sockets[2];
+	struct pollfd fds = {};
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
+	};
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	EXPECT_EQ(ret, 0);
+
+	pid_leader = create_child(&pidfd_leader, 0);
+	EXPECT_GE(pid_leader, 0);
+
+	if (pid_leader == 0) {
+		close(ipc_sockets[0]);
+
+		/* The thread will outlive the thread-group leader. */
+		if (pthread_create(&thread, NULL, pidfd_info_thread_exec_sane, &ipc_sockets[1]))
+			syscall(__NR_exit, EXIT_FAILURE);
+
+		/*
+		 * Pause the thread-group leader. It will be killed once
+		 * the subthread execs.
+		 */
+		pause();
+		syscall(__NR_exit, EXIT_SUCCESS);
+	}
+
+	/* Retrieve the tid of the thread. */
+	EXPECT_EQ(close(ipc_sockets[1]), 0);
+	ASSERT_EQ(read_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
+
+	/* Opening a thread as a PIDFD_THREAD must succeed. */
+	pidfd_thread = sys_pidfd_open(pid_thread, PIDFD_THREAD);
+	ASSERT_GE(pidfd_thread, 0);
+
+	/* Open a thread-specific pidfd for the thread-group leader. */
+	pidfd_leader_thread = sys_pidfd_open(pid_leader, PIDFD_THREAD);
+	ASSERT_GE(pidfd_leader_thread, 0);
+
+	pid_poller = fork();
+	ASSERT_GE(pid_poller, 0);
+	if (pid_poller == 0) {
+		/*
+		 * The subthread will now exec. The struct pid of the old
+		 * thread-group leader will be assumed by the subthread which
+		 * becomes the new thread-group leader. So no exit notification
+		 * must be generated. Wait for 5 seconds and call it a success
+		 * if no notification has been received.
+		 */
+		fds.events = POLLIN;
+		fds.fd = pidfd_leader_thread;
+		nevents = poll(&fds, 1, 10000 /* wait 5 seconds */);
+		if (nevents != 0)
+			_exit(EXIT_FAILURE);
+		if (fds.revents & POLLIN)
+			_exit(EXIT_FAILURE);
+		if (fds.revents & POLLHUP)
+			_exit(EXIT_FAILURE);
+		_exit(EXIT_SUCCESS);
+	}
+
+	/* Now that we've opened a thread-specific pidfd the thread can exec. */
+	ASSERT_EQ(write_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
+	EXPECT_EQ(close(ipc_sockets[0]), 0);
+	ASSERT_EQ(wait_for_pid(pid_poller), 0);
+
+	/* Wait until the kernel has SIGKILLed the thread. */
+	fds.events = POLLHUP;
+	fds.fd = pidfd_thread;
+	nevents = poll(&fds, 1, -1);
+	ASSERT_EQ(nevents, 1);
+	/* The thread has been reaped. */
+	ASSERT_TRUE(!!(fds.revents & POLLHUP));
+
+	/* Retrieve thread-specific exit info from pidfd. */
+	ASSERT_EQ(ioctl(pidfd_thread, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+	/*
+	 * While the kernel will have SIGKILLed the whole thread-group
+	 * during exec it will cause the individual threads to exit
+	 * cleanly.
+	 */
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
+
+	/*
+	 * The thread-group leader is still alive, the thread has taken
+	 * over its struct pid and thus its pid number.
+	 */
+	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(pidfd_leader, PIDFD_GET_INFO, &info), 0);
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_EXIT));
+	ASSERT_EQ(info.pid, pid_leader);
+
+	/* Take down the thread-group leader. */
+	EXPECT_EQ(sys_pidfd_send_signal(pidfd_leader, SIGKILL, NULL, 0), 0);
+
+	/*
+	 * Afte the exec we're dealing with an empty thread-group so now
+	 * we must see an exit notification on the thread-specific pidfd
+	 * for the thread-group leader as there's no subthread that can
+	 * revive the struct pid.
+	 */
+	fds.events = POLLIN;
+	fds.fd = pidfd_leader_thread;
+	nevents = poll(&fds, 1, -1);
+	ASSERT_EQ(nevents, 1);
+	ASSERT_TRUE(!!(fds.revents & POLLIN));
+	ASSERT_FALSE(!!(fds.revents & POLLHUP));
+
+	EXPECT_EQ(sys_waitid(P_PIDFD, pidfd_leader, NULL, WEXITED), 0);
+
+	/* Retrieve exit information for the thread-group leader. */
+	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(pidfd_leader, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+
+	EXPECT_EQ(close(pidfd_leader), 0);
+	EXPECT_EQ(close(pidfd_thread), 0);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling
  2025-03-20  9:29 ` [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling Christian Brauner
@ 2025-03-20 10:57   ` Oleg Nesterov
  2025-03-20 11:21     ` Oleg Nesterov
  2025-03-20 12:36     ` Christian Brauner
  0 siblings, 2 replies; 11+ messages in thread
From: Oleg Nesterov @ 2025-03-20 10:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

Christian,

All the comments look misleading (and overcomplicated) to me.

See below, but first lets recall the commit 64bef697d33b75fc06c5789
("pidfd: implement PIDFD_THREAD flag for pidfd_open()") which says

    pidfd: implement PIDFD_THREAD flag for pidfd_open()

    With this flag:

            ....

            - pidfd_poll() succeeds when the task exits and becomes a
              zombie (iow, passes exit_notify()), even if it is a leader
              and thread-group is not empty.

This patch simply reverts this behaviour, the exiting leader will not
report the exit if it has sub-threads (alive or not). And afaics your
V1 tried to do the same. And this eliminates the

              This means that the behaviour of pidfd_poll(PIDFD_THREAD,
              pid-of-group-leader) is not well defined if it races with
              exec() from its sub-thread; ...

problem mentioned in the changelog. That is all.

IOW, with this change PIDFD_THREAD has no effect.

Except the pid_has_task() checks in sys_pidfd_open() paths, without
PIDFD_THREAD the target task must be a group leader.

On 03/20, Christian Brauner wrote:
>
> @@ -218,12 +218,32 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)

Your forgot to remove the no longer used

	bool thread = file->f_flags & PIDFD_THREAD;

above ;)

>  	/*
>  	 * Depending on PIDFD_THREAD, inform pollers when the thread
>  	 * or the whole thread-group exits.

See above (and below), this no longer depends on PIDFD_THREAD.

> +	else if (task->exit_state && !delay_group_leader(task))
>  		poll_flags = EPOLLIN | EPOLLRDNORM;

So with this change:

If the exiting task is a sub-thread, report EPOLLIN as before.
delay_group_leader() can't be true. In this case PIDFD_THREAD
must be set.

If the exiting task is a leader, we do not care about PIDFD_THREAD.
We report EPOLLIN only if it is the last/only thread.

> diff --git a/kernel/exit.c b/kernel/exit.c
> index 9916305e34d3..ce5cdad5ba9c 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -271,6 +271,9 @@ void release_task(struct task_struct *p)
>  		 * If we were the last child thread and the leader has
>  		 * exited already, and the leader's parent ignores SIGCHLD,
>  		 * then we are the one who should release the leader.
> +		 *
> +		 * This will also wake PIDFD_THREAD pidfds for the
> +		 * thread-group leader that already exited.
>  		 */
>  		zap_leader = do_notify_parent(leader, leader->exit_signal);

Again, this doesn't depend on PIDFD_THREAD.

> @@ -743,10 +746,13 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>
>  	tsk->exit_state = EXIT_ZOMBIE;
>  	/*
> -	 * sub-thread or delay_group_leader(), wake up the
> -	 * PIDFD_THREAD waiters.
> +	 * Wake up PIDFD_THREAD waiters if this is a proper subthread
> +	 * exit. If this is a premature thread-group leader exit delay
> +	 * the notification until the last subthread exits. If a
> +	 * subthread should exec before then no notification will be
> +	 * generated.
>  	 */
> -	if (!thread_group_empty(tsk))
> +	if (!delay_group_leader(tsk))
>  		do_notify_pidfd(tsk);

The same...

> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2180,8 +2180,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
>  	WARN_ON_ONCE(!tsk->ptrace &&
>  	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
>  	/*
> -	 * tsk is a group leader and has no threads, wake up the
> -	 * non-PIDFD_THREAD waiters.
> +	 * This is a thread-group leader without subthreads so wake up
> +	 * the non-PIDFD_THREAD waiters. This also wakes the
> +	 * PIDFD_THREAD waiters for the thread-group leader in case it
> +	 * exited prematurely from release_task().
>  	 */

This too.

Oleg.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling
  2025-03-20 10:57   ` Oleg Nesterov
@ 2025-03-20 11:21     ` Oleg Nesterov
  2025-03-20 13:16       ` Christian Brauner
  2025-03-20 12:36     ` Christian Brauner
  1 sibling, 1 reply; 11+ messages in thread
From: Oleg Nesterov @ 2025-03-20 11:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

And just for the record...

Consider the simplest case: a single-threaded not-ptraced process
exits. In this case do_notify_pidfd() will be called twice, from
exit_notify() and right after that from do_notify_parent().

We can cleanup this logic, but I don't think this is important and
this needs a separate patch.

(With or without this change: if the exiting task is ptraced or its
 parent exits without wait(), do_notify_pidfd() will be called even
 more times, but I think we do not care at all).

Oleg.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling
  2025-03-20 10:57   ` Oleg Nesterov
  2025-03-20 11:21     ` Oleg Nesterov
@ 2025-03-20 12:36     ` Christian Brauner
  2025-03-20 14:02       ` Oleg Nesterov
  1 sibling, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2025-03-20 12:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Thu, Mar 20, 2025 at 11:57:02AM +0100, Oleg Nesterov wrote:
> Christian,

Oleg!

> 
> All the comments look misleading (and overcomplicated) to me.
> 
> See below, but first lets recall the commit 64bef697d33b75fc06c5789
> ("pidfd: implement PIDFD_THREAD flag for pidfd_open()") which says
> 
>     pidfd: implement PIDFD_THREAD flag for pidfd_open()
> 
>     With this flag:
> 
>             ....
> 
>             - pidfd_poll() succeeds when the task exits and becomes a
>               zombie (iow, passes exit_notify()), even if it is a leader
>               and thread-group is not empty.
> 
> This patch simply reverts this behaviour, the exiting leader will not
> report the exit if it has sub-threads (alive or not). And afaics your
> V1 tried to do the same. And this eliminates the
> 
>               This means that the behaviour of pidfd_poll(PIDFD_THREAD,
>               pid-of-group-leader) is not well defined if it races with
>               exec() from its sub-thread; ...
> 
> problem mentioned in the changelog. That is all.
> 
> IOW, with this change PIDFD_THREAD has no effect.

But that's what I'm trying to say: This patch aligns the behavior of
thread-specific and non-thread-specific thread-group leader pidfds. IOW,
the behavior of:

pidfd_open(<thread-group-leader-pid>, 0)
pidfd_open(<thread-group-leader-pid>, PIDFD_THREAD)

is the same wrt to polling after this patch. That's also what the
selftests are designed to test and show.

> 
> Except the pid_has_task() checks in sys_pidfd_open() paths, without
> PIDFD_THREAD the target task must be a group leader.
> 
> On 03/20, Christian Brauner wrote:
> >
> > @@ -218,12 +218,32 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> 
> Your forgot to remove the no longer used
> 
> 	bool thread = file->f_flags & PIDFD_THREAD;
> 
> above ;)

Thanks, fixed.

> 
> >  	/*
> >  	 * Depending on PIDFD_THREAD, inform pollers when the thread
> >  	 * or the whole thread-group exits.
> 
> See above (and below), this no longer depends on PIDFD_THREAD.
> 
> > +	else if (task->exit_state && !delay_group_leader(task))
> >  		poll_flags = EPOLLIN | EPOLLRDNORM;
> 
> So with this change:
> 
> If the exiting task is a sub-thread, report EPOLLIN as before.
> delay_group_leader() can't be true. In this case PIDFD_THREAD
> must be set.
> 
> If the exiting task is a leader, we do not care about PIDFD_THREAD.
> We report EPOLLIN only if it is the last/only thread.
> 
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index 9916305e34d3..ce5cdad5ba9c 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -271,6 +271,9 @@ void release_task(struct task_struct *p)
> >  		 * If we were the last child thread and the leader has
> >  		 * exited already, and the leader's parent ignores SIGCHLD,
> >  		 * then we are the one who should release the leader.
> > +		 *
> > +		 * This will also wake PIDFD_THREAD pidfds for the
> > +		 * thread-group leader that already exited.
> >  		 */
> >  		zap_leader = do_notify_parent(leader, leader->exit_signal);
> 
> Again, this doesn't depend on PIDFD_THREAD.

The comment is literally just saying that we delayed notification of
PIDFD_THREAD pidfds for a thread-group leader until now. After all its
subthreads are released instead of when the thread-group leader did
actually exit earlier.

> 
> > @@ -743,10 +746,13 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >
> >  	tsk->exit_state = EXIT_ZOMBIE;
> >  	/*
> > -	 * sub-thread or delay_group_leader(), wake up the
> > -	 * PIDFD_THREAD waiters.
> > +	 * Wake up PIDFD_THREAD waiters if this is a proper subthread
> > +	 * exit. If this is a premature thread-group leader exit delay
> > +	 * the notification until the last subthread exits. If a
> > +	 * subthread should exec before then no notification will be
> > +	 * generated.
> >  	 */
> > -	if (!thread_group_empty(tsk))
> > +	if (!delay_group_leader(tsk))
> >  		do_notify_pidfd(tsk);
> 
> The same...

What you seem to be saying is that you want all references to
PIDFD_THREAD to be dropped in the comments because the behavior is now
identical. But what I would like to have is comments in the code that
illustrate where and how we guarantee this behavioral equivalency.

Because where the notifications happen does differ.

The delayed thread-group leader stuff is literally only apparent to
anyone who has stared and lived with these horrible behavioral warts of
early thread-group leader exit and subthread exec for a really long
time. For anyone else this isn't self-explanatory at all and each time
one has to go look at it it requires jumping around all the locations
where and how exit notifications are generated and piece together the
whole picture. It is laughably complex to follow.

So I'm wiping the comments but I very much disagree that they are
misleading/useless.

> 
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -2180,8 +2180,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
> >  	WARN_ON_ONCE(!tsk->ptrace &&
> >  	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
> >  	/*
> > -	 * tsk is a group leader and has no threads, wake up the
> > -	 * non-PIDFD_THREAD waiters.
> > +	 * This is a thread-group leader without subthreads so wake up
> > +	 * the non-PIDFD_THREAD waiters. This also wakes the
> > +	 * PIDFD_THREAD waiters for the thread-group leader in case it
> > +	 * exited prematurely from release_task().
> >  	 */
> 
> This too.

This one I agree is misplaced. It would be sufficient to have the
comment in release_task().

Christian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling
  2025-03-20 11:21     ` Oleg Nesterov
@ 2025-03-20 13:16       ` Christian Brauner
  0 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2025-03-20 13:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Thu, Mar 20, 2025 at 12:21:26PM +0100, Oleg Nesterov wrote:
> And just for the record...
> 
> Consider the simplest case: a single-threaded not-ptraced process
> exits. In this case do_notify_pidfd() will be called twice, from
> exit_notify() and right after that from do_notify_parent().

Yes.

> 
> We can cleanup this logic, but I don't think this is important and
> this needs a separate patch.

I would actually clean this up.

> 
> (With or without this change: if the exiting task is ptraced or its
>  parent exits without wait(), do_notify_pidfd() will be called even
>  more times, but I think we do not care at all).

Yes, though the ptrace codeflow is even more cursed so I'm not too
worried about making that clean (I wouldn't know how.).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling
  2025-03-20 12:36     ` Christian Brauner
@ 2025-03-20 14:02       ` Oleg Nesterov
  2025-03-20 14:32         ` Christian Brauner
  0 siblings, 1 reply; 11+ messages in thread
From: Oleg Nesterov @ 2025-03-20 14:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On 03/20, Christian Brauner wrote:
>
> What you seem to be saying is that you want all references to
> PIDFD_THREAD to be dropped in the comments because the behavior is now
> identical.

yes, to me the references to PIDFD_THREAD look as if PIDFD_THREAD
has some subtle differences in behavior.

With or without PIDFD_THREAD, do_notify_pidfd() is called and pidfd_poll()
returns EPOLLIN when this thread (leader or not) is ready for wait() from
the parent or debugger.

But!

> So I'm wiping the comments but I very much disagree that they are
> misleading/useless.

No, if you don't agree than do not remove the comments ;)

And... can you explain the motivation for this patch?

I mean... Again, the current PIDFD_THREAD/group-leader behavior is
not well defined, this is clear.

But if user-space does sys_pidfd_open(group_leader_pid) and needs the
"correct" EPOLLIN when the whole process exits, then it should not use
PIDFD_THREAD ?

Just in case, I am not arguing, I am just trying to understand.

Oleg.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling
  2025-03-20 14:02       ` Oleg Nesterov
@ 2025-03-20 14:32         ` Christian Brauner
  0 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2025-03-20 14:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Thu, Mar 20, 2025 at 03:02:00PM +0100, Oleg Nesterov wrote:
> On 03/20, Christian Brauner wrote:
> >
> > What you seem to be saying is that you want all references to
> > PIDFD_THREAD to be dropped in the comments because the behavior is now
> > identical.
> 
> yes, to me the references to PIDFD_THREAD look as if PIDFD_THREAD
> has some subtle differences in behavior.
> 
> With or without PIDFD_THREAD, do_notify_pidfd() is called and pidfd_poll()
> returns EPOLLIN when this thread (leader or not) is ready for wait() from
> the parent or debugger.
> 
> But!
> 
> > So I'm wiping the comments but I very much disagree that they are
> > misleading/useless.
> 
> No, if you don't agree than do not remove the comments ;)

No, it's fine. We always find some compromise and I've reworded the
comments substantially to not rely on PIDFD_THREAD at all. I always
appreciate the feedback, don't get me wrong!

> And... can you explain the motivation for this patch?

Yes, sure.

> 
> I mean... Again, the current PIDFD_THREAD/group-leader behavior is
> not well defined, this is clear.
> 
> But if user-space does sys_pidfd_open(group_leader_pid) and needs the
> "correct" EPOLLIN when the whole process exits, then it should not use
> PIDFD_THREAD ?
> 
> Just in case, I am not arguing, I am just trying to understand.

One driver is consistency. It's really weird to sometimes get exit
notifications and sometimes don't. It's easier to understand that we
delay notification until the thread-group is empty for a thread-based
pidfd for a thread-group leader rather than explaining de_thread()
timing issues for subthread exec.

But also remembering our earlier discussion on PIDFD_INFO_EXIT: If the
thread-group leader exits prematurely and userspace gets an exit
notification they end up with a Zombie they cannot (yet) reap.

I don't think we should carry that behavior over into the pidfd api. I'd
rather have it be so that if you get an exit notification it means that
you can may now reap the thing (I'm probably unaware of some ptrace
induced behavior that render this statement wrong.).

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-03-20 14:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20  9:29 [PATCH v3 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit Christian Brauner
2025-03-20  9:29 ` [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling Christian Brauner
2025-03-20 10:57   ` Oleg Nesterov
2025-03-20 11:21     ` Oleg Nesterov
2025-03-20 13:16       ` Christian Brauner
2025-03-20 12:36     ` Christian Brauner
2025-03-20 14:02       ` Oleg Nesterov
2025-03-20 14:32         ` Christian Brauner
2025-03-20  9:29 ` [PATCH v3 2/4] selftests/pidfd: first test for multi-threaded exec polling Christian Brauner
2025-03-20  9:29 ` [PATCH v3 3/4] selftests/pidfd: second " Christian Brauner
2025-03-20  9:29 ` [PATCH v3 4/4] selftests/pidfd: third " Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).