[PATCH v4 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit
@ 2025-03-20 13:24 Christian Brauner
  2025-03-20 13:24 ` [PATCH v4 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling Christian Brauner
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Christian Brauner @ 2025-03-20 13:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Oleg,

Don't kill me but this is another attempt at trying to make pidfd
polling for multi-threaded exec and premature thread-group leader exit
consistent.

A quick recap of these two cases:

(1) During a multi-threaded exec by a subthread, i.e., non-thread-group
    leader thread, all other threads in the thread-group including the
    thread-group leader are killed and the struct pid of the
    thread-group leader will be taken over by the subthread that called
    exec. IOW, two tasks change their TIDs.

(2) A premature thread-group leader exit means that the thread-group
    leader exited before all of the other subthreads in the thread-group
    have exited.

Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD.
Any caller that holds a PIDFD_THREAD pidfd to the current thread-group
leader may or may not see an exit notification on the file descriptor
depending on when poll is performed. If the poll is performed before the
exec of the subthread has concluded an exit notification is generated
for the old thread-group leader. If the poll is performed after the exec
of the subthread has concluded no exit notification is generated for the
old thread-group leader.

The correct behavior would be to simply not generate an exit
notification on the struct pid of a subhthread exec because the struct
pid is taken over by the subthread and thus remains alive.

But this is difficult to handle because a thread-group may exit
premature as mentioned in (2). In that case an exit notification is
reliably generated but the subthreads may continue to run for an
indeterminate amount of time and thus also may exec at some point.

This tiny series tries to address this problem. If that works correctly
then no exit notifications are generated for a PIDFD_THREAD pidfd for a
thread-group leader until all subthreads have been reaped. If a
subthread should exec before no exit notification will be generated
until that task exits or it creates subthreads and repeates the cycle.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v4:
- EDITME: describe what is new in this series revision.
- EDITME: use bulletpoints and terse descriptions.
- Link to v3: https://lore.kernel.org/r/20250320-work-pidfs-thread_group-v3-0-b7e5f7e2c3b1@kernel.org

Changes in v3:
- Simplify the implementation based on Oleg's suggestion.
- Link to v2: https://lore.kernel.org/r/20250318-work-pidfs-thread_group-v2-0-2677898ffa2e@kernel.org

Changes in v2:
- This simplifies the implementation of pidfs_poll() a bit since the
  check for multi-threaded exec I added is wrong and actually not even
  needed afaict because pid->delayed_leader cleanly handles both
  multi-threaded exec and premature thread-group leader exec not
  followed by a subthread exec.
- Link to v1: https://lore.kernel.org/r/20250317-work-pidfs-thread_group-v1-0-bc9e5ed283e9@kernel.org

---
Christian Brauner (4):
      pidfs: improve multi-threaded exec and premature thread-group leader exit polling
      selftests/pidfd: first test for multi-threaded exec polling
      selftests/pidfd: second test for multi-threaded exec polling
      selftests/pidfd: third test for multi-threaded exec polling

 fs/pidfs.c                                      |   9 +-
 kernel/exit.c                                   |   6 +-
 kernel/signal.c                                 |   3 +-
 tools/testing/selftests/pidfd/pidfd_info_test.c | 237 +++++++++++++++++++++---
 4 files changed, 225 insertions(+), 30 deletions(-)
---
base-commit: 68db272741834d5e528b5e1af6b83b479fec6cbb
change-id: 20250317-work-pidfs-thread_group-141682f9a50a

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling
  2025-03-20 13:24 [PATCH v4 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit Christian Brauner
@ 2025-03-20 13:24 ` Christian Brauner
  2025-03-20 14:28   ` Oleg Nesterov
  2025-03-20 13:24 ` [PATCH v4 2/4] selftests/pidfd: first test for multi-threaded exec polling Christian Brauner
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2025-03-20 13:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

This is another attempt trying to make pidfd polling for multi-threaded
exec and premature thread-group leader exit consistent.

A quick recap of these two cases:

(1) During a multi-threaded exec by a subthread, i.e., non-thread-group
    leader thread, all other threads in the thread-group including the
    thread-group leader are killed and the struct pid of the
    thread-group leader will be taken over by the subthread that called
    exec. IOW, two tasks change their TIDs.

(2) A premature thread-group leader exit means that the thread-group
    leader exited before all of the other subthreads in the thread-group
    have exited.

Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD.
Any caller that holds a PIDFD_THREAD pidfd to the current thread-group
leader may or may not see an exit notification on the file descriptor
depending on when poll is performed. If the poll is performed before the
exec of the subthread has concluded an exit notification is generated
for the old thread-group leader. If the poll is performed after the exec
of the subthread has concluded no exit notification is generated for the
old thread-group leader.

The correct behavior would be to simply not generate an exit
notification on the struct pid of a subhthread exec because the struct
pid is taken over by the subthread and thus remains alive.

But this is difficult to handle because a thread-group may exit
prematurely as mentioned in (2). In that case an exit notification is
reliably generated but the subthreads may continue to run for an
indeterminate amount of time and thus also may exec at some point.

So far there was no way to distinguish between (1) and (2) internally.
This tiny series tries to address this problem by discarding
PIDFD_THREAD notification on premature thread-group leader exit.

If that works correctly then no exit notifications are generated for a
PIDFD_THREAD pidfd for a thread-group leader until all subthreads have
been reaped. If a subthread should exec aftewards no exit notification
will be generated until that task exits or it creates subthreads and
repeates the cycle.

Co-Developed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c      | 9 +++++----
 kernel/exit.c   | 6 +++---
 kernel/signal.c | 3 +--
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index a48cc44ced6f..1b3d23e0ffdd 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -210,20 +210,21 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
 static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
 {
 	struct pid *pid = pidfd_pid(file);
-	bool thread = file->f_flags & PIDFD_THREAD;
 	struct task_struct *task;
 	__poll_t poll_flags = 0;

 	poll_wait(file, &pid->wait_pidfd, pts);
 	/*
-	 * Depending on PIDFD_THREAD, inform pollers when the thread
-	 * or the whole thread-group exits.
+	 * Don't wake waiters if the thread-group leader exited
+	 * prematurely. They either get notified when the last subthread
+	 * exits or not at all if one of the remaining subthreads execs
+	 * and assumes the struct pid of the old thread-group leader.
 	 */
 	guard(rcu)();
 	task = pid_task(pid, PIDTYPE_PID);
 	if (!task)
 		poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP;
-	else if (task->exit_state && (thread || thread_group_empty(task)))
+	else if (task->exit_state && !delay_group_leader(task))
 		poll_flags = EPOLLIN | EPOLLRDNORM;

 	return poll_flags;
diff --git a/kernel/exit.c b/kernel/exit.c
index 9916305e34d3..683766316a3d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -743,10 +743,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)

 	tsk->exit_state = EXIT_ZOMBIE;
 	/*
-	 * sub-thread or delay_group_leader(), wake up the
-	 * PIDFD_THREAD waiters.
+	 * Ignore thread-group leaders that exited before all
+	 * subthreads did.
 	 */
-	if (!thread_group_empty(tsk))
+	if (!delay_group_leader(tsk))
 		do_notify_pidfd(tsk);

 	if (unlikely(tsk->ptrace)) {
diff --git a/kernel/signal.c b/kernel/signal.c
index 081f19a24506..027ad9e97417 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2180,8 +2180,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	WARN_ON_ONCE(!tsk->ptrace &&
 	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
 	/*
-	 * tsk is a group leader and has no threads, wake up the
-	 * non-PIDFD_THREAD waiters.
+	 * Notify for thread-group leaders without subthreads.
 	 */
 	if (thread_group_empty(tsk))
 		do_notify_pidfd(tsk);

-- 
2.47.2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 2/4] selftests/pidfd: first test for multi-threaded exec polling
  2025-03-20 13:24 [PATCH v4 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit Christian Brauner
  2025-03-20 13:24 ` [PATCH v4 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling Christian Brauner
@ 2025-03-20 13:24 ` Christian Brauner
  2025-03-20 13:24 ` [PATCH v4 3/4] selftests/pidfd: second " Christian Brauner
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2025-03-20 13:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Add first test for premature thread-group leader exit.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 38 ++++++++++++++++++++-----
 1 file changed, 31 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 09bc4ae7aed5..28a28ae4686a 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -236,7 +236,7 @@ static void *pidfd_info_pause_thread(void *arg)
 
 TEST_F(pidfd_info, thread_group)
 {
-	pid_t pid_leader, pid_thread;
+	pid_t pid_leader, pid_poller, pid_thread;
 	pthread_t thread;
 	int nevents, pidfd_leader, pidfd_thread, pidfd_leader_thread, ret;
 	int ipc_sockets[2];
@@ -262,6 +262,35 @@ TEST_F(pidfd_info, thread_group)
 		syscall(__NR_exit, EXIT_SUCCESS);
 	}
 
+	/*
+	 * Opening a PIDFD_THREAD aka thread-specific pidfd based on a
+	 * thread-group leader must succeed.
+	 */
+	pidfd_leader_thread = sys_pidfd_open(pid_leader, PIDFD_THREAD);
+	ASSERT_GE(pidfd_leader_thread, 0);
+
+	pid_poller = fork();
+	ASSERT_GE(pid_poller, 0);
+	if (pid_poller == 0) {
+		/*
+		 * We can't poll and wait for the old thread-group
+		 * leader to exit using a thread-specific pidfd. The
+		 * thread-group leader exited prematurely and
+		 * notification is delayed until all subthreads have
+		 * exited.
+		 */
+		fds.events = POLLIN;
+		fds.fd = pidfd_leader_thread;
+		nevents = poll(&fds, 1, 10000 /* wait 5 seconds */);
+		if (nevents != 0)
+			_exit(EXIT_FAILURE);
+		if (fds.revents & POLLIN)
+			_exit(EXIT_FAILURE);
+		if (fds.revents & POLLHUP)
+			_exit(EXIT_FAILURE);
+		_exit(EXIT_SUCCESS);
+	}
+
 	/* Retrieve the tid of the thread. */
 	EXPECT_EQ(close(ipc_sockets[1]), 0);
 	ASSERT_EQ(read_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
@@ -275,12 +304,7 @@ TEST_F(pidfd_info, thread_group)
 	pidfd_thread = sys_pidfd_open(pid_thread, PIDFD_THREAD);
 	ASSERT_GE(pidfd_thread, 0);
 
-	/*
-	 * Opening a PIDFD_THREAD aka thread-specific pidfd based on a
-	 * thread-group leader must succeed.
-	 */
-	pidfd_leader_thread = sys_pidfd_open(pid_leader, PIDFD_THREAD);
-	ASSERT_GE(pidfd_leader_thread, 0);
+	ASSERT_EQ(wait_for_pid(pid_poller), 0);
 
 	/*
 	 * Note that pidfd_leader is a thread-group pidfd, so polling on it

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 3/4] selftests/pidfd: second test for multi-threaded exec polling
  2025-03-20 13:24 [PATCH v4 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit Christian Brauner
  2025-03-20 13:24 ` [PATCH v4 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling Christian Brauner
  2025-03-20 13:24 ` [PATCH v4 2/4] selftests/pidfd: first test for multi-threaded exec polling Christian Brauner
@ 2025-03-20 13:24 ` Christian Brauner
  2025-03-20 13:24 ` [PATCH v4 4/4] selftests/pidfd: third " Christian Brauner
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2025-03-20 13:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Ensure that during a multi-threaded exec and premature thread-group
leader exit no exit notification is generated.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 72 ++++++++++++++++---------
 1 file changed, 48 insertions(+), 24 deletions(-)

diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 28a28ae4686a..4169780c9e55 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -413,7 +413,7 @@ static void *pidfd_info_thread_exec(void *arg)
 
 TEST_F(pidfd_info, thread_group_exec)
 {
-	pid_t pid_leader, pid_thread;
+	pid_t pid_leader, pid_poller, pid_thread;
 	pthread_t thread;
 	int nevents, pidfd_leader, pidfd_leader_thread, pidfd_thread, ret;
 	int ipc_sockets[2];
@@ -439,6 +439,37 @@ TEST_F(pidfd_info, thread_group_exec)
 		syscall(__NR_exit, EXIT_SUCCESS);
 	}
 
+	/* Open a thread-specific pidfd for the thread-group leader. */
+	pidfd_leader_thread = sys_pidfd_open(pid_leader, PIDFD_THREAD);
+	ASSERT_GE(pidfd_leader_thread, 0);
+
+	pid_poller = fork();
+	ASSERT_GE(pid_poller, 0);
+	if (pid_poller == 0) {
+		/*
+		 * We can't poll and wait for the old thread-group
+		 * leader to exit using a thread-specific pidfd. The
+		 * thread-group leader exited prematurely and
+		 * notification is delayed until all subthreads have
+		 * exited.
+		 *
+		 * When the thread has execed it will taken over the old
+		 * thread-group leaders struct pid. Calling poll after
+		 * the thread execed will thus block again because a new
+		 * thread-group has started.
+		 */
+		fds.events = POLLIN;
+		fds.fd = pidfd_leader_thread;
+		nevents = poll(&fds, 1, 10000 /* wait 5 seconds */);
+		if (nevents != 0)
+			_exit(EXIT_FAILURE);
+		if (fds.revents & POLLIN)
+			_exit(EXIT_FAILURE);
+		if (fds.revents & POLLHUP)
+			_exit(EXIT_FAILURE);
+		_exit(EXIT_SUCCESS);
+	}
+
 	/* Retrieve the tid of the thread. */
 	EXPECT_EQ(close(ipc_sockets[1]), 0);
 	ASSERT_EQ(read_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
@@ -447,33 +478,12 @@ TEST_F(pidfd_info, thread_group_exec)
 	pidfd_thread = sys_pidfd_open(pid_thread, PIDFD_THREAD);
 	ASSERT_GE(pidfd_thread, 0);
 
-	/* Open a thread-specific pidfd for the thread-group leader. */
-	pidfd_leader_thread = sys_pidfd_open(pid_leader, PIDFD_THREAD);
-	ASSERT_GE(pidfd_leader_thread, 0);
-
-	/*
-	 * We can poll and wait for the old thread-group leader to exit
-	 * using a thread-specific pidfd.
-	 *
-	 * This only works until the thread has execed. When the thread
-	 * has execed it will have taken over the old thread-group
-	 * leaders struct pid. Calling poll after the thread execed will
-	 * thus block again because a new thread-group has started (Yes,
-	 * it's fscked.).
-	 */
-	fds.events = POLLIN;
-	fds.fd = pidfd_leader_thread;
-	nevents = poll(&fds, 1, -1);
-	ASSERT_EQ(nevents, 1);
-	/* The thread-group leader has exited. */
-	ASSERT_TRUE(!!(fds.revents & POLLIN));
-	/* The thread-group leader hasn't been reaped. */
-	ASSERT_FALSE(!!(fds.revents & POLLHUP));
-
 	/* Now that we've opened a thread-specific pidfd the thread can exec. */
 	ASSERT_EQ(write_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
 	EXPECT_EQ(close(ipc_sockets[0]), 0);
 
+	ASSERT_EQ(wait_for_pid(pid_poller), 0);
+
 	/* Wait until the kernel has SIGKILLed the thread. */
 	fds.events = POLLHUP;
 	fds.fd = pidfd_thread;
@@ -506,6 +516,20 @@ TEST_F(pidfd_info, thread_group_exec)
 
 	/* Take down the thread-group leader. */
 	EXPECT_EQ(sys_pidfd_send_signal(pidfd_leader, SIGKILL, NULL, 0), 0);
+
+	/*
+	 * Afte the exec we're dealing with an empty thread-group so now
+	 * we must see an exit notification on the thread-specific pidfd
+	 * for the thread-group leader as there's no subthread that can
+	 * revive the struct pid.
+	 */
+	fds.events = POLLIN;
+	fds.fd = pidfd_leader_thread;
+	nevents = poll(&fds, 1, -1);
+	ASSERT_EQ(nevents, 1);
+	ASSERT_TRUE(!!(fds.revents & POLLIN));
+	ASSERT_FALSE(!!(fds.revents & POLLHUP));
+
 	EXPECT_EQ(sys_waitid(P_PIDFD, pidfd_leader, NULL, WEXITED), 0);
 
 	/* Retrieve exit information for the thread-group leader. */

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 4/4] selftests/pidfd: third test for multi-threaded exec polling
  2025-03-20 13:24 [PATCH v4 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit Christian Brauner
                   ` (2 preceding siblings ...)
  2025-03-20 13:24 ` [PATCH v4 3/4] selftests/pidfd: second " Christian Brauner
@ 2025-03-20 13:24 ` Christian Brauner
  2025-03-23 17:19 ` [PATCH] pidfs: cleanup the usage of do_notify_pidfd() Oleg Nesterov
  2025-03-24 17:19 ` [PATCH] exit: fix the usage of delay_group_leader->exit_code in do_notify_parent() and pidfs_exit() Oleg Nesterov
  5 siblings, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2025-03-20 13:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Ensure that during a multi-threaded exec and premature thread-group
leader exit no exit notification is generated.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 147 ++++++++++++++++++++++++
 1 file changed, 147 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 4169780c9e55..1758a1b0457b 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -542,4 +542,151 @@ TEST_F(pidfd_info, thread_group_exec)
 	EXPECT_EQ(close(pidfd_thread), 0);
 }
 
+static void *pidfd_info_thread_exec_sane(void *arg)
+{
+	pid_t pid_thread = gettid();
+	int ipc_socket = *(int *)arg;
+
+	/* Inform the grand-parent what the tid of this thread is. */
+	if (write_nointr(ipc_socket, &pid_thread, sizeof(pid_thread)) != sizeof(pid_thread))
+		return NULL;
+
+	if (read_nointr(ipc_socket, &pid_thread, sizeof(pid_thread)) != sizeof(pid_thread))
+		return NULL;
+
+	close(ipc_socket);
+
+	sys_execveat(AT_FDCWD, "pidfd_exec_helper", NULL, NULL, 0);
+	return NULL;
+}
+
+TEST_F(pidfd_info, thread_group_exec_thread)
+{
+	pid_t pid_leader, pid_poller, pid_thread;
+	pthread_t thread;
+	int nevents, pidfd_leader, pidfd_leader_thread, pidfd_thread, ret;
+	int ipc_sockets[2];
+	struct pollfd fds = {};
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
+	};
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	EXPECT_EQ(ret, 0);
+
+	pid_leader = create_child(&pidfd_leader, 0);
+	EXPECT_GE(pid_leader, 0);
+
+	if (pid_leader == 0) {
+		close(ipc_sockets[0]);
+
+		/* The thread will outlive the thread-group leader. */
+		if (pthread_create(&thread, NULL, pidfd_info_thread_exec_sane, &ipc_sockets[1]))
+			syscall(__NR_exit, EXIT_FAILURE);
+
+		/*
+		 * Pause the thread-group leader. It will be killed once
+		 * the subthread execs.
+		 */
+		pause();
+		syscall(__NR_exit, EXIT_SUCCESS);
+	}
+
+	/* Retrieve the tid of the thread. */
+	EXPECT_EQ(close(ipc_sockets[1]), 0);
+	ASSERT_EQ(read_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
+
+	/* Opening a thread as a PIDFD_THREAD must succeed. */
+	pidfd_thread = sys_pidfd_open(pid_thread, PIDFD_THREAD);
+	ASSERT_GE(pidfd_thread, 0);
+
+	/* Open a thread-specific pidfd for the thread-group leader. */
+	pidfd_leader_thread = sys_pidfd_open(pid_leader, PIDFD_THREAD);
+	ASSERT_GE(pidfd_leader_thread, 0);
+
+	pid_poller = fork();
+	ASSERT_GE(pid_poller, 0);
+	if (pid_poller == 0) {
+		/*
+		 * The subthread will now exec. The struct pid of the old
+		 * thread-group leader will be assumed by the subthread which
+		 * becomes the new thread-group leader. So no exit notification
+		 * must be generated. Wait for 5 seconds and call it a success
+		 * if no notification has been received.
+		 */
+		fds.events = POLLIN;
+		fds.fd = pidfd_leader_thread;
+		nevents = poll(&fds, 1, 10000 /* wait 5 seconds */);
+		if (nevents != 0)
+			_exit(EXIT_FAILURE);
+		if (fds.revents & POLLIN)
+			_exit(EXIT_FAILURE);
+		if (fds.revents & POLLHUP)
+			_exit(EXIT_FAILURE);
+		_exit(EXIT_SUCCESS);
+	}
+
+	/* Now that we've opened a thread-specific pidfd the thread can exec. */
+	ASSERT_EQ(write_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
+	EXPECT_EQ(close(ipc_sockets[0]), 0);
+	ASSERT_EQ(wait_for_pid(pid_poller), 0);
+
+	/* Wait until the kernel has SIGKILLed the thread. */
+	fds.events = POLLHUP;
+	fds.fd = pidfd_thread;
+	nevents = poll(&fds, 1, -1);
+	ASSERT_EQ(nevents, 1);
+	/* The thread has been reaped. */
+	ASSERT_TRUE(!!(fds.revents & POLLHUP));
+
+	/* Retrieve thread-specific exit info from pidfd. */
+	ASSERT_EQ(ioctl(pidfd_thread, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+	/*
+	 * While the kernel will have SIGKILLed the whole thread-group
+	 * during exec it will cause the individual threads to exit
+	 * cleanly.
+	 */
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
+
+	/*
+	 * The thread-group leader is still alive, the thread has taken
+	 * over its struct pid and thus its pid number.
+	 */
+	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(pidfd_leader, PIDFD_GET_INFO, &info), 0);
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_EXIT));
+	ASSERT_EQ(info.pid, pid_leader);
+
+	/* Take down the thread-group leader. */
+	EXPECT_EQ(sys_pidfd_send_signal(pidfd_leader, SIGKILL, NULL, 0), 0);
+
+	/*
+	 * Afte the exec we're dealing with an empty thread-group so now
+	 * we must see an exit notification on the thread-specific pidfd
+	 * for the thread-group leader as there's no subthread that can
+	 * revive the struct pid.
+	 */
+	fds.events = POLLIN;
+	fds.fd = pidfd_leader_thread;
+	nevents = poll(&fds, 1, -1);
+	ASSERT_EQ(nevents, 1);
+	ASSERT_TRUE(!!(fds.revents & POLLIN));
+	ASSERT_FALSE(!!(fds.revents & POLLHUP));
+
+	EXPECT_EQ(sys_waitid(P_PIDFD, pidfd_leader, NULL, WEXITED), 0);
+
+	/* Retrieve exit information for the thread-group leader. */
+	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(pidfd_leader, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+
+	EXPECT_EQ(close(pidfd_leader), 0);
+	EXPECT_EQ(close(pidfd_thread), 0);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling
  2025-03-20 13:24 ` [PATCH v4 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling Christian Brauner
@ 2025-03-20 14:28   ` Oleg Nesterov
  2025-03-20 15:26     ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2025-03-20 14:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On 03/20, Christian Brauner wrote:
>
> Co-Developed-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/pidfs.c      | 9 +++++----
>  kernel/exit.c   | 6 +++---
>  kernel/signal.c | 3 +--
>  3 files changed, 9 insertions(+), 9 deletions(-)

Thanks, LGTM.

Todo:

	- As we already discussed, do_notify_pidfd() can be called
	  twice from exit_notify() paths, we can avoid this.

	  But this connects to another minor issue:

	- With this change, debuggers can no longer use PIDFD_THREAD.
	  I guess we don't really care, I don't think PIDFD_THREAD was
	  ever used for this purpose. but perhaps we can change this
	  or at least document somewhere...

I'll try to do this but not today and (most probably) not tomorrow.

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling
  2025-03-20 14:28   ` Oleg Nesterov
@ 2025-03-20 15:26     ` Christian Brauner
  0 siblings, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2025-03-20 15:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Thu, Mar 20, 2025 at 03:28:17PM +0100, Oleg Nesterov wrote:
> On 03/20, Christian Brauner wrote:
> >
> > Co-Developed-by: Oleg Nesterov <oleg@redhat.com>
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/pidfs.c      | 9 +++++----
> >  kernel/exit.c   | 6 +++---
> >  kernel/signal.c | 3 +--
> >  3 files changed, 9 insertions(+), 9 deletions(-)
> 
> Thanks, LGTM.
> 
> Todo:
> 
> 	- As we already discussed, do_notify_pidfd() can be called
> 	  twice from exit_notify() paths, we can avoid this.
> 
> 	  But this connects to another minor issue:
> 
> 	- With this change, debuggers can no longer use PIDFD_THREAD.
> 	  I guess we don't really care, I don't think PIDFD_THREAD was
> 	  ever used for this purpose. but perhaps we can change this

If you have ideas how to let them use it I'm all for it!

> 	  or at least document somewhere...
> 
> I'll try to do this but not today and (most probably) not tomorrow.

Cool, sounds great!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] pidfs: cleanup the usage of do_notify_pidfd()
  2025-03-20 13:24 [PATCH v4 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit Christian Brauner
                   ` (3 preceding siblings ...)
  2025-03-20 13:24 ` [PATCH v4 4/4] selftests/pidfd: third " Christian Brauner
@ 2025-03-23 17:19 ` Oleg Nesterov
  2025-03-23 17:45   ` selftests/pidfd: (Was: [PATCH] pidfs: cleanup the usage of do_notify_pidfd()) Oleg Nesterov
                     ` (2 more replies)
  2025-03-24 17:19 ` [PATCH] exit: fix the usage of delay_group_leader->exit_code in do_notify_parent() and pidfs_exit() Oleg Nesterov
  5 siblings, 3 replies; 15+ messages in thread
From: Oleg Nesterov @ 2025-03-23 17:19 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

If a single-threaded process exits do_notify_pidfd() will be called twice,
from exit_notify() and right after that from do_notify_parent().

1. Change exit_notify() to call do_notify_pidfd() if the exiting task is
   not ptraced and it is not a group leader.

2. Change do_notify_parent() to call do_notify_pidfd() unconditionally.

   If tsk is not ptraced, do_notify_parent() will only be called when it
   is a group-leader and thread_group_empty() is true.

This means that if tsk is ptraced, do_notify_pidfd() will be called from
do_notify_parent() even if tsk is a delay_group_leader(). But this case is
less common, and apart from the unnecessary __wake_up() is harmless.

Granted, this unnecessary __wake_up() can be avoided, but I don't want to
do it in this patch because it's just a consequence of another historical
oddity: we notify the tracer even if !thread_group_empty(), but do_wait()
from debugger can't work until all other threads exit. With or without this
patch we should either eliminate do_notify_parent() in this case, or change
do_wait(WEXITED) to untrace the ptraced delay_group_leader() at least when
ptrace_reparented().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c   | 8 ++------
 kernel/signal.c | 8 +++-----
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 683766316a3d..d0ebccb9dec0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -742,12 +742,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 		kill_orphaned_pgrp(tsk->group_leader, NULL);
 
 	tsk->exit_state = EXIT_ZOMBIE;
-	/*
-	 * Ignore thread-group leaders that exited before all
-	 * subthreads did.
-	 */
-	if (!delay_group_leader(tsk))
-		do_notify_pidfd(tsk);
 
 	if (unlikely(tsk->ptrace)) {
 		int sig = thread_group_leader(tsk) &&
@@ -760,6 +754,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 			do_notify_parent(tsk, tsk->exit_signal);
 	} else {
 		autoreap = true;
+		/* untraced sub-thread */
+		do_notify_pidfd(tsk);
 	}
 
 	if (autoreap) {
diff --git a/kernel/signal.c b/kernel/signal.c
index 027ad9e97417..1d8db0dabb71 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2179,11 +2179,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 
 	WARN_ON_ONCE(!tsk->ptrace &&
 	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
-	/*
-	 * Notify for thread-group leaders without subthreads.
-	 */
-	if (thread_group_empty(tsk))
-		do_notify_pidfd(tsk);
+
+	/* ptraced, or group-leader without sub-threads */
+	do_notify_pidfd(tsk);
 
 	if (sig != SIGCHLD) {
 		/*
-- 
2.25.1.362.g51ebf55



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* selftests/pidfd: (Was: [PATCH] pidfs: cleanup the usage of do_notify_pidfd())
  2025-03-23 17:19 ` [PATCH] pidfs: cleanup the usage of do_notify_pidfd() Oleg Nesterov
@ 2025-03-23 17:45   ` Oleg Nesterov
  2025-03-23 20:36     ` Christian Brauner
  2025-03-23 20:40   ` [PATCH] pidfs: cleanup the usage of do_notify_pidfd() Christian Brauner
  2025-03-23 20:42   ` Christian Brauner
  2 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2025-03-23 17:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

Christian,

I had to spend some (a lot;) time to understand why pidfd_info_test
(and more) fails with my patch under qemu on my machine ;) Until I
applied the patch below.

I think it is a bad idea to do the things like

	#ifndef __NR_clone3
	#define __NR_clone3 -1
	#endif

because this can hide a problem. My working laptop runs Fedora-23 which
doesn't have __NR_clone3/etc in /usr/include/. So "make" happily succeeds,
but everything fails and it is not clear why.

Oleg.
---

diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
index 3d2663fe50ba..eeca8005723f 100644
--- a/tools/testing/selftests/clone3/clone3_selftests.h
+++ b/tools/testing/selftests/clone3/clone3_selftests.h
@@ -16,7 +16,7 @@
 #define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
 
 #ifndef __NR_clone3
-#define __NR_clone3 -1
+#define __NR_clone3 435
 #endif
 
 struct __clone_args {
diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index cec22aa11cdf..55bcf81a2b9a 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -32,19 +32,19 @@
 #endif
 
 #ifndef __NR_pidfd_open
-#define __NR_pidfd_open -1
+#define __NR_pidfd_open 434
 #endif
 
 #ifndef __NR_pidfd_send_signal
-#define __NR_pidfd_send_signal -1
+#define __NR_pidfd_send_signal 424
 #endif
 
 #ifndef __NR_clone3
-#define __NR_clone3 -1
+#define __NR_clone3 435
 #endif
 
 #ifndef __NR_pidfd_getfd
-#define __NR_pidfd_getfd -1
+#define __NR_pidfd_getfd 438
 #endif
 
 #ifndef PIDFD_NONBLOCK


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: selftests/pidfd: (Was: [PATCH] pidfs: cleanup the usage of do_notify_pidfd())
  2025-03-23 17:45   ` selftests/pidfd: (Was: [PATCH] pidfs: cleanup the usage of do_notify_pidfd()) Oleg Nesterov
@ 2025-03-23 20:36     ` Christian Brauner
  2025-03-23 21:17       ` Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2025-03-23 20:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Sun, Mar 23, 2025 at 06:45:18PM +0100, Oleg Nesterov wrote:
> Christian,
> 
> I had to spend some (a lot;) time to understand why pidfd_info_test
> (and more) fails with my patch under qemu on my machine ;) Until I
> applied the patch below.
> 
> I think it is a bad idea to do the things like
> 
> 	#ifndef __NR_clone3
> 	#define __NR_clone3 -1
> 	#endif
> 
> because this can hide a problem. My working laptop runs Fedora-23 which
> doesn't have __NR_clone3/etc in /usr/include/. So "make" happily succeeds,
> but everything fails and it is not clear why.

Yeah, I agree. You want to send your small patch as a quick fix?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] pidfs: cleanup the usage of do_notify_pidfd()
  2025-03-23 17:19 ` [PATCH] pidfs: cleanup the usage of do_notify_pidfd() Oleg Nesterov
  2025-03-23 17:45   ` selftests/pidfd: (Was: [PATCH] pidfs: cleanup the usage of do_notify_pidfd()) Oleg Nesterov
@ 2025-03-23 20:40   ` Christian Brauner
  2025-03-23 20:42   ` Christian Brauner
  2 siblings, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2025-03-23 20:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, linux-fsdevel, Jeff Layton, Lennart Poettering,
	Daan De Meyer, Mike Yuan

On Sun, 23 Mar 2025 18:19:55 +0100, Oleg Nesterov wrote:
> If a single-threaded process exits do_notify_pidfd() will be called twice,
> from exit_notify() and right after that from do_notify_parent().
> 
> 1. Change exit_notify() to call do_notify_pidfd() if the exiting task is
>    not ptraced and it is not a group leader.
> 
> 2. Change do_notify_parent() to call do_notify_pidfd() unconditionally.
> 
> [...]

Applied to the vfs.fixes branch of the vfs/vfs.git tree.
Patches in the vfs.fixes branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.fixes

[1/1] pidfs: cleanup the usage of do_notify_pidfd()
      https://git.kernel.org/vfs/vfs/c/98ce463bc6f4
[1/1] selftests/pidfd: (Was: [PATCH] pidfs: cleanup the usage of do_notify_pidfd())
      https://git.kernel.org/vfs/vfs/c/cc8c2e338a25

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] pidfs: cleanup the usage of do_notify_pidfd()
  2025-03-23 17:19 ` [PATCH] pidfs: cleanup the usage of do_notify_pidfd() Oleg Nesterov
  2025-03-23 17:45   ` selftests/pidfd: (Was: [PATCH] pidfs: cleanup the usage of do_notify_pidfd()) Oleg Nesterov
  2025-03-23 20:40   ` [PATCH] pidfs: cleanup the usage of do_notify_pidfd() Christian Brauner
@ 2025-03-23 20:42   ` Christian Brauner
  2 siblings, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2025-03-23 20:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Sun, Mar 23, 2025 at 06:19:55PM +0100, Oleg Nesterov wrote:
> If a single-threaded process exits do_notify_pidfd() will be called twice,
> from exit_notify() and right after that from do_notify_parent().
> 
> 1. Change exit_notify() to call do_notify_pidfd() if the exiting task is
>    not ptraced and it is not a group leader.
> 
> 2. Change do_notify_parent() to call do_notify_pidfd() unconditionally.
> 
>    If tsk is not ptraced, do_notify_parent() will only be called when it
>    is a group-leader and thread_group_empty() is true.
> 
> This means that if tsk is ptraced, do_notify_pidfd() will be called from
> do_notify_parent() even if tsk is a delay_group_leader(). But this case is
> less common, and apart from the unnecessary __wake_up() is harmless.
> 
> Granted, this unnecessary __wake_up() can be avoided, but I don't want to
> do it in this patch because it's just a consequence of another historical
> oddity: we notify the tracer even if !thread_group_empty(), but do_wait()
> from debugger can't work until all other threads exit. With or without this
> patch we should either eliminate do_notify_parent() in this case, or change
> do_wait(WEXITED) to untrace the ptraced delay_group_leader() at least when
> ptrace_reparented().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---

Thanks for doing this! I'll send this together with the first set of
fixes after the merge window closes.

>  kernel/exit.c   | 8 ++------
>  kernel/signal.c | 8 +++-----
>  2 files changed, 5 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 683766316a3d..d0ebccb9dec0 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -742,12 +742,6 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  		kill_orphaned_pgrp(tsk->group_leader, NULL);
>  
>  	tsk->exit_state = EXIT_ZOMBIE;
> -	/*
> -	 * Ignore thread-group leaders that exited before all
> -	 * subthreads did.
> -	 */
> -	if (!delay_group_leader(tsk))
> -		do_notify_pidfd(tsk);
>  
>  	if (unlikely(tsk->ptrace)) {
>  		int sig = thread_group_leader(tsk) &&
> @@ -760,6 +754,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  			do_notify_parent(tsk, tsk->exit_signal);
>  	} else {
>  		autoreap = true;
> +		/* untraced sub-thread */
> +		do_notify_pidfd(tsk);
>  	}
>  
>  	if (autoreap) {
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 027ad9e97417..1d8db0dabb71 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2179,11 +2179,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
>  
>  	WARN_ON_ONCE(!tsk->ptrace &&
>  	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
> -	/*
> -	 * Notify for thread-group leaders without subthreads.
> -	 */
> -	if (thread_group_empty(tsk))
> -		do_notify_pidfd(tsk);
> +
> +	/* ptraced, or group-leader without sub-threads */
> +	do_notify_pidfd(tsk);
>  
>  	if (sig != SIGCHLD) {
>  		/*
> -- 
> 2.25.1.362.g51ebf55
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: selftests/pidfd: (Was: [PATCH] pidfs: cleanup the usage of do_notify_pidfd())
  2025-03-23 20:36     ` Christian Brauner
@ 2025-03-23 21:17       ` Oleg Nesterov
  0 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2025-03-23 21:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On 03/23, Christian Brauner wrote:
>
> On Sun, Mar 23, 2025 at 06:45:18PM +0100, Oleg Nesterov wrote:
> >
> > I think it is a bad idea to do the things like
> >
> > 	#ifndef __NR_clone3
> > 	#define __NR_clone3 -1
> > 	#endif
> >
> > because this can hide a problem. My working laptop runs Fedora-23 which
> > doesn't have __NR_clone3/etc in /usr/include/. So "make" happily succeeds,
> > but everything fails and it is not clear why.
>
> Yeah, I agree. You want to send your small patch as a quick fix?

OK, will do.

FYI, I have another fixlet,

	exit_info->exit_code = tsk->exit_code;

in pidfs_exit() no longer looks correct with the recent changes.
In fact this was probably wrong after we decided to move pidfs_exit()
to release_task(), but I didn't notice this when I reviewed the last
version of "pidfs: record exit code and cgroupid at exit".

But I need to write the changelog which should explain the uglyness
we already have, and the reason for the change. Perhaps we don't
really care, but I think this should be discussed at least.

Oleg.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] exit: fix the usage of delay_group_leader->exit_code in do_notify_parent() and pidfs_exit()
  2025-03-20 13:24 [PATCH v4 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit Christian Brauner
                   ` (4 preceding siblings ...)
  2025-03-23 17:19 ` [PATCH] pidfs: cleanup the usage of do_notify_pidfd() Oleg Nesterov
@ 2025-03-24 17:19 ` Oleg Nesterov
  2025-03-25 14:57   ` Christian Brauner
  5 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2025-03-24 17:19 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

Consider a process with a group leader L and a sub-thread T.
L does sys_exit(1), then T does sys_exit_group(2).

In this case wait_task_zombie(L) will notice SIGNAL_GROUP_EXIT and use
L->signal->group_exit_code, this is correct.

But, before that, do_notify_parent(L) called by release_task(T) will use
L->exit_code != L->signal->group_exit_code, and this is not consistent.
We don't really care, I think that nobody relies on the info which comes
with SIGCHLD, if nothing else SIGCHLD < SIGRTMIN can be queued only once.

But pidfs_exit() is more problematic, I think pidfs_exit_info->exit_code
should report ->group_exit_code in this case, just like wait_task_zombie().

TODO: with this change we can hopefully cleanup (or may be even kill) the
similar SIGNAL_GROUP_EXIT checks, at least in wait_task_zombie().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/exit.c b/kernel/exit.c
index d0ebccb9dec0..4a0604f5cedd 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -267,6 +267,9 @@ void release_task(struct task_struct *p)
 	leader = p->group_leader;
 	if (leader != p && thread_group_empty(leader)
 			&& leader->exit_state == EXIT_ZOMBIE) {
+		/* for pidfs_exit() and do_notify_parent() */
+		if (leader->signal->flags & SIGNAL_GROUP_EXIT)
+			leader->exit_code = leader->signal->group_exit_code;
 		/*
 		 * If we were the last child thread and the leader has
 		 * exited already, and the leader's parent ignores SIGCHLD,
-- 
2.25.1.362.g51ebf55

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] exit: fix the usage of delay_group_leader->exit_code in do_notify_parent() and pidfs_exit()
  2025-03-24 17:19 ` [PATCH] exit: fix the usage of delay_group_leader->exit_code in do_notify_parent() and pidfs_exit() Oleg Nesterov
@ 2025-03-25 14:57   ` Christian Brauner
  0 siblings, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2025-03-25 14:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, linux-fsdevel, Jeff Layton, Lennart Poettering,
	Daan De Meyer, Mike Yuan

On Mon, 24 Mar 2025 18:19:41 +0100, Oleg Nesterov wrote:
> Consider a process with a group leader L and a sub-thread T.
> L does sys_exit(1), then T does sys_exit_group(2).
> 
> In this case wait_task_zombie(L) will notice SIGNAL_GROUP_EXIT and use
> L->signal->group_exit_code, this is correct.
> 
> But, before that, do_notify_parent(L) called by release_task(T) will use
> L->exit_code != L->signal->group_exit_code, and this is not consistent.
> We don't really care, I think that nobody relies on the info which comes
> with SIGCHLD, if nothing else SIGCHLD < SIGRTMIN can be queued only once.
> 
> [...]

Applied to the vfs.fixes branch of the vfs/vfs.git tree.
Patches in the vfs.fixes branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.fixes

[1/1] exit: fix the usage of delay_group_leader->exit_code in do_notify_parent() and pidfs_exit()
      https://git.kernel.org/vfs/vfs/c/9133607de37a

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-03-25 14:57 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20 13:24 [PATCH v4 0/4] pidfs: handle multi-threaded exec and premature thread-group leader exit Christian Brauner
2025-03-20 13:24 ` [PATCH v4 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling Christian Brauner
2025-03-20 14:28   ` Oleg Nesterov
2025-03-20 15:26     ` Christian Brauner
2025-03-20 13:24 ` [PATCH v4 2/4] selftests/pidfd: first test for multi-threaded exec polling Christian Brauner
2025-03-20 13:24 ` [PATCH v4 3/4] selftests/pidfd: second " Christian Brauner
2025-03-20 13:24 ` [PATCH v4 4/4] selftests/pidfd: third " Christian Brauner
2025-03-23 17:19 ` [PATCH] pidfs: cleanup the usage of do_notify_pidfd() Oleg Nesterov
2025-03-23 17:45   ` selftests/pidfd: (Was: [PATCH] pidfs: cleanup the usage of do_notify_pidfd()) Oleg Nesterov
2025-03-23 20:36     ` Christian Brauner
2025-03-23 21:17       ` Oleg Nesterov
2025-03-23 20:40   ` [PATCH] pidfs: cleanup the usage of do_notify_pidfd() Christian Brauner
2025-03-23 20:42   ` Christian Brauner
2025-03-24 17:19 ` [PATCH] exit: fix the usage of delay_group_leader->exit_code in do_notify_parent() and pidfs_exit() Oleg Nesterov
2025-03-25 14:57   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).