* [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
@ 2023-11-30 16:39 Tycho Andersen
  2023-11-30 16:39 ` [RFC 2/3] selftests/pidfd: add non-thread-group leader tests Tycho Andersen
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Tycho Andersen @ 2023-11-30 16:39 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Eric W . Biederman, linux-kernel, linux-api,
	Tycho Andersen, Tycho Andersen
From: Tycho Andersen <tandersen@netflix.com>
We are using the pidfd family of syscalls with the seccomp userspace
notifier. When some thread triggers a seccomp notification, we want to do
some things to its context (munge fd tables via pidfd_getfd(), maybe write
to its memory, etc.). However, threads created with ~CLONE_FILES or
~CLONE_VM mean that we can't use the pidfd family of syscalls for this
purpose, since their fd table or mm are distinct from the thread group
leader's. In this patch, we relax this restriction for pidfd_open().
In order to avoid dangling poll() users we need to notify pidfd waiters
when individual threads die, but once we do that all the other machinery
seems to work ok viz. the tests. But I suppose there are more cases than
just this one.
Another weirdness is the open-coding of this vs. exporting using
do_notify_pidfd(). This particular location is after __exit_signal() is
called, which does __unhash_process() which kills ->thread_pid, so we need
to use the copy we have locally, vs do_notify_pid() which accesses it via
task_pid(). Maybe this suggests that the notification should live somewhere
in __exit_signals()? I just put it here because I saw we were already
testing if this task was the leader.
Signed-off-by: Tycho Andersen <tandersen@netflix.com>
---
 kernel/exit.c | 29 +++++++++++++++++++----------
 kernel/fork.c |  4 +---
 kernel/pid.c  | 11 +----------
 3 files changed, 21 insertions(+), 23 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index ee9f43bed49a..34eeefc7ee21 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -263,16 +263,25 @@ void release_task(struct task_struct *p)
 	 */
 	zap_leader = 0;
 	leader = p->group_leader;
-	if (leader != p && thread_group_empty(leader)
-			&& leader->exit_state == EXIT_ZOMBIE) {
-		/*
-		 * If we were the last child thread and the leader has
-		 * exited already, and the leader's parent ignores SIGCHLD,
-		 * then we are the one who should release the leader.
-		 */
-		zap_leader = do_notify_parent(leader, leader->exit_signal);
-		if (zap_leader)
-			leader->exit_state = EXIT_DEAD;
+	if (leader != p) {
+		if (thread_group_empty(leader)
+				&& leader->exit_state == EXIT_ZOMBIE) {
+			/*
+			 * If we were the last child thread and the leader has
+			 * exited already, and the leader's parent ignores SIGCHLD,
+			 * then we are the one who should release the leader.
+			 */
+			zap_leader = do_notify_parent(leader,
+						      leader->exit_signal);
+			if (zap_leader)
+				leader->exit_state = EXIT_DEAD;
+		} else {
+			/*
+			 * wake up pidfd pollers anyway, they want to know this
+			 * thread is dying.
+			 */
+			wake_up_all(&thread_pid->wait_pidfd);
+		}
 	}
 
 	write_unlock_irq(&tasklist_lock);
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..eef15c93f6cf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2163,8 +2163,6 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  * Allocate a new file that stashes @pid and reserve a new pidfd number in the
  * caller's file descriptor table. The pidfd is reserved but not installed yet.
  *
- * The helper verifies that @pid is used as a thread group leader.
- *
  * If this function returns successfully the caller is responsible to either
  * call fd_install() passing the returned pidfd and pidfd file as arguments in
  * order to install the pidfd into its file descriptor table or they must use
@@ -2182,7 +2180,7 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  */
 int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
 {
-	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
+	if (!pid)
 		return -EINVAL;
 
 	return __pidfd_prepare(pid, flags, ret);
diff --git a/kernel/pid.c b/kernel/pid.c
index 6500ef956f2f..4806798022d9 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -552,11 +552,6 @@ struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
  * Return the task associated with @pidfd. The function takes a reference on
  * the returned task. The caller is responsible for releasing that reference.
  *
- * Currently, the process identified by @pidfd is always a thread-group leader.
- * This restriction currently exists for all aspects of pidfds including pidfd
- * creation (CLONE_PIDFD cannot be used with CLONE_THREAD) and pidfd polling
- * (only supports thread group leaders).
- *
  * Return: On success, the task_struct associated with the pidfd.
  *	   On error, a negative errno number will be returned.
  */
@@ -615,11 +610,7 @@ int pidfd_create(struct pid *pid, unsigned int flags)
  * @flags: flags to pass
  *
  * This creates a new pid file descriptor with the O_CLOEXEC flag set for
- * the process identified by @pid. Currently, the process identified by
- * @pid must be a thread-group leader. This restriction currently exists
- * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
- * be used with CLONE_THREAD) and pidfd polling (only supports thread group
- * leaders).
+ * the process identified by @pid.
  *
  * Return: On success, a cloexec pidfd is returned.
  *         On error, a negative errno number will be returned.
base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
-- 
2.34.1
^ permalink raw reply related	[flat|nested] 23+ messages in thread
* [RFC 2/3] selftests/pidfd: add non-thread-group leader tests
  2023-11-30 16:39 [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders Tycho Andersen
@ 2023-11-30 16:39 ` Tycho Andersen
  2023-11-30 16:39 ` [RFC 3/3] clone: allow CLONE_THREAD | CLONE_PIDFD together Tycho Andersen
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Tycho Andersen @ 2023-11-30 16:39 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Eric W . Biederman, linux-kernel, linux-api,
	Tycho Andersen, Tycho Andersen
From: Tycho Andersen <tandersen@netflix.com>
This adds a family of tests for various behaviors of non-thread-group
leaders. Maybe this should live in pidfd_open_test.c instead? We'd need
some hoisting there then.
Signed-off-by: Tycho Andersen <tandersen@netflix.com>
---
 tools/testing/selftests/pidfd/.gitignore      |   1 +
 tools/testing/selftests/pidfd/Makefile        |   3 +-
 .../selftests/pidfd/pidfd_non_tgl_test.c      | 339 ++++++++++++++++++
 3 files changed, 342 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore
index 973198a3ec3d..e7532e84a34a 100644
--- a/tools/testing/selftests/pidfd/.gitignore
+++ b/tools/testing/selftests/pidfd/.gitignore
@@ -6,3 +6,4 @@ pidfd_wait
 pidfd_fdinfo_test
 pidfd_getfd_test
 pidfd_setns_test
+pidfd_non_tgl_test
diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index d731e3e76d5b..50e3aa9de05a 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -2,7 +2,8 @@
 CFLAGS += -g $(KHDR_INCLUDES) -pthread -Wall
 
 TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test \
-	pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test
+	pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test \
+	pidfd_non_tgl_test
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/pidfd/pidfd_non_tgl_test.c b/tools/testing/selftests/pidfd/pidfd_non_tgl_test.c
new file mode 100644
index 000000000000..e3992f2d88cf
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_non_tgl_test.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/socket.h>
+#include <limits.h>
+#include <string.h>
+#include <signal.h>
+#include <syscall.h>
+#include <sched.h>
+#include <poll.h>
+
+#include "../kselftest.h"
+#include "pidfd.h"
+
+// glibc defaults to 8MB stacks
+#define STACK_SIZE (8 * 1024 * 1024)
+static char stack[STACK_SIZE];
+
+static int thread_sleep(void *)
+{
+	while (1)
+		sleep(100);
+	return 1;
+}
+
+static int fork_task_with_thread(int (*fn)(void *), int sk_pair[2],
+				 pid_t *tgl, pid_t *thread, int *tgl_pidfd,
+				 int *thread_pidfd)
+{
+	*tgl_pidfd = *thread_pidfd = -1;
+
+	*tgl = fork();
+	if (*tgl < 0) {
+		perror("fork");
+		return -1;
+	}
+
+	if (!*tgl) {
+		int flags = CLONE_THREAD | CLONE_VM | CLONE_SIGHAND;
+		pid_t t;
+
+		t = clone(fn, stack + STACK_SIZE, flags, sk_pair);
+		if (t < 0) {
+			perror("clone");
+			exit(1);
+		}
+
+		close(sk_pair[1]);
+
+		if (write(sk_pair[0], &t, sizeof(t)) != sizeof(t)) {
+			perror("read");
+			exit(1);
+		}
+
+		// wait to get killed for various reasons by the tests.
+		while (1)
+			sleep(100);
+	}
+
+	errno = 0;
+	if (read(sk_pair[1], thread, sizeof(*thread)) != sizeof(*thread)) {
+		perror("read");
+		goto cleanup;
+	}
+
+	*tgl_pidfd = sys_pidfd_open(*tgl, 0);
+	if (*tgl_pidfd < 0) {
+		perror("pidfd_open tgl");
+		goto cleanup;
+	}
+
+	*thread_pidfd = sys_pidfd_open(*thread, 0);
+	if (*thread_pidfd < 0) {
+		perror("pidfd");
+		goto cleanup;
+	}
+
+	return 0;
+
+cleanup:
+	kill(*tgl, SIGKILL);
+	if (*tgl_pidfd >= 0)
+		close(*tgl_pidfd);
+	if (*thread_pidfd >= 0)
+		close(*thread_pidfd);
+	return -1;
+}
+
+static int test_non_tgl_basic(void)
+{
+	pid_t tgl, thread;
+	int sk_pair[2], status;
+	int tgl_pidfd = -1, thread_pidfd = -1;
+	int ret = KSFT_FAIL;
+
+	if (socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair) < 0) {
+		ksft_print_msg("socketpair failed %s\n", strerror(errno));
+		return KSFT_FAIL;
+	}
+
+	if (fork_task_with_thread(thread_sleep, sk_pair, &tgl, &thread,
+				  &tgl_pidfd, &thread_pidfd) < 0) {
+		return KSFT_FAIL;
+	}
+
+	/*
+	 * KILL of a thread should still kill everyone
+	 */
+	if (sys_pidfd_send_signal(thread_pidfd, SIGKILL, NULL, 0)) {
+		perror("pidfd_send_signal");
+		goto cleanup;
+	}
+
+	errno = 0;
+	if (waitpid(tgl, &status, 0) != tgl) {
+		perror("waitpid tgl");
+		goto cleanup;
+	}
+
+	if (!WIFSIGNALED(status) || WTERMSIG(status) != SIGKILL) {
+		ksft_print_msg("bad exit status %x\n", status);
+		goto cleanup;
+	}
+
+	ret = KSFT_PASS;
+
+cleanup:
+	close(sk_pair[0]);
+	close(sk_pair[1]);
+	close(tgl_pidfd);
+	close(thread_pidfd);
+	return ret;
+}
+
+static int thread_exec(void *arg)
+{
+	int *sk_pair = arg;
+	pid_t thread;
+
+	if (read(sk_pair[0], &thread, sizeof(thread)) != sizeof(thread)) {
+		perror("read");
+		exit(1);
+	}
+
+	execlp("/bin/true", "/bin/true", NULL);
+	return 1;
+}
+
+static int test_non_tgl_exec(void)
+{
+	pid_t tgl, thread;
+	int sk_pair[2];
+	int tgl_pidfd = -1, thread_pidfd = -1;
+	int ret = KSFT_FAIL, ready;
+	struct pollfd pollfd;
+
+	if (socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair) < 0) {
+		ksft_print_msg("socketpair failed %s\n", strerror(errno));
+		return KSFT_FAIL;
+	}
+
+	if (fork_task_with_thread(thread_exec, sk_pair, &tgl, &thread,
+				  &tgl_pidfd, &thread_pidfd) < 0) {
+		return KSFT_FAIL;
+	}
+
+	if (write(sk_pair[1], &thread, sizeof(thread)) != sizeof(thread)) {
+		perror("read");
+		goto cleanup;
+	}
+
+	// thread will exec(), so this pidfd should eventually be dead (i.e.
+	// poll should return).
+	pollfd.fd = thread_pidfd;
+	pollfd.events = POLLIN;
+
+	ready = poll(&pollfd, 1, -1);
+	if (ready == -1) {
+		perror("poll");
+		goto cleanup;
+	}
+
+	if (ready != 1) {
+		ksft_print_msg("bad poll result %d\n", ready);
+		goto cleanup;
+	}
+
+	if (pollfd.revents != POLLIN) {
+		ksft_print_msg("bad poll revents: %x\n", pollfd.revents);
+		goto cleanup;
+	}
+
+	errno = 0;
+	if (sys_pidfd_getfd(thread_pidfd, 0, 0) > 0) {
+		ksft_print_msg("got a real fd");
+		goto cleanup;
+	}
+
+	if (errno != ESRCH) {
+		ksft_print_msg("polling invalid thread didn't give ESRCH");
+		goto cleanup;
+	}
+
+	ret = KSFT_PASS;
+
+cleanup:
+	close(sk_pair[0]);
+	close(sk_pair[1]);
+	close(tgl_pidfd);
+	close(thread_pidfd);
+	return ret;
+}
+
+int thread_wait_exit(void *arg)
+{
+	int *sk_pair = arg;
+	pid_t thread;
+
+	if (read(sk_pair[0], &thread, sizeof(thread)) != sizeof(thread)) {
+		perror("read");
+		exit(1);
+	}
+
+	return 0;
+}
+
+static int test_non_tgl_exit(void)
+{
+	pid_t tgl, thread;
+	int sk_pair[2], status;
+	int tgl_pidfd = -1, thread_pidfd = -1;
+	int ret = KSFT_FAIL, ready;
+	struct pollfd pollfd;
+
+	if (socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair) < 0) {
+		ksft_print_msg("socketpair failed %s\n", strerror(errno));
+		return KSFT_FAIL;
+	}
+
+	if (fork_task_with_thread(thread_wait_exit, sk_pair, &tgl, &thread,
+				  &tgl_pidfd, &thread_pidfd) < 0) {
+		return KSFT_FAIL;
+	}
+
+	if (write(sk_pair[1], &thread, sizeof(thread)) != sizeof(thread)) {
+		perror("write");
+		goto cleanup;
+	}
+
+	// thread will exit, so this pidfd should eventually be dead (i.e.
+	// poll should return).
+	pollfd.fd = thread_pidfd;
+	pollfd.events = POLLIN;
+
+	ready = poll(&pollfd, 1, -1);
+	if (ready == -1) {
+		perror("poll");
+		goto cleanup;
+	}
+
+	if (ready != 1) {
+		ksft_print_msg("bad poll result %d\n", ready);
+		goto cleanup;
+	}
+
+	if (pollfd.revents != POLLIN) {
+		ksft_print_msg("bad poll revents: %x\n", pollfd.revents);
+		goto cleanup;
+	}
+
+	errno = 0;
+	if (sys_pidfd_getfd(thread_pidfd, 0, 0) > 0) {
+		ksft_print_msg("got a real pidfd");
+		goto cleanup;
+	}
+
+	if (errno != ESRCH) {
+		ksft_print_msg("polling invalid thread didn't give ESRCH");
+		goto cleanup;
+	}
+
+	close(thread_pidfd);
+
+	// ok, but the thread group *leader* should still be alive
+	pollfd.fd = tgl_pidfd;
+	ready = poll(&pollfd, 1, 1);
+	if (ready == -1) {
+		perror("poll");
+		goto cleanup;
+	}
+
+	if (ready != 0) {
+		ksft_print_msg("polling leader returned something?! %x", pollfd.revents);
+		goto cleanup;
+	}
+
+	ret = KSFT_PASS;
+
+cleanup:
+	kill(tgl, SIGKILL);
+	if (waitpid(tgl, &status, 0) != tgl) {
+		perror("waitpid");
+		return KSFT_FAIL;
+	}
+
+	return ret;
+}
+
+#define T(x) { x, #x }
+struct pidfd_non_tgl_test {
+	int (*fn)();
+	const char *name;
+} tests[] = {
+	T(test_non_tgl_basic),
+	T(test_non_tgl_exec),
+	T(test_non_tgl_exit),
+};
+#undef T
+
+int main(int argc, char *argv[])
+{
+	int i, ret = EXIT_SUCCESS;
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		switch (tests[i].fn()) {
+		case KSFT_PASS:
+			ksft_test_result_pass("%s\n", tests[i].name);
+			break;
+		case KSFT_SKIP:
+			ksft_test_result_skip("%s\n", tests[i].name);
+			break;
+		default:
+			ret = EXIT_FAILURE;
+			ksft_test_result_fail("%s\n", tests[i].name);
+			break;
+		}
+	}
+
+	return ret;
+}
-- 
2.34.1
^ permalink raw reply related	[flat|nested] 23+ messages in thread
* [RFC 3/3] clone: allow CLONE_THREAD | CLONE_PIDFD together
  2023-11-30 16:39 [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders Tycho Andersen
  2023-11-30 16:39 ` [RFC 2/3] selftests/pidfd: add non-thread-group leader tests Tycho Andersen
@ 2023-11-30 16:39 ` Tycho Andersen
  2023-11-30 17:39 ` [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders Oleg Nesterov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Tycho Andersen @ 2023-11-30 16:39 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Eric W . Biederman, linux-kernel, linux-api,
	Tycho Andersen, Tycho Andersen
From: Tycho Andersen <tandersen@netflix.com>
This removes the restriction of CLONE_THREAD | CLONE_PIDFD being specified
together. Assuming the previous patch sorts out all the thorny issues this
should be safe. I've left it as a separate patch since it is not strictly
necessary as a usecase for us, but might be nice? Perhaps we want to wait
until someone actually needs it though.
Signed-off-by: Tycho Andersen <tandersen@netflix.com>
---
 kernel/fork.c                                   |  3 +--
 .../selftests/pidfd/pidfd_non_tgl_test.c        | 17 +++++++++++++++++
 2 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index eef15c93f6cf..ada476f38b56 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2302,9 +2302,8 @@ __latent_entropy struct task_struct *copy_process(
 		/*
 		 * - CLONE_DETACHED is blocked so that we can potentially
 		 *   reuse it later for CLONE_PIDFD.
-		 * - CLONE_THREAD is blocked until someone really needs it.
 		 */
-		if (clone_flags & (CLONE_DETACHED | CLONE_THREAD))
+		if (clone_flags & CLONE_DETACHED)
 			return ERR_PTR(-EINVAL);
 	}
 
diff --git a/tools/testing/selftests/pidfd/pidfd_non_tgl_test.c b/tools/testing/selftests/pidfd/pidfd_non_tgl_test.c
index e3992f2d88cf..dfd6a2cd85a3 100644
--- a/tools/testing/selftests/pidfd/pidfd_non_tgl_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_non_tgl_test.c
@@ -305,6 +305,22 @@ static int test_non_tgl_exit(void)
 	return ret;
 }
 
+static int test_clone_thread_pidfd(void)
+{
+	pid_t pid;
+	int flags = CLONE_THREAD | CLONE_VM | CLONE_SIGHAND | CLONE_PIDFD;
+	int pidfd;
+
+	pid = clone(thread_sleep, stack + STACK_SIZE, flags, NULL, &pidfd);
+	if (pid < 0) {
+		perror("clone");
+		return KSFT_FAIL;
+	}
+
+	close(pidfd);
+	return KSFT_PASS;
+}
+
 #define T(x) { x, #x }
 struct pidfd_non_tgl_test {
 	int (*fn)();
@@ -313,6 +329,7 @@ struct pidfd_non_tgl_test {
 	T(test_non_tgl_basic),
 	T(test_non_tgl_exec),
 	T(test_non_tgl_exit),
+	T(test_clone_thread_pidfd),
 };
 #undef T
 
-- 
2.34.1
^ permalink raw reply related	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-11-30 16:39 [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders Tycho Andersen
  2023-11-30 16:39 ` [RFC 2/3] selftests/pidfd: add non-thread-group leader tests Tycho Andersen
  2023-11-30 16:39 ` [RFC 3/3] clone: allow CLONE_THREAD | CLONE_PIDFD together Tycho Andersen
@ 2023-11-30 17:39 ` Oleg Nesterov
  2023-11-30 17:56   ` Tycho Andersen
  2023-11-30 18:37 ` Florian Weimer
  2023-12-07 17:21 ` Christian Brauner
  4 siblings, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2023-11-30 17:39 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Christian Brauner, Eric W . Biederman, linux-kernel, linux-api,
	Tycho Andersen
Hi Tycho,
I can't really read this patch now, possibly I am wrong, but...
On 11/30, Tycho Andersen wrote:
>
> @@ -263,16 +263,25 @@ void release_task(struct task_struct *p)
>  	 */
>  	zap_leader = 0;
>  	leader = p->group_leader;
> -	if (leader != p && thread_group_empty(leader)
> -			&& leader->exit_state == EXIT_ZOMBIE) {
> -		/*
> -		 * If we were the last child thread and the leader has
> -		 * exited already, and the leader's parent ignores SIGCHLD,
> -		 * then we are the one who should release the leader.
> -		 */
> -		zap_leader = do_notify_parent(leader, leader->exit_signal);
> -		if (zap_leader)
> -			leader->exit_state = EXIT_DEAD;
> +	if (leader != p) {
> +		if (thread_group_empty(leader)
> +				&& leader->exit_state == EXIT_ZOMBIE) {
> +			/*
> +			 * If we were the last child thread and the leader has
> +			 * exited already, and the leader's parent ignores SIGCHLD,
> +			 * then we are the one who should release the leader.
> +			 */
> +			zap_leader = do_notify_parent(leader,
> +						      leader->exit_signal);
> +			if (zap_leader)
> +				leader->exit_state = EXIT_DEAD;
> +		} else {
> +			/*
> +			 * wake up pidfd pollers anyway, they want to know this
> +			 * thread is dying.
> +			 */
> +			wake_up_all(&thread_pid->wait_pidfd);
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
somehow I can't believe this is a good change after a quick glance ;)
I think that wake_up_all(wait_pidfd) should have a single caller,
do_notify_pidfd(). This probably means it should be shiftef from
do_notify_parent() to exit_notify(), I am not sure...
No?
Oleg.
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-11-30 17:39 ` [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders Oleg Nesterov
@ 2023-11-30 17:56   ` Tycho Andersen
  2023-12-01 16:31     ` Tycho Andersen
  0 siblings, 1 reply; 23+ messages in thread
From: Tycho Andersen @ 2023-11-30 17:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Eric W . Biederman, linux-kernel, linux-api,
	Tycho Andersen
On Thu, Nov 30, 2023 at 06:39:39PM +0100, Oleg Nesterov wrote:
> Hi Tycho,
> 
> I can't really read this patch now, possibly I am wrong, but...
No worries, no rush here.
> On 11/30, Tycho Andersen wrote:
> >
> > @@ -263,16 +263,25 @@ void release_task(struct task_struct *p)
> >  	 */
> >  	zap_leader = 0;
> >  	leader = p->group_leader;
> > -	if (leader != p && thread_group_empty(leader)
> > -			&& leader->exit_state == EXIT_ZOMBIE) {
> > -		/*
> > -		 * If we were the last child thread and the leader has
> > -		 * exited already, and the leader's parent ignores SIGCHLD,
> > -		 * then we are the one who should release the leader.
> > -		 */
> > -		zap_leader = do_notify_parent(leader, leader->exit_signal);
> > -		if (zap_leader)
> > -			leader->exit_state = EXIT_DEAD;
> > +	if (leader != p) {
> > +		if (thread_group_empty(leader)
> > +				&& leader->exit_state == EXIT_ZOMBIE) {
> > +			/*
> > +			 * If we were the last child thread and the leader has
> > +			 * exited already, and the leader's parent ignores SIGCHLD,
> > +			 * then we are the one who should release the leader.
> > +			 */
> > +			zap_leader = do_notify_parent(leader,
> > +						      leader->exit_signal);
> > +			if (zap_leader)
> > +				leader->exit_state = EXIT_DEAD;
> > +		} else {
> > +			/*
> > +			 * wake up pidfd pollers anyway, they want to know this
> > +			 * thread is dying.
> > +			 */
> > +			wake_up_all(&thread_pid->wait_pidfd);
>                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> somehow I can't believe this is a good change after a quick glance ;)
Yeah, I figured it would raise some eyebrows :)
> I think that wake_up_all(wait_pidfd) should have a single caller,
> do_notify_pidfd(). This probably means it should be shiftef from
> do_notify_parent() to exit_notify(), I am not sure...
__exit_signals() is what I was thinking in the patch description, but
I'll look at exit_notify() too.
Tycho
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-11-30 16:39 [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders Tycho Andersen
                   ` (2 preceding siblings ...)
  2023-11-30 17:39 ` [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders Oleg Nesterov
@ 2023-11-30 18:37 ` Florian Weimer
  2023-11-30 18:54   ` Tycho Andersen
  2023-12-07 17:21 ` Christian Brauner
  4 siblings, 1 reply; 23+ messages in thread
From: Florian Weimer @ 2023-11-30 18:37 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Christian Brauner, Oleg Nesterov, Eric W . Biederman,
	linux-kernel, linux-api, Tycho Andersen, mathieu.desnoyers
* Tycho Andersen:
> From: Tycho Andersen <tandersen@netflix.com>
>
> We are using the pidfd family of syscalls with the seccomp userspace
> notifier. When some thread triggers a seccomp notification, we want to do
> some things to its context (munge fd tables via pidfd_getfd(), maybe write
> to its memory, etc.). However, threads created with ~CLONE_FILES or
> ~CLONE_VM mean that we can't use the pidfd family of syscalls for this
> purpose, since their fd table or mm are distinct from the thread group
> leader's. In this patch, we relax this restriction for pidfd_open().
Does this mean that pidfd_getfd cannot currently be used to get
descriptors for a TID if that TID doesn't happen to share its descriptor
set with the thread group leader?
I'd like to offer a userspace API which allows safe stashing of
unreachable file descriptors on a service thread.
Cc:ing Mathieu because of our previous discussions?
Thanks,
Florian
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-11-30 18:37 ` Florian Weimer
@ 2023-11-30 18:54   ` Tycho Andersen
  2023-11-30 19:00     ` Mathieu Desnoyers
  0 siblings, 1 reply; 23+ messages in thread
From: Tycho Andersen @ 2023-11-30 18:54 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christian Brauner, Oleg Nesterov, Eric W . Biederman,
	linux-kernel, linux-api, Tycho Andersen, mathieu.desnoyers
On Thu, Nov 30, 2023 at 07:37:02PM +0100, Florian Weimer wrote:
> * Tycho Andersen:
> 
> > From: Tycho Andersen <tandersen@netflix.com>
> >
> > We are using the pidfd family of syscalls with the seccomp userspace
> > notifier. When some thread triggers a seccomp notification, we want to do
> > some things to its context (munge fd tables via pidfd_getfd(), maybe write
> > to its memory, etc.). However, threads created with ~CLONE_FILES or
> > ~CLONE_VM mean that we can't use the pidfd family of syscalls for this
> > purpose, since their fd table or mm are distinct from the thread group
> > leader's. In this patch, we relax this restriction for pidfd_open().
> 
> Does this mean that pidfd_getfd cannot currently be used to get
> descriptors for a TID if that TID doesn't happen to share its descriptor
> set with the thread group leader?
Correct, that's what I'm trying to solve.
> I'd like to offer a userspace API which allows safe stashing of
> unreachable file descriptors on a service thread.
By "safe" here do you mean not accessible via pidfd_getfd()?
Tycho
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-11-30 18:54   ` Tycho Andersen
@ 2023-11-30 19:00     ` Mathieu Desnoyers
  2023-11-30 19:17       ` Tycho Andersen
  2023-11-30 19:43       ` Florian Weimer
  0 siblings, 2 replies; 23+ messages in thread
From: Mathieu Desnoyers @ 2023-11-30 19:00 UTC (permalink / raw)
  To: Tycho Andersen, Florian Weimer
  Cc: Christian Brauner, Oleg Nesterov, Eric W . Biederman,
	linux-kernel, linux-api, Tycho Andersen
On 2023-11-30 13:54, Tycho Andersen wrote:
> On Thu, Nov 30, 2023 at 07:37:02PM +0100, Florian Weimer wrote:
>> * Tycho Andersen:
>>
>>> From: Tycho Andersen <tandersen@netflix.com>
>>>
>>> We are using the pidfd family of syscalls with the seccomp userspace
>>> notifier. When some thread triggers a seccomp notification, we want to do
>>> some things to its context (munge fd tables via pidfd_getfd(), maybe write
>>> to its memory, etc.). However, threads created with ~CLONE_FILES or
>>> ~CLONE_VM mean that we can't use the pidfd family of syscalls for this
>>> purpose, since their fd table or mm are distinct from the thread group
>>> leader's. In this patch, we relax this restriction for pidfd_open().
>>
>> Does this mean that pidfd_getfd cannot currently be used to get
>> descriptors for a TID if that TID doesn't happen to share its descriptor
>> set with the thread group leader?
> 
> Correct, that's what I'm trying to solve.
> 
>> I'd like to offer a userspace API which allows safe stashing of
>> unreachable file descriptors on a service thread.
> 
> By "safe" here do you mean not accessible via pidfd_getfd()?
For the LTTng-UST use-case, we need to be able to create and
use a file descriptor from an agent thread injected within the target
process in a way that is safe against patterns where the application
blindly close all file descriptors (for-loop doing close(2),
closefrom(2) or closeall(2)).
The main issue here is that even though we could handle errors
(-1, errno=EBADF) in the sendmsg/recvmsg calls, re-use of a file
descriptor by the application can lead to data corruption, which
is certainly an unwanted consequence.
AFAIU glibc has similar requirements with respect to io_uring
file descriptors.
Thanks,
Mathieu
-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-11-30 19:00     ` Mathieu Desnoyers
@ 2023-11-30 19:17       ` Tycho Andersen
  2023-11-30 19:43       ` Florian Weimer
  1 sibling, 0 replies; 23+ messages in thread
From: Tycho Andersen @ 2023-11-30 19:17 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Florian Weimer, Christian Brauner, Oleg Nesterov,
	Eric W . Biederman, linux-kernel, linux-api, Tycho Andersen,
	Kees Cook
On Thu, Nov 30, 2023 at 02:00:01PM -0500, Mathieu Desnoyers wrote:
> On 2023-11-30 13:54, Tycho Andersen wrote:
> > On Thu, Nov 30, 2023 at 07:37:02PM +0100, Florian Weimer wrote:
> > > * Tycho Andersen:
> > > 
> > > > From: Tycho Andersen <tandersen@netflix.com>
> > > > 
> > > > We are using the pidfd family of syscalls with the seccomp userspace
> > > > notifier. When some thread triggers a seccomp notification, we want to do
> > > > some things to its context (munge fd tables via pidfd_getfd(), maybe write
> > > > to its memory, etc.). However, threads created with ~CLONE_FILES or
> > > > ~CLONE_VM mean that we can't use the pidfd family of syscalls for this
> > > > purpose, since their fd table or mm are distinct from the thread group
> > > > leader's. In this patch, we relax this restriction for pidfd_open().
> > > 
> > > Does this mean that pidfd_getfd cannot currently be used to get
> > > descriptors for a TID if that TID doesn't happen to share its descriptor
> > > set with the thread group leader?
> > 
> > Correct, that's what I'm trying to solve.
> > 
> > > I'd like to offer a userspace API which allows safe stashing of
> > > unreachable file descriptors on a service thread.
> > 
> > By "safe" here do you mean not accessible via pidfd_getfd()?
> 
> For the LTTng-UST use-case, we need to be able to create and
> use a file descriptor from an agent thread injected within the target
> process in a way that is safe against patterns where the application
> blindly close all file descriptors (for-loop doing close(2),
> closefrom(2) or closeall(2)).
> 
> The main issue here is that even though we could handle errors
> (-1, errno=EBADF) in the sendmsg/recvmsg calls, re-use of a file
> descriptor by the application can lead to data corruption, which
> is certainly an unwanted consequence.
> 
> AFAIU glibc has similar requirements with respect to io_uring
> file descriptors.
I see, thanks. And this introduces another problem: what if one of
these things is a memfd, then that memory needs to be invisible to the
process as well presumably?
This "invisible to the process" mapping would solve another
longstanding problem with seccomp: handlers could copy syscall
arguments to this safe memory area and then _CONTINUE things safely...
Tycho
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-11-30 19:00     ` Mathieu Desnoyers
  2023-11-30 19:17       ` Tycho Andersen
@ 2023-11-30 19:43       ` Florian Weimer
  2023-12-06 15:27         ` Tycho Andersen
  2023-12-07 22:58         ` Christian Brauner
  1 sibling, 2 replies; 23+ messages in thread
From: Florian Weimer @ 2023-11-30 19:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Tycho Andersen, Christian Brauner, Oleg Nesterov,
	Eric W . Biederman, linux-kernel, linux-api, Tycho Andersen
* Mathieu Desnoyers:
>>> I'd like to offer a userspace API which allows safe stashing of
>>> unreachable file descriptors on a service thread.
>> By "safe" here do you mean not accessible via pidfd_getfd()?
No, unreachable by close/close_range/dup2/dup3.  I expect we can do an
intra-process transfer using /proc, but I'm hoping for something nicer.
Thanks,
Florian
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-11-30 17:56   ` Tycho Andersen
@ 2023-12-01 16:31     ` Tycho Andersen
  2023-12-07 17:57       ` Christian Brauner
  0 siblings, 1 reply; 23+ messages in thread
From: Tycho Andersen @ 2023-12-01 16:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Eric W . Biederman, linux-kernel, linux-api,
	Tycho Andersen
On Thu, Nov 30, 2023 at 10:57:01AM -0700, Tycho Andersen wrote:
> On Thu, Nov 30, 2023 at 06:39:39PM +0100, Oleg Nesterov wrote:
> > I think that wake_up_all(wait_pidfd) should have a single caller,
> > do_notify_pidfd(). This probably means it should be shiftef from
> > do_notify_parent() to exit_notify(), I am not sure...
Indeed, below passes the tests without issue and is much less ugly.
I'll respin with that later next week sometime.
Thanks,
Tycho
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 3499c1a8b929..04c4423ebed0 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -332,6 +332,7 @@ extern int kill_pid_usb_asyncio(int sig, int errno, sigval_t addr, struct pid *,
 extern int kill_pgrp(struct pid *pid, int sig, int priv);
 extern int kill_pid(struct pid *pid, int sig, int priv);
 extern __must_check bool do_notify_parent(struct task_struct *, int);
+extern void do_notify_pidfd(struct task_struct *);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int);
 extern void force_fatal_sig(int);
diff --git a/kernel/exit.c b/kernel/exit.c
index 34eeefc7ee21..fd6048c20c48 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -769,6 +769,8 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 		wake_up_process(tsk->signal->group_exec_task);
 	write_unlock_irq(&tasklist_lock);
 
+	do_notify_pidfd(tsk);
+
 	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
 		list_del_init(&p->ptrace_entry);
 		release_task(p);
diff --git a/kernel/signal.c b/kernel/signal.c
index 47a7602dfe8d..7b3a1e147225 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2028,7 +2028,7 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
 	return ret;
 }
 
-static void do_notify_pidfd(struct task_struct *task)
+void do_notify_pidfd(struct task_struct *task)
 {
 	struct pid *pid;
 
@@ -2060,9 +2060,6 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	WARN_ON_ONCE(!tsk->ptrace &&
 	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
 
-	/* Wake up all pidfd waiters */
-	do_notify_pidfd(tsk);
-
 	if (sig != SIGCHLD) {
 		/*
 		 * This is only possible if parent == real_parent.
^ permalink raw reply related	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-11-30 19:43       ` Florian Weimer
@ 2023-12-06 15:27         ` Tycho Andersen
  2023-12-07 22:58         ` Christian Brauner
  1 sibling, 0 replies; 23+ messages in thread
From: Tycho Andersen @ 2023-12-06 15:27 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Mathieu Desnoyers, Christian Brauner, Oleg Nesterov,
	Eric W . Biederman, linux-kernel, linux-api, Tycho Andersen
On Thu, Nov 30, 2023 at 08:43:18PM +0100, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
> >>> I'd like to offer a userspace API which allows safe stashing of
> >>> unreachable file descriptors on a service thread.
> 
> >> By "safe" here do you mean not accessible via pidfd_getfd()?
> 
> No, unreachable by close/close_range/dup2/dup3.  I expect we can do an
> intra-process transfer using /proc, but I'm hoping for something nicer.
It occurred to me that we could get the seccomp() protected-memory
functionality almost all the way via some combination of
memfd_create(MFD_ALLOW_SEALING), fcntl(F_SEAL_WRITE|F_SEAL_SEAL), and
mmap(PROT_NONE). Some other thread could come along and unmap/remap,
but perhaps with some kind of F_SEAL_NOUNMAP married to one of these
special files we could both get what we want?
I submitted a talk to FOSDEM just for grins, if anyone is planning to
attend that.
Tycho
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-11-30 16:39 [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders Tycho Andersen
                   ` (3 preceding siblings ...)
  2023-11-30 18:37 ` Florian Weimer
@ 2023-12-07 17:21 ` Christian Brauner
  2023-12-07 17:52   ` Tycho Andersen
  2023-12-08 17:47   ` Jan Kara
  4 siblings, 2 replies; 23+ messages in thread
From: Christian Brauner @ 2023-12-07 17:21 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Oleg Nesterov, Eric W . Biederman, linux-kernel, linux-api,
	Tycho Andersen, Jan Kara, linux-fsdevel, Joel Fernandes
[Cc fsdevel & Jan because we had some discussions about fanotify
returning non-thread-group pidfds. That's just for awareness or in case
this might need special handling.]
On Thu, Nov 30, 2023 at 09:39:44AM -0700, Tycho Andersen wrote:
> From: Tycho Andersen <tandersen@netflix.com>
> 
> We are using the pidfd family of syscalls with the seccomp userspace
> notifier. When some thread triggers a seccomp notification, we want to do
> some things to its context (munge fd tables via pidfd_getfd(), maybe write
> to its memory, etc.). However, threads created with ~CLONE_FILES or
> ~CLONE_VM mean that we can't use the pidfd family of syscalls for this
> purpose, since their fd table or mm are distinct from the thread group
> leader's. In this patch, we relax this restriction for pidfd_open().
> 
> In order to avoid dangling poll() users we need to notify pidfd waiters
> when individual threads die, but once we do that all the other machinery
> seems to work ok viz. the tests. But I suppose there are more cases than
> just this one.
> 
> Another weirdness is the open-coding of this vs. exporting using
> do_notify_pidfd(). This particular location is after __exit_signal() is
> called, which does __unhash_process() which kills ->thread_pid, so we need
> to use the copy we have locally, vs do_notify_pid() which accesses it via
> task_pid(). Maybe this suggests that the notification should live somewhere
> in __exit_signals()? I just put it here because I saw we were already
> testing if this task was the leader.
> 
> Signed-off-by: Tycho Andersen <tandersen@netflix.com>
> ---
So we've always said that if there's a use-case for this then we're
willing to support it. And I think that stance hasn't changed. I know
that others have expressed interest in this as well.
So currently the series only enables pidfds for threads to be created
and allows notifications for threads. But all places that currently make
use of pidfds refuse non-thread-group leaders. We can certainly proceed
with a patch series that only enables creation and exit notification but
we should also consider unlocking additional functionality:
* audit of all callers that use pidfd_get_task()
  (1) process_madvise()
  (2) process_mrlease()
  I expect that both can handle threads just fine but we'd need an Ack
  from mm people.
* pidfd_prepare() is used to create pidfds for:
  (1) CLONE_PIDFD via clone() and clone3()
  (2) SCM_PIDFD and SO_PEERPIDFD
  (3) fanotify
  
  (1) is what this series here is about.
  For (2) we need to check whether fanotify would be ok to handle pidfds
  for threads. It might be fine but Jan will probably know more.
  For (3) the change doesn't matter because SCM_CREDS always use the
  thread-group leader. So even if we allowed the creation of pidfds for
  threads it wouldn't matter.
* audit all callers of pidfd_pid() whether they could simply be switched
  to handle individual threads:
  (1) setns() handles threads just fine so this is safe to allow.
  (2) pidfd_getfd() I would like to keep restricted and essentially
      freeze new features for it.
      I'm not happy that we did didn't just implement it as an ioctl to
      the seccomp notifier. And I wouldn't oppose a patch that would add
      that functionality to the seccomp notifier itself. But that's a
      separate topic.
  (3) pidfd_send_signal(). I think that one is the most interesting on
      to allow signaling individual threads. I'm not sure that you need
      to do this right now in this patch but we need to think about what
      we want to do there.
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-12-07 17:21 ` Christian Brauner
@ 2023-12-07 17:52   ` Tycho Andersen
  2023-12-08 17:47   ` Jan Kara
  1 sibling, 0 replies; 23+ messages in thread
From: Tycho Andersen @ 2023-12-07 17:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Eric W . Biederman, linux-kernel, linux-api,
	Tycho Andersen, Jan Kara, linux-fsdevel, Joel Fernandes
On Thu, Dec 07, 2023 at 06:21:18PM +0100, Christian Brauner wrote:
> [Cc fsdevel & Jan because we had some discussions about fanotify
> returning non-thread-group pidfds. That's just for awareness or in case
> this might need special handling.]
> 
> On Thu, Nov 30, 2023 at 09:39:44AM -0700, Tycho Andersen wrote:
> > From: Tycho Andersen <tandersen@netflix.com>
> > 
> > We are using the pidfd family of syscalls with the seccomp userspace
> > notifier. When some thread triggers a seccomp notification, we want to do
> > some things to its context (munge fd tables via pidfd_getfd(), maybe write
> > to its memory, etc.). However, threads created with ~CLONE_FILES or
> > ~CLONE_VM mean that we can't use the pidfd family of syscalls for this
> > purpose, since their fd table or mm are distinct from the thread group
> > leader's. In this patch, we relax this restriction for pidfd_open().
> > 
> > In order to avoid dangling poll() users we need to notify pidfd waiters
> > when individual threads die, but once we do that all the other machinery
> > seems to work ok viz. the tests. But I suppose there are more cases than
> > just this one.
> > 
> > Another weirdness is the open-coding of this vs. exporting using
> > do_notify_pidfd(). This particular location is after __exit_signal() is
> > called, which does __unhash_process() which kills ->thread_pid, so we need
> > to use the copy we have locally, vs do_notify_pid() which accesses it via
> > task_pid(). Maybe this suggests that the notification should live somewhere
> > in __exit_signals()? I just put it here because I saw we were already
> > testing if this task was the leader.
> > 
> > Signed-off-by: Tycho Andersen <tandersen@netflix.com>
> > ---
> 
> So we've always said that if there's a use-case for this then we're
> willing to support it. And I think that stance hasn't changed. I know
> that others have expressed interest in this as well.
> 
> So currently the series only enables pidfds for threads to be created
> and allows notifications for threads. But all places that currently make
> use of pidfds refuse non-thread-group leaders. We can certainly proceed
> with a patch series that only enables creation and exit notification but
> we should also consider unlocking additional functionality:
> 
> * audit of all callers that use pidfd_get_task()
> 
>   (1) process_madvise()
>   (2) process_mrlease()
> 
>   I expect that both can handle threads just fine but we'd need an Ack
>   from mm people.
> 
> * pidfd_prepare() is used to create pidfds for:
> 
>   (1) CLONE_PIDFD via clone() and clone3()
>   (2) SCM_PIDFD and SO_PEERPIDFD
>   (3) fanotify
>   
>   (1) is what this series here is about.
> 
>   For (2) we need to check whether fanotify would be ok to handle pidfds
>   for threads. It might be fine but Jan will probably know more.
> 
>   For (3) the change doesn't matter because SCM_CREDS always use the
>   thread-group leader. So even if we allowed the creation of pidfds for
>   threads it wouldn't matter.
> * audit all callers of pidfd_pid() whether they could simply be switched
>   to handle individual threads:
> 
>   (1) setns() handles threads just fine so this is safe to allow.
>   (2) pidfd_getfd() I would like to keep restricted and essentially
>       freeze new features for it.
> 
>       I'm not happy that we did didn't just implement it as an ioctl to
>       the seccomp notifier. And I wouldn't oppose a patch that would add
>       that functionality to the seccomp notifier itself. But that's a
>       separate topic.
>   (3) pidfd_send_signal(). I think that one is the most interesting on
>       to allow signaling individual threads. I'm not sure that you need
>       to do this right now in this patch but we need to think about what
>       we want to do there.
This all sounds reasonable to me, I can take a look as time permits.
pidfd_send_signal() at the very least would have been useful while
writing these tests.
Tycho
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-12-01 16:31     ` Tycho Andersen
@ 2023-12-07 17:57       ` Christian Brauner
  2023-12-07 21:25         ` Christian Brauner
  0 siblings, 1 reply; 23+ messages in thread
From: Christian Brauner @ 2023-12-07 17:57 UTC (permalink / raw)
  To: Tycho Andersen, Oleg Nesterov
  Cc: Eric W . Biederman, linux-kernel, linux-api, Tycho Andersen,
	Jan Kara, linux-fsdevel, Joel Fernandes
On Fri, Dec 01, 2023 at 09:31:40AM -0700, Tycho Andersen wrote:
> On Thu, Nov 30, 2023 at 10:57:01AM -0700, Tycho Andersen wrote:
> > On Thu, Nov 30, 2023 at 06:39:39PM +0100, Oleg Nesterov wrote:
> > > I think that wake_up_all(wait_pidfd) should have a single caller,
> > > do_notify_pidfd(). This probably means it should be shiftef from
> > > do_notify_parent() to exit_notify(), I am not sure...
> 
> Indeed, below passes the tests without issue and is much less ugly.
So I think I raised that question on another medium already but what
does the interaction with de_thread() look like?
Say some process creates pidfd for a thread in a non-empty thread-group
is created via CLONE_PIDFD. The pidfd_file->private_data is set to
struct pid of that task. The task the pidfd refers to later exec's.
Once it passed de_thread() the task the pidfd refers to assumes the
struct pid of the old thread-group leader and continues.
At the same time, the old thread-group leader now assumes the struct pid
of the task that just exec'd.
So after de_thread() the pidfd now referes to the old thread-group
leaders struct pid. Any subsequent operation will fail because the
process has already exited.
Basically, the pidfd now refers to the old thread-group leader and any
subsequent operation will fail even though the task still exists.
Conversely, if someone had created a pidfd that referred to the old
thread-group leader task then this pidfd will now suddenly refer to the
new thread-group leader task for the same reason: the struct pid's were
exchanged.
So this also means, iiuc, that the pidfd could now be passed to
waitid(P_PIFD) to retrieve the status of the old thread-group leader
that just got zapped.
And for the case where the pidfd referred to the old thread-group leader
task you would now suddenly _not_ be able to wait on that task anymore.
If these concerns are correct, then I think we need to decide what
semantics we want and how to handle this because that's not ok.
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-12-07 17:57       ` Christian Brauner
@ 2023-12-07 21:25         ` Christian Brauner
  2023-12-08 20:04           ` Tycho Andersen
  0 siblings, 1 reply; 23+ messages in thread
From: Christian Brauner @ 2023-12-07 21:25 UTC (permalink / raw)
  To: Tycho Andersen, Oleg Nesterov
  Cc: Eric W . Biederman, linux-kernel, linux-api, Tycho Andersen,
	Jan Kara, linux-fsdevel, Joel Fernandes
> If these concerns are correct
So, ok. I misremebered this. The scenario I had been thinking of is
basically the following.
We have a thread-group with thread-group leader 1234 and a thread with
4567 in that thread-group. Assume current thread-group leader is tsk1
and the non-thread-group leader is tsk2. tsk1 uses struct pid *tg_pid
and tsk2 uses struct pid *t_pid. The struct pids look like this after
creation of both thread-group leader tsk1 and thread tsk2:
	TGID 1234				TID 4567 
	tg_pid[PIDTYPE_PID]  = tsk1		t_pid[PIDTYPE_PID]  = tsk2
	tg_pid[PIDTYPE_TGID] = tsk1		t_pid[PIDTYPE_TGID] = NULL
IOW, tsk2's struct pid has never been used as a thread-group leader and
thus PIDTYPE_TGID is NULL. Now assume someone does create pidfds for
tsk1 and for tsk2:
	
	tg_pidfd = pidfd_open(tsk1)		t_pidfd = pidfd_open(tsk2)
	-> tg_pidfd->private_data = tg_pid	-> t_pidfd->private_data = t_pid
So we stash away struct pid *tg_pid for a pidfd_open() on tsk1 and we
stash away struct pid *t_pid for a pidfd_open() on tsk2.
If we wait on that task via P_PIDFD we get:
				/* waiting through pidfd */
	waitid(P_PIDFD, tg_pidfd)		waitid(P_PIDFD, t_pidfd)
	tg_pid[PIDTYPE_TGID] == tsk1		t_pid[PIDTYPE_TGID] == NULL
	=> succeeds				=> fails
Because struct pid *tg_pid is used a thread-group leader struct pid we
can wait on that tsk1. But we can't via the non-thread-group leader
pidfd because the struct pid *t_pid has never been used as a
thread-group leader.
Now assume, t_pid exec's and the struct pids are transfered. IIRC, we
get:
	tg_pid[PIDTYPE_PID]   = tsk2		t_pid[PIDTYPE_PID]   = tsk1
	tg_pid[PIDTYPE_TGID]  = tsk2		t_pid[PIDTYPE_TGID]  = NULL
If we wait on that task via P_PIDFD we get:
	
				/* waiting through pidfd */
	waitid(P_PIDFD, tg_pidfd)		waitid(P_PIDFD, t_pid)
	tg_pid[PIDTYPE_TGID] == tsk2		t_pid[PIDTYPE_TGID] == NULL
	=> succeeds				=> fails
Which is what we want. So effectively this should all work and I
misremembered the struct pid linkage. So afaict we don't even have a
problem here which is great.
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-11-30 19:43       ` Florian Weimer
  2023-12-06 15:27         ` Tycho Andersen
@ 2023-12-07 22:58         ` Christian Brauner
  2023-12-08  3:16           ` Jens Axboe
  2023-12-08 13:15           ` Florian Weimer
  1 sibling, 2 replies; 23+ messages in thread
From: Christian Brauner @ 2023-12-07 22:58 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Mathieu Desnoyers, Tycho Andersen, linux-kernel, linux-api,
	Jan Kara, linux-fsdevel, Jens Axboe
[adjusting Cc as that's really a separate topic]
On Thu, Nov 30, 2023 at 08:43:18PM +0100, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
> >>> I'd like to offer a userspace API which allows safe stashing of
> >>> unreachable file descriptors on a service thread.
Fwiw, systemd has a concept called the fdstore:
https://systemd.io/FILE_DESCRIPTOR_STORE
"The file descriptor store [...] allows services to upload during
runtime additional fds to the service manager that it shall keep on its
behalf. File descriptors are passed back to the service on subsequent
activations, the same way as any socket activation fds are passed.
[...]
The primary use-case of this logic is to permit services to restart
seamlessly (for example to update them to a newer version), without
losing execution context, dropping pinned resources, terminating
established connections or even just momentarily losing connectivity. In
fact, as the file descriptors can be uploaded freely at any time during
the service runtime, this can even be used to implement services that
robustly handle abnormal termination and can recover from that without
losing pinned resources."
> 
> >> By "safe" here do you mean not accessible via pidfd_getfd()?
> 
> No, unreachable by close/close_range/dup2/dup3.  I expect we can do an
> intra-process transfer using /proc, but I'm hoping for something nicer.
File descriptors are reachable for all processes/threads that share a
file descriptor table. Changing that means breaking core userspace
assumptions about how file descriptors work. That's not going to happen
as far as I'm concerned.
We may consider additional security_* hooks in close*() and dup*(). That
would allow you to utilize Landlock or BPF LSM to prevent file
descriptors from being closed or duplicated. pidfd_getfd() is already
blockable via security_file_receive().
In general, messing with fds in that way is really not a good idea.
If you need something that awkward, then you should go all the way and
look at io_uring which basically has a separate fd-like handle called
"fixed files".
Fixed file indexes are separate file-descriptor like handles that can
only be used from io_uring calls but not with the regular system call
interface.
IOW, you can refer to a file using an io_uring fixed index. The index to
use can be chosen by userspace and can't be used with any regular
fd-based system calls.
The io_uring fd itself can be made a fixed file itself
The only thing missing would be to turn an io_uring fixed file back into
a regular file descriptor. That could probably be done by using
receive_fd() and then installing that fd back into the caller's file
descriptor table. But that would require an io_uring patch.
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-12-07 22:58         ` Christian Brauner
@ 2023-12-08  3:16           ` Jens Axboe
  2023-12-08 13:15           ` Florian Weimer
  1 sibling, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2023-12-08  3:16 UTC (permalink / raw)
  To: Christian Brauner, Florian Weimer
  Cc: Mathieu Desnoyers, Tycho Andersen, linux-kernel, linux-api,
	Jan Kara, linux-fsdevel
On 12/7/23 3:58 PM, Christian Brauner wrote:
> [adjusting Cc as that's really a separate topic]
> 
> On Thu, Nov 30, 2023 at 08:43:18PM +0100, Florian Weimer wrote:
>> * Mathieu Desnoyers:
>>
>>>>> I'd like to offer a userspace API which allows safe stashing of
>>>>> unreachable file descriptors on a service thread.
> 
> Fwiw, systemd has a concept called the fdstore:
> 
> https://systemd.io/FILE_DESCRIPTOR_STORE
> 
> "The file descriptor store [...] allows services to upload during
> runtime additional fds to the service manager that it shall keep on its
> behalf. File descriptors are passed back to the service on subsequent
> activations, the same way as any socket activation fds are passed.
> 
> [...]
> 
> The primary use-case of this logic is to permit services to restart
> seamlessly (for example to update them to a newer version), without
> losing execution context, dropping pinned resources, terminating
> established connections or even just momentarily losing connectivity. In
> fact, as the file descriptors can be uploaded freely at any time during
> the service runtime, this can even be used to implement services that
> robustly handle abnormal termination and can recover from that without
> losing pinned resources."
> 
>>
>>>> By "safe" here do you mean not accessible via pidfd_getfd()?
>>
>> No, unreachable by close/close_range/dup2/dup3.  I expect we can do an
>> intra-process transfer using /proc, but I'm hoping for something nicer.
> 
> File descriptors are reachable for all processes/threads that share a
> file descriptor table. Changing that means breaking core userspace
> assumptions about how file descriptors work. That's not going to happen
> as far as I'm concerned.
> 
> We may consider additional security_* hooks in close*() and dup*(). That
> would allow you to utilize Landlock or BPF LSM to prevent file
> descriptors from being closed or duplicated. pidfd_getfd() is already
> blockable via security_file_receive().
> 
> In general, messing with fds in that way is really not a good idea.
> 
> If you need something that awkward, then you should go all the way and
> look at io_uring which basically has a separate fd-like handle called
> "fixed files".
> 
> Fixed file indexes are separate file-descriptor like handles that can
> only be used from io_uring calls but not with the regular system call
> interface.
> 
> IOW, you can refer to a file using an io_uring fixed index. The index to
> use can be chosen by userspace and can't be used with any regular
> fd-based system calls.
> 
> The io_uring fd itself can be made a fixed file itself
> 
> The only thing missing would be to turn an io_uring fixed file back into
> a regular file descriptor. That could probably be done by using
> receive_fd() and then installing that fd back into the caller's file
> descriptor table. But that would require an io_uring patch.
FWIW, since it was very trivial, I posted an rfc/test patch for just
that with a test case. It's here:
https://lore.kernel.org/io-uring/df0e24ff-f3a0-4818-8282-2a4e03b7b5a6@kernel.dk/
-- 
Jens Axboe
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-12-07 22:58         ` Christian Brauner
  2023-12-08  3:16           ` Jens Axboe
@ 2023-12-08 13:15           ` Florian Weimer
  2023-12-08 13:48             ` Christian Brauner
  1 sibling, 1 reply; 23+ messages in thread
From: Florian Weimer @ 2023-12-08 13:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Mathieu Desnoyers, Tycho Andersen, linux-kernel, linux-api,
	Jan Kara, linux-fsdevel, Jens Axboe
* Christian Brauner:
> File descriptors are reachable for all processes/threads that share a
> file descriptor table. Changing that means breaking core userspace
> assumptions about how file descriptors work. That's not going to happen
> as far as I'm concerned.
It already has happened, though?  Threads are free to call
unshare(CLONE_FILES).  I'm sure that we have applications out there that
expect this to work.  At this point, the question is about whether we
want to acknowledge this possibility at the libc level or not.
Thanks,
Florian
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-12-08 13:15           ` Florian Weimer
@ 2023-12-08 13:48             ` Christian Brauner
  2023-12-08 13:58               ` Florian Weimer
  0 siblings, 1 reply; 23+ messages in thread
From: Christian Brauner @ 2023-12-08 13:48 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Mathieu Desnoyers, Tycho Andersen, linux-kernel, linux-api,
	Jan Kara, linux-fsdevel, Jens Axboe
On Fri, Dec 08, 2023 at 02:15:58PM +0100, Florian Weimer wrote:
> * Christian Brauner:
> 
> > File descriptors are reachable for all processes/threads that share a
> > file descriptor table. Changing that means breaking core userspace
> > assumptions about how file descriptors work. That's not going to happen
> > as far as I'm concerned.
> 
> It already has happened, though?  Threads are free to call
> unshare(CLONE_FILES).  I'm sure that we have applications out there that
If you unshare a file descriptor table it will affect all file
descriptors of a given task. We don't allow hiding individual or ranges
of file descriptors from close/dup. That's akin to a partially shared
file descriptor table which is conceptually probably doable but just
plain weird and nasty to get right imho.
This really is either LSM territory to block such operations or use
stuff like io_uring gives you.
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-12-08 13:48             ` Christian Brauner
@ 2023-12-08 13:58               ` Florian Weimer
  0 siblings, 0 replies; 23+ messages in thread
From: Florian Weimer @ 2023-12-08 13:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Mathieu Desnoyers, Tycho Andersen, linux-kernel, linux-api,
	Jan Kara, linux-fsdevel, Jens Axboe
* Christian Brauner:
> On Fri, Dec 08, 2023 at 02:15:58PM +0100, Florian Weimer wrote:
>> * Christian Brauner:
>> 
>> > File descriptors are reachable for all processes/threads that share a
>> > file descriptor table. Changing that means breaking core userspace
>> > assumptions about how file descriptors work. That's not going to happen
>> > as far as I'm concerned.
>> 
>> It already has happened, though?  Threads are free to call
>> unshare(CLONE_FILES).  I'm sure that we have applications out there that
>
> If you unshare a file descriptor table it will affect all file
> descriptors of a given task. We don't allow hiding individual or ranges
> of file descriptors from close/dup. That's akin to a partially shared
> file descriptor table which is conceptually probably doable but just
> plain weird and nasty to get right imho.
>
> This really is either LSM territory to block such operations or use
> stuff like io_uring gives you.
Sorry, I misunderstood.  I'm imagining for something that doesn't share
partial tables and relies on explicit action to make available a
descriptor from a separate different table in another table, based on
some unique identifier (that is a bit more random than a file
descriptor).  So a bit similar to the the existing systemd service, but
not targeted at service restarts.
Thanks,
Florian
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-12-07 17:21 ` Christian Brauner
  2023-12-07 17:52   ` Tycho Andersen
@ 2023-12-08 17:47   ` Jan Kara
  1 sibling, 0 replies; 23+ messages in thread
From: Jan Kara @ 2023-12-08 17:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tycho Andersen, Oleg Nesterov, Eric W . Biederman, linux-kernel,
	linux-api, Tycho Andersen, Jan Kara, linux-fsdevel,
	Joel Fernandes
On Thu 07-12-23 18:21:18, Christian Brauner wrote:
> [Cc fsdevel & Jan because we had some discussions about fanotify
> returning non-thread-group pidfds. That's just for awareness or in case
> this might need special handling.]
Thanks!
> On Thu, Nov 30, 2023 at 09:39:44AM -0700, Tycho Andersen wrote:
> > From: Tycho Andersen <tandersen@netflix.com>
> > 
> > We are using the pidfd family of syscalls with the seccomp userspace
> > notifier. When some thread triggers a seccomp notification, we want to do
> > some things to its context (munge fd tables via pidfd_getfd(), maybe write
> > to its memory, etc.). However, threads created with ~CLONE_FILES or
> > ~CLONE_VM mean that we can't use the pidfd family of syscalls for this
> > purpose, since their fd table or mm are distinct from the thread group
> > leader's. In this patch, we relax this restriction for pidfd_open().
> > 
> > In order to avoid dangling poll() users we need to notify pidfd waiters
> > when individual threads die, but once we do that all the other machinery
> > seems to work ok viz. the tests. But I suppose there are more cases than
> > just this one.
> > 
> > Another weirdness is the open-coding of this vs. exporting using
> > do_notify_pidfd(). This particular location is after __exit_signal() is
> > called, which does __unhash_process() which kills ->thread_pid, so we need
> > to use the copy we have locally, vs do_notify_pid() which accesses it via
> > task_pid(). Maybe this suggests that the notification should live somewhere
> > in __exit_signals()? I just put it here because I saw we were already
> > testing if this task was the leader.
> > 
> > Signed-off-by: Tycho Andersen <tandersen@netflix.com>
> > ---
> 
> So we've always said that if there's a use-case for this then we're
> willing to support it. And I think that stance hasn't changed. I know
> that others have expressed interest in this as well.
> 
> So currently the series only enables pidfds for threads to be created
> and allows notifications for threads. But all places that currently make
> use of pidfds refuse non-thread-group leaders. We can certainly proceed
> with a patch series that only enables creation and exit notification but
> we should also consider unlocking additional functionality:
 
...
> * pidfd_prepare() is used to create pidfds for:
> 
>   (1) CLONE_PIDFD via clone() and clone3()
>   (2) SCM_PIDFD and SO_PEERPIDFD
>   (3) fanotify
So for fanotify there's no problem I can think of. All we do is return the
pidfd we get to userspace with the event to identify the task generating
the event. So in practice this would mean userspace will get proper pidfd
instead of error value (FAN_EPIDFD) for events generated by
non-thread-group leader. IMO a win.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
  2023-12-07 21:25         ` Christian Brauner
@ 2023-12-08 20:04           ` Tycho Andersen
  0 siblings, 0 replies; 23+ messages in thread
From: Tycho Andersen @ 2023-12-08 20:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Eric W . Biederman, linux-kernel, linux-api,
	Tycho Andersen, Jan Kara, linux-fsdevel, Joel Fernandes
On Thu, Dec 07, 2023 at 10:25:09PM +0100, Christian Brauner wrote:
> > If these concerns are correct
> 
> So, ok. I misremebered this. The scenario I had been thinking of is
> basically the following.
> 
> We have a thread-group with thread-group leader 1234 and a thread with
> 4567 in that thread-group. Assume current thread-group leader is tsk1
> and the non-thread-group leader is tsk2. tsk1 uses struct pid *tg_pid
> and tsk2 uses struct pid *t_pid. The struct pids look like this after
> creation of both thread-group leader tsk1 and thread tsk2:
> 
> 	TGID 1234				TID 4567 
> 	tg_pid[PIDTYPE_PID]  = tsk1		t_pid[PIDTYPE_PID]  = tsk2
> 	tg_pid[PIDTYPE_TGID] = tsk1		t_pid[PIDTYPE_TGID] = NULL
> 
> IOW, tsk2's struct pid has never been used as a thread-group leader and
> thus PIDTYPE_TGID is NULL. Now assume someone does create pidfds for
> tsk1 and for tsk2:
> 	
> 	tg_pidfd = pidfd_open(tsk1)		t_pidfd = pidfd_open(tsk2)
> 	-> tg_pidfd->private_data = tg_pid	-> t_pidfd->private_data = t_pid
> 
> So we stash away struct pid *tg_pid for a pidfd_open() on tsk1 and we
> stash away struct pid *t_pid for a pidfd_open() on tsk2.
> 
> If we wait on that task via P_PIDFD we get:
> 
> 				/* waiting through pidfd */
> 	waitid(P_PIDFD, tg_pidfd)		waitid(P_PIDFD, t_pidfd)
> 	tg_pid[PIDTYPE_TGID] == tsk1		t_pid[PIDTYPE_TGID] == NULL
> 	=> succeeds				=> fails
> 
> Because struct pid *tg_pid is used a thread-group leader struct pid we
> can wait on that tsk1. But we can't via the non-thread-group leader
> pidfd because the struct pid *t_pid has never been used as a
> thread-group leader.
> 
> Now assume, t_pid exec's and the struct pids are transfered. IIRC, we
> get:
> 
> 	tg_pid[PIDTYPE_PID]   = tsk2		t_pid[PIDTYPE_PID]   = tsk1
> 	tg_pid[PIDTYPE_TGID]  = tsk2		t_pid[PIDTYPE_TGID]  = NULL
> 
> If we wait on that task via P_PIDFD we get:
> 	
> 				/* waiting through pidfd */
> 	waitid(P_PIDFD, tg_pidfd)		waitid(P_PIDFD, t_pid)
> 	tg_pid[PIDTYPE_TGID] == tsk2		t_pid[PIDTYPE_TGID] == NULL
> 	=> succeeds				=> fails
> 
> Which is what we want. So effectively this should all work and I
> misremembered the struct pid linkage. So afaict we don't even have a
> problem here which is great.
It sounds like we need some tests for waitpid() directly though, to
ensure the semantics stay stable. I can add those and send a v3,
assuming the location of do_notify_pidfd() looks ok to you in v2:
https://lore.kernel.org/all/20231207170946.130823-1-tycho@tycho.pizza/
Tycho
^ permalink raw reply	[flat|nested] 23+ messages in thread
end of thread, other threads:[~2023-12-08 20:04 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-30 16:39 [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders Tycho Andersen
2023-11-30 16:39 ` [RFC 2/3] selftests/pidfd: add non-thread-group leader tests Tycho Andersen
2023-11-30 16:39 ` [RFC 3/3] clone: allow CLONE_THREAD | CLONE_PIDFD together Tycho Andersen
2023-11-30 17:39 ` [RFC 1/3] pidfd: allow pidfd_open() on non-thread-group leaders Oleg Nesterov
2023-11-30 17:56   ` Tycho Andersen
2023-12-01 16:31     ` Tycho Andersen
2023-12-07 17:57       ` Christian Brauner
2023-12-07 21:25         ` Christian Brauner
2023-12-08 20:04           ` Tycho Andersen
2023-11-30 18:37 ` Florian Weimer
2023-11-30 18:54   ` Tycho Andersen
2023-11-30 19:00     ` Mathieu Desnoyers
2023-11-30 19:17       ` Tycho Andersen
2023-11-30 19:43       ` Florian Weimer
2023-12-06 15:27         ` Tycho Andersen
2023-12-07 22:58         ` Christian Brauner
2023-12-08  3:16           ` Jens Axboe
2023-12-08 13:15           ` Florian Weimer
2023-12-08 13:48             ` Christian Brauner
2023-12-08 13:58               ` Florian Weimer
2023-12-07 17:21 ` Christian Brauner
2023-12-07 17:52   ` Tycho Andersen
2023-12-08 17:47   ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).