[PATCH v2 00/15] pidfs: provide information after task has been reaped

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/15] pidfs: provide information after task has been reaped
@ 2025-03-04  9:41 Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 01/15] pidfs: switch to copy_struct_to_user() Christian Brauner
                   ` (14 more replies)
  0 siblings, 15 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Hey,

Various tools need access to information about a process/task even after
it has already been reaped. For example, systemd's journal logs
and uses such information as the cgroup id and exit status to deal with
processes that have been sent via SCM_PIDFD or SCM_PEERPIDFD. By the
time the pidfd is received the process might have already been reaped.

This series aims to provide information by extending the PIDFD_GET_INFO
ioctl to retrieve the exit code and cgroup id. There might be other
stuff that we would want in the future.

Pidfd
polling allows waiting on either task exit or for a task to have been
reaped. The contract for PIDFD_INFO_EXIT is simply that EPOLLHUP must
be observed before exit information can be retrieved, i.e., exit
information is only provided once the task has been reaped. 

Note, that if a thread-group leader exits before other threads in the
thread-group then exit information will only be available once the
thread-group is empty. This aligns with wait() as well, where reaping of
a thread-group leader that exited before the thread-group was empty is
delayed until the thread-group is empty.

With PIDFD_INFO_EXIT autoreaping might actually become usable because it
means a parent can ignore SIGCHLD or set SA_NOCLDWAIT and simply use
pidfd polling and PIDFD_INFO_EXIT to get get status information for its
children. The kernel will autocleanup right away instead of delaying.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Call pidfs_exit() from release_task().
- Don't provide exit information once the task has exited but once the
  task has been reaped. This makes for simpler semantics. Thus, call
  pidfs_exit() from release_task().
- Link to v1: https://lore.kernel.org/r/20250228-work-pidfs-kill_on_last_close-v1-0-5bd7e6bb428e@kernel.org

---
Christian Brauner (15):
      pidfs: switch to copy_struct_to_user()
      pidfd: rely on automatic cleanup in __pidfd_prepare()
      pidfs: move setting flags into pidfs_alloc_file()
      pidfs: add inode allocation
      pidfs: record exit code and cgroupid at exit
      pidfs: allow to retrieve exit information
      selftests/pidfd: fix header inclusion
      pidfs/selftests: ensure correct headers for ioctl handling
      selftests/pidfd: move more defines to common header
      selftests/pidfd: add first PIDFD_INFO_EXIT selftest
      selftests/pidfd: add second PIDFD_INFO_EXIT selftest
      selftests/pidfd: add third PIDFD_INFO_EXIT selftest
      selftests/pidfd: add fourth PIDFD_INFO_EXIT selftest
      selftests/pidfd: add fifth PIDFD_INFO_EXIT selftest
      selftests/pidfd: add sixth PIDFD_INFO_EXIT selftest

 fs/internal.h                                     |   1 +
 fs/libfs.c                                        |   4 +-
 fs/pidfs.c                                        | 182 +++++++++--
 include/linux/pidfs.h                             |   1 +
 include/uapi/linux/pidfd.h                        |   3 +-
 kernel/exit.c                                     |   2 +
 kernel/fork.c                                     |  15 +-
 tools/testing/selftests/pidfd/.gitignore          |   1 +
 tools/testing/selftests/pidfd/Makefile            |   2 +-
 tools/testing/selftests/pidfd/pidfd.h             |  86 +++++
 tools/testing/selftests/pidfd/pidfd_fdinfo_test.c |   1 +
 tools/testing/selftests/pidfd/pidfd_info_test.c   | 372 ++++++++++++++++++++++
 tools/testing/selftests/pidfd/pidfd_open_test.c   |  26 --
 tools/testing/selftests/pidfd/pidfd_setns_test.c  |  45 ---
 14 files changed, 636 insertions(+), 105 deletions(-)
---
base-commit: b1e809e7f64ad47dd232ff072d8ef59c1fe414c5
change-id: 20250227-work-pidfs-kill_on_last_close-a23ddf21db47

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 01/15] pidfs: switch to copy_struct_to_user()
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04 12:42   ` Jeff Layton
  2025-03-04  9:41 ` [PATCH v2 02/15] pidfd: rely on automatic cleanup in __pidfd_prepare() Christian Brauner
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

We have a helper that deals with all the required logic.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 049352f973de..aa8c8bda8c8f 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -276,10 +276,7 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
 	 * userspace knows about will be copied. If userspace provides a new
 	 * struct, only the bits that the kernel knows about will be copied.
 	 */
-	if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo))))
-		return -EFAULT;
-
-	return 0;
+	return copy_struct_to_user(uinfo, usize, &kinfo, sizeof(kinfo), NULL);
 }
 
 static bool pidfs_ioctl_valid(unsigned int cmd)

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 02/15] pidfd: rely on automatic cleanup in __pidfd_prepare()
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 01/15] pidfs: switch to copy_struct_to_user() Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04 12:44   ` Jeff Layton
  2025-03-04  9:41 ` [PATCH v2 03/15] pidfs: move setting flags into pidfs_alloc_file() Christian Brauner
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Rely on scope-based cleanup for the allocated file descriptor.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 kernel/fork.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f3..6230f5256bc5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2032,25 +2032,23 @@ static inline void rcu_copy_process(struct task_struct *p)
  */
 static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
 {
-	int pidfd;
 	struct file *pidfd_file;
 
-	pidfd = get_unused_fd_flags(O_CLOEXEC);
+	CLASS(get_unused_fd, pidfd)(O_CLOEXEC);
 	if (pidfd < 0)
 		return pidfd;
 
 	pidfd_file = pidfs_alloc_file(pid, flags | O_RDWR);
-	if (IS_ERR(pidfd_file)) {
-		put_unused_fd(pidfd);
+	if (IS_ERR(pidfd_file))
 		return PTR_ERR(pidfd_file);
-	}
+
 	/*
 	 * anon_inode_getfile() ignores everything outside of the
 	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
 	 */
 	pidfd_file->f_flags |= (flags & PIDFD_THREAD);
 	*ret = pidfd_file;
-	return pidfd;
+	return take_fd(pidfd);
 }
 
 /**

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 03/15] pidfs: move setting flags into pidfs_alloc_file()
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 01/15] pidfs: switch to copy_struct_to_user() Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 02/15] pidfd: rely on automatic cleanup in __pidfd_prepare() Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04 12:53   ` Jeff Layton
  2025-03-04  9:41 ` [PATCH v2 04/15] pidfs: add inode allocation Christian Brauner
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Instead od adding it into __pidfd_prepare() place it where the actual
file allocation happens and update the outdated comment.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c    | 4 ++++
 kernel/fork.c | 5 -----
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index aa8c8bda8c8f..ecc0dd886714 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -696,6 +696,10 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
 		return ERR_PTR(ret);
 
 	pidfd_file = dentry_open(&path, flags, current_cred());
+	/* Raise PIDFD_THREAD explicitly as do_dentry_open() strips it. */
+	if (!IS_ERR(pidfd_file))
+		pidfd_file->f_flags |= (flags & PIDFD_THREAD);
+
 	path_put(&path);
 	return pidfd_file;
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index 6230f5256bc5..8eac9cd3385b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2042,11 +2042,6 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
 	if (IS_ERR(pidfd_file))
 		return PTR_ERR(pidfd_file);
 
-	/*
-	 * anon_inode_getfile() ignores everything outside of the
-	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
-	 */
-	pidfd_file->f_flags |= (flags & PIDFD_THREAD);
 	*ret = pidfd_file;
 	return take_fd(pidfd);
 }

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 04/15] pidfs: add inode allocation
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
                   ` (2 preceding siblings ...)
  2025-03-04  9:41 ` [PATCH v2 03/15] pidfs: move setting flags into pidfs_alloc_file() Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04 13:06   ` Jeff Layton
  2025-03-04  9:41 ` [PATCH v2 05/15] pidfs: record exit code and cgroupid at exit Christian Brauner
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index ecc0dd886714..eaecb0a947f0 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -24,6 +24,27 @@
 #include "internal.h"
 #include "mount.h"
 
+static struct kmem_cache *pidfs_cachep __ro_after_init;
+
+/*
+ * Stashes information that userspace needs to access even after the
+ * process has been reaped.
+ */
+struct pidfs_exit_info {
+	__u64 cgroupid;
+	__u64 exit_code;
+};
+
+struct pidfs_inode {
+	struct pidfs_exit_info exit_info;
+	struct inode vfs_inode;
+};
+
+static inline struct pidfs_inode *pidfs_i(struct inode *inode)
+{
+	return container_of(inode, struct pidfs_inode, vfs_inode);
+}
+
 static struct rb_root pidfs_ino_tree = RB_ROOT;
 
 #if BITS_PER_LONG == 32
@@ -492,9 +513,29 @@ static void pidfs_evict_inode(struct inode *inode)
 	put_pid(pid);
 }
 
+static struct inode *pidfs_alloc_inode(struct super_block *sb)
+{
+	struct pidfs_inode *pi;
+
+	pi = alloc_inode_sb(sb, pidfs_cachep, GFP_KERNEL);
+	if (!pi)
+		return NULL;
+
+	memset(&pi->exit_info, 0, sizeof(pi->exit_info));
+
+	return &pi->vfs_inode;
+}
+
+static void pidfs_free_inode(struct inode *inode)
+{
+	kmem_cache_free(pidfs_cachep, pidfs_i(inode));
+}
+
 static const struct super_operations pidfs_sops = {
+	.alloc_inode	= pidfs_alloc_inode,
 	.drop_inode	= generic_delete_inode,
 	.evict_inode	= pidfs_evict_inode,
+	.free_inode	= pidfs_free_inode,
 	.statfs		= simple_statfs,
 };
 
@@ -704,8 +745,19 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
 	return pidfd_file;
 }
 
+static void pidfs_inode_init_once(void *data)
+{
+	struct pidfs_inode *pi = data;
+
+	inode_init_once(&pi->vfs_inode);
+}
+
 void __init pidfs_init(void)
 {
+	pidfs_cachep = kmem_cache_create("pidfs_cache", sizeof(struct pidfs_inode), 0,
+					 (SLAB_HWCACHE_ALIGN | SLAB_RECLAIM_ACCOUNT |
+					  SLAB_ACCOUNT | SLAB_PANIC),
+					 pidfs_inode_init_once);
 	pidfs_mnt = kern_mount(&pidfs_type);
 	if (IS_ERR(pidfs_mnt))
 		panic("Failed to mount pidfs pseudo filesystem");

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 05/15] pidfs: record exit code and cgroupid at exit
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
                   ` (3 preceding siblings ...)
  2025-03-04  9:41 ` [PATCH v2 04/15] pidfs: add inode allocation Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04 13:05   ` Jeff Layton
  2025-03-04 13:10   ` Oleg Nesterov
  2025-03-04  9:41 ` [PATCH v2 06/15] pidfs: allow to retrieve exit information Christian Brauner
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Record the exit code and cgroupid in do_exit() and stash in struct
pidfs_exit_info so it can be retrieved even after the task has been
reaped.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/internal.h         |  1 +
 fs/libfs.c            |  4 ++--
 fs/pidfs.c            | 41 ++++++++++++++++++++++++++++++++++++++++-
 include/linux/pidfs.h |  1 +
 kernel/exit.c         |  2 ++
 5 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index e7f02ae1e098..c1e6d8b294cb 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -325,6 +325,7 @@ struct stashed_operations {
 int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
 		      struct path *path);
 void stashed_dentry_prune(struct dentry *dentry);
+struct dentry *stashed_dentry_get(struct dentry **stashed);
 /**
  * path_mounted - check whether path is mounted
  * @path: path to check
diff --git a/fs/libfs.c b/fs/libfs.c
index 8444f5cc4064..cf5a267aafe4 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -2113,7 +2113,7 @@ struct timespec64 simple_inode_init_ts(struct inode *inode)
 }
 EXPORT_SYMBOL(simple_inode_init_ts);
 
-static inline struct dentry *get_stashed_dentry(struct dentry **stashed)
+struct dentry *stashed_dentry_get(struct dentry **stashed)
 {
 	struct dentry *dentry;
 
@@ -2215,7 +2215,7 @@ int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
 	const struct stashed_operations *sops = mnt->mnt_sb->s_fs_info;
 
 	/* See if dentry can be reused. */
-	path->dentry = get_stashed_dentry(stashed);
+	path->dentry = stashed_dentry_get(stashed);
 	if (path->dentry) {
 		sops->put_data(data);
 		goto out_path;
diff --git a/fs/pidfs.c b/fs/pidfs.c
index eaecb0a947f0..258e1c13ee56 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -32,7 +32,7 @@ static struct kmem_cache *pidfs_cachep __ro_after_init;
  */
 struct pidfs_exit_info {
 	__u64 cgroupid;
-	__u64 exit_code;
+	__s32 exit_code;
 };
 
 struct pidfs_inode {
@@ -458,6 +458,45 @@ struct pid *pidfd_pid(const struct file *file)
 	return file_inode(file)->i_private;
 }
 
+/*
+ * We're called from release_task(). We know there's at least one
+ * reference to struct pid being held that won't be released until the
+ * task has been reaped which cannot happen until we're out of
+ * release_task().
+ *
+ * If this struct pid is refered to by a pidfd then stashed_dentry_get()
+ * will return the dentry and inode for that struct pid. Since we've
+ * taken a reference on it there's now an additional reference from the
+ * exit path on it. Which is fine. We're going to put it again in a
+ * second and we know that the pid is kept alive anyway.
+ *
+ * Worst case is that we've filled in the info and immediately free the
+ * dentry and inode afterwards since the pidfd has been closed. Since
+ * pidfs_exit() currently is placed after exit_task_work() we know that
+ * it cannot be us aka the exiting task holding a pidfd to ourselves.
+ */
+void pidfs_exit(struct task_struct *tsk)
+{
+	struct dentry *dentry;
+
+	dentry = stashed_dentry_get(&task_pid(tsk)->stashed);
+	if (dentry) {
+		struct inode *inode = d_inode(dentry);
+		struct pidfs_exit_info *exit_info = &pidfs_i(inode)->exit_info;
+#ifdef CONFIG_CGROUPS
+		struct cgroup *cgrp;
+
+		rcu_read_lock();
+		cgrp = task_dfl_cgroup(tsk);
+		exit_info->cgroupid = cgroup_id(cgrp);
+		rcu_read_unlock();
+#endif
+		exit_info->exit_code = tsk->exit_code;
+
+		dput(dentry);
+	}
+}
+
 static struct vfsmount *pidfs_mnt __ro_after_init;
 
 /*
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index 7c830d0dec9a..05e6f8f4a026 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -6,6 +6,7 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
 void __init pidfs_init(void);
 void pidfs_add_pid(struct pid *pid);
 void pidfs_remove_pid(struct pid *pid);
+void pidfs_exit(struct task_struct *tsk);
 extern const struct dentry_operations pidfs_dentry_operations;
 
 #endif /* _LINUX_PID_FS_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index 3485e5fc499e..98d292120296 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -69,6 +69,7 @@
 #include <linux/sysfs.h>
 #include <linux/user_events.h>
 #include <linux/uaccess.h>
+#include <linux/pidfs.h>
 
 #include <uapi/linux/wait.h>
 
@@ -254,6 +255,7 @@ void release_task(struct task_struct *p)
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
 	thread_pid = get_pid(p->thread_pid);
+	pidfs_exit(p);
 	__exit_signal(p);
 
 	/*

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 06/15] pidfs: allow to retrieve exit information
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
                   ` (4 preceding siblings ...)
  2025-03-04  9:41 ` [PATCH v2 05/15] pidfs: record exit code and cgroupid at exit Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04 13:27   ` Jeff Layton
                     ` (2 more replies)
  2025-03-04  9:41 ` [PATCH v2 07/15] selftests/pidfd: fix header inclusion Christian Brauner
                   ` (8 subsequent siblings)
  14 siblings, 3 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Some tools like systemd's jounral need to retrieve the exit and cgroup
information after a process has already been reaped. This can e.g.,
happen when retrieving a pidfd via SCM_PIDFD or SCM_PEERPIDFD.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c                 | 88 ++++++++++++++++++++++++++++++++++++----------
 include/uapi/linux/pidfd.h |  3 +-
 kernel/exit.c              |  2 +-
 3 files changed, 73 insertions(+), 20 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 258e1c13ee56..11744d7fe177 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -36,7 +36,8 @@ struct pidfs_exit_info {
 };
 
 struct pidfs_inode {
-	struct pidfs_exit_info exit_info;
+	struct pidfs_exit_info __pei;
+	struct pidfs_exit_info *exit_info;
 	struct inode vfs_inode;
 };
 
@@ -228,17 +229,28 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
 	return poll_flags;
 }
 
-static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
+static inline bool current_in_pidns(struct pid *pid)
+{
+	const struct pid_namespace *ns = task_active_pid_ns(current);
+
+	if (ns->level <= pid->level)
+		return pid->numbers[ns->level].ns == ns;
+
+	return false;
+}
+
+static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
+	struct pid *pid = pidfd_pid(file);
 	size_t usize = _IOC_SIZE(cmd);
 	struct pidfd_info kinfo = {};
+	struct pidfs_exit_info *exit_info;
+	struct inode *inode = file_inode(file);
 	struct user_namespace *user_ns;
+	struct task_struct *task;
 	const struct cred *c;
 	__u64 mask;
-#ifdef CONFIG_CGROUPS
-	struct cgroup *cgrp;
-#endif
 
 	if (!uinfo)
 		return -EINVAL;
@@ -248,6 +260,37 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
 	if (copy_from_user(&mask, &uinfo->mask, sizeof(mask)))
 		return -EFAULT;
 
+	task = get_pid_task(pid, PIDTYPE_PID);
+	if (!task) {
+		if (!(mask & PIDFD_INFO_EXIT))
+			return -ESRCH;
+
+		if (!current_in_pidns(pid))
+			return -ESRCH;
+	}
+
+	if (mask & PIDFD_INFO_EXIT) {
+		exit_info = READ_ONCE(pidfs_i(inode)->exit_info);
+		if (exit_info) {
+#ifdef CONFIG_CGROUPS
+			kinfo.cgroupid = exit_info->cgroupid;
+			kinfo.mask |= PIDFD_INFO_EXIT | PIDFD_INFO_CGROUPID;
+#endif
+			kinfo.exit_code = exit_info->exit_code;
+		}
+	}
+
+	/*
+	 * If the task has already been reaped only exit information
+	 * can be provided. It's entirely possible that the task has
+	 * already been reaped but we managed to grab a reference to it
+	 * before that. So a full set of information about @task doesn't
+	 * mean it hasn't been waited upon. Similarly, a full set of
+	 * information doesn't mean that the task hasn't already exited.
+	 */
+	if (!task)
+		goto copy_out;
+
 	c = get_task_cred(task);
 	if (!c)
 		return -ESRCH;
@@ -267,11 +310,15 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
 	put_cred(c);
 
 #ifdef CONFIG_CGROUPS
-	rcu_read_lock();
-	cgrp = task_dfl_cgroup(task);
-	kinfo.cgroupid = cgroup_id(cgrp);
-	kinfo.mask |= PIDFD_INFO_CGROUPID;
-	rcu_read_unlock();
+	if (!kinfo.cgroupid) {
+		struct cgroup *cgrp;
+
+		rcu_read_lock();
+		cgrp = task_dfl_cgroup(task);
+		kinfo.cgroupid = cgroup_id(cgrp);
+		kinfo.mask |= PIDFD_INFO_CGROUPID;
+		rcu_read_unlock();
+	}
 #endif
 
 	/*
@@ -291,6 +338,7 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
 	if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1))
 		return -ESRCH;
 
+copy_out:
 	/*
 	 * If userspace and the kernel have the same struct size it can just
 	 * be copied. If userspace provides an older struct, only the bits that
@@ -325,7 +373,6 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct task_struct *task __free(put_task) = NULL;
 	struct nsproxy *nsp __free(put_nsproxy) = NULL;
-	struct pid *pid = pidfd_pid(file);
 	struct ns_common *ns_common = NULL;
 	struct pid_namespace *pid_ns;
 
@@ -340,13 +387,13 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		return put_user(file_inode(file)->i_generation, argp);
 	}
 
-	task = get_pid_task(pid, PIDTYPE_PID);
-	if (!task)
-		return -ESRCH;
-
 	/* Extensible IOCTL that does not open namespace FDs, take a shortcut */
 	if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO))
-		return pidfd_info(task, cmd, arg);
+		return pidfd_info(file, cmd, arg);
+
+	task = get_pid_task(pidfd_pid(file), PIDTYPE_PID);
+	if (!task)
+		return -ESRCH;
 
 	if (arg)
 		return -EINVAL;
@@ -479,10 +526,12 @@ void pidfs_exit(struct task_struct *tsk)
 {
 	struct dentry *dentry;
 
+	might_sleep();
+
 	dentry = stashed_dentry_get(&task_pid(tsk)->stashed);
 	if (dentry) {
 		struct inode *inode = d_inode(dentry);
-		struct pidfs_exit_info *exit_info = &pidfs_i(inode)->exit_info;
+		struct pidfs_exit_info *exit_info = &pidfs_i(inode)->__pei;
 #ifdef CONFIG_CGROUPS
 		struct cgroup *cgrp;
 
@@ -493,6 +542,8 @@ void pidfs_exit(struct task_struct *tsk)
 #endif
 		exit_info->exit_code = tsk->exit_code;
 
+		/* Ensure that PIDFD_GET_INFO sees either all or nothing. */
+		smp_store_release(&pidfs_i(inode)->exit_info, &pidfs_i(inode)->__pei);
 		dput(dentry);
 	}
 }
@@ -560,7 +611,8 @@ static struct inode *pidfs_alloc_inode(struct super_block *sb)
 	if (!pi)
 		return NULL;
 
-	memset(&pi->exit_info, 0, sizeof(pi->exit_info));
+	memset(&pi->__pei, 0, sizeof(pi->__pei));
+	pi->exit_info = NULL;
 
 	return &pi->vfs_inode;
 }
diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
index e0abd0b18841..e5966f1a7743 100644
--- a/include/uapi/linux/pidfd.h
+++ b/include/uapi/linux/pidfd.h
@@ -20,6 +20,7 @@
 #define PIDFD_INFO_PID			(1UL << 0) /* Always returned, even if not requested */
 #define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
 #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
+#define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
 
 #define PIDFD_INFO_SIZE_VER0		64 /* sizeof first published struct */
 
@@ -86,7 +87,7 @@ struct pidfd_info {
 	__u32 sgid;
 	__u32 fsuid;
 	__u32 fsgid;
-	__u32 spare0[1];
+	__s32 exit_code;
 };
 
 #define PIDFS_IOCTL_MAGIC 0xFF
diff --git a/kernel/exit.c b/kernel/exit.c
index 98d292120296..9916305e34d3 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -250,12 +250,12 @@ void release_task(struct task_struct *p)
 	dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
 	rcu_read_unlock();
 
+	pidfs_exit(p);
 	cgroup_release(p);
 
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
 	thread_pid = get_pid(p->thread_pid);
-	pidfs_exit(p);
 	__exit_signal(p);
 
 	/*

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 07/15] selftests/pidfd: fix header inclusion
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
                   ` (5 preceding siblings ...)
  2025-03-04  9:41 ` [PATCH v2 06/15] pidfs: allow to retrieve exit information Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 08/15] pidfs/selftests: ensure correct headers for ioctl handling Christian Brauner
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Ensure that necessary defines are present.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_fdinfo_test.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/pidfd/pidfd_fdinfo_test.c b/tools/testing/selftests/pidfd/pidfd_fdinfo_test.c
index f062a986e382..f718aac75068 100644
--- a/tools/testing/selftests/pidfd/pidfd_fdinfo_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_fdinfo_test.c
@@ -13,6 +13,7 @@
 #include <syscall.h>
 #include <sys/wait.h>
 #include <sys/mman.h>
+#include <sys/mount.h>
 
 #include "pidfd.h"
 #include "../kselftest.h"

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 08/15] pidfs/selftests: ensure correct headers for ioctl handling
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
                   ` (6 preceding siblings ...)
  2025-03-04  9:41 ` [PATCH v2 07/15] selftests/pidfd: fix header inclusion Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 09/15] selftests/pidfd: move more defines to common header Christian Brauner
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_setns_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/pidfd/pidfd_setns_test.c b/tools/testing/selftests/pidfd/pidfd_setns_test.c
index 222f8131283b..d9e715de68b3 100644
--- a/tools/testing/selftests/pidfd/pidfd_setns_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_setns_test.c
@@ -16,7 +16,7 @@
 #include <unistd.h>
 #include <sys/socket.h>
 #include <sys/stat.h>
-#include <linux/ioctl.h>
+#include <sys/ioctl.h>
 
 #include "pidfd.h"
 #include "../kselftest_harness.h"

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 09/15] selftests/pidfd: move more defines to common header
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
                   ` (7 preceding siblings ...)
  2025-03-04  9:41 ` [PATCH v2 08/15] pidfs/selftests: ensure correct headers for ioctl handling Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 10/15] selftests/pidfd: add first PIDFD_INFO_EXIT selftest Christian Brauner
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd.h            | 78 ++++++++++++++++++++++++
 tools/testing/selftests/pidfd/pidfd_open_test.c  | 26 --------
 tools/testing/selftests/pidfd/pidfd_setns_test.c | 45 --------------
 3 files changed, 78 insertions(+), 71 deletions(-)

diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index 027ebaf14844..bad518766aa5 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -12,6 +12,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <syscall.h>
+#include <sys/ioctl.h>
 #include <sys/types.h>
 #include <sys/wait.h>
 
@@ -66,6 +67,83 @@
 #define PIDFD_SELF_PROCESS	PIDFD_SELF_THREAD_GROUP
 #endif
 
+#ifndef PIDFS_IOCTL_MAGIC
+#define PIDFS_IOCTL_MAGIC 0xFF
+#endif
+
+#ifndef PIDFD_GET_CGROUP_NAMESPACE
+#define PIDFD_GET_CGROUP_NAMESPACE            _IO(PIDFS_IOCTL_MAGIC, 1)
+#endif
+
+#ifndef PIDFD_GET_IPC_NAMESPACE
+#define PIDFD_GET_IPC_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 2)
+#endif
+
+#ifndef PIDFD_GET_MNT_NAMESPACE
+#define PIDFD_GET_MNT_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 3)
+#endif
+
+#ifndef PIDFD_GET_NET_NAMESPACE
+#define PIDFD_GET_NET_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 4)
+#endif
+
+#ifndef PIDFD_GET_PID_NAMESPACE
+#define PIDFD_GET_PID_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 5)
+#endif
+
+#ifndef PIDFD_GET_PID_FOR_CHILDREN_NAMESPACE
+#define PIDFD_GET_PID_FOR_CHILDREN_NAMESPACE  _IO(PIDFS_IOCTL_MAGIC, 6)
+#endif
+
+#ifndef PIDFD_GET_TIME_NAMESPACE
+#define PIDFD_GET_TIME_NAMESPACE              _IO(PIDFS_IOCTL_MAGIC, 7)
+#endif
+
+#ifndef PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE
+#define PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 8)
+#endif
+
+#ifndef PIDFD_GET_USER_NAMESPACE
+#define PIDFD_GET_USER_NAMESPACE              _IO(PIDFS_IOCTL_MAGIC, 9)
+#endif
+
+#ifndef PIDFD_GET_UTS_NAMESPACE
+#define PIDFD_GET_UTS_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 10)
+#endif
+
+#ifndef PIDFD_GET_INFO
+#define PIDFD_GET_INFO			      _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info)
+#endif
+
+#ifndef PIDFD_INFO_PID
+#define PIDFD_INFO_PID			(1UL << 0) /* Always returned, even if not requested */
+#endif
+
+#ifndef PIDFD_INFO_CREDS
+#define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
+#endif
+
+#ifndef PIDFD_INFO_CGROUPID
+#define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
+#endif
+
+struct pidfd_info {
+	__u64 mask;
+	__u64 cgroupid;
+	__u32 pid;
+	__u32 tgid;
+	__u32 ppid;
+	__u32 ruid;
+	__u32 rgid;
+	__u32 euid;
+	__u32 egid;
+	__u32 suid;
+	__u32 sgid;
+	__u32 fsuid;
+	__u32 fsgid;
+	__u32 spare0[1];
+};
+
 /*
  * The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
  * That means, when it wraps around any pid < 300 will be skipped.
diff --git a/tools/testing/selftests/pidfd/pidfd_open_test.c b/tools/testing/selftests/pidfd/pidfd_open_test.c
index 9a40ccb1ff6d..cd3de40e4977 100644
--- a/tools/testing/selftests/pidfd/pidfd_open_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_open_test.c
@@ -22,32 +22,6 @@
 #include "pidfd.h"
 #include "../kselftest.h"
 
-#ifndef PIDFS_IOCTL_MAGIC
-#define PIDFS_IOCTL_MAGIC 0xFF
-#endif
-
-#ifndef PIDFD_GET_INFO
-#define PIDFD_GET_INFO _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info)
-#define PIDFD_INFO_CGROUPID		(1UL << 0)
-
-struct pidfd_info {
-	__u64 mask;
-	__u64 cgroupid;
-	__u32 pid;
-	__u32 tgid;
-	__u32 ppid;
-	__u32 ruid;
-	__u32 rgid;
-	__u32 euid;
-	__u32 egid;
-	__u32 suid;
-	__u32 sgid;
-	__u32 fsuid;
-	__u32 fsgid;
-	__u32 spare0[1];
-};
-#endif
-
 static int safe_int(const char *numstr, int *converted)
 {
 	char *err = NULL;
diff --git a/tools/testing/selftests/pidfd/pidfd_setns_test.c b/tools/testing/selftests/pidfd/pidfd_setns_test.c
index d9e715de68b3..e6a079b3d5e2 100644
--- a/tools/testing/selftests/pidfd/pidfd_setns_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_setns_test.c
@@ -16,55 +16,10 @@
 #include <unistd.h>
 #include <sys/socket.h>
 #include <sys/stat.h>
-#include <sys/ioctl.h>
 
 #include "pidfd.h"
 #include "../kselftest_harness.h"
 
-#ifndef PIDFS_IOCTL_MAGIC
-#define PIDFS_IOCTL_MAGIC 0xFF
-#endif
-
-#ifndef PIDFD_GET_CGROUP_NAMESPACE
-#define PIDFD_GET_CGROUP_NAMESPACE            _IO(PIDFS_IOCTL_MAGIC, 1)
-#endif
-
-#ifndef PIDFD_GET_IPC_NAMESPACE
-#define PIDFD_GET_IPC_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 2)
-#endif
-
-#ifndef PIDFD_GET_MNT_NAMESPACE
-#define PIDFD_GET_MNT_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 3)
-#endif
-
-#ifndef PIDFD_GET_NET_NAMESPACE
-#define PIDFD_GET_NET_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 4)
-#endif
-
-#ifndef PIDFD_GET_PID_NAMESPACE
-#define PIDFD_GET_PID_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 5)
-#endif
-
-#ifndef PIDFD_GET_PID_FOR_CHILDREN_NAMESPACE
-#define PIDFD_GET_PID_FOR_CHILDREN_NAMESPACE  _IO(PIDFS_IOCTL_MAGIC, 6)
-#endif
-
-#ifndef PIDFD_GET_TIME_NAMESPACE
-#define PIDFD_GET_TIME_NAMESPACE              _IO(PIDFS_IOCTL_MAGIC, 7)
-#endif
-
-#ifndef PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE
-#define PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 8)
-#endif
-
-#ifndef PIDFD_GET_USER_NAMESPACE
-#define PIDFD_GET_USER_NAMESPACE              _IO(PIDFS_IOCTL_MAGIC, 9)
-#endif
-
-#ifndef PIDFD_GET_UTS_NAMESPACE
-#define PIDFD_GET_UTS_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 10)
-#endif
-
 enum {
 	PIDFD_NS_USER,
 	PIDFD_NS_MNT,

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 10/15] selftests/pidfd: add first PIDFD_INFO_EXIT selftest
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
                   ` (8 preceding siblings ...)
  2025-03-04  9:41 ` [PATCH v2 09/15] selftests/pidfd: move more defines to common header Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 11/15] selftests/pidfd: add second " Christian Brauner
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Add a selftest for PIDFD_INFO_EXIT behavior.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/.gitignore        |   1 +
 tools/testing/selftests/pidfd/Makefile          |   2 +-
 tools/testing/selftests/pidfd/pidfd.h           |   6 +-
 tools/testing/selftests/pidfd/pidfd_info_test.c | 146 ++++++++++++++++++++++++
 4 files changed, 153 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore
index bf92481f925c..bddae1d4d7e4 100644
--- a/tools/testing/selftests/pidfd/.gitignore
+++ b/tools/testing/selftests/pidfd/.gitignore
@@ -8,3 +8,4 @@ pidfd_getfd_test
 pidfd_setns_test
 pidfd_file_handle_test
 pidfd_bind_mount
+pidfd_info_test
diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index 301343a11b62..a94c2bc8d594 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -3,7 +3,7 @@ CFLAGS += -g $(KHDR_INCLUDES) -pthread -Wall
 
 TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test \
 	pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test \
-	pidfd_file_handle_test pidfd_bind_mount
+	pidfd_file_handle_test pidfd_bind_mount pidfd_info_test
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index bad518766aa5..cc8e381978df 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -127,6 +127,10 @@
 #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
 #endif
 
+#ifndef PIDFD_INFO_EXIT
+#define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
+#endif
+
 struct pidfd_info {
 	__u64 mask;
 	__u64 cgroupid;
@@ -141,7 +145,7 @@ struct pidfd_info {
 	__u32 sgid;
 	__u32 fsuid;
 	__u32 fsgid;
-	__u32 spare0[1];
+	__s32 exit_code;
 };
 
 /*
diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
new file mode 100644
index 000000000000..cc1d3d5eba59
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <linux/types.h>
+#include <poll.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <linux/kcmp.h>
+#include <sys/stat.h>
+
+#include "pidfd.h"
+#include "../kselftest_harness.h"
+
+FIXTURE(pidfd_info)
+{
+	pid_t child_pid1;
+	int child_pidfd1;
+
+	pid_t child_pid2;
+	int child_pidfd2;
+
+	pid_t child_pid3;
+	int child_pidfd3;
+
+	pid_t child_pid4;
+	int child_pidfd4;
+};
+
+FIXTURE_SETUP(pidfd_info)
+{
+	int ret;
+	int ipc_sockets[2];
+	char c;
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	EXPECT_EQ(ret, 0);
+
+	self->child_pid1 = create_child(&self->child_pidfd1, 0);
+	EXPECT_GE(self->child_pid1, 0);
+
+	if (self->child_pid1 == 0) {
+		close(ipc_sockets[0]);
+
+		if (write_nointr(ipc_sockets[1], "1", 1) < 0)
+			_exit(EXIT_FAILURE);
+
+		close(ipc_sockets[1]);
+
+		pause();
+		_exit(EXIT_SUCCESS);
+	}
+
+	EXPECT_EQ(close(ipc_sockets[1]), 0);
+	ASSERT_EQ(read_nointr(ipc_sockets[0], &c, 1), 1);
+	EXPECT_EQ(close(ipc_sockets[0]), 0);
+
+	/* SIGKILL but don't reap. */
+	EXPECT_EQ(sys_pidfd_send_signal(self->child_pidfd1, SIGKILL, NULL, 0), 0);
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	EXPECT_EQ(ret, 0);
+
+	self->child_pid2 = create_child(&self->child_pidfd2, 0);
+	EXPECT_GE(self->child_pid2, 0);
+
+	if (self->child_pid2 == 0) {
+		close(ipc_sockets[0]);
+
+		if (write_nointr(ipc_sockets[1], "1", 1) < 0)
+			_exit(EXIT_FAILURE);
+
+		close(ipc_sockets[1]);
+
+		pause();
+		_exit(EXIT_SUCCESS);
+	}
+
+	EXPECT_EQ(close(ipc_sockets[1]), 0);
+	ASSERT_EQ(read_nointr(ipc_sockets[0], &c, 1), 1);
+	EXPECT_EQ(close(ipc_sockets[0]), 0);
+
+	/* SIGKILL and reap. */
+	EXPECT_EQ(sys_pidfd_send_signal(self->child_pidfd2, SIGKILL, NULL, 0), 0);
+	EXPECT_EQ(sys_waitid(P_PID, self->child_pid2, NULL, WEXITED), 0);
+
+	self->child_pid3 = create_child(&self->child_pidfd3, CLONE_NEWUSER | CLONE_NEWPID);
+	EXPECT_GE(self->child_pid3, 0);
+
+	if (self->child_pid3 == 0)
+		_exit(EXIT_SUCCESS);
+
+	self->child_pid4 = create_child(&self->child_pidfd4, CLONE_NEWUSER | CLONE_NEWPID);
+	EXPECT_GE(self->child_pid4, 0);
+
+	if (self->child_pid4 == 0)
+		_exit(EXIT_SUCCESS);
+
+	EXPECT_EQ(sys_waitid(P_PID, self->child_pid4, NULL, WEXITED), 0);
+}
+
+FIXTURE_TEARDOWN(pidfd_info)
+{
+	sys_pidfd_send_signal(self->child_pidfd1, SIGKILL, NULL, 0);
+	if (self->child_pidfd1 >= 0)
+		EXPECT_EQ(0, close(self->child_pidfd1));
+
+	sys_waitid(P_PID, self->child_pid1, NULL, WEXITED);
+
+	sys_pidfd_send_signal(self->child_pidfd2, SIGKILL, NULL, 0);
+	if (self->child_pidfd2 >= 0)
+		EXPECT_EQ(0, close(self->child_pidfd2));
+
+	sys_waitid(P_PID, self->child_pid2, NULL, WEXITED);
+	sys_waitid(P_PID, self->child_pid3, NULL, WEXITED);
+	sys_waitid(P_PID, self->child_pid4, NULL, WEXITED);
+}
+
+TEST_F(pidfd_info, sigkill_exit)
+{
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID,
+	};
+
+	/* Process has exited but not been reaped so this must work. */
+	ASSERT_EQ(ioctl(self->child_pidfd1, PIDFD_GET_INFO, &info), 0);
+
+	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(self->child_pidfd1, PIDFD_GET_INFO, &info), 0);
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_CREDS));
+	/* Process has exited but not been reaped, so no PIDFD_INFO_EXIT information yet. */
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_EXIT));
+}
+
+TEST_HARNESS_MAIN

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 11/15] selftests/pidfd: add second PIDFD_INFO_EXIT selftest
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
                   ` (9 preceding siblings ...)
  2025-03-04  9:41 ` [PATCH v2 10/15] selftests/pidfd: add first PIDFD_INFO_EXIT selftest Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 12/15] selftests/pidfd: add third " Christian Brauner
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Add a selftest for PIDFD_INFO_EXIT behavior.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index cc1d3d5eba59..2a5742a2a55f 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -143,4 +143,22 @@ TEST_F(pidfd_info, sigkill_exit)
 	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_EXIT));
 }
 
+TEST_F(pidfd_info, sigkill_reaped)
+{
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID,
+	};
+
+	/* Process has already been reaped and PIDFD_INFO_EXIT hasn't been set. */
+	ASSERT_NE(ioctl(self->child_pidfd2, PIDFD_GET_INFO, &info), 0);
+	ASSERT_EQ(errno, ESRCH);
+
+	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(self->child_pidfd2, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+	ASSERT_TRUE(WIFSIGNALED(info.exit_code));
+	ASSERT_EQ(WTERMSIG(info.exit_code), SIGKILL);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 12/15] selftests/pidfd: add third PIDFD_INFO_EXIT selftest
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
                   ` (10 preceding siblings ...)
  2025-03-04  9:41 ` [PATCH v2 11/15] selftests/pidfd: add second " Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 13/15] selftests/pidfd: add fourth " Christian Brauner
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Add a selftest for PIDFD_INFO_EXIT behavior.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 2a5742a2a55f..2917e7a03b31 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -161,4 +161,20 @@ TEST_F(pidfd_info, sigkill_reaped)
 	ASSERT_EQ(WTERMSIG(info.exit_code), SIGKILL);
 }
 
+TEST_F(pidfd_info, success_exit)
+{
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID,
+	};
+
+	/* Process has exited but not been reaped so this must work. */
+	ASSERT_EQ(ioctl(self->child_pidfd3, PIDFD_GET_INFO, &info), 0);
+
+	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(self->child_pidfd3, PIDFD_GET_INFO, &info), 0);
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_CREDS));
+	/* Process has exited but not been reaped, so no PIDFD_INFO_EXIT information yet. */
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_EXIT));
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 13/15] selftests/pidfd: add fourth PIDFD_INFO_EXIT selftest
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
                   ` (11 preceding siblings ...)
  2025-03-04  9:41 ` [PATCH v2 12/15] selftests/pidfd: add third " Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 14/15] selftests/pidfd: add fifth " Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 15/15] selftests/pidfd: add sixth " Christian Brauner
  14 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Add a selftest for PIDFD_INFO_EXIT behavior.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 2917e7a03b31..0d0af4c2a84d 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -177,4 +177,22 @@ TEST_F(pidfd_info, success_exit)
 	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_EXIT));
 }
 
+TEST_F(pidfd_info, success_reaped)
+{
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID,
+	};
+
+	/* Process has already been reaped and PIDFD_INFO_EXIT hasn't been set. */
+	ASSERT_NE(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
+	ASSERT_EQ(errno, ESRCH);
+
+	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 14/15] selftests/pidfd: add fifth PIDFD_INFO_EXIT selftest
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
                   ` (12 preceding siblings ...)
  2025-03-04  9:41 ` [PATCH v2 13/15] selftests/pidfd: add fourth " Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04  9:41 ` [PATCH v2 15/15] selftests/pidfd: add sixth " Christian Brauner
  14 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Add a selftest for PIDFD_INFO_EXIT behavior.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 0d0af4c2a84d..16e4be2364df 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -195,4 +195,27 @@ TEST_F(pidfd_info, success_reaped)
 	ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
 }
 
+TEST_F(pidfd_info, success_reaped_poll)
+{
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
+	};
+	struct pollfd fds = {};
+	int nevents;
+
+	fds.events = POLLIN;
+	fds.fd = self->child_pidfd2;
+
+	nevents = poll(&fds, 1, -1);
+	ASSERT_EQ(nevents, 1);
+	ASSERT_TRUE(!!(fds.revents & POLLIN));
+	ASSERT_TRUE(!!(fds.revents & POLLHUP));
+
+	ASSERT_EQ(ioctl(self->child_pidfd2, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+	ASSERT_TRUE(WIFSIGNALED(info.exit_code));
+	ASSERT_EQ(WTERMSIG(info.exit_code), SIGKILL);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 15/15] selftests/pidfd: add sixth PIDFD_INFO_EXIT selftest
  2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
                   ` (13 preceding siblings ...)
  2025-03-04  9:41 ` [PATCH v2 14/15] selftests/pidfd: add fifth " Christian Brauner
@ 2025-03-04  9:41 ` Christian Brauner
  2025-03-04 20:18   ` [PATCH v2 17/16] selftests/pidfd: test multi-threaded exec with PPIDFD_INFO_EXIT Christian Brauner
  14 siblings, 1 reply; 32+ messages in thread
From: Christian Brauner @ 2025-03-04  9:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Add a selftest for PIDFD_INFO_EXIT behavior.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd.h           |   4 +
 tools/testing/selftests/pidfd/pidfd_info_test.c | 151 ++++++++++++++++++++++++
 2 files changed, 155 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index cc8e381978df..fee6fd3e67dd 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -131,6 +131,10 @@
 #define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
 #endif
 
+#ifndef PIDFD_THREAD
+#define PIDFD_THREAD O_EXCL
+#endif
+
 struct pidfd_info {
 	__u64 mask;
 	__u64 cgroupid;
diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 16e4be2364df..5e86e3df323b 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -218,4 +218,155 @@ TEST_F(pidfd_info, success_reaped_poll)
 	ASSERT_EQ(WTERMSIG(info.exit_code), SIGKILL);
 }
 
+static void *pidfd_info_pause_thread(void *arg)
+{
+	pid_t pid_thread = gettid();
+	int ipc_socket = *(int *)arg;
+
+	/* Inform the grand-parent what the tid of this thread is. */
+	if (write_nointr(ipc_socket, &pid_thread, sizeof(pid_thread)) != sizeof(pid_thread))
+		return NULL;
+
+	close(ipc_socket);
+
+	/* Sleep untill we're killed. */
+	pause();
+	return NULL;
+}
+
+TEST_F(pidfd_info, thread_group)
+{
+	pid_t pid_leader, pid_thread;
+	pthread_t thread;
+	int nevents, pidfd_leader, pidfd_thread, pidfd_leader_thread, ret;
+	int ipc_sockets[2];
+	struct pollfd fds = {};
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
+	}, info2;
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	EXPECT_EQ(ret, 0);
+
+	pid_leader = create_child(&pidfd_leader, 0);
+	EXPECT_GE(pid_leader, 0);
+
+	if (pid_leader == 0) {
+		close(ipc_sockets[0]);
+
+		/* The thread will outlive the thread-group leader. */
+		if (pthread_create(&thread, NULL, pidfd_info_pause_thread, &ipc_sockets[1]))
+			syscall(__NR_exit, EXIT_FAILURE);
+
+		/* Make the thread-group leader exit prematurely. */
+		syscall(__NR_exit, EXIT_SUCCESS);
+	}
+
+	/* Retrieve the tid of the thread. */
+	EXPECT_EQ(close(ipc_sockets[1]), 0);
+	ASSERT_EQ(read_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
+	EXPECT_EQ(close(ipc_sockets[0]), 0);
+
+	/* Opening a thread as a thread-group leader must fail. */
+	pidfd_thread = sys_pidfd_open(pid_thread, 0);
+	ASSERT_LT(pidfd_thread, 0);
+
+	/* Opening a thread as a PIDFD_THREAD must succeed. */
+	pidfd_thread = sys_pidfd_open(pid_thread, PIDFD_THREAD);
+	ASSERT_GE(pidfd_thread, 0);
+
+	/*
+	 * Opening a PIDFD_THREAD aka thread-specific pidfd based on a
+	 * thread-group leader must succeed.
+	 */
+	pidfd_leader_thread = sys_pidfd_open(pid_leader, PIDFD_THREAD);
+	ASSERT_GE(pidfd_leader_thread, 0);
+
+	/*
+	 * Note that pidfd_leader is a thread-group pidfd, so polling on it
+	 * would only notify us once all thread in the thread-group have
+	 * exited. So we can't poll before we have taken down the whole
+	 * thread-group.
+	 */
+
+	/* Get PIDFD_GET_INFO using the thread-group leader pidfd. */
+	ASSERT_EQ(ioctl(pidfd_leader, PIDFD_GET_INFO, &info), 0);
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_CREDS));
+	/* Process has exited but not been reaped, so no PIDFD_INFO_EXIT information yet. */
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_EXIT));
+	ASSERT_EQ(info.pid, pid_leader);
+
+	/*
+	 * Now retrieve the same info using the thread specific pidfd
+	 * for the thread-group leader.
+	 */
+	info2.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(pidfd_leader_thread, PIDFD_GET_INFO, &info2), 0);
+	ASSERT_TRUE(!!(info2.mask & PIDFD_INFO_CREDS));
+	/* Process has exited but not been reaped, so no PIDFD_INFO_EXIT information yet. */
+	ASSERT_FALSE(!!(info2.mask & PIDFD_INFO_EXIT));
+	ASSERT_EQ(info2.pid, pid_leader);
+
+	/* Now try the thread-specific pidfd. */
+	ASSERT_EQ(ioctl(pidfd_thread, PIDFD_GET_INFO, &info), 0);
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_CREDS));
+	/* The thread hasn't exited, so no PIDFD_INFO_EXIT information yet. */
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_EXIT));
+	ASSERT_EQ(info.pid, pid_thread);
+
+	/*
+	 * Take down the whole thread-group. The thread-group leader
+	 * exited successfully but the thread will now be SIGKILLed.
+	 * This must be reflected in the recorded exit information.
+	 */
+	EXPECT_EQ(sys_pidfd_send_signal(pidfd_leader, SIGKILL, NULL, 0), 0);
+	EXPECT_EQ(sys_waitid(P_PIDFD, pidfd_leader, NULL, WEXITED), 0);
+
+	fds.events = POLLIN;
+	fds.fd = pidfd_leader;
+	nevents = poll(&fds, 1, -1);
+	ASSERT_EQ(nevents, 1);
+	ASSERT_TRUE(!!(fds.revents & POLLIN));
+	/* The thread-group leader has been reaped. */
+	ASSERT_TRUE(!!(fds.revents & POLLHUP));
+
+	/*
+	 * Retrieve exit information for the thread-group leader via the
+	 * thread-group leader pidfd.
+	 */
+	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(pidfd_leader, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+	/* The thread-group leader exited successfully. Only the specific thread was SIGKILLed. */
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
+
+	/*
+	 * Retrieve exit information for the thread-group leader via the
+	 * thread-specific pidfd.
+	 */
+	info2.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(pidfd_leader_thread, PIDFD_GET_INFO, &info2), 0);
+	ASSERT_FALSE(!!(info2.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info2.mask & PIDFD_INFO_EXIT));
+
+	/* The thread-group leader exited successfully. Only the specific thread was SIGKILLed. */
+	ASSERT_TRUE(WIFEXITED(info2.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info2.exit_code), 0);
+
+	/* Retrieve exit information for the thread. */
+	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(pidfd_thread, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+
+	/* The thread got SIGKILLed. */
+	ASSERT_TRUE(WIFSIGNALED(info.exit_code));
+	ASSERT_EQ(WTERMSIG(info.exit_code), SIGKILL);
+
+	EXPECT_EQ(close(pidfd_leader), 0);
+	EXPECT_EQ(close(pidfd_thread), 0);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/15] pidfs: switch to copy_struct_to_user()
  2025-03-04  9:41 ` [PATCH v2 01/15] pidfs: switch to copy_struct_to_user() Christian Brauner
@ 2025-03-04 12:42   ` Jeff Layton
  0 siblings, 0 replies; 32+ messages in thread
From: Jeff Layton @ 2025-03-04 12:42 UTC (permalink / raw)
  To: Christian Brauner, Oleg Nesterov
  Cc: linux-fsdevel, Lennart Poettering, Daan De Meyer, Mike Yuan

On Tue, 2025-03-04 at 10:41 +0100, Christian Brauner wrote:
> We have a helper that deals with all the required logic.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/pidfs.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 049352f973de..aa8c8bda8c8f 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -276,10 +276,7 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
>  	 * userspace knows about will be copied. If userspace provides a new
>  	 * struct, only the bits that the kernel knows about will be copied.
>  	 */
> -	if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo))))
> -		return -EFAULT;
> -
> -	return 0;
> +	return copy_struct_to_user(uinfo, usize, &kinfo, sizeof(kinfo), NULL);
>  }
>  
>  static bool pidfs_ioctl_valid(unsigned int cmd)
> 

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 02/15] pidfd: rely on automatic cleanup in __pidfd_prepare()
  2025-03-04  9:41 ` [PATCH v2 02/15] pidfd: rely on automatic cleanup in __pidfd_prepare() Christian Brauner
@ 2025-03-04 12:44   ` Jeff Layton
  0 siblings, 0 replies; 32+ messages in thread
From: Jeff Layton @ 2025-03-04 12:44 UTC (permalink / raw)
  To: Christian Brauner, Oleg Nesterov
  Cc: linux-fsdevel, Lennart Poettering, Daan De Meyer, Mike Yuan

On Tue, 2025-03-04 at 10:41 +0100, Christian Brauner wrote:
> Rely on scope-based cleanup for the allocated file descriptor.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  kernel/fork.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 735405a9c5f3..6230f5256bc5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2032,25 +2032,23 @@ static inline void rcu_copy_process(struct task_struct *p)
>   */
>  static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
>  {
> -	int pidfd;
>  	struct file *pidfd_file;
>  
> -	pidfd = get_unused_fd_flags(O_CLOEXEC);
> +	CLASS(get_unused_fd, pidfd)(O_CLOEXEC);
>  	if (pidfd < 0)
>  		return pidfd;
>  
>  	pidfd_file = pidfs_alloc_file(pid, flags | O_RDWR);
> -	if (IS_ERR(pidfd_file)) {
> -		put_unused_fd(pidfd);
> +	if (IS_ERR(pidfd_file))
>  		return PTR_ERR(pidfd_file);
> -	}
> +
>  	/*
>  	 * anon_inode_getfile() ignores everything outside of the
>  	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
>  	 */
>  	pidfd_file->f_flags |= (flags & PIDFD_THREAD);
>  	*ret = pidfd_file;
> -	return pidfd;
> +	return take_fd(pidfd);
>  }
>  
>  /**
> 

I'll Ack this, but I will say that I find the result to be less
readable than what was there before.

Acked-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 03/15] pidfs: move setting flags into pidfs_alloc_file()
  2025-03-04  9:41 ` [PATCH v2 03/15] pidfs: move setting flags into pidfs_alloc_file() Christian Brauner
@ 2025-03-04 12:53   ` Jeff Layton
  0 siblings, 0 replies; 32+ messages in thread
From: Jeff Layton @ 2025-03-04 12:53 UTC (permalink / raw)
  To: Christian Brauner, Oleg Nesterov
  Cc: linux-fsdevel, Lennart Poettering, Daan De Meyer, Mike Yuan

On Tue, 2025-03-04 at 10:41 +0100, Christian Brauner wrote:
> Instead od adding it into __pidfd_prepare() place it where the actual
> file allocation happens and update the outdated comment.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/pidfs.c    | 4 ++++
>  kernel/fork.c | 5 -----
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index aa8c8bda8c8f..ecc0dd886714 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -696,6 +696,10 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
>  		return ERR_PTR(ret);
>  
>  	pidfd_file = dentry_open(&path, flags, current_cred());
> +	/* Raise PIDFD_THREAD explicitly as do_dentry_open() strips it. */
> +	if (!IS_ERR(pidfd_file))
> +		pidfd_file->f_flags |= (flags & PIDFD_THREAD);
> +
>  	path_put(&path);
>  	return pidfd_file;
>  }
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 6230f5256bc5..8eac9cd3385b 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2042,11 +2042,6 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
>  	if (IS_ERR(pidfd_file))
>  		return PTR_ERR(pidfd_file);
>  
> -	/*
> -	 * anon_inode_getfile() ignores everything outside of the
> -	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
> -	 */
> -	pidfd_file->f_flags |= (flags & PIDFD_THREAD);
>  	*ret = pidfd_file;
>  	return take_fd(pidfd);
>  }
> 

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 05/15] pidfs: record exit code and cgroupid at exit
  2025-03-04  9:41 ` [PATCH v2 05/15] pidfs: record exit code and cgroupid at exit Christian Brauner
@ 2025-03-04 13:05   ` Jeff Layton
  2025-03-04 13:10   ` Oleg Nesterov
  1 sibling, 0 replies; 32+ messages in thread
From: Jeff Layton @ 2025-03-04 13:05 UTC (permalink / raw)
  To: Christian Brauner, Oleg Nesterov
  Cc: linux-fsdevel, Lennart Poettering, Daan De Meyer, Mike Yuan

On Tue, 2025-03-04 at 10:41 +0100, Christian Brauner wrote:
> Record the exit code and cgroupid in do_exit() and stash in struct
> pidfs_exit_info so it can be retrieved even after the task has been
> reaped.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/internal.h         |  1 +
>  fs/libfs.c            |  4 ++--
>  fs/pidfs.c            | 41 ++++++++++++++++++++++++++++++++++++++++-
>  include/linux/pidfs.h |  1 +
>  kernel/exit.c         |  2 ++
>  5 files changed, 46 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/internal.h b/fs/internal.h
> index e7f02ae1e098..c1e6d8b294cb 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -325,6 +325,7 @@ struct stashed_operations {
>  int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
>  		      struct path *path);
>  void stashed_dentry_prune(struct dentry *dentry);
> +struct dentry *stashed_dentry_get(struct dentry **stashed);
>  /**
>   * path_mounted - check whether path is mounted
>   * @path: path to check
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 8444f5cc4064..cf5a267aafe4 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -2113,7 +2113,7 @@ struct timespec64 simple_inode_init_ts(struct inode *inode)
>  }
>  EXPORT_SYMBOL(simple_inode_init_ts);
>  
> -static inline struct dentry *get_stashed_dentry(struct dentry **stashed)
> +struct dentry *stashed_dentry_get(struct dentry **stashed)
>  {
>  	struct dentry *dentry;
>  
> @@ -2215,7 +2215,7 @@ int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
>  	const struct stashed_operations *sops = mnt->mnt_sb->s_fs_info;
>  
>  	/* See if dentry can be reused. */
> -	path->dentry = get_stashed_dentry(stashed);
> +	path->dentry = stashed_dentry_get(stashed);
>  	if (path->dentry) {
>  		sops->put_data(data);
>  		goto out_path;
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index eaecb0a947f0..258e1c13ee56 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -32,7 +32,7 @@ static struct kmem_cache *pidfs_cachep __ro_after_init;
>   */
>  struct pidfs_exit_info {
>  	__u64 cgroupid;
> -	__u64 exit_code;
> +	__s32 exit_code;

^^^
The above delta should be folded into the previous patch.

>  };
>  
>  struct pidfs_inode {
> @@ -458,6 +458,45 @@ struct pid *pidfd_pid(const struct file *file)
>  	return file_inode(file)->i_private;
>  }
>  
> +/*
> + * We're called from release_task(). We know there's at least one
> + * reference to struct pid being held that won't be released until the
> + * task has been reaped which cannot happen until we're out of
> + * release_task().
> + *
> + * If this struct pid is refered to by a pidfd then stashed_dentry_get()
> + * will return the dentry and inode for that struct pid. Since we've
> + * taken a reference on it there's now an additional reference from the
> + * exit path on it. Which is fine. We're going to put it again in a
> + * second and we know that the pid is kept alive anyway.
> + *
> + * Worst case is that we've filled in the info and immediately free the
> + * dentry and inode afterwards since the pidfd has been closed. Since
> + * pidfs_exit() currently is placed after exit_task_work() we know that
> + * it cannot be us aka the exiting task holding a pidfd to ourselves.
> + */

That is a subtle interaction. Thanks for the comment!

> +void pidfs_exit(struct task_struct *tsk)
> +{
> +	struct dentry *dentry;
> +
> +	dentry = stashed_dentry_get(&task_pid(tsk)->stashed);
> +	if (dentry) {
> +		struct inode *inode = d_inode(dentry);
> +		struct pidfs_exit_info *exit_info = &pidfs_i(inode)->exit_info;
> +#ifdef CONFIG_CGROUPS
> +		struct cgroup *cgrp;
> +
> +		rcu_read_lock();
> +		cgrp = task_dfl_cgroup(tsk);
> +		exit_info->cgroupid = cgroup_id(cgrp);
> +		rcu_read_unlock();
> +#endif
> +		exit_info->exit_code = tsk->exit_code;
> +
> +		dput(dentry);
> +	}
> +}
> +
>  static struct vfsmount *pidfs_mnt __ro_after_init;
>  
>  /*
> diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
> index 7c830d0dec9a..05e6f8f4a026 100644
> --- a/include/linux/pidfs.h
> +++ b/include/linux/pidfs.h
> @@ -6,6 +6,7 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
>  void __init pidfs_init(void);
>  void pidfs_add_pid(struct pid *pid);
>  void pidfs_remove_pid(struct pid *pid);
> +void pidfs_exit(struct task_struct *tsk);
>  extern const struct dentry_operations pidfs_dentry_operations;
>  
>  #endif /* _LINUX_PID_FS_H */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 3485e5fc499e..98d292120296 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -69,6 +69,7 @@
>  #include <linux/sysfs.h>
>  #include <linux/user_events.h>
>  #include <linux/uaccess.h>
> +#include <linux/pidfs.h>
>  
>  #include <uapi/linux/wait.h>
>  
> @@ -254,6 +255,7 @@ void release_task(struct task_struct *p)
>  	write_lock_irq(&tasklist_lock);
>  	ptrace_release_task(p);
>  	thread_pid = get_pid(p->thread_pid);
> +	pidfs_exit(p);
>  	__exit_signal(p);
>  
>  	/*
> 

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 04/15] pidfs: add inode allocation
  2025-03-04  9:41 ` [PATCH v2 04/15] pidfs: add inode allocation Christian Brauner
@ 2025-03-04 13:06   ` Jeff Layton
  0 siblings, 0 replies; 32+ messages in thread
From: Jeff Layton @ 2025-03-04 13:06 UTC (permalink / raw)
  To: Christian Brauner, Oleg Nesterov
  Cc: linux-fsdevel, Lennart Poettering, Daan De Meyer, Mike Yuan

On Tue, 2025-03-04 at 10:41 +0100, Christian Brauner wrote:
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/pidfs.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 52 insertions(+)
> 
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index ecc0dd886714..eaecb0a947f0 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -24,6 +24,27 @@
>  #include "internal.h"
>  #include "mount.h"
>  
> +static struct kmem_cache *pidfs_cachep __ro_after_init;
> +
> +/*
> + * Stashes information that userspace needs to access even after the
> + * process has been reaped.
> + */
> +struct pidfs_exit_info {
> +	__u64 cgroupid;
> +	__u64 exit_code;

This is changed to __s32 in the next patch. Might as well do that here.

> +};
> +
> +struct pidfs_inode {
> +	struct pidfs_exit_info exit_info;
> +	struct inode vfs_inode;
> +};
> +
> +static inline struct pidfs_inode *pidfs_i(struct inode *inode)
> +{
> +	return container_of(inode, struct pidfs_inode, vfs_inode);
> +}
> +
>  static struct rb_root pidfs_ino_tree = RB_ROOT;
>  
>  #if BITS_PER_LONG == 32
> @@ -492,9 +513,29 @@ static void pidfs_evict_inode(struct inode *inode)
>  	put_pid(pid);
>  }
>  
> +static struct inode *pidfs_alloc_inode(struct super_block *sb)
> +{
> +	struct pidfs_inode *pi;
> +
> +	pi = alloc_inode_sb(sb, pidfs_cachep, GFP_KERNEL);
> +	if (!pi)
> +		return NULL;
> +
> +	memset(&pi->exit_info, 0, sizeof(pi->exit_info));
> +
> +	return &pi->vfs_inode;
> +}
> +
> +static void pidfs_free_inode(struct inode *inode)
> +{
> +	kmem_cache_free(pidfs_cachep, pidfs_i(inode));
> +}
> +
>  static const struct super_operations pidfs_sops = {
> +	.alloc_inode	= pidfs_alloc_inode,
>  	.drop_inode	= generic_delete_inode,
>  	.evict_inode	= pidfs_evict_inode,
> +	.free_inode	= pidfs_free_inode,
>  	.statfs		= simple_statfs,
>  };
>  
> @@ -704,8 +745,19 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
>  	return pidfd_file;
>  }
>  
> +static void pidfs_inode_init_once(void *data)
> +{
> +	struct pidfs_inode *pi = data;
> +
> +	inode_init_once(&pi->vfs_inode);
> +}
> +
>  void __init pidfs_init(void)
>  {
> +	pidfs_cachep = kmem_cache_create("pidfs_cache", sizeof(struct pidfs_inode), 0,
> +					 (SLAB_HWCACHE_ALIGN | SLAB_RECLAIM_ACCOUNT |
> +					  SLAB_ACCOUNT | SLAB_PANIC),
> +					 pidfs_inode_init_once);
>  	pidfs_mnt = kern_mount(&pidfs_type);
>  	if (IS_ERR(pidfs_mnt))
>  		panic("Failed to mount pidfs pseudo filesystem");
> 

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 05/15] pidfs: record exit code and cgroupid at exit
  2025-03-04  9:41 ` [PATCH v2 05/15] pidfs: record exit code and cgroupid at exit Christian Brauner
  2025-03-04 13:05   ` Jeff Layton
@ 2025-03-04 13:10   ` Oleg Nesterov
  2025-03-04 19:10     ` Christian Brauner
  1 sibling, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2025-03-04 13:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

I will read this series later, but I see nothing wrong after a quick glance.

Minor nit below...

On 03/04, Christian Brauner wrote:
>
> Record the exit code and cgroupid in do_exit()
                                    ^^^^^^^^^^^^
this is no longer true. In release_task().

> @@ -254,6 +255,7 @@ void release_task(struct task_struct *p)
>  	write_lock_irq(&tasklist_lock);
>  	ptrace_release_task(p);
>  	thread_pid = get_pid(p->thread_pid);
> +	pidfs_exit(p);
>  	__exit_signal(p);

And the next patch rightly moves pidfs_exit() up outside of tasklist.

Why not call it before write_lock_irq(&tasklist_lock) from the very
beginning?

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 06/15] pidfs: allow to retrieve exit information
  2025-03-04  9:41 ` [PATCH v2 06/15] pidfs: allow to retrieve exit information Christian Brauner
@ 2025-03-04 13:27   ` Jeff Layton
  2025-03-04 17:23     ` Christian Brauner
  2025-03-04 17:22   ` Oleg Nesterov
  2025-03-04 17:34   ` Oleg Nesterov
  2 siblings, 1 reply; 32+ messages in thread
From: Jeff Layton @ 2025-03-04 13:27 UTC (permalink / raw)
  To: Christian Brauner, Oleg Nesterov
  Cc: linux-fsdevel, Lennart Poettering, Daan De Meyer, Mike Yuan

On Tue, 2025-03-04 at 10:41 +0100, Christian Brauner wrote:
> Some tools like systemd's jounral need to retrieve the exit and cgroup
> information after a process has already been reaped. This can e.g.,
> happen when retrieving a pidfd via SCM_PIDFD or SCM_PEERPIDFD.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/pidfs.c                 | 88 ++++++++++++++++++++++++++++++++++++----------
>  include/uapi/linux/pidfd.h |  3 +-
>  kernel/exit.c              |  2 +-
>  3 files changed, 73 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 258e1c13ee56..11744d7fe177 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -36,7 +36,8 @@ struct pidfs_exit_info {
>  };
>  
>  struct pidfs_inode {
> -	struct pidfs_exit_info exit_info;
> +	struct pidfs_exit_info __pei;
> +	struct pidfs_exit_info *exit_info;
>  	struct inode vfs_inode;
>  };
>  
> @@ -228,17 +229,28 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
>  	return poll_flags;
>  }
>  
> -static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> +static inline bool current_in_pidns(struct pid *pid)
> +{
> +	const struct pid_namespace *ns = task_active_pid_ns(current);
> +
> +	if (ns->level <= pid->level)
> +		return pid->numbers[ns->level].ns == ns;
> +
> +	return false;
> +}
> +
> +static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
>  {
>  	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> +	struct pid *pid = pidfd_pid(file);
>  	size_t usize = _IOC_SIZE(cmd);
>  	struct pidfd_info kinfo = {};
> +	struct pidfs_exit_info *exit_info;
> +	struct inode *inode = file_inode(file);
>  	struct user_namespace *user_ns;
> +	struct task_struct *task;
>  	const struct cred *c;
>  	__u64 mask;
> -#ifdef CONFIG_CGROUPS
> -	struct cgroup *cgrp;
> -#endif
>  
>  	if (!uinfo)
>  		return -EINVAL;
> @@ -248,6 +260,37 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
>  	if (copy_from_user(&mask, &uinfo->mask, sizeof(mask)))
>  		return -EFAULT;
>  
> +	task = get_pid_task(pid, PIDTYPE_PID);
> +	if (!task) {
> +		if (!(mask & PIDFD_INFO_EXIT))
> +			return -ESRCH;
> +
> +		if (!current_in_pidns(pid))
> +			return -ESRCH;
> +	}
> +
> +	if (mask & PIDFD_INFO_EXIT) {
> +		exit_info = READ_ONCE(pidfs_i(inode)->exit_info);
> +		if (exit_info) {
> +#ifdef CONFIG_CGROUPS
> +			kinfo.cgroupid = exit_info->cgroupid;
> +			kinfo.mask |= PIDFD_INFO_EXIT | PIDFD_INFO_CGROUPID;
> +#endif
> +			kinfo.exit_code = exit_info->exit_code;
> +		}
> +	}
> +
> +	/*
> +	 * If the task has already been reaped only exit information
> +	 * can be provided. It's entirely possible that the task has
> +	 * already been reaped but we managed to grab a reference to it
> +	 * before that. So a full set of information about @task doesn't
> +	 * mean it hasn't been waited upon. Similarly, a full set of
> +	 * information doesn't mean that the task hasn't already exited.
> +	 */
> +	if (!task)
> +		goto copy_out;
> +
>  	c = get_task_cred(task);
>  	if (!c)
>  		return -ESRCH;
> @@ -267,11 +310,15 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
>  	put_cred(c);
>  
>  #ifdef CONFIG_CGROUPS
> -	rcu_read_lock();
> -	cgrp = task_dfl_cgroup(task);
> -	kinfo.cgroupid = cgroup_id(cgrp);
> -	kinfo.mask |= PIDFD_INFO_CGROUPID;
> -	rcu_read_unlock();
> +	if (!kinfo.cgroupid) {
> +		struct cgroup *cgrp;
> +
> +		rcu_read_lock();
> +		cgrp = task_dfl_cgroup(task);
> +		kinfo.cgroupid = cgroup_id(cgrp);
> +		kinfo.mask |= PIDFD_INFO_CGROUPID;
> +		rcu_read_unlock();
> +	}
>  #endif
>  
>  	/*
> @@ -291,6 +338,7 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
>  	if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1))
>  		return -ESRCH;
>  
> +copy_out:
>  	/*
>  	 * If userspace and the kernel have the same struct size it can just
>  	 * be copied. If userspace provides an older struct, only the bits that
> @@ -325,7 +373,6 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  {
>  	struct task_struct *task __free(put_task) = NULL;
>  	struct nsproxy *nsp __free(put_nsproxy) = NULL;
> -	struct pid *pid = pidfd_pid(file);
>  	struct ns_common *ns_common = NULL;
>  	struct pid_namespace *pid_ns;
>  
> @@ -340,13 +387,13 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  		return put_user(file_inode(file)->i_generation, argp);
>  	}
>  
> -	task = get_pid_task(pid, PIDTYPE_PID);
> -	if (!task)
> -		return -ESRCH;
> -
>  	/* Extensible IOCTL that does not open namespace FDs, take a shortcut */
>  	if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO))
> -		return pidfd_info(task, cmd, arg);
> +		return pidfd_info(file, cmd, arg);
> +
> +	task = get_pid_task(pidfd_pid(file), PIDTYPE_PID);
> +	if (!task)
> +		return -ESRCH;
>  
>  	if (arg)
>  		return -EINVAL;
> @@ -479,10 +526,12 @@ void pidfs_exit(struct task_struct *tsk)
>  {
>  	struct dentry *dentry;
>  
> +	might_sleep();
> +
>  	dentry = stashed_dentry_get(&task_pid(tsk)->stashed);
>  	if (dentry) {
>  		struct inode *inode = d_inode(dentry);
> -		struct pidfs_exit_info *exit_info = &pidfs_i(inode)->exit_info;
> +		struct pidfs_exit_info *exit_info = &pidfs_i(inode)->__pei;
>  #ifdef CONFIG_CGROUPS
>  		struct cgroup *cgrp;
>  
> @@ -493,6 +542,8 @@ void pidfs_exit(struct task_struct *tsk)
>  #endif
>  		exit_info->exit_code = tsk->exit_code;
>  
> +		/* Ensure that PIDFD_GET_INFO sees either all or nothing. */
> +		smp_store_release(&pidfs_i(inode)->exit_info, &pidfs_i(inode)->__pei);
>  		dput(dentry);
>  	}
>  }
> @@ -560,7 +611,8 @@ static struct inode *pidfs_alloc_inode(struct super_block *sb)
>  	if (!pi)
>  		return NULL;
>  
> -	memset(&pi->exit_info, 0, sizeof(pi->exit_info));
> +	memset(&pi->__pei, 0, sizeof(pi->__pei));
> +	pi->exit_info = NULL;
>  
>  	return &pi->vfs_inode;
>  }
> diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> index e0abd0b18841..e5966f1a7743 100644
> --- a/include/uapi/linux/pidfd.h
> +++ b/include/uapi/linux/pidfd.h
> @@ -20,6 +20,7 @@
>  #define PIDFD_INFO_PID			(1UL << 0) /* Always returned, even if not requested */
>  #define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
>  #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
> +#define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
>  
>  #define PIDFD_INFO_SIZE_VER0		64 /* sizeof first published struct */
>  
> @@ -86,7 +87,7 @@ struct pidfd_info {
>  	__u32 sgid;
>  	__u32 fsuid;
>  	__u32 fsgid;
> -	__u32 spare0[1];
> +	__s32 exit_code;
>  };
>  
>  #define PIDFS_IOCTL_MAGIC 0xFF
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 98d292120296..9916305e34d3 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -250,12 +250,12 @@ void release_task(struct task_struct *p)
>  	dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
>  	rcu_read_unlock();
>  
> +	pidfs_exit(p);
>  	cgroup_release(p);
>  
>  	write_lock_irq(&tasklist_lock);
>  	ptrace_release_task(p);
>  	thread_pid = get_pid(p->thread_pid);
> -	pidfs_exit(p);

Why move this after you just added it?

>  	__exit_signal(p);
>  
>  	/*
> 

Everything else looks fine though. Assuming you fix the nit above, you
can add:

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 06/15] pidfs: allow to retrieve exit information
  2025-03-04  9:41 ` [PATCH v2 06/15] pidfs: allow to retrieve exit information Christian Brauner
  2025-03-04 13:27   ` Jeff Layton
@ 2025-03-04 17:22   ` Oleg Nesterov
  2025-03-04 20:16     ` Christian Brauner
  2025-03-04 17:34   ` Oleg Nesterov
  2 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2025-03-04 17:22 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On 03/04, Christian Brauner wrote:
>
> @@ -248,6 +260,37 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
>  	if (copy_from_user(&mask, &uinfo->mask, sizeof(mask)))
>  		return -EFAULT;
>
> +	task = get_pid_task(pid, PIDTYPE_PID);
> +	if (!task) {
> +		if (!(mask & PIDFD_INFO_EXIT))
> +			return -ESRCH;
> +
> +		if (!current_in_pidns(pid))
> +			return -ESRCH;
> +	}
> +
> +	if (mask & PIDFD_INFO_EXIT) {
> +		exit_info = READ_ONCE(pidfs_i(inode)->exit_info);
> +		if (exit_info) {
> +#ifdef CONFIG_CGROUPS
> +			kinfo.cgroupid = exit_info->cgroupid;
> +			kinfo.mask |= PIDFD_INFO_EXIT | PIDFD_INFO_CGROUPID;
> +#endif
> +			kinfo.exit_code = exit_info->exit_code;
> +		}

Confused... so, without CONFIG_CGROUPS pidfd_info() will never report
PIDFD_INFO_EXIT in kinfo.mask ?

> --- a/include/uapi/linux/pidfd.h
> +++ b/include/uapi/linux/pidfd.h
> @@ -20,6 +20,7 @@
>  #define PIDFD_INFO_PID			(1UL << 0) /* Always returned, even if not requested */
>  #define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
>  #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
> +#define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The comment doesn't match the "if (mask & PIDFD_INFO_EXIT)" check above...

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 06/15] pidfs: allow to retrieve exit information
  2025-03-04 13:27   ` Jeff Layton
@ 2025-03-04 17:23     ` Christian Brauner
  0 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04 17:23 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Oleg Nesterov, linux-fsdevel, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Tue, Mar 04, 2025 at 08:27:29AM -0500, Jeff Layton wrote:
> On Tue, 2025-03-04 at 10:41 +0100, Christian Brauner wrote:
> > Some tools like systemd's jounral need to retrieve the exit and cgroup
> > information after a process has already been reaped. This can e.g.,
> > happen when retrieving a pidfd via SCM_PIDFD or SCM_PEERPIDFD.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/pidfs.c                 | 88 ++++++++++++++++++++++++++++++++++++----------
> >  include/uapi/linux/pidfd.h |  3 +-
> >  kernel/exit.c              |  2 +-
> >  3 files changed, 73 insertions(+), 20 deletions(-)
> > 
> > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > index 258e1c13ee56..11744d7fe177 100644
> > --- a/fs/pidfs.c
> > +++ b/fs/pidfs.c
> > @@ -36,7 +36,8 @@ struct pidfs_exit_info {
> >  };
> >  
> >  struct pidfs_inode {
> > -	struct pidfs_exit_info exit_info;
> > +	struct pidfs_exit_info __pei;
> > +	struct pidfs_exit_info *exit_info;
> >  	struct inode vfs_inode;
> >  };
> >  
> > @@ -228,17 +229,28 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> >  	return poll_flags;
> >  }
> >  
> > -static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> > +static inline bool current_in_pidns(struct pid *pid)
> > +{
> > +	const struct pid_namespace *ns = task_active_pid_ns(current);
> > +
> > +	if (ns->level <= pid->level)
> > +		return pid->numbers[ns->level].ns == ns;
> > +
> > +	return false;
> > +}
> > +
> > +static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
> >  {
> >  	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> > +	struct pid *pid = pidfd_pid(file);
> >  	size_t usize = _IOC_SIZE(cmd);
> >  	struct pidfd_info kinfo = {};
> > +	struct pidfs_exit_info *exit_info;
> > +	struct inode *inode = file_inode(file);
> >  	struct user_namespace *user_ns;
> > +	struct task_struct *task;
> >  	const struct cred *c;
> >  	__u64 mask;
> > -#ifdef CONFIG_CGROUPS
> > -	struct cgroup *cgrp;
> > -#endif
> >  
> >  	if (!uinfo)
> >  		return -EINVAL;
> > @@ -248,6 +260,37 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
> >  	if (copy_from_user(&mask, &uinfo->mask, sizeof(mask)))
> >  		return -EFAULT;
> >  
> > +	task = get_pid_task(pid, PIDTYPE_PID);
> > +	if (!task) {
> > +		if (!(mask & PIDFD_INFO_EXIT))
> > +			return -ESRCH;
> > +
> > +		if (!current_in_pidns(pid))
> > +			return -ESRCH;
> > +	}
> > +
> > +	if (mask & PIDFD_INFO_EXIT) {
> > +		exit_info = READ_ONCE(pidfs_i(inode)->exit_info);
> > +		if (exit_info) {
> > +#ifdef CONFIG_CGROUPS
> > +			kinfo.cgroupid = exit_info->cgroupid;
> > +			kinfo.mask |= PIDFD_INFO_EXIT | PIDFD_INFO_CGROUPID;
> > +#endif
> > +			kinfo.exit_code = exit_info->exit_code;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * If the task has already been reaped only exit information
> > +	 * can be provided. It's entirely possible that the task has
> > +	 * already been reaped but we managed to grab a reference to it
> > +	 * before that. So a full set of information about @task doesn't
> > +	 * mean it hasn't been waited upon. Similarly, a full set of
> > +	 * information doesn't mean that the task hasn't already exited.
> > +	 */
> > +	if (!task)
> > +		goto copy_out;
> > +
> >  	c = get_task_cred(task);
> >  	if (!c)
> >  		return -ESRCH;
> > @@ -267,11 +310,15 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
> >  	put_cred(c);
> >  
> >  #ifdef CONFIG_CGROUPS
> > -	rcu_read_lock();
> > -	cgrp = task_dfl_cgroup(task);
> > -	kinfo.cgroupid = cgroup_id(cgrp);
> > -	kinfo.mask |= PIDFD_INFO_CGROUPID;
> > -	rcu_read_unlock();
> > +	if (!kinfo.cgroupid) {
> > +		struct cgroup *cgrp;
> > +
> > +		rcu_read_lock();
> > +		cgrp = task_dfl_cgroup(task);
> > +		kinfo.cgroupid = cgroup_id(cgrp);
> > +		kinfo.mask |= PIDFD_INFO_CGROUPID;
> > +		rcu_read_unlock();
> > +	}
> >  #endif
> >  
> >  	/*
> > @@ -291,6 +338,7 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
> >  	if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1))
> >  		return -ESRCH;
> >  
> > +copy_out:
> >  	/*
> >  	 * If userspace and the kernel have the same struct size it can just
> >  	 * be copied. If userspace provides an older struct, only the bits that
> > @@ -325,7 +373,6 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >  {
> >  	struct task_struct *task __free(put_task) = NULL;
> >  	struct nsproxy *nsp __free(put_nsproxy) = NULL;
> > -	struct pid *pid = pidfd_pid(file);
> >  	struct ns_common *ns_common = NULL;
> >  	struct pid_namespace *pid_ns;
> >  
> > @@ -340,13 +387,13 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >  		return put_user(file_inode(file)->i_generation, argp);
> >  	}
> >  
> > -	task = get_pid_task(pid, PIDTYPE_PID);
> > -	if (!task)
> > -		return -ESRCH;
> > -
> >  	/* Extensible IOCTL that does not open namespace FDs, take a shortcut */
> >  	if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO))
> > -		return pidfd_info(task, cmd, arg);
> > +		return pidfd_info(file, cmd, arg);
> > +
> > +	task = get_pid_task(pidfd_pid(file), PIDTYPE_PID);
> > +	if (!task)
> > +		return -ESRCH;
> >  
> >  	if (arg)
> >  		return -EINVAL;
> > @@ -479,10 +526,12 @@ void pidfs_exit(struct task_struct *tsk)
> >  {
> >  	struct dentry *dentry;
> >  
> > +	might_sleep();
> > +
> >  	dentry = stashed_dentry_get(&task_pid(tsk)->stashed);
> >  	if (dentry) {
> >  		struct inode *inode = d_inode(dentry);
> > -		struct pidfs_exit_info *exit_info = &pidfs_i(inode)->exit_info;
> > +		struct pidfs_exit_info *exit_info = &pidfs_i(inode)->__pei;
> >  #ifdef CONFIG_CGROUPS
> >  		struct cgroup *cgrp;
> >  
> > @@ -493,6 +542,8 @@ void pidfs_exit(struct task_struct *tsk)
> >  #endif
> >  		exit_info->exit_code = tsk->exit_code;
> >  
> > +		/* Ensure that PIDFD_GET_INFO sees either all or nothing. */
> > +		smp_store_release(&pidfs_i(inode)->exit_info, &pidfs_i(inode)->__pei);
> >  		dput(dentry);
> >  	}
> >  }
> > @@ -560,7 +611,8 @@ static struct inode *pidfs_alloc_inode(struct super_block *sb)
> >  	if (!pi)
> >  		return NULL;
> >  
> > -	memset(&pi->exit_info, 0, sizeof(pi->exit_info));
> > +	memset(&pi->__pei, 0, sizeof(pi->__pei));
> > +	pi->exit_info = NULL;
> >  
> >  	return &pi->vfs_inode;
> >  }
> > diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> > index e0abd0b18841..e5966f1a7743 100644
> > --- a/include/uapi/linux/pidfd.h
> > +++ b/include/uapi/linux/pidfd.h
> > @@ -20,6 +20,7 @@
> >  #define PIDFD_INFO_PID			(1UL << 0) /* Always returned, even if not requested */
> >  #define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
> >  #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
> > +#define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
> >  
> >  #define PIDFD_INFO_SIZE_VER0		64 /* sizeof first published struct */
> >  
> > @@ -86,7 +87,7 @@ struct pidfd_info {
> >  	__u32 sgid;
> >  	__u32 fsuid;
> >  	__u32 fsgid;
> > -	__u32 spare0[1];
> > +	__s32 exit_code;
> >  };
> >  
> >  #define PIDFS_IOCTL_MAGIC 0xFF
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index 98d292120296..9916305e34d3 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -250,12 +250,12 @@ void release_task(struct task_struct *p)
> >  	dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
> >  	rcu_read_unlock();
> >  
> > +	pidfs_exit(p);
> >  	cgroup_release(p);
> >  
> >  	write_lock_irq(&tasklist_lock);
> >  	ptrace_release_task(p);
> >  	thread_pid = get_pid(p->thread_pid);
> > -	pidfs_exit(p);
> 
> Why move this after you just added it?

Ah sorry that should've been done in the previous patch. pidfs_exit()
needs to be called outside of tasklist_lock as it can sleep due to dput().

> 
> >  	__exit_signal(p);
> >  
> >  	/*
> > 
> 
> Everything else looks fine though. Assuming you fix the nit above, you
> can add:
> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 06/15] pidfs: allow to retrieve exit information
  2025-03-04  9:41 ` [PATCH v2 06/15] pidfs: allow to retrieve exit information Christian Brauner
  2025-03-04 13:27   ` Jeff Layton
  2025-03-04 17:22   ` Oleg Nesterov
@ 2025-03-04 17:34   ` Oleg Nesterov
  2025-03-04 20:09     ` Christian Brauner
  2 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2025-03-04 17:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On 03/04, Christian Brauner wrote:
>
> +	task = get_pid_task(pid, PIDTYPE_PID);
> +	if (!task) {
> +		if (!(mask & PIDFD_INFO_EXIT))
> +			return -ESRCH;
> +
> +		if (!current_in_pidns(pid))
> +			return -ESRCH;

Damn ;) could you explain the current_in_pidns() check to me ?
I am puzzled...

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 05/15] pidfs: record exit code and cgroupid at exit
  2025-03-04 13:10   ` Oleg Nesterov
@ 2025-03-04 19:10     ` Christian Brauner
  0 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04 19:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Tue, Mar 04, 2025 at 02:10:18PM +0100, Oleg Nesterov wrote:
> I will read this series later, but I see nothing wrong after a quick glance.
> 
> Minor nit below...
> 
> On 03/04, Christian Brauner wrote:
> >
> > Record the exit code and cgroupid in do_exit()
>                                     ^^^^^^^^^^^^
> this is no longer true. In release_task().

Yes, thanks for catching this!

> 
> > @@ -254,6 +255,7 @@ void release_task(struct task_struct *p)
> >  	write_lock_irq(&tasklist_lock);
> >  	ptrace_release_task(p);
> >  	thread_pid = get_pid(p->thread_pid);
> > +	pidfs_exit(p);
> >  	__exit_signal(p);
> 
> And the next patch rightly moves pidfs_exit() up outside of tasklist.
> 
> Why not call it before write_lock_irq(&tasklist_lock) from the very
> beginning?

Yes, sorry that was my intention. pidfs_exit() can sleep due to dput().
Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 06/15] pidfs: allow to retrieve exit information
  2025-03-04 17:34   ` Oleg Nesterov
@ 2025-03-04 20:09     ` Christian Brauner
  2025-03-04 21:47       ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Christian Brauner @ 2025-03-04 20:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Tue, Mar 04, 2025 at 06:34:56PM +0100, Oleg Nesterov wrote:
> On 03/04, Christian Brauner wrote:
> >
> > +	task = get_pid_task(pid, PIDTYPE_PID);
> > +	if (!task) {
> > +		if (!(mask & PIDFD_INFO_EXIT))
> > +			return -ESRCH;
> > +
> > +		if (!current_in_pidns(pid))
> > +			return -ESRCH;
> 
> Damn ;) could you explain the current_in_pidns() check to me ?
> I am puzzled...

So we currently restrict interactions with pidfd by pid namespace
hierarchy. Meaning that we ensure that the pidfd is part of the caller's
pid namespace hierarchy. So this check is similar to:

pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
{
        struct upid *upid;
        pid_t nr = 0;

        if (pid && ns->level <= pid->level) {
                upid = &pid->numbers[ns->level];
                if (upid->ns == ns)
                        nr = upid->nr;
        }
        return nr;
}

IOW, we verify that the caller's pid namespace can be found within the
pid namespace hierarchy that the struct pid had a pid numbers allocated
in.

Only that by the time we perform this check the pid numbers have already
been freed so we can't use that function directly. But the pid namespace
hierarchy is still alive as that won't be released until the pidfd has
put the reference on struct @pid. So we can use current_in_pidns().

Is that understandable?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 06/15] pidfs: allow to retrieve exit information
  2025-03-04 17:22   ` Oleg Nesterov
@ 2025-03-04 20:16     ` Christian Brauner
  0 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04 20:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Tue, Mar 04, 2025 at 06:22:55PM +0100, Oleg Nesterov wrote:
> On 03/04, Christian Brauner wrote:
> >
> > @@ -248,6 +260,37 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
> >  	if (copy_from_user(&mask, &uinfo->mask, sizeof(mask)))
> >  		return -EFAULT;
> >
> > +	task = get_pid_task(pid, PIDTYPE_PID);
> > +	if (!task) {
> > +		if (!(mask & PIDFD_INFO_EXIT))
> > +			return -ESRCH;
> > +
> > +		if (!current_in_pidns(pid))
> > +			return -ESRCH;
> > +	}
> > +
> > +	if (mask & PIDFD_INFO_EXIT) {
> > +		exit_info = READ_ONCE(pidfs_i(inode)->exit_info);
> > +		if (exit_info) {
> > +#ifdef CONFIG_CGROUPS
> > +			kinfo.cgroupid = exit_info->cgroupid;
> > +			kinfo.mask |= PIDFD_INFO_EXIT | PIDFD_INFO_CGROUPID;
> > +#endif
> > +			kinfo.exit_code = exit_info->exit_code;
> > +		}
> 
> Confused... so, without CONFIG_CGROUPS pidfd_info() will never report
> PIDFD_INFO_EXIT in kinfo.mask ?

Fixed.

> 
> > --- a/include/uapi/linux/pidfd.h
> > +++ b/include/uapi/linux/pidfd.h
> > @@ -20,6 +20,7 @@
> >  #define PIDFD_INFO_PID			(1UL << 0) /* Always returned, even if not requested */
> >  #define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
> >  #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
> > +#define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
>                                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> The comment doesn't match the "if (mask & PIDFD_INFO_EXIT)" check above...

Also fixed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 17/16] selftests/pidfd: test multi-threaded exec with PPIDFD_INFO_EXIT
  2025-03-04  9:41 ` [PATCH v2 15/15] selftests/pidfd: add sixth " Christian Brauner
@ 2025-03-04 20:18   ` Christian Brauner
  0 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-04 20:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, linux-fsdevel, Jeff Layton, Lennart Poettering,
	Daan De Meyer, Mike Yuan

Add a selftest for PIDFD_INFO_EXIT behavior.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/.gitignore      |   1 +
 tools/testing/selftests/pidfd/Makefile        |   2 +
 tools/testing/selftests/pidfd/pidfd.h         |   7 ++
 .../selftests/pidfd/pidfd_exec_helper.c       |  12 +++
 .../testing/selftests/pidfd/pidfd_info_test.c | 102 ++++++++++++++++++
 5 files changed, 124 insertions(+)
 create mode 100644 tools/testing/selftests/pidfd/pidfd_exec_helper.c

diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore
index bddae1d4d7e4..0406a065deb4 100644
--- a/tools/testing/selftests/pidfd/.gitignore
+++ b/tools/testing/selftests/pidfd/.gitignore
@@ -9,3 +9,4 @@ pidfd_setns_test
 pidfd_file_handle_test
 pidfd_bind_mount
 pidfd_info_test
+pidfd_exec_helper
diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index a94c2bc8d594..fcbefc0d77f6 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -5,5 +5,7 @@ TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test \
 	pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test \
 	pidfd_file_handle_test pidfd_bind_mount pidfd_info_test
 
+TEST_GEN_PROGS_EXTENDED := pidfd_exec_helper
+
 include ../lib.mk
 
diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index fee6fd3e67dd..cec22aa11cdf 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -254,4 +254,11 @@ static inline ssize_t write_nointr(int fd, const void *buf, size_t count)
 	return ret;
 }
 
+static inline int sys_execveat(int dirfd, const char *pathname,
+			       char *const argv[], char *const envp[],
+			       int flags)
+{
+        return syscall(__NR_execveat, dirfd, pathname, argv, envp, flags);
+}
+
 #endif /* __PIDFD_H */
diff --git a/tools/testing/selftests/pidfd/pidfd_exec_helper.c b/tools/testing/selftests/pidfd/pidfd_exec_helper.c
new file mode 100644
index 000000000000..5516808c95f2
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_exec_helper.c
@@ -0,0 +1,12 @@
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+int main(int argc, char *argv[])
+{
+	if (pause())
+		_exit(EXIT_FAILURE);
+
+	_exit(EXIT_SUCCESS);
+}
diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 5e86e3df323b..f159a6705a07 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -369,4 +369,106 @@ TEST_F(pidfd_info, thread_group)
 	EXPECT_EQ(close(pidfd_thread), 0);
 }
 
+static void *pidfd_info_thread_exec(void *arg)
+{
+	pid_t pid_thread = gettid();
+	int ipc_socket = *(int *)arg;
+
+	/* Inform the grand-parent what the tid of this thread is. */
+	if (write_nointr(ipc_socket, &pid_thread, sizeof(pid_thread)) != sizeof(pid_thread))
+		return NULL;
+
+	if (read_nointr(ipc_socket, &pid_thread, sizeof(pid_thread)) != sizeof(pid_thread))
+		return NULL;
+
+	close(ipc_socket);
+
+	sys_execveat(AT_FDCWD, "pidfd_exec_helper", NULL, NULL, 0);
+	return NULL;
+}
+
+TEST_F(pidfd_info, thread_group_exec)
+{
+	pid_t pid_leader, pid_thread;
+	pthread_t thread;
+	int nevents, pidfd_leader, pidfd_thread, ret;
+	int ipc_sockets[2];
+	struct pollfd fds = {};
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
+	};
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	EXPECT_EQ(ret, 0);
+
+	pid_leader = create_child(&pidfd_leader, 0);
+	EXPECT_GE(pid_leader, 0);
+
+	if (pid_leader == 0) {
+		close(ipc_sockets[0]);
+
+		/* The thread will outlive the thread-group leader. */
+		if (pthread_create(&thread, NULL, pidfd_info_thread_exec, &ipc_sockets[1]))
+			syscall(__NR_exit, EXIT_FAILURE);
+
+		/* Make the thread-group leader exit prematurely. */
+		syscall(__NR_exit, EXIT_SUCCESS);
+	}
+
+	/* Retrieve the tid of the thread. */
+	EXPECT_EQ(close(ipc_sockets[1]), 0);
+	ASSERT_EQ(read_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
+
+	/* Opening a thread as a PIDFD_THREAD must succeed. */
+	pidfd_thread = sys_pidfd_open(pid_thread, PIDFD_THREAD);
+	ASSERT_GE(pidfd_thread, 0);
+
+	/* Now that we've opened a thread-specific pidfd the thread can exec. */
+	ASSERT_EQ(write_nointr(ipc_sockets[0], &pid_thread, sizeof(pid_thread)), sizeof(pid_thread));
+	EXPECT_EQ(close(ipc_sockets[0]), 0);
+
+	/* Wait until the kernel has SIGKILLed the thread. */
+	fds.events = POLLHUP;
+	fds.fd = pidfd_thread;
+	nevents = poll(&fds, 1, -1);
+	ASSERT_EQ(nevents, 1);
+	/* The thread has been reaped. */
+	ASSERT_TRUE(!!(fds.revents & POLLHUP));
+
+	/* Retrieve thread-specific exit info from pidfd. */
+	ASSERT_EQ(ioctl(pidfd_thread, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+	/*
+	 * While the kernel will have SIGKILLed the whole thread-group
+	 * during exec it will cause the individual threads to exit
+	 * cleanly.
+	 */
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
+
+	/*
+	 * The thread-group leader is still alive, the thread has taken
+	 * over its struct pid and thus its pid number.
+	 */
+	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(pidfd_leader, PIDFD_GET_INFO, &info), 0);
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_EXIT));
+	ASSERT_EQ(info.pid, pid_leader);
+
+	/* Take down the thread-group leader. */
+	EXPECT_EQ(sys_pidfd_send_signal(pidfd_leader, SIGKILL, NULL, 0), 0);
+	EXPECT_EQ(sys_waitid(P_PIDFD, pidfd_leader, NULL, WEXITED), 0);
+
+	/* Retrieve exit information for the thread-group leader. */
+	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;
+	ASSERT_EQ(ioctl(pidfd_leader, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+
+	EXPECT_EQ(close(pidfd_leader), 0);
+	EXPECT_EQ(close(pidfd_thread), 0);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 06/15] pidfs: allow to retrieve exit information
  2025-03-04 20:09     ` Christian Brauner
@ 2025-03-04 21:47       ` Oleg Nesterov
  2025-03-05  8:54         ` Christian Brauner
  0 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2025-03-04 21:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On 03/04, Christian Brauner wrote:
>
> On Tue, Mar 04, 2025 at 06:34:56PM +0100, Oleg Nesterov wrote:
> > On 03/04, Christian Brauner wrote:
> > >
> > > +	task = get_pid_task(pid, PIDTYPE_PID);
> > > +	if (!task) {
> > > +		if (!(mask & PIDFD_INFO_EXIT))
> > > +			return -ESRCH;
> > > +
> > > +		if (!current_in_pidns(pid))
> > > +			return -ESRCH;
> >
> > Damn ;) could you explain the current_in_pidns() check to me ?
> > I am puzzled...
>
> So we currently restrict interactions with pidfd by pid namespace
> hierarchy. Meaning that we ensure that the pidfd is part of the caller's
> pid namespace hierarchy.

Well this is clear... but sorry I still can't understand.

Why do we check current_in_pidns() only if get_pid_task(PIDTYPE_PID)
returns NULL?

And, unless (quite possibly) I am totally confused, if task != NULL
but current_in_pidns() would return false, then

	kinfo.pid = task_pid_vnr(task);

below will set kinfo.pid = 0, and pidfd_info() will return -ESRCH anyway?

> So this check is similar to:
>
> pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
> {
>         struct upid *upid;
>         pid_t nr = 0;
>
>         if (pid && ns->level <= pid->level) {
>                 upid = &pid->numbers[ns->level];
>                 if (upid->ns == ns)
>                         nr = upid->nr;
>         }
>         return nr;
> }
>
> Only that by the time we perform this check the pid numbers have already
> been freed so we can't use that function directly.

Confused again... Yes, the [u]pid numbers can be already "freed" in that
upid->nr can be already idr_remove()'ed, but

> But the pid namespace
> hierarchy is still alive as that won't be released until the pidfd has
> put the reference on struct @pid.

Yes, so I still don't undestand, sorry :/

IOW. Why not check current_in_pidns() at the start? and do
task = get_pid_task() later, right before

	if (!task)
		goto copy_out;

?

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 06/15] pidfs: allow to retrieve exit information
  2025-03-04 21:47       ` Oleg Nesterov
@ 2025-03-05  8:54         ` Christian Brauner
  0 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-05  8:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Tue, Mar 04, 2025 at 10:47:11PM +0100, Oleg Nesterov wrote:
> On 03/04, Christian Brauner wrote:
> >
> > On Tue, Mar 04, 2025 at 06:34:56PM +0100, Oleg Nesterov wrote:
> > > On 03/04, Christian Brauner wrote:
> > > >
> > > > +	task = get_pid_task(pid, PIDTYPE_PID);
> > > > +	if (!task) {
> > > > +		if (!(mask & PIDFD_INFO_EXIT))
> > > > +			return -ESRCH;
> > > > +
> > > > +		if (!current_in_pidns(pid))
> > > > +			return -ESRCH;
> > >
> > > Damn ;) could you explain the current_in_pidns() check to me ?
> > > I am puzzled...
> >
> > So we currently restrict interactions with pidfd by pid namespace
> > hierarchy. Meaning that we ensure that the pidfd is part of the caller's
> > pid namespace hierarchy.
> 
> Well this is clear... but sorry I still can't understand.

Ok, it sounded like you wanted me to explain what current_in_pidns()
does not why it's placed where it is placed. :)

> 
> Why do we check current_in_pidns() only if get_pid_task(PIDTYPE_PID)
> returns NULL?

Because if task != NULL we already catch it kinfo.pid and since we can't
skip the call to task_pid_vnr() it seemed redundant to do that check
twice. But if we do it before get_task_pid() it makes more sense ofc.

> 
> And, unless (quite possibly) I am totally confused, if task != NULL
> but current_in_pidns() would return false, then
> 
> 	kinfo.pid = task_pid_vnr(task);
> 
> below will set kinfo.pid = 0, and pidfd_info() will return -ESRCH anyway?

Yes.

> 
> > So this check is similar to:
> >
> > pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
> > {
> >         struct upid *upid;
> >         pid_t nr = 0;
> >
> >         if (pid && ns->level <= pid->level) {
> >                 upid = &pid->numbers[ns->level];
> >                 if (upid->ns == ns)
> >                         nr = upid->nr;
> >         }
> >         return nr;
> > }
> >
> > Only that by the time we perform this check the pid numbers have already
> > been freed so we can't use that function directly.
> 
> Confused again... Yes, the [u]pid numbers can be already "freed" in that
> upid->nr can be already idr_remove()'ed, but
> 
> > But the pid namespace
> > hierarchy is still alive as that won't be released until the pidfd has
> > put the reference on struct @pid.
> 
> Yes, so I still don't undestand, sorry :/
> 
> IOW. Why not check current_in_pidns() at the start? and do
> task = get_pid_task() later, right before

Sure, that's a good suggestion.

> 
> 	if (!task)
> 		goto copy_out;
> 
> ?
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-03-05  8:54 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-04  9:41 [PATCH v2 00/15] pidfs: provide information after task has been reaped Christian Brauner
2025-03-04  9:41 ` [PATCH v2 01/15] pidfs: switch to copy_struct_to_user() Christian Brauner
2025-03-04 12:42   ` Jeff Layton
2025-03-04  9:41 ` [PATCH v2 02/15] pidfd: rely on automatic cleanup in __pidfd_prepare() Christian Brauner
2025-03-04 12:44   ` Jeff Layton
2025-03-04  9:41 ` [PATCH v2 03/15] pidfs: move setting flags into pidfs_alloc_file() Christian Brauner
2025-03-04 12:53   ` Jeff Layton
2025-03-04  9:41 ` [PATCH v2 04/15] pidfs: add inode allocation Christian Brauner
2025-03-04 13:06   ` Jeff Layton
2025-03-04  9:41 ` [PATCH v2 05/15] pidfs: record exit code and cgroupid at exit Christian Brauner
2025-03-04 13:05   ` Jeff Layton
2025-03-04 13:10   ` Oleg Nesterov
2025-03-04 19:10     ` Christian Brauner
2025-03-04  9:41 ` [PATCH v2 06/15] pidfs: allow to retrieve exit information Christian Brauner
2025-03-04 13:27   ` Jeff Layton
2025-03-04 17:23     ` Christian Brauner
2025-03-04 17:22   ` Oleg Nesterov
2025-03-04 20:16     ` Christian Brauner
2025-03-04 17:34   ` Oleg Nesterov
2025-03-04 20:09     ` Christian Brauner
2025-03-04 21:47       ` Oleg Nesterov
2025-03-05  8:54         ` Christian Brauner
2025-03-04  9:41 ` [PATCH v2 07/15] selftests/pidfd: fix header inclusion Christian Brauner
2025-03-04  9:41 ` [PATCH v2 08/15] pidfs/selftests: ensure correct headers for ioctl handling Christian Brauner
2025-03-04  9:41 ` [PATCH v2 09/15] selftests/pidfd: move more defines to common header Christian Brauner
2025-03-04  9:41 ` [PATCH v2 10/15] selftests/pidfd: add first PIDFD_INFO_EXIT selftest Christian Brauner
2025-03-04  9:41 ` [PATCH v2 11/15] selftests/pidfd: add second " Christian Brauner
2025-03-04  9:41 ` [PATCH v2 12/15] selftests/pidfd: add third " Christian Brauner
2025-03-04  9:41 ` [PATCH v2 13/15] selftests/pidfd: add fourth " Christian Brauner
2025-03-04  9:41 ` [PATCH v2 14/15] selftests/pidfd: add fifth " Christian Brauner
2025-03-04  9:41 ` [PATCH v2 15/15] selftests/pidfd: add sixth " Christian Brauner
2025-03-04 20:18   ` [PATCH v2 17/16] selftests/pidfd: test multi-threaded exec with PPIDFD_INFO_EXIT Christian Brauner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.