[PATCH RFC 00/10] pidfs: provide information after task has been reaped

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 00/10] pidfs: provide information after task has been reaped
@ 2025-02-28 12:44 Christian Brauner
  2025-02-28 12:44 ` [PATCH RFC 01/10] pidfs: switch to copy_struct_to_user() Christian Brauner
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: Christian Brauner @ 2025-02-28 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Hey,

Various tools need access to information about a process/task even after
it has already been reaped. For example, systemd's journal logs
and uses such information as the cgroup id and exit status to deal with
processes that have been sent via SCM_PIDFD or SCM_PEERPIDFD. By the
time the pidfd is received the process might've already exited or even
been reaped.

This series aims to provide information by extending the PIDFD_GET_INFO
ioctl to retrieve the exit code and cgroup id. There might be other
stuff that we would want in the future.

Note, this is and RFC and it has a bunch of TODOs/questions in comments.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (10):
      pidfs: switch to copy_struct_to_user()
      pidfd: rely on automatic cleanup in __pidfd_prepare()
      pidfs: move setting flags into pidfs_alloc_file()
      pidfs: add inode allocation
      pidfs: record exit code and cgroupid at exit
      pidfs: allow to retrieve exit information
      selftests/pidfd: fix header inclusion
      pidfs/selftests: ensure correct headers for ioctl handling
      selftests/pidfd: move more defines to common header
      selftests/pidfd: add PIDFD_INFO_EXIT tests

 fs/internal.h                                     |   1 +
 fs/libfs.c                                        |   4 +-
 fs/pidfs.c                                        | 170 ++++++++++++++++++--
 include/linux/pidfs.h                             |   1 +
 include/uapi/linux/pidfd.h                        |   3 +-
 kernel/exit.c                                     |   2 +
 kernel/fork.c                                     |  15 +-
 tools/testing/selftests/pidfd/.gitignore          |   1 +
 tools/testing/selftests/pidfd/Makefile            |   2 +-
 tools/testing/selftests/pidfd/pidfd.h             |  82 ++++++++++
 tools/testing/selftests/pidfd/pidfd_fdinfo_test.c |   1 +
 tools/testing/selftests/pidfd/pidfd_info_test.c   | 185 ++++++++++++++++++++++
 tools/testing/selftests/pidfd/pidfd_open_test.c   |  26 ---
 tools/testing/selftests/pidfd/pidfd_setns_test.c  |  45 ------
 14 files changed, 439 insertions(+), 99 deletions(-)
---
base-commit: b1e809e7f64ad47dd232ff072d8ef59c1fe414c5
change-id: 20250227-work-pidfs-kill_on_last_close-a23ddf21db47


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFC 01/10] pidfs: switch to copy_struct_to_user()
  2025-02-28 12:44 [PATCH RFC 00/10] pidfs: provide information after task has been reaped Christian Brauner
@ 2025-02-28 12:44 ` Christian Brauner
  2025-02-28 12:44 ` [PATCH RFC 02/10] pidfd: rely on automatic cleanup in __pidfd_prepare() Christian Brauner
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Christian Brauner @ 2025-02-28 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

We have a helper that deals with all the required logic.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 049352f973de..aa8c8bda8c8f 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -276,10 +276,7 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
 	 * userspace knows about will be copied. If userspace provides a new
 	 * struct, only the bits that the kernel knows about will be copied.
 	 */
-	if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo))))
-		return -EFAULT;
-
-	return 0;
+	return copy_struct_to_user(uinfo, usize, &kinfo, sizeof(kinfo), NULL);
 }
 
 static bool pidfs_ioctl_valid(unsigned int cmd)

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 02/10] pidfd: rely on automatic cleanup in __pidfd_prepare()
  2025-02-28 12:44 [PATCH RFC 00/10] pidfs: provide information after task has been reaped Christian Brauner
  2025-02-28 12:44 ` [PATCH RFC 01/10] pidfs: switch to copy_struct_to_user() Christian Brauner
@ 2025-02-28 12:44 ` Christian Brauner
  2025-02-28 12:44 ` [PATCH RFC 03/10] pidfs: move setting flags into pidfs_alloc_file() Christian Brauner
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Christian Brauner @ 2025-02-28 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Rely on scope-based cleanup for the allocated file descriptor.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 kernel/fork.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f3..6230f5256bc5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2032,25 +2032,23 @@ static inline void rcu_copy_process(struct task_struct *p)
  */
 static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
 {
-	int pidfd;
 	struct file *pidfd_file;
 
-	pidfd = get_unused_fd_flags(O_CLOEXEC);
+	CLASS(get_unused_fd, pidfd)(O_CLOEXEC);
 	if (pidfd < 0)
 		return pidfd;
 
 	pidfd_file = pidfs_alloc_file(pid, flags | O_RDWR);
-	if (IS_ERR(pidfd_file)) {
-		put_unused_fd(pidfd);
+	if (IS_ERR(pidfd_file))
 		return PTR_ERR(pidfd_file);
-	}
+
 	/*
 	 * anon_inode_getfile() ignores everything outside of the
 	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
 	 */
 	pidfd_file->f_flags |= (flags & PIDFD_THREAD);
 	*ret = pidfd_file;
-	return pidfd;
+	return take_fd(pidfd);
 }
 
 /**

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 03/10] pidfs: move setting flags into pidfs_alloc_file()
  2025-02-28 12:44 [PATCH RFC 00/10] pidfs: provide information after task has been reaped Christian Brauner
  2025-02-28 12:44 ` [PATCH RFC 01/10] pidfs: switch to copy_struct_to_user() Christian Brauner
  2025-02-28 12:44 ` [PATCH RFC 02/10] pidfd: rely on automatic cleanup in __pidfd_prepare() Christian Brauner
@ 2025-02-28 12:44 ` Christian Brauner
  2025-03-02 13:09   ` Oleg Nesterov
  2025-02-28 12:44 ` [PATCH RFC 04/10] pidfs: add inode allocation Christian Brauner
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Christian Brauner @ 2025-02-28 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Instead od adding it into __pidfd_prepare() place it where the actual
file allocation happens and update the outdated comment.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c    | 4 ++++
 kernel/fork.c | 5 -----
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index aa8c8bda8c8f..61be98f7ad0b 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -696,6 +696,10 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
 		return ERR_PTR(ret);
 
 	pidfd_file = dentry_open(&path, flags, current_cred());
+	/* Raise PIDFD_THREAD explicitly as dentry_open() strips it. */
+	if (!IS_ERR(pidfd_file))
+		pidfd_file->f_flags |= (flags & PIDFD_THREAD);
+
 	path_put(&path);
 	return pidfd_file;
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index 6230f5256bc5..8eac9cd3385b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2042,11 +2042,6 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
 	if (IS_ERR(pidfd_file))
 		return PTR_ERR(pidfd_file);
 
-	/*
-	 * anon_inode_getfile() ignores everything outside of the
-	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
-	 */
-	pidfd_file->f_flags |= (flags & PIDFD_THREAD);
 	*ret = pidfd_file;
 	return take_fd(pidfd);
 }

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 04/10] pidfs: add inode allocation
  2025-02-28 12:44 [PATCH RFC 00/10] pidfs: provide information after task has been reaped Christian Brauner
                   ` (2 preceding siblings ...)
  2025-02-28 12:44 ` [PATCH RFC 03/10] pidfs: move setting flags into pidfs_alloc_file() Christian Brauner
@ 2025-02-28 12:44 ` Christian Brauner
  2025-02-28 12:44 ` [PATCH RFC 05/10] pidfs: record exit code and cgroupid at exit Christian Brauner
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Christian Brauner @ 2025-02-28 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 61be98f7ad0b..64428697996f 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -24,6 +24,27 @@
 #include "internal.h"
 #include "mount.h"
 
+static struct kmem_cache *pidfs_cachep __ro_after_init;
+
+/*
+ * Stashes information that userspace needs to access even after the
+ * process has been reaped.
+ */
+struct pidfs_exit_info {
+	__u64 cgroupid;
+	__u64 exit_code;
+};
+
+struct pidfs_inode {
+	struct pidfs_exit_info exit_info;
+	struct inode vfs_inode;
+};
+
+static inline struct pidfs_inode *pidfs_i(struct inode *inode)
+{
+	return container_of(inode, struct pidfs_inode, vfs_inode);
+}
+
 static struct rb_root pidfs_ino_tree = RB_ROOT;
 
 #if BITS_PER_LONG == 32
@@ -492,9 +513,29 @@ static void pidfs_evict_inode(struct inode *inode)
 	put_pid(pid);
 }
 
+static struct inode *pidfs_alloc_inode(struct super_block *sb)
+{
+	struct pidfs_inode *pi;
+
+	pi = alloc_inode_sb(sb, pidfs_cachep, GFP_KERNEL);
+	if (!pi)
+		return NULL;
+
+	memset(&pi->exit_info, 0, sizeof(pi->exit_info));
+
+	return &pi->vfs_inode;
+}
+
+static void pidfs_free_inode(struct inode *inode)
+{
+	kmem_cache_free(pidfs_cachep, pidfs_i(inode));
+}
+
 static const struct super_operations pidfs_sops = {
+	.alloc_inode	= pidfs_alloc_inode,
 	.drop_inode	= generic_delete_inode,
 	.evict_inode	= pidfs_evict_inode,
+	.free_inode	= pidfs_free_inode,
 	.statfs		= simple_statfs,
 };
 
@@ -704,8 +745,19 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
 	return pidfd_file;
 }
 
+static void pidfs_inode_init_once(void *data)
+{
+	struct pidfs_inode *pi = data;
+
+	inode_init_once(&pi->vfs_inode);
+}
+
 void __init pidfs_init(void)
 {
+	pidfs_cachep = kmem_cache_create("pidfs_cache", sizeof(struct pidfs_inode), 0,
+					 (SLAB_HWCACHE_ALIGN | SLAB_RECLAIM_ACCOUNT |
+					  SLAB_ACCOUNT | SLAB_PANIC),
+					 pidfs_inode_init_once);
 	pidfs_mnt = kern_mount(&pidfs_type);
 	if (IS_ERR(pidfs_mnt))
 		panic("Failed to mount pidfs pseudo filesystem");

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 05/10] pidfs: record exit code and cgroupid at exit
  2025-02-28 12:44 [PATCH RFC 00/10] pidfs: provide information after task has been reaped Christian Brauner
                   ` (3 preceding siblings ...)
  2025-02-28 12:44 ` [PATCH RFC 04/10] pidfs: add inode allocation Christian Brauner
@ 2025-02-28 12:44 ` Christian Brauner
  2025-03-02 15:19   ` Oleg Nesterov
  2025-02-28 12:44 ` [PATCH RFC 06/10] pidfs: allow to retrieve exit information Christian Brauner
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Christian Brauner @ 2025-02-28 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Record the exit code and cgroupid in do_exit() and stash in struct
pidfs_exit_info so it can be retrieved even after the task has been
reaped.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/internal.h         |  1 +
 fs/libfs.c            |  4 ++--
 fs/pidfs.c            | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pidfs.h |  1 +
 kernel/exit.c         |  2 ++
 5 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index e7f02ae1e098..c1e6d8b294cb 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -325,6 +325,7 @@ struct stashed_operations {
 int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
 		      struct path *path);
 void stashed_dentry_prune(struct dentry *dentry);
+struct dentry *stashed_dentry_get(struct dentry **stashed);
 /**
  * path_mounted - check whether path is mounted
  * @path: path to check
diff --git a/fs/libfs.c b/fs/libfs.c
index 8444f5cc4064..cf5a267aafe4 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -2113,7 +2113,7 @@ struct timespec64 simple_inode_init_ts(struct inode *inode)
 }
 EXPORT_SYMBOL(simple_inode_init_ts);
 
-static inline struct dentry *get_stashed_dentry(struct dentry **stashed)
+struct dentry *stashed_dentry_get(struct dentry **stashed)
 {
 	struct dentry *dentry;
 
@@ -2215,7 +2215,7 @@ int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
 	const struct stashed_operations *sops = mnt->mnt_sb->s_fs_info;
 
 	/* See if dentry can be reused. */
-	path->dentry = get_stashed_dentry(stashed);
+	path->dentry = stashed_dentry_get(stashed);
 	if (path->dentry) {
 		sops->put_data(data);
 		goto out_path;
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 64428697996f..433f676c066c 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -458,6 +458,53 @@ struct pid *pidfd_pid(const struct file *file)
 	return file_inode(file)->i_private;
 }
 
+/*
+ * We're called from do_exit(). We know there's at least one reference
+ * to struct pid being held that won't be released until the task has
+ * been reaped which cannot happen until we're out of do_exit().
+ *
+ * If this struct pid is refered to by a pidfd then stashed_dentry_get()
+ * will return the dentry and inode for that struct pid. Since we've
+ * taken a reference on it there's now an additional reference from the
+ * exit path on it. Which is fine. We're going to put it again in a
+ * second and we know that the pid is kept alive anyway.
+ *
+ * Worst case is that we've filled in the info and immediately free the
+ * dentry and inode afterwards since the pidfd has been closed. Since
+ * pidfs_exit() currently is placed after exit_task_work() we know that
+ * it cannot be us aka the exiting task holding a pidfd to ourselves.
+ */
+void pidfs_exit(struct task_struct *tsk)
+{
+	struct dentry *dentry;
+
+	dentry = stashed_dentry_get(&task_pid(tsk)->stashed);
+	if (dentry) {
+		struct inode *inode;
+		struct pidfs_exit_info *exit_info;
+#ifdef CONFIG_CGROUPS
+		struct cgroup *cgrp;
+#endif
+		inode = d_inode(dentry);
+		exit_info = &pidfs_i(inode)->exit_info;
+
+		/* TODO: Annoy Oleg to tell me how to do this correctly. */
+		if (tsk->signal->flags & SIGNAL_GROUP_EXIT)
+			exit_info->exit_code = tsk->signal->group_exit_code;
+		else
+			exit_info->exit_code = tsk->exit_code;
+
+#ifdef CONFIG_CGROUPS
+		rcu_read_lock();
+		cgrp = task_dfl_cgroup(tsk);
+		exit_info->cgroupid = cgroup_id(cgrp);
+		rcu_read_unlock();
+#endif
+
+		dput(dentry);
+	}
+}
+
 static struct vfsmount *pidfs_mnt __ro_after_init;
 
 /*
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index 7c830d0dec9a..05e6f8f4a026 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -6,6 +6,7 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
 void __init pidfs_init(void);
 void pidfs_add_pid(struct pid *pid);
 void pidfs_remove_pid(struct pid *pid);
+void pidfs_exit(struct task_struct *tsk);
 extern const struct dentry_operations pidfs_dentry_operations;
 
 #endif /* _LINUX_PID_FS_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index 3485e5fc499e..cae475e7858c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -69,6 +69,7 @@
 #include <linux/sysfs.h>
 #include <linux/user_events.h>
 #include <linux/uaccess.h>
+#include <linux/pidfs.h>
 
 #include <uapi/linux/wait.h>
 
@@ -948,6 +949,7 @@ void __noreturn do_exit(long code)
 
 	sched_autogroup_exit_task(tsk);
 	cgroup_exit(tsk);
+	pidfs_exit(tsk);
 
 	/*
 	 * FIXME: do that only when needed, using sched_exit tracepoint

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 06/10] pidfs: allow to retrieve exit information
  2025-02-28 12:44 [PATCH RFC 00/10] pidfs: provide information after task has been reaped Christian Brauner
                   ` (4 preceding siblings ...)
  2025-02-28 12:44 ` [PATCH RFC 05/10] pidfs: record exit code and cgroupid at exit Christian Brauner
@ 2025-02-28 12:44 ` Christian Brauner
  2025-03-02  2:40   ` Mike Yuan
  2025-03-02 15:53   ` Oleg Nesterov
  2025-02-28 12:44 ` [PATCH RFC 07/10] selftests/pidfd: fix header inclusion Christian Brauner
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 25+ messages in thread
From: Christian Brauner @ 2025-02-28 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Some tools like systemd's jounral need to retrieve the exit and cgroup
information after a process has already been reaped. This can e.g.,
happen when retrieving a pidfd via SCM_PIDFD or SCM_PEERPIDFD.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c                 | 70 +++++++++++++++++++++++++++++++++++++---------
 include/uapi/linux/pidfd.h |  3 +-
 2 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 433f676c066c..e500bc4c5af2 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -32,11 +32,12 @@ static struct kmem_cache *pidfs_cachep __ro_after_init;
  */
 struct pidfs_exit_info {
 	__u64 cgroupid;
-	__u64 exit_code;
+	__s32 exit_code;
 };
 
 struct pidfs_inode {
-	struct pidfs_exit_info exit_info;
+	struct pidfs_exit_info __pei;
+	struct pidfs_exit_info *exit_info;
 	struct inode vfs_inode;
 };
 
@@ -228,11 +229,14 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
 	return poll_flags;
 }
 
-static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
+static long pidfd_info(struct file *file, struct task_struct *task,
+		       unsigned int cmd, unsigned long arg)
 {
 	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
 	size_t usize = _IOC_SIZE(cmd);
 	struct pidfd_info kinfo = {};
+	struct pidfs_exit_info *exit_info;
+	struct inode *inode = file_inode(file);
 	struct user_namespace *user_ns;
 	const struct cred *c;
 	__u64 mask;
@@ -248,6 +252,39 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
 	if (copy_from_user(&mask, &uinfo->mask, sizeof(mask)))
 		return -EFAULT;
 
+	exit_info = READ_ONCE(pidfs_i(inode)->exit_info);
+	if (exit_info) {
+		/*
+		 * TODO: Oleg, I didn't see a reason for putting
+		 * retrieval of the exit status of a task behind some
+		 * form of permission check. Maybe there's some
+		 * potential concerns with seeing the exit status of a
+		 * SIGKILLed suid binary or something but even then I'm
+		 * not sure that's a problem.
+		 *
+		 * If we want this we could put this behind some *uid
+		 * check similar to what ptrace access does by recording
+		 * parts of the creds we'd need for checking this. But
+		 * only if we really need it.
+		 */
+		kinfo.exit_code = exit_info->exit_code;
+#ifdef CONFIG_CGROUPS
+		kinfo.cgroupid = exit_info->cgroupid;
+		kinfo.mask |= PIDFD_INFO_EXIT | PIDFD_INFO_CGROUPID;
+#endif
+	}
+
+	/*
+	 * If the task has already been reaped only exit information
+	 * can be provided. It's entirely possible that the task has
+	 * already been reaped but we managed to grab a reference to it
+	 * before that. So a full set of information about @task doesn't
+	 * mean it hasn't been waited upon. Similarly, a full set of
+	 * information doesn't mean that the task hasn't already exited.
+	 */
+	if (!task)
+		goto copy_out;
+
 	c = get_task_cred(task);
 	if (!c)
 		return -ESRCH;
@@ -267,11 +304,13 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
 	put_cred(c);
 
 #ifdef CONFIG_CGROUPS
-	rcu_read_lock();
-	cgrp = task_dfl_cgroup(task);
-	kinfo.cgroupid = cgroup_id(cgrp);
-	kinfo.mask |= PIDFD_INFO_CGROUPID;
-	rcu_read_unlock();
+	if (!kinfo.cgroupid) {
+		rcu_read_lock();
+		cgrp = task_dfl_cgroup(task);
+		kinfo.cgroupid = cgroup_id(cgrp);
+		kinfo.mask |= PIDFD_INFO_CGROUPID;
+		rcu_read_unlock();
+	}
 #endif
 
 	/*
@@ -291,6 +330,7 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
 	if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1))
 		return -ESRCH;
 
+copy_out:
 	/*
 	 * If userspace and the kernel have the same struct size it can just
 	 * be copied. If userspace provides an older struct, only the bits that
@@ -341,12 +381,13 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	}
 
 	task = get_pid_task(pid, PIDTYPE_PID);
-	if (!task)
-		return -ESRCH;
 
 	/* Extensible IOCTL that does not open namespace FDs, take a shortcut */
 	if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO))
-		return pidfd_info(task, cmd, arg);
+		return pidfd_info(file, task, cmd, arg);
+
+	if (!task)
+		return -ESRCH;
 
 	if (arg)
 		return -EINVAL;
@@ -486,7 +527,7 @@ void pidfs_exit(struct task_struct *tsk)
 		struct cgroup *cgrp;
 #endif
 		inode = d_inode(dentry);
-		exit_info = &pidfs_i(inode)->exit_info;
+		exit_info = &pidfs_i(inode)->__pei;
 
 		/* TODO: Annoy Oleg to tell me how to do this correctly. */
 		if (tsk->signal->flags & SIGNAL_GROUP_EXIT)
@@ -501,6 +542,8 @@ void pidfs_exit(struct task_struct *tsk)
 		rcu_read_unlock();
 #endif
 
+		/* Ensure that PIDFD_GET_INFO sees either all or nothing. */
+		smp_store_release(&pidfs_i(inode)->exit_info, &pidfs_i(inode)->__pei);
 		dput(dentry);
 	}
 }
@@ -568,7 +611,8 @@ static struct inode *pidfs_alloc_inode(struct super_block *sb)
 	if (!pi)
 		return NULL;
 
-	memset(&pi->exit_info, 0, sizeof(pi->exit_info));
+	memset(&pi->__pei, 0, sizeof(pi->__pei));
+	pi->exit_info = NULL;
 
 	return &pi->vfs_inode;
 }
diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
index e0abd0b18841..e5966f1a7743 100644
--- a/include/uapi/linux/pidfd.h
+++ b/include/uapi/linux/pidfd.h
@@ -20,6 +20,7 @@
 #define PIDFD_INFO_PID			(1UL << 0) /* Always returned, even if not requested */
 #define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
 #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
+#define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
 
 #define PIDFD_INFO_SIZE_VER0		64 /* sizeof first published struct */
 
@@ -86,7 +87,7 @@ struct pidfd_info {
 	__u32 sgid;
 	__u32 fsuid;
 	__u32 fsgid;
-	__u32 spare0[1];
+	__s32 exit_code;
 };
 
 #define PIDFS_IOCTL_MAGIC 0xFF

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 07/10] selftests/pidfd: fix header inclusion
  2025-02-28 12:44 [PATCH RFC 00/10] pidfs: provide information after task has been reaped Christian Brauner
                   ` (5 preceding siblings ...)
  2025-02-28 12:44 ` [PATCH RFC 06/10] pidfs: allow to retrieve exit information Christian Brauner
@ 2025-02-28 12:44 ` Christian Brauner
  2025-02-28 12:44 ` [PATCH RFC 08/10] pidfs/selftests: ensure correct headers for ioctl handling Christian Brauner
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Christian Brauner @ 2025-02-28 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Ensure that necessary defines are present.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_fdinfo_test.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/pidfd/pidfd_fdinfo_test.c b/tools/testing/selftests/pidfd/pidfd_fdinfo_test.c
index f062a986e382..f718aac75068 100644
--- a/tools/testing/selftests/pidfd/pidfd_fdinfo_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_fdinfo_test.c
@@ -13,6 +13,7 @@
 #include <syscall.h>
 #include <sys/wait.h>
 #include <sys/mman.h>
+#include <sys/mount.h>
 
 #include "pidfd.h"
 #include "../kselftest.h"

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 08/10] pidfs/selftests: ensure correct headers for ioctl handling
  2025-02-28 12:44 [PATCH RFC 00/10] pidfs: provide information after task has been reaped Christian Brauner
                   ` (6 preceding siblings ...)
  2025-02-28 12:44 ` [PATCH RFC 07/10] selftests/pidfd: fix header inclusion Christian Brauner
@ 2025-02-28 12:44 ` Christian Brauner
  2025-02-28 12:44 ` [PATCH RFC 09/10] selftests/pidfd: move more defines to common header Christian Brauner
  2025-02-28 12:44 ` [PATCH RFC 10/10] selftests/pidfd: add PIDFD_INFO_EXIT tests Christian Brauner
  9 siblings, 0 replies; 25+ messages in thread
From: Christian Brauner @ 2025-02-28 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_setns_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/pidfd/pidfd_setns_test.c b/tools/testing/selftests/pidfd/pidfd_setns_test.c
index 222f8131283b..d9e715de68b3 100644
--- a/tools/testing/selftests/pidfd/pidfd_setns_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_setns_test.c
@@ -16,7 +16,7 @@
 #include <unistd.h>
 #include <sys/socket.h>
 #include <sys/stat.h>
-#include <linux/ioctl.h>
+#include <sys/ioctl.h>
 
 #include "pidfd.h"
 #include "../kselftest_harness.h"

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 09/10] selftests/pidfd: move more defines to common header
  2025-02-28 12:44 [PATCH RFC 00/10] pidfs: provide information after task has been reaped Christian Brauner
                   ` (7 preceding siblings ...)
  2025-02-28 12:44 ` [PATCH RFC 08/10] pidfs/selftests: ensure correct headers for ioctl handling Christian Brauner
@ 2025-02-28 12:44 ` Christian Brauner
  2025-02-28 12:44 ` [PATCH RFC 10/10] selftests/pidfd: add PIDFD_INFO_EXIT tests Christian Brauner
  9 siblings, 0 replies; 25+ messages in thread
From: Christian Brauner @ 2025-02-28 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd.h            | 78 ++++++++++++++++++++++++
 tools/testing/selftests/pidfd/pidfd_open_test.c  | 26 --------
 tools/testing/selftests/pidfd/pidfd_setns_test.c | 45 --------------
 3 files changed, 78 insertions(+), 71 deletions(-)

diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index 027ebaf14844..bad518766aa5 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -12,6 +12,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <syscall.h>
+#include <sys/ioctl.h>
 #include <sys/types.h>
 #include <sys/wait.h>
 
@@ -66,6 +67,83 @@
 #define PIDFD_SELF_PROCESS	PIDFD_SELF_THREAD_GROUP
 #endif
 
+#ifndef PIDFS_IOCTL_MAGIC
+#define PIDFS_IOCTL_MAGIC 0xFF
+#endif
+
+#ifndef PIDFD_GET_CGROUP_NAMESPACE
+#define PIDFD_GET_CGROUP_NAMESPACE            _IO(PIDFS_IOCTL_MAGIC, 1)
+#endif
+
+#ifndef PIDFD_GET_IPC_NAMESPACE
+#define PIDFD_GET_IPC_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 2)
+#endif
+
+#ifndef PIDFD_GET_MNT_NAMESPACE
+#define PIDFD_GET_MNT_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 3)
+#endif
+
+#ifndef PIDFD_GET_NET_NAMESPACE
+#define PIDFD_GET_NET_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 4)
+#endif
+
+#ifndef PIDFD_GET_PID_NAMESPACE
+#define PIDFD_GET_PID_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 5)
+#endif
+
+#ifndef PIDFD_GET_PID_FOR_CHILDREN_NAMESPACE
+#define PIDFD_GET_PID_FOR_CHILDREN_NAMESPACE  _IO(PIDFS_IOCTL_MAGIC, 6)
+#endif
+
+#ifndef PIDFD_GET_TIME_NAMESPACE
+#define PIDFD_GET_TIME_NAMESPACE              _IO(PIDFS_IOCTL_MAGIC, 7)
+#endif
+
+#ifndef PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE
+#define PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 8)
+#endif
+
+#ifndef PIDFD_GET_USER_NAMESPACE
+#define PIDFD_GET_USER_NAMESPACE              _IO(PIDFS_IOCTL_MAGIC, 9)
+#endif
+
+#ifndef PIDFD_GET_UTS_NAMESPACE
+#define PIDFD_GET_UTS_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 10)
+#endif
+
+#ifndef PIDFD_GET_INFO
+#define PIDFD_GET_INFO			      _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info)
+#endif
+
+#ifndef PIDFD_INFO_PID
+#define PIDFD_INFO_PID			(1UL << 0) /* Always returned, even if not requested */
+#endif
+
+#ifndef PIDFD_INFO_CREDS
+#define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
+#endif
+
+#ifndef PIDFD_INFO_CGROUPID
+#define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
+#endif
+
+struct pidfd_info {
+	__u64 mask;
+	__u64 cgroupid;
+	__u32 pid;
+	__u32 tgid;
+	__u32 ppid;
+	__u32 ruid;
+	__u32 rgid;
+	__u32 euid;
+	__u32 egid;
+	__u32 suid;
+	__u32 sgid;
+	__u32 fsuid;
+	__u32 fsgid;
+	__u32 spare0[1];
+};
+
 /*
  * The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
  * That means, when it wraps around any pid < 300 will be skipped.
diff --git a/tools/testing/selftests/pidfd/pidfd_open_test.c b/tools/testing/selftests/pidfd/pidfd_open_test.c
index 9a40ccb1ff6d..cd3de40e4977 100644
--- a/tools/testing/selftests/pidfd/pidfd_open_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_open_test.c
@@ -22,32 +22,6 @@
 #include "pidfd.h"
 #include "../kselftest.h"
 
-#ifndef PIDFS_IOCTL_MAGIC
-#define PIDFS_IOCTL_MAGIC 0xFF
-#endif
-
-#ifndef PIDFD_GET_INFO
-#define PIDFD_GET_INFO _IOWR(PIDFS_IOCTL_MAGIC, 11, struct pidfd_info)
-#define PIDFD_INFO_CGROUPID		(1UL << 0)
-
-struct pidfd_info {
-	__u64 mask;
-	__u64 cgroupid;
-	__u32 pid;
-	__u32 tgid;
-	__u32 ppid;
-	__u32 ruid;
-	__u32 rgid;
-	__u32 euid;
-	__u32 egid;
-	__u32 suid;
-	__u32 sgid;
-	__u32 fsuid;
-	__u32 fsgid;
-	__u32 spare0[1];
-};
-#endif
-
 static int safe_int(const char *numstr, int *converted)
 {
 	char *err = NULL;
diff --git a/tools/testing/selftests/pidfd/pidfd_setns_test.c b/tools/testing/selftests/pidfd/pidfd_setns_test.c
index d9e715de68b3..e6a079b3d5e2 100644
--- a/tools/testing/selftests/pidfd/pidfd_setns_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_setns_test.c
@@ -16,55 +16,10 @@
 #include <unistd.h>
 #include <sys/socket.h>
 #include <sys/stat.h>
-#include <sys/ioctl.h>
 
 #include "pidfd.h"
 #include "../kselftest_harness.h"
 
-#ifndef PIDFS_IOCTL_MAGIC
-#define PIDFS_IOCTL_MAGIC 0xFF
-#endif
-
-#ifndef PIDFD_GET_CGROUP_NAMESPACE
-#define PIDFD_GET_CGROUP_NAMESPACE            _IO(PIDFS_IOCTL_MAGIC, 1)
-#endif
-
-#ifndef PIDFD_GET_IPC_NAMESPACE
-#define PIDFD_GET_IPC_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 2)
-#endif
-
-#ifndef PIDFD_GET_MNT_NAMESPACE
-#define PIDFD_GET_MNT_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 3)
-#endif
-
-#ifndef PIDFD_GET_NET_NAMESPACE
-#define PIDFD_GET_NET_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 4)
-#endif
-
-#ifndef PIDFD_GET_PID_NAMESPACE
-#define PIDFD_GET_PID_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 5)
-#endif
-
-#ifndef PIDFD_GET_PID_FOR_CHILDREN_NAMESPACE
-#define PIDFD_GET_PID_FOR_CHILDREN_NAMESPACE  _IO(PIDFS_IOCTL_MAGIC, 6)
-#endif
-
-#ifndef PIDFD_GET_TIME_NAMESPACE
-#define PIDFD_GET_TIME_NAMESPACE              _IO(PIDFS_IOCTL_MAGIC, 7)
-#endif
-
-#ifndef PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE
-#define PIDFD_GET_TIME_FOR_CHILDREN_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 8)
-#endif
-
-#ifndef PIDFD_GET_USER_NAMESPACE
-#define PIDFD_GET_USER_NAMESPACE              _IO(PIDFS_IOCTL_MAGIC, 9)
-#endif
-
-#ifndef PIDFD_GET_UTS_NAMESPACE
-#define PIDFD_GET_UTS_NAMESPACE               _IO(PIDFS_IOCTL_MAGIC, 10)
-#endif
-
 enum {
 	PIDFD_NS_USER,
 	PIDFD_NS_MNT,

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 10/10] selftests/pidfd: add PIDFD_INFO_EXIT tests
  2025-02-28 12:44 [PATCH RFC 00/10] pidfs: provide information after task has been reaped Christian Brauner
                   ` (8 preceding siblings ...)
  2025-02-28 12:44 ` [PATCH RFC 09/10] selftests/pidfd: move more defines to common header Christian Brauner
@ 2025-02-28 12:44 ` Christian Brauner
  9 siblings, 0 replies; 25+ messages in thread
From: Christian Brauner @ 2025-02-28 12:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, Christian Brauner

Test that PIDFD_INFO_EXIT works correctly.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/.gitignore        |   1 +
 tools/testing/selftests/pidfd/Makefile          |   2 +-
 tools/testing/selftests/pidfd/pidfd.h           |   6 +-
 tools/testing/selftests/pidfd/pidfd_info_test.c | 185 ++++++++++++++++++++++++
 4 files changed, 192 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore
index bf92481f925c..bddae1d4d7e4 100644
--- a/tools/testing/selftests/pidfd/.gitignore
+++ b/tools/testing/selftests/pidfd/.gitignore
@@ -8,3 +8,4 @@ pidfd_getfd_test
 pidfd_setns_test
 pidfd_file_handle_test
 pidfd_bind_mount
+pidfd_info_test
diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index 301343a11b62..a94c2bc8d594 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -3,7 +3,7 @@ CFLAGS += -g $(KHDR_INCLUDES) -pthread -Wall
 
 TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test \
 	pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test \
-	pidfd_file_handle_test pidfd_bind_mount
+	pidfd_file_handle_test pidfd_bind_mount pidfd_info_test
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index bad518766aa5..cc8e381978df 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -127,6 +127,10 @@
 #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
 #endif
 
+#ifndef PIDFD_INFO_EXIT
+#define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
+#endif
+
 struct pidfd_info {
 	__u64 mask;
 	__u64 cgroupid;
@@ -141,7 +145,7 @@ struct pidfd_info {
 	__u32 sgid;
 	__u32 fsuid;
 	__u32 fsgid;
-	__u32 spare0[1];
+	__s32 exit_code;
 };
 
 /*
diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
new file mode 100644
index 000000000000..e2368ba271fb
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -0,0 +1,185 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <linux/types.h>
+#include <poll.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <linux/kcmp.h>
+#include <sys/stat.h>
+
+#include "pidfd.h"
+#include "../kselftest_harness.h"
+
+FIXTURE(pidfd_info)
+{
+	pid_t child_pid1;
+	int child_pidfd1;
+
+	pid_t child_pid2;
+	int child_pidfd2;
+
+	pid_t child_pid3;
+	int child_pidfd3;
+
+	pid_t child_pid4;
+	int child_pidfd4;
+};
+
+FIXTURE_SETUP(pidfd_info)
+{
+	int ret;
+	int ipc_sockets[2];
+	char c;
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	EXPECT_EQ(ret, 0);
+
+	self->child_pid1 = create_child(&self->child_pidfd1, 0);
+	EXPECT_GE(self->child_pid1, 0);
+
+	if (self->child_pid1 == 0) {
+		close(ipc_sockets[0]);
+
+		if (write_nointr(ipc_sockets[1], "1", 1) < 0)
+			_exit(EXIT_FAILURE);
+
+		close(ipc_sockets[1]);
+
+		pause();
+		_exit(EXIT_SUCCESS);
+	}
+
+	EXPECT_EQ(close(ipc_sockets[1]), 0);
+	ASSERT_EQ(read_nointr(ipc_sockets[0], &c, 1), 1);
+	EXPECT_EQ(close(ipc_sockets[0]), 0);
+
+	/* SIGKILL but don't reap. */
+	EXPECT_EQ(sys_pidfd_send_signal(self->child_pidfd1, SIGKILL, NULL, 0), 0);
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	EXPECT_EQ(ret, 0);
+
+	self->child_pid2 = create_child(&self->child_pidfd2, 0);
+	EXPECT_GE(self->child_pid2, 0);
+
+	if (self->child_pid2 == 0) {
+		close(ipc_sockets[0]);
+
+		if (write_nointr(ipc_sockets[1], "1", 1) < 0)
+			_exit(EXIT_FAILURE);
+
+		close(ipc_sockets[1]);
+
+		pause();
+		_exit(EXIT_SUCCESS);
+	}
+
+	EXPECT_EQ(close(ipc_sockets[1]), 0);
+	ASSERT_EQ(read_nointr(ipc_sockets[0], &c, 1), 1);
+	EXPECT_EQ(close(ipc_sockets[0]), 0);
+
+	/* SIGKILL and reap. */
+	EXPECT_EQ(sys_pidfd_send_signal(self->child_pidfd2, SIGKILL, NULL, 0), 0);
+	EXPECT_EQ(sys_waitid(P_PID, self->child_pid2, NULL, WEXITED), 0);
+
+	self->child_pid3 = create_child(&self->child_pidfd3, CLONE_NEWUSER | CLONE_NEWPID);
+	EXPECT_GE(self->child_pid3, 0);
+
+	if (self->child_pid3 == 0)
+		_exit(EXIT_SUCCESS);
+
+	self->child_pid4 = create_child(&self->child_pidfd4, CLONE_NEWUSER | CLONE_NEWPID);
+	EXPECT_GE(self->child_pid4, 0);
+
+	if (self->child_pid4 == 0)
+		_exit(EXIT_SUCCESS);
+
+	EXPECT_EQ(sys_waitid(P_PID, self->child_pid4, NULL, WEXITED), 0);
+}
+
+FIXTURE_TEARDOWN(pidfd_info)
+{
+	sys_pidfd_send_signal(self->child_pidfd1, SIGKILL, NULL, 0);
+	if (self->child_pidfd1 >= 0)
+		EXPECT_EQ(0, close(self->child_pidfd1));
+
+	sys_waitid(P_PID, self->child_pid1, NULL, WEXITED);
+
+	sys_pidfd_send_signal(self->child_pidfd2, SIGKILL, NULL, 0);
+	if (self->child_pidfd2 >= 0)
+		EXPECT_EQ(0, close(self->child_pidfd2));
+
+	sys_waitid(P_PID, self->child_pid2, NULL, WEXITED);
+	sys_waitid(P_PID, self->child_pid3, NULL, WEXITED);
+	sys_waitid(P_PID, self->child_pid4, NULL, WEXITED);
+}
+
+TEST_F(pidfd_info, sigkill_exit)
+{
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID,
+		.mask = PIDFD_INFO_EXIT,
+	};
+
+	ASSERT_EQ(ioctl(self->child_pidfd1, PIDFD_GET_INFO, &info), 0);
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+	ASSERT_TRUE(WIFSIGNALED(info.exit_code));
+	ASSERT_EQ(WTERMSIG(info.exit_code), SIGKILL);
+}
+
+TEST_F(pidfd_info, sigkill_reaped)
+{
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID,
+		.mask = PIDFD_INFO_EXIT,
+	};
+
+	ASSERT_EQ(ioctl(self->child_pidfd2, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+	ASSERT_TRUE(WIFSIGNALED(info.exit_code));
+	ASSERT_EQ(WTERMSIG(info.exit_code), SIGKILL);
+}
+
+TEST_F(pidfd_info, success_exit)
+{
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID,
+		.mask = PIDFD_INFO_EXIT,
+	};
+
+	ASSERT_EQ(ioctl(self->child_pidfd3, PIDFD_GET_INFO, &info), 0);
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
+}
+
+TEST_F(pidfd_info, success_reaped)
+{
+	struct pidfd_info info = {
+		.mask = PIDFD_INFO_CGROUPID,
+		.mask = PIDFD_INFO_EXIT,
+	};
+
+	ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
+	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
+	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
+}
+
+TEST_HARNESS_MAIN

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 06/10] pidfs: allow to retrieve exit information
  2025-02-28 12:44 ` [PATCH RFC 06/10] pidfs: allow to retrieve exit information Christian Brauner
@ 2025-03-02  2:40   ` Mike Yuan
  2025-03-02 12:33     ` Christian Brauner
  2025-03-02 15:53   ` Oleg Nesterov
  1 sibling, 1 reply; 25+ messages in thread
From: Mike Yuan @ 2025-03-02  2:40 UTC (permalink / raw)
  To: brauner@kernel.org
  Cc: oleg@redhat.com, linux-fsdevel@vger.kernel.org,
	jlayton@kernel.org, lennart@poettering.net,
	daan.j.demeyer@gmail.com

On 2/28/25 13:44, Christian Brauner <brauner@kernel.org> wrote:

>  Some tools like systemd's jounral need to retrieve the exit and cgroup
>  information after a process has already been reaped. This can e.g.,
>  happen when retrieving a pidfd via SCM_PIDFD or SCM_PEERPIDFD.
>  
>  Signed-off-by: Christian Brauner <brauner@kernel.org>
>  ---
>   fs/pidfs.c                 | 70 +++++++++++++++++++++++++++++++++++++---------
>   include/uapi/linux/pidfd.h |  3 +-
>   2 files changed, 59 insertions(+), 14 deletions(-)
>  
>  diff --git a/fs/pidfs.c b/fs/pidfs.c
>  index 433f676c066c..e500bc4c5af2 100644
>  --- a/fs/pidfs.c
>  +++ b/fs/pidfs.c
>  @@ -32,11 +32,12 @@ static struct kmem_cache *pidfs_cachep __ro_after_init;
>    */
>   struct pidfs_exit_info {
>   	__u64 cgroupid;
>  -	__u64 exit_code;
>  +	__s32 exit_code;
>   };
>  
>   struct pidfs_inode {
>  -	struct pidfs_exit_info exit_info;
>  +	struct pidfs_exit_info __pei;
>  +	struct pidfs_exit_info *exit_info;
>   	struct inode vfs_inode;
>   };
>  
>  @@ -228,11 +229,14 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
>   	return poll_flags;
>   }
>  
>  -static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
>  +static long pidfd_info(struct file *file, struct task_struct *task,
>  +		       unsigned int cmd, unsigned long arg)
>   {
>   	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
>   	size_t usize = _IOC_SIZE(cmd);
>   	struct pidfd_info kinfo = {};
>  +	struct pidfs_exit_info *exit_info;
>  +	struct inode *inode = file_inode(file);
>   	struct user_namespace *user_ns;
>   	const struct cred *c;
>   	__u64 mask;
>  @@ -248,6 +252,39 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
>   	if (copy_from_user(&mask, &uinfo->mask, sizeof(mask)))
>   		return -EFAULT;
>  
>  +	exit_info = READ_ONCE(pidfs_i(inode)->exit_info);
>  +	if (exit_info) {
>  +		/*
>  +		 * TODO: Oleg, I didn't see a reason for putting
>  +		 * retrieval of the exit status of a task behind some
>  +		 * form of permission check. Maybe there's some
>  +		 * potential concerns with seeing the exit status of a
>  +		 * SIGKILLed suid binary or something but even then I'm
>  +		 * not sure that's a problem.
>  +		 *
>  +		 * If we want this we could put this behind some *uid
>  +		 * check similar to what ptrace access does by recording
>  +		 * parts of the creds we'd need for checking this. But
>  +		 * only if we really need it.
>  +		 */
>  +		kinfo.exit_code = exit_info->exit_code;
>  +#ifdef CONFIG_CGROUPS
>  +		kinfo.cgroupid = exit_info->cgroupid;
>  +		kinfo.mask |= PIDFD_INFO_EXIT | PIDFD_INFO_CGROUPID;
>  +#endif
>  +	}
>  +
>  +	/*
>  +	 * If the task has already been reaped only exit information
>  +	 * can be provided. It's entirely possible that the task has
>  +	 * already been reaped but we managed to grab a reference to it
>  +	 * before that. So a full set of information about @task doesn't
>  +	 * mean it hasn't been waited upon. Similarly, a full set of
>  +	 * information doesn't mean that the task hasn't already exited.
>  +	 */
>  +	if (!task)
>  +		goto copy_out;
>  +
>   	c = get_task_cred(task);
>   	if (!c)
>   		return -ESRCH;
>  @@ -267,11 +304,13 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
>   	put_cred(c);
>  
>   #ifdef CONFIG_CGROUPS
>  -	rcu_read_lock();
>  -	cgrp = task_dfl_cgroup(task);
>  -	kinfo.cgroupid = cgroup_id(cgrp);
>  -	kinfo.mask |= PIDFD_INFO_CGROUPID;
>  -	rcu_read_unlock();
>  +	if (!kinfo.cgroupid) {
>  +		rcu_read_lock();
>  +		cgrp = task_dfl_cgroup(task);
>  +		kinfo.cgroupid = cgroup_id(cgrp);
>  +		kinfo.mask |= PIDFD_INFO_CGROUPID;
>  +		rcu_read_unlock();
>  +	}
>   #endif
>  
>   	/*
>  @@ -291,6 +330,7 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
>   	if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1))
>   		return -ESRCH;
>  
>  +copy_out:
>   	/*
>   	 * If userspace and the kernel have the same struct size it can just
>   	 * be copied. If userspace provides an older struct, only the bits that
>  @@ -341,12 +381,13 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>   	}
>  
>   	task = get_pid_task(pid, PIDTYPE_PID);
>  -	if (!task)
>  -		return -ESRCH;

Hmm, this breaks our current assumption/assertion on the API in systemd (see pidfd_get_pid_ioctl() in basic/pidfd-util.c).
Moreover, it now imposes an inconsistency: if the pidfd refers to a process from foreign pidns, the current impl treats it as if the process didn't exist, and returns -ESRCH. Now a truly exited task deviates from that...

I'd prefer to retain the current behavior of returning -ESRCH unless PIDFD_INFO_EXIT is specified in mask, in which case it's then guaranteed that -ESRCH would never be seen. IOW the caller should be explicit on what they want, which feels semantically more reasonable to me and probably even simpler?

>  
>   	/* Extensible IOCTL that does not open namespace FDs, take a shortcut */
>   	if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO))
>  -		return pidfd_info(task, cmd, arg);
>  +		return pidfd_info(file, task, cmd, arg);
>  +
>  +	if (!task)
>  +		return -ESRCH;
>  
>   	if (arg)
>   		return -EINVAL;
>  @@ -486,7 +527,7 @@ void pidfs_exit(struct task_struct *tsk)
>   		struct cgroup *cgrp;
>   #endif
>   		inode = d_inode(dentry);
>  -		exit_info = &pidfs_i(inode)->exit_info;
>  +		exit_info = &pidfs_i(inode)->__pei;
>  
>   		/* TODO: Annoy Oleg to tell me how to do this correctly. */
>   		if (tsk->signal->flags & SIGNAL_GROUP_EXIT)
>  @@ -501,6 +542,8 @@ void pidfs_exit(struct task_struct *tsk)
>   		rcu_read_unlock();
>   #endif
>  
>  +		/* Ensure that PIDFD_GET_INFO sees either all or nothing. */
>  +		smp_store_release(&pidfs_i(inode)->exit_info, &pidfs_i(inode)->__pei);
>   		dput(dentry);
>   	}
>   }
>  @@ -568,7 +611,8 @@ static struct inode *pidfs_alloc_inode(struct super_block *sb)
>   	if (!pi)
>   		return NULL;
>  
>  -	memset(&pi->exit_info, 0, sizeof(pi->exit_info));
>  +	memset(&pi->__pei, 0, sizeof(pi->__pei));
>  +	pi->exit_info = NULL;
>  
>   	return &pi->vfs_inode;
>   }
>  diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
>  index e0abd0b18841..e5966f1a7743 100644
>  --- a/include/uapi/linux/pidfd.h
>  +++ b/include/uapi/linux/pidfd.h
>  @@ -20,6 +20,7 @@
>   #define PIDFD_INFO_PID			(1UL << 0) /* Always returned, even if not requested */
>   #define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
>   #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
>  +#define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
>  
>   #define PIDFD_INFO_SIZE_VER0		64 /* sizeof first published struct */
>  
>  @@ -86,7 +87,7 @@ struct pidfd_info {
>   	__u32 sgid;
>   	__u32 fsuid;
>   	__u32 fsgid;
>  -	__u32 spare0[1];
>  +	__s32 exit_code;
>   };
>  
>   #define PIDFS_IOCTL_MAGIC 0xFF
>  
>  --
>  2.47.2
>  
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 06/10] pidfs: allow to retrieve exit information
  2025-03-02  2:40   ` Mike Yuan
@ 2025-03-02 12:33     ` Christian Brauner
  0 siblings, 0 replies; 25+ messages in thread
From: Christian Brauner @ 2025-03-02 12:33 UTC (permalink / raw)
  To: Mike Yuan
  Cc: oleg@redhat.com, linux-fsdevel@vger.kernel.org,
	jlayton@kernel.org, lennart@poettering.net,
	daan.j.demeyer@gmail.com

On Sun, Mar 02, 2025 at 02:40:16AM +0000, Mike Yuan wrote:
> On 2/28/25 13:44, Christian Brauner <brauner@kernel.org> wrote:
> 
> >  Some tools like systemd's jounral need to retrieve the exit and cgroup
> >  information after a process has already been reaped. This can e.g.,
> >  happen when retrieving a pidfd via SCM_PIDFD or SCM_PEERPIDFD.
> >  
> >  Signed-off-by: Christian Brauner <brauner@kernel.org>
> >  ---
> >   fs/pidfs.c                 | 70 +++++++++++++++++++++++++++++++++++++---------
> >   include/uapi/linux/pidfd.h |  3 +-
> >   2 files changed, 59 insertions(+), 14 deletions(-)
> >  
> >  diff --git a/fs/pidfs.c b/fs/pidfs.c
> >  index 433f676c066c..e500bc4c5af2 100644
> >  --- a/fs/pidfs.c
> >  +++ b/fs/pidfs.c
> >  @@ -32,11 +32,12 @@ static struct kmem_cache *pidfs_cachep __ro_after_init;
> >    */
> >   struct pidfs_exit_info {
> >   	__u64 cgroupid;
> >  -	__u64 exit_code;
> >  +	__s32 exit_code;
> >   };
> >  
> >   struct pidfs_inode {
> >  -	struct pidfs_exit_info exit_info;
> >  +	struct pidfs_exit_info __pei;
> >  +	struct pidfs_exit_info *exit_info;
> >   	struct inode vfs_inode;
> >   };
> >  
> >  @@ -228,11 +229,14 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> >   	return poll_flags;
> >   }
> >  
> >  -static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> >  +static long pidfd_info(struct file *file, struct task_struct *task,
> >  +		       unsigned int cmd, unsigned long arg)
> >   {
> >   	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> >   	size_t usize = _IOC_SIZE(cmd);
> >   	struct pidfd_info kinfo = {};
> >  +	struct pidfs_exit_info *exit_info;
> >  +	struct inode *inode = file_inode(file);
> >   	struct user_namespace *user_ns;
> >   	const struct cred *c;
> >   	__u64 mask;
> >  @@ -248,6 +252,39 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
> >   	if (copy_from_user(&mask, &uinfo->mask, sizeof(mask)))
> >   		return -EFAULT;
> >  
> >  +	exit_info = READ_ONCE(pidfs_i(inode)->exit_info);
> >  +	if (exit_info) {
> >  +		/*
> >  +		 * TODO: Oleg, I didn't see a reason for putting
> >  +		 * retrieval of the exit status of a task behind some
> >  +		 * form of permission check. Maybe there's some
> >  +		 * potential concerns with seeing the exit status of a
> >  +		 * SIGKILLed suid binary or something but even then I'm
> >  +		 * not sure that's a problem.
> >  +		 *
> >  +		 * If we want this we could put this behind some *uid
> >  +		 * check similar to what ptrace access does by recording
> >  +		 * parts of the creds we'd need for checking this. But
> >  +		 * only if we really need it.
> >  +		 */
> >  +		kinfo.exit_code = exit_info->exit_code;
> >  +#ifdef CONFIG_CGROUPS
> >  +		kinfo.cgroupid = exit_info->cgroupid;
> >  +		kinfo.mask |= PIDFD_INFO_EXIT | PIDFD_INFO_CGROUPID;
> >  +#endif
> >  +	}
> >  +
> >  +	/*
> >  +	 * If the task has already been reaped only exit information
> >  +	 * can be provided. It's entirely possible that the task has
> >  +	 * already been reaped but we managed to grab a reference to it
> >  +	 * before that. So a full set of information about @task doesn't
> >  +	 * mean it hasn't been waited upon. Similarly, a full set of
> >  +	 * information doesn't mean that the task hasn't already exited.
> >  +	 */
> >  +	if (!task)
> >  +		goto copy_out;
> >  +
> >   	c = get_task_cred(task);
> >   	if (!c)
> >   		return -ESRCH;
> >  @@ -267,11 +304,13 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
> >   	put_cred(c);
> >  
> >   #ifdef CONFIG_CGROUPS
> >  -	rcu_read_lock();
> >  -	cgrp = task_dfl_cgroup(task);
> >  -	kinfo.cgroupid = cgroup_id(cgrp);
> >  -	kinfo.mask |= PIDFD_INFO_CGROUPID;
> >  -	rcu_read_unlock();
> >  +	if (!kinfo.cgroupid) {
> >  +		rcu_read_lock();
> >  +		cgrp = task_dfl_cgroup(task);
> >  +		kinfo.cgroupid = cgroup_id(cgrp);
> >  +		kinfo.mask |= PIDFD_INFO_CGROUPID;
> >  +		rcu_read_unlock();
> >  +	}
> >   #endif
> >  
> >   	/*
> >  @@ -291,6 +330,7 @@ static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long
> >   	if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1))
> >   		return -ESRCH;
> >  
> >  +copy_out:
> >   	/*
> >   	 * If userspace and the kernel have the same struct size it can just
> >   	 * be copied. If userspace provides an older struct, only the bits that
> >  @@ -341,12 +381,13 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >   	}
> >  
> >   	task = get_pid_task(pid, PIDTYPE_PID);
> >  -	if (!task)
> >  -		return -ESRCH;
> 
> Hmm, this breaks our current assumption/assertion on the API in
> systemd (see pidfd_get_pid_ioctl() in basic/pidfd-util.c).
> Moreover, it now imposes an inconsistency: if the pidfd refers to a
> process from foreign pidns, the current impl treats it as if the
> process didn't exist, and returns -ESRCH. Now a truly exited task
> deviates from that...

Thanks for spotting that. It should not be possible to retrieve
PIDFD_INFO_EXIT if the dead task is outside the pid namespace hierarchy.
That's easy to handle though.

> 
> I'd prefer to retain the current behavior of returning -ESRCH unless
> PIDFD_INFO_EXIT is specified in mask, in which case it's then
> guaranteed that -ESRCH would never be seen. IOW the caller should be
> explicit on what they want, which feels semantically more reasonable
> to me and probably even simpler?

Sure.

> 
> >  
> >   	/* Extensible IOCTL that does not open namespace FDs, take a shortcut */
> >   	if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO))
> >  -		return pidfd_info(task, cmd, arg);
> >  +		return pidfd_info(file, task, cmd, arg);
> >  +
> >  +	if (!task)
> >  +		return -ESRCH;
> >  
> >   	if (arg)
> >   		return -EINVAL;
> >  @@ -486,7 +527,7 @@ void pidfs_exit(struct task_struct *tsk)
> >   		struct cgroup *cgrp;
> >   #endif
> >   		inode = d_inode(dentry);
> >  -		exit_info = &pidfs_i(inode)->exit_info;
> >  +		exit_info = &pidfs_i(inode)->__pei;
> >  
> >   		/* TODO: Annoy Oleg to tell me how to do this correctly. */
> >   		if (tsk->signal->flags & SIGNAL_GROUP_EXIT)
> >  @@ -501,6 +542,8 @@ void pidfs_exit(struct task_struct *tsk)
> >   		rcu_read_unlock();
> >   #endif
> >  
> >  +		/* Ensure that PIDFD_GET_INFO sees either all or nothing. */
> >  +		smp_store_release(&pidfs_i(inode)->exit_info, &pidfs_i(inode)->__pei);
> >   		dput(dentry);
> >   	}
> >   }
> >  @@ -568,7 +611,8 @@ static struct inode *pidfs_alloc_inode(struct super_block *sb)
> >   	if (!pi)
> >   		return NULL;
> >  
> >  -	memset(&pi->exit_info, 0, sizeof(pi->exit_info));
> >  +	memset(&pi->__pei, 0, sizeof(pi->__pei));
> >  +	pi->exit_info = NULL;
> >  
> >   	return &pi->vfs_inode;
> >   }
> >  diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> >  index e0abd0b18841..e5966f1a7743 100644
> >  --- a/include/uapi/linux/pidfd.h
> >  +++ b/include/uapi/linux/pidfd.h
> >  @@ -20,6 +20,7 @@
> >   #define PIDFD_INFO_PID			(1UL << 0) /* Always returned, even if not requested */
> >   #define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
> >   #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
> >  +#define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
> >  
> >   #define PIDFD_INFO_SIZE_VER0		64 /* sizeof first published struct */
> >  
> >  @@ -86,7 +87,7 @@ struct pidfd_info {
> >   	__u32 sgid;
> >   	__u32 fsuid;
> >   	__u32 fsgid;
> >  -	__u32 spare0[1];
> >  +	__s32 exit_code;
> >   };
> >  
> >   #define PIDFS_IOCTL_MAGIC 0xFF
> >  
> >  --
> >  2.47.2
> >  
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 03/10] pidfs: move setting flags into pidfs_alloc_file()
  2025-02-28 12:44 ` [PATCH RFC 03/10] pidfs: move setting flags into pidfs_alloc_file() Christian Brauner
@ 2025-03-02 13:09   ` Oleg Nesterov
  2025-03-02 15:59     ` Christian Brauner
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2025-03-02 13:09 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On 02/28, Christian Brauner wrote:
>
> @@ -696,6 +696,10 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
>  		return ERR_PTR(ret);
>
>  	pidfd_file = dentry_open(&path, flags, current_cred());
> +	/* Raise PIDFD_THREAD explicitly as dentry_open() strips it. */
                                            ^^^^^^^^^^^^^^^^^^^^^^^
Hmm, does it?

dentry_open(flags) just passes "flags" to alloc_empty_file()->init_file(),
and init_file(flags) does

	f->f_flags      = flags;

so it seems that

> @@ -2042,11 +2042,6 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
>  	if (IS_ERR(pidfd_file))
>  		return PTR_ERR(pidfd_file);
>
> -	/*
> -	 * anon_inode_getfile() ignores everything outside of the
> -	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
> -	 */
> -	pidfd_file->f_flags |= (flags & PIDFD_THREAD);

we can just kill this outdated code?

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 05/10] pidfs: record exit code and cgroupid at exit
  2025-02-28 12:44 ` [PATCH RFC 05/10] pidfs: record exit code and cgroupid at exit Christian Brauner
@ 2025-03-02 15:19   ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2025-03-02 15:19 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On 02/28, Christian Brauner wrote:
>
> +void pidfs_exit(struct task_struct *tsk)
> +{
> +	struct dentry *dentry;
> +
> +	dentry = stashed_dentry_get(&task_pid(tsk)->stashed);
> +	if (dentry) {
> +		struct inode *inode;
> +		struct pidfs_exit_info *exit_info;
> +#ifdef CONFIG_CGROUPS
> +		struct cgroup *cgrp;
> +#endif
> +		inode = d_inode(dentry);
> +		exit_info = &pidfs_i(inode)->exit_info;
> +
> +		/* TODO: Annoy Oleg to tell me how to do this correctly. */
> +		if (tsk->signal->flags & SIGNAL_GROUP_EXIT)
> +			exit_info->exit_code = tsk->signal->group_exit_code;
> +		else
> +			exit_info->exit_code = tsk->exit_code;

I think you don't need to check SIGNAL_GROUP_EXIT,

		exit_info->exit_code = tsk->exit_code;

should be fine.

Yes, if SIGNAL_GROUP_EXIT is already set then signal->group_exit_code
can differ.

But this can only happen if the "current" thread exits on its own using
sys_exit() and it races with another thread which does sys_exit_group()
and sets SIGNAL_GROUP_EXIT.

In this case pidfs_exit() can miss SIGNAL_GROUP_EXIT anyway, but we don't
care.  This doesn't differ from the case when current exits, and then
another thread does sys_exit_group() or exec().

Just in case... If current exits because it was killed by sys_exit_group()
from another thread, current->exit_code will be correct, it will be equal
to signal->group_exit_code.

But I am not sure I understand the next patch. let me check...

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 06/10] pidfs: allow to retrieve exit information
  2025-02-28 12:44 ` [PATCH RFC 06/10] pidfs: allow to retrieve exit information Christian Brauner
  2025-03-02  2:40   ` Mike Yuan
@ 2025-03-02 15:53   ` Oleg Nesterov
  2025-03-02 16:29     ` Christian Brauner
  1 sibling, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2025-03-02 15:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On 02/28, Christian Brauner wrote:
>
> Some tools like systemd's jounral need to retrieve the exit and cgroup
> information after a process has already been reaped.
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

But unless I am totally confused do_exit() calls pidfd_exit() even
before exit_notify(), the exiting task is not even zombie yet. It
will reaped only when it passes exit_notify() and its parent does
wait().

And what about the multi-threaded case? Suppose the main thread
does sys_exit(0) and it has alive sub-threads.

In this case pidfd_info() will report kinfo.exit_code = 0.
And this is probably fine if (file->f_flags & PIDFD_THREAD) != 0.

But what if this file was created without PIDFD_THREAD? If another
thread does exit_group(1) after that, the process's exit code is
1 << 8, but it can't be retrieved.

Finally, sys_execve(). Suppose we have a main thread L and a
sub-thread T.

T execs and kill the leader L. L exits and populates
pidfs_i(inode)->exit_info.

T calls exchange_tids() in de_thread() and becomes the new leader
with the same (old) pid.

Now, T is very much alive, but pidfs_i(inode)->exit_info != NULL.

Or I am totally confused?

> +	exit_info = READ_ONCE(pidfs_i(inode)->exit_info);
> +	if (exit_info) {
> +		/*
> +		 * TODO: Oleg, I didn't see a reason for putting
> +		 * retrieval of the exit status of a task behind some
> +		 * form of permission check.

Neither me.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 03/10] pidfs: move setting flags into pidfs_alloc_file()
  2025-03-02 13:09   ` Oleg Nesterov
@ 2025-03-02 15:59     ` Christian Brauner
  2025-03-02 16:05       ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Brauner @ 2025-03-02 15:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Sun, Mar 02, 2025 at 02:09:36PM +0100, Oleg Nesterov wrote:
> On 02/28, Christian Brauner wrote:
> >
> > @@ -696,6 +696,10 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
> >  		return ERR_PTR(ret);
> >
> >  	pidfd_file = dentry_open(&path, flags, current_cred());
> > +	/* Raise PIDFD_THREAD explicitly as dentry_open() strips it. */
>                                             ^^^^^^^^^^^^^^^^^^^^^^^
> Hmm, does it?
> 
> dentry_open(flags) just passes "flags" to alloc_empty_file()->init_file(),
> and init_file(flags) does
> 
> 	f->f_flags      = flags;
> 
> so it seems that

dentry_open()
-> do_dentry_open()
   {
           f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
	           f->f_iocb_flags = iocb_flags(f);
   }

> 
> > @@ -2042,11 +2042,6 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
> >  	if (IS_ERR(pidfd_file))
> >  		return PTR_ERR(pidfd_file);
> >
> > -	/*
> > -	 * anon_inode_getfile() ignores everything outside of the
> > -	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
> > -	 */
> > -	pidfd_file->f_flags |= (flags & PIDFD_THREAD);
> 
> we can just kill this outdated code?

Unfortunately not.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 03/10] pidfs: move setting flags into pidfs_alloc_file()
  2025-03-02 15:59     ` Christian Brauner
@ 2025-03-02 16:05       ` Oleg Nesterov
  2025-03-02 16:29         ` Christian Brauner
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2025-03-02 16:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On 03/02, Christian Brauner wrote:
>
> On Sun, Mar 02, 2025 at 02:09:36PM +0100, Oleg Nesterov wrote:
> > On 02/28, Christian Brauner wrote:
> > >
> > > @@ -696,6 +696,10 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
> > >  		return ERR_PTR(ret);
> > >
> > >  	pidfd_file = dentry_open(&path, flags, current_cred());
> > > +	/* Raise PIDFD_THREAD explicitly as dentry_open() strips it. */
> >                                             ^^^^^^^^^^^^^^^^^^^^^^^
> > Hmm, does it?
> >
> > dentry_open(flags) just passes "flags" to alloc_empty_file()->init_file(),
> > and init_file(flags) does
> >
> > 	f->f_flags      = flags;
> >
>
> dentry_open()
> -> do_dentry_open()
>    {
>            f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);

Ah, indeed, thanks ;) so perhaps you can update the comment,
s/dentry_open/do_dentry_open/ to make it more clear?

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 06/10] pidfs: allow to retrieve exit information
  2025-03-02 15:53   ` Oleg Nesterov
@ 2025-03-02 16:29     ` Christian Brauner
  2025-03-02 17:21       ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Brauner @ 2025-03-02 16:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Sun, Mar 02, 2025 at 04:53:46PM +0100, Oleg Nesterov wrote:
> On 02/28, Christian Brauner wrote:
> >
> > Some tools like systemd's jounral need to retrieve the exit and cgroup
> > information after a process has already been reaped.
>               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> But unless I am totally confused do_exit() calls pidfd_exit() even
> before exit_notify(), the exiting task is not even zombie yet. It
> will reaped only when it passes exit_notify() and its parent does
> wait().

The overall goal is that it's possible to retrieve exit status and
cgroupid even if the task has already been reaped.

It's intentionally placed before exit_notify(), i.e., before the task is
a zombie because exit_notify() wakes pidfd-pollers. Ideally, pidfd
pollers would be woken and then could use the PIDFD_GET_INFO ioctl to
retrieve the exit status.

It would however be fine to place it into exit_notify() if it's a better
fit there. If you have a preference let me know.

I don't see a reason why seeing the exit status before that would be an
issue. The only downside would be that some other task that just keeps
ioctl()ing in a loop would possible see the exit status before the
parent does. But I didn't think this would be a big issue.

> And what about the multi-threaded case? Suppose the main thread
> does sys_exit(0) and it has alive sub-threads.
> 
> In this case pidfd_info() will report kinfo.exit_code = 0.
> And this is probably fine if (file->f_flags & PIDFD_THREAD) != 0.

Yes.

> But what if this file was created without PIDFD_THREAD? If another
> thread does exit_group(1) after that, the process's exit code is
> 1 << 8, but it can't be retrieved.

Yes, I had raised that in an off-list discussion about this as well and
was unsure what the cleanest way of dealing with this would be.

My initial approach had been to not just have:

struct pidfs_exit_info {
        __u64 cgroupid;
        __s32 exit_code;
};

but to have:

struct pidfs_exit_info {
        __u64 cgroupid;
        __s32 exit_code;
	__u64 tg_cgroupid;
	__s32 tg_exit_code;
};

so that it would be possible to retrieve either depending on the type of
pidfd. Is that feasible?

> Finally, sys_execve(). Suppose we have a main thread L and a
> sub-thread T.
> 
> T execs and kill the leader L. L exits and populates
> pidfs_i(inode)->exit_info.
> 
> T calls exchange_tids() in de_thread() and becomes the new leader
> with the same (old) pid.
> 
> Now, T is very much alive, but pidfs_i(inode)->exit_info != NULL.

Yes, de_thread() is a good point. That nasty wrinkly I had really
ignored^wforgotten. We should not report an exit status in this case. I
think that's what you're agreeing with as well?

> Or I am totally confused?

No, you're right!

What's the best way of handling the de_thread() case? Would moving this
into exit_notify() be enough where we also handle
PIDFD_THREAD/~PIDFD_THREAD waking?

Thanks for the review!

> 
> 
> 
> > +	exit_info = READ_ONCE(pidfs_i(inode)->exit_info);
> > +	if (exit_info) {
> > +		/*
> > +		 * TODO: Oleg, I didn't see a reason for putting
> > +		 * retrieval of the exit status of a task behind some
> > +		 * form of permission check.
> 
> Neither me.
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 03/10] pidfs: move setting flags into pidfs_alloc_file()
  2025-03-02 16:05       ` Oleg Nesterov
@ 2025-03-02 16:29         ` Christian Brauner
  0 siblings, 0 replies; 25+ messages in thread
From: Christian Brauner @ 2025-03-02 16:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Sun, Mar 02, 2025 at 05:05:23PM +0100, Oleg Nesterov wrote:
> On 03/02, Christian Brauner wrote:
> >
> > On Sun, Mar 02, 2025 at 02:09:36PM +0100, Oleg Nesterov wrote:
> > > On 02/28, Christian Brauner wrote:
> > > >
> > > > @@ -696,6 +696,10 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
> > > >  		return ERR_PTR(ret);
> > > >
> > > >  	pidfd_file = dentry_open(&path, flags, current_cred());
> > > > +	/* Raise PIDFD_THREAD explicitly as dentry_open() strips it. */
> > >                                             ^^^^^^^^^^^^^^^^^^^^^^^
> > > Hmm, does it?
> > >
> > > dentry_open(flags) just passes "flags" to alloc_empty_file()->init_file(),
> > > and init_file(flags) does
> > >
> > > 	f->f_flags      = flags;
> > >
> >
> > dentry_open()
> > -> do_dentry_open()
> >    {
> >            f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
> 
> Ah, indeed, thanks ;) so perhaps you can update the comment,
> s/dentry_open/do_dentry_open/ to make it more clear?

Will do!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 06/10] pidfs: allow to retrieve exit information
  2025-03-02 16:29     ` Christian Brauner
@ 2025-03-02 17:21       ` Oleg Nesterov
  2025-03-02 18:56         ` Christian Brauner
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2025-03-02 17:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On 03/02, Christian Brauner wrote:
>
> On Sun, Mar 02, 2025 at 04:53:46PM +0100, Oleg Nesterov wrote:
> > On 02/28, Christian Brauner wrote:
> > >
> > > Some tools like systemd's jounral need to retrieve the exit and cgroup
> > > information after a process has already been reaped.
> >               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > But unless I am totally confused do_exit() calls pidfd_exit() even
> > before exit_notify(), the exiting task is not even zombie yet. It
> > will reaped only when it passes exit_notify() and its parent does
> > wait().
>
> The overall goal is that it's possible to retrieve exit status and
> cgroupid even if the task has already been reaped.

OK, please see below...

> It's intentionally placed before exit_notify(), i.e., before the task is
> a zombie because exit_notify() wakes pidfd-pollers. Ideally, pidfd
> pollers would be woken and then could use the PIDFD_GET_INFO ioctl to
> retrieve the exit status.

This was more a less clear to me. But this doesn't match the "the task has
already been reaped" goal above...

> It would however be fine to place it into exit_notify() if it's a better
> fit there. If you have a preference let me know.
>
> I don't see a reason why seeing the exit status before that would be an
> issue.

The problem is that it is not clear how can we do this correctly.
Especialy considering the problem with exec...

> > But what if this file was created without PIDFD_THREAD? If another
> > thread does exit_group(1) after that, the process's exit code is
> > 1 << 8, but it can't be retrieved.
>
> Yes, I had raised that in an off-list discussion about this as well and
> was unsure what the cleanest way of dealing with this would be.

I am not sure too, but again, please see below.

> > Now, T is very much alive, but pidfs_i(inode)->exit_info != NULL.

...

> What's the best way of handling the de_thread() case? Would moving this
> into exit_notify() be enough where we also handle
> PIDFD_THREAD/~PIDFD_THREAD waking?

I don't think that moving pidfd_exit() into exit_notify() can solve any
problem.

But what if we move pidfd_exit() into release_task() paths? Called when
the task is reaped by the parent/debugger, or if a sub-thread auto-reaps.

Can the users of pidfd_info(PIDFD_INFO_EXIT) rely on POLLHUP from
release_task() -> detach_pid() -> __change_pid(new => NULL) ?

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 06/10] pidfs: allow to retrieve exit information
  2025-03-02 17:21       ` Oleg Nesterov
@ 2025-03-02 18:56         ` Christian Brauner
  2025-03-02 20:24           ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Brauner @ 2025-03-02 18:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

On Sun, Mar 02, 2025 at 06:21:49PM +0100, Oleg Nesterov wrote:
> On 03/02, Christian Brauner wrote:
> >
> > On Sun, Mar 02, 2025 at 04:53:46PM +0100, Oleg Nesterov wrote:
> > > On 02/28, Christian Brauner wrote:
> > > >
> > > > Some tools like systemd's jounral need to retrieve the exit and cgroup
> > > > information after a process has already been reaped.
> > >               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >
> > > But unless I am totally confused do_exit() calls pidfd_exit() even
> > > before exit_notify(), the exiting task is not even zombie yet. It
> > > will reaped only when it passes exit_notify() and its parent does
> > > wait().
> >
> > The overall goal is that it's possible to retrieve exit status and
> > cgroupid even if the task has already been reaped.
> 
> OK, please see below...
> 
> > It's intentionally placed before exit_notify(), i.e., before the task is
> > a zombie because exit_notify() wakes pidfd-pollers. Ideally, pidfd
> > pollers would be woken and then could use the PIDFD_GET_INFO ioctl to
> > retrieve the exit status.
> 
> This was more a less clear to me. But this doesn't match the "the task has
> already been reaped" goal above...
> 
> > It would however be fine to place it into exit_notify() if it's a better
> > fit there. If you have a preference let me know.
> >
> > I don't see a reason why seeing the exit status before that would be an
> > issue.
> 
> The problem is that it is not clear how can we do this correctly.
> Especialy considering the problem with exec...
> 
> > > But what if this file was created without PIDFD_THREAD? If another
> > > thread does exit_group(1) after that, the process's exit code is
> > > 1 << 8, but it can't be retrieved.
> >
> > Yes, I had raised that in an off-list discussion about this as well and
> > was unsure what the cleanest way of dealing with this would be.
> 
> I am not sure too, but again, please see below.
> 
> > > Now, T is very much alive, but pidfs_i(inode)->exit_info != NULL.
> 
> ...
> 
> > What's the best way of handling the de_thread() case? Would moving this
> > into exit_notify() be enough where we also handle
> > PIDFD_THREAD/~PIDFD_THREAD waking?
> 
> I don't think that moving pidfd_exit() into exit_notify() can solve any
> problem.
> 
> But what if we move pidfd_exit() into release_task() paths? Called when
> the task is reaped by the parent/debugger, or if a sub-thread auto-reaps.
> 
> Can the users of pidfd_info(PIDFD_INFO_EXIT) rely on POLLHUP from
> release_task() -> detach_pid() -> __change_pid(new => NULL) ?

Ok, so:

release_task()
-> __exit_signal()
   -> detach_pid()
      -> __change_pid()

That sounds good. So could we do something like:

diff --git a/kernel/exit.c b/kernel/exit.c
index cae475e7858c..66bb5c53454f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -127,8 +127,10 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 {
        nr_threads--;
        detach_pid(p, PIDTYPE_PID);
+       pidfs_exit(p); // record exit information for individual thread
        if (group_dead) {
                detach_pid(p, PIDTYPE_TGID);
+               pidfs_exit(p); // record exit information for thread-group leader
                detach_pid(p, PIDTYPE_PGID);
                detach_pid(p, PIDTYPE_SID);

I know, as written this won't work but I'm just trying to get the idea
across of recording exit information for both the individual thread and
the thread-group leader in __unhash_process().

That should tackle both problems, i.e., recording exit information for
both thread and thread-group leader as well as exec?

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 06/10] pidfs: allow to retrieve exit information
  2025-03-02 18:56         ` Christian Brauner
@ 2025-03-02 20:24           ` Oleg Nesterov
  2025-03-03  9:06             ` Lennart Poettering
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2025-03-02 20:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan

Christian,

I am already sleeping. I'll try to reply right now, but quite possibly
I will need to correct myself tomorrow ;)

On 03/02, Christian Brauner wrote:
>
> Ok, so:
>
> release_task()
> -> __exit_signal()
>    -> detach_pid()
>       -> __change_pid()
>
> That sounds good. So could we do something like:

Yes, this is what I meant, except...

> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -127,8 +127,10 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
>  {
>         nr_threads--;
>         detach_pid(p, PIDTYPE_PID);
> +       pidfs_exit(p); // record exit information for individual thread

To me it would be better to do this in the caller, release_task().
But this is minor and I won't insist. Please see below.

>         if (group_dead) {
>                 detach_pid(p, PIDTYPE_TGID);
> +               pidfs_exit(p); // record exit information for thread-group leader

This looks pointless, task_pid(p) is the same.

> I know, as written this won't work but I'm just trying to get the idea
> across of recording exit information for both the individual thread and
> the thread-group leader in __unhash_process().
>
> That should tackle both problems, i.e., recording exit information for
> both thread and thread-group leader as well as exec?

This will fix the problem with mt-exec, but this won't help to discriminate
the leader-exit and the-whole-group-exit cases...

With this this (or something like this) change pidfd_info() can only report
the exit code of the already reaped thread/process, leader or not.

I mean... If the leader L exits using sys_exit() and it has the live sub-
threads, release_task(L) / __unhash_process(L) will be only called when
the last sub-thread exits and it (or debugger) does "goto repeat;" in
release_task() to finally reap the leader.

IOW. If someone does sys_pidfd_create(group-leader-pid, PIDFD_THREAD),
pidfd_info() won't report PIDFD_INFO_EXIT if the leader has exited using
sys_exit() before other threads.

But perhaps this is fine?

Let me repeat, I have no idea how and why people use pidfd ;)

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 06/10] pidfs: allow to retrieve exit information
  2025-03-02 20:24           ` Oleg Nesterov
@ 2025-03-03  9:06             ` Lennart Poettering
  2025-03-03 11:32               ` Christian Brauner
  0 siblings, 1 reply; 25+ messages in thread
From: Lennart Poettering @ 2025-03-03  9:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, linux-fsdevel, Jeff Layton, Daan De Meyer,
	Mike Yuan

On So, 02.03.25 21:24, Oleg Nesterov (oleg@redhat.com) wrote:

> This will fix the problem with mt-exec, but this won't help to discriminate
> the leader-exit and the-whole-group-exit cases...
>
> With this this (or something like this) change pidfd_info() can only report
> the exit code of the already reaped thread/process, leader or not.
>
> I mean... If the leader L exits using sys_exit() and it has the live sub-
> threads, release_task(L) / __unhash_process(L) will be only called when
> the last sub-thread exits and it (or debugger) does "goto repeat;" in
> release_task() to finally reap the leader.
>
> IOW. If someone does sys_pidfd_create(group-leader-pid, PIDFD_THREAD),
> pidfd_info() won't report PIDFD_INFO_EXIT if the leader has exited using
> sys_exit() before other threads.
>
> But perhaps this is fine?

I think this is fine, but I'd really like a way how userspace can
determine this state reliably. i.e. a zombie state where the exit
status is not available yet is a bit strange by classic UNIX
standards on some level, no?

But I guess that might not be a pidfd specific issue. i.e. I figure
classic waitid() with WNOHANG failing on a zombie process that is set
up like that is a bit weird too, no? Or how does that work there?
(pretty sure some userspace might not be expecting that...)

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 06/10] pidfs: allow to retrieve exit information
  2025-03-03  9:06             ` Lennart Poettering
@ 2025-03-03 11:32               ` Christian Brauner
  0 siblings, 0 replies; 25+ messages in thread
From: Christian Brauner @ 2025-03-03 11:32 UTC (permalink / raw)
  To: Lennart Poettering, Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Daan De Meyer, Mike Yuan

On Mon, Mar 03, 2025 at 10:06:31AM +0100, Lennart Poettering wrote:
> On So, 02.03.25 21:24, Oleg Nesterov (oleg@redhat.com) wrote:
> 
> > This will fix the problem with mt-exec, but this won't help to discriminate
> > the leader-exit and the-whole-group-exit cases...
> >
> > With this this (or something like this) change pidfd_info() can only report
> > the exit code of the already reaped thread/process, leader or not.

Yes, that's fine. I don't think we need to report exit status
information right after the task has exited. It's fine to only provide
it once it has been reaped and it makes things simpler afaict.

Pidfd polling allows waiting on either task exit or for a task to have
been reaped. So the contract for PIDFD_INFO_EXIT is simply that EPOLLHUP
must be observed before exit information can be retrieved.

This aligns with wait() as well, where reaping of a thread-group leader
that exited before the thread-group was empty is delayed until the
thread-group is empty.

I think that with PIDFD_INFO_EXIT autoreaping might actually become
usable because it means a parent can ignore SIGCHLD or set SA_NOCLDWAIT
and simply use pidfd polling and PIDFD_INFO_EXIT to get get status
information from its children. But the kernel will autocleanup right
away instead of delaying. If it's a subreaper there's probably some
wrinkle with grand-children that get reparented to it? But for the
non-subreaper case it should be very useful.

> > I mean... If the leader L exits using sys_exit() and it has the live sub-
> > threads, release_task(L) / __unhash_process(L) will be only called when
> > the last sub-thread exits and it (or debugger) does "goto repeat;" in
> > release_task() to finally reap the leader.
> >
> > IOW. If someone does sys_pidfd_create(group-leader-pid, PIDFD_THREAD),
> > pidfd_info() won't report PIDFD_INFO_EXIT if the leader has exited using
> > sys_exit() before other threads.
> >
> > But perhaps this is fine?
> 
> I think this is fine, but I'd really like a way how userspace can
> determine this state reliably. i.e. a zombie state where the exit
> status is not available yet is a bit strange by classic UNIX
> standards on some level, no?
> 
> But I guess that might not be a pidfd specific issue. i.e. I figure
> classic waitid() with WNOHANG failing on a zombie process that is set
> up like that is a bit weird too, no? Or how does that work there?
> (pretty sure some userspace might not be expecting that...)

Yes, how I read the code WNOHANG exhibits the same behavior (so does WNOWAIT):

        if (exit_state == EXIT_ZOMBIE) {
                /* we don't reap group leaders with subthreads */
                if (!delay_group_leader(p)) {
                        /*
                         * A zombie ptracee is only visible to its ptracer.
                         * Notification and reaping will be cascaded to the
                         * real parent when the ptracer detaches.
                         */
                        if (unlikely(ptrace) || likely(!p->ptrace))
                                return wait_task_zombie(wo, p);
                }

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-03-03 11:32 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-28 12:44 [PATCH RFC 00/10] pidfs: provide information after task has been reaped Christian Brauner
2025-02-28 12:44 ` [PATCH RFC 01/10] pidfs: switch to copy_struct_to_user() Christian Brauner
2025-02-28 12:44 ` [PATCH RFC 02/10] pidfd: rely on automatic cleanup in __pidfd_prepare() Christian Brauner
2025-02-28 12:44 ` [PATCH RFC 03/10] pidfs: move setting flags into pidfs_alloc_file() Christian Brauner
2025-03-02 13:09   ` Oleg Nesterov
2025-03-02 15:59     ` Christian Brauner
2025-03-02 16:05       ` Oleg Nesterov
2025-03-02 16:29         ` Christian Brauner
2025-02-28 12:44 ` [PATCH RFC 04/10] pidfs: add inode allocation Christian Brauner
2025-02-28 12:44 ` [PATCH RFC 05/10] pidfs: record exit code and cgroupid at exit Christian Brauner
2025-03-02 15:19   ` Oleg Nesterov
2025-02-28 12:44 ` [PATCH RFC 06/10] pidfs: allow to retrieve exit information Christian Brauner
2025-03-02  2:40   ` Mike Yuan
2025-03-02 12:33     ` Christian Brauner
2025-03-02 15:53   ` Oleg Nesterov
2025-03-02 16:29     ` Christian Brauner
2025-03-02 17:21       ` Oleg Nesterov
2025-03-02 18:56         ` Christian Brauner
2025-03-02 20:24           ` Oleg Nesterov
2025-03-03  9:06             ` Lennart Poettering
2025-03-03 11:32               ` Christian Brauner
2025-02-28 12:44 ` [PATCH RFC 07/10] selftests/pidfd: fix header inclusion Christian Brauner
2025-02-28 12:44 ` [PATCH RFC 08/10] pidfs/selftests: ensure correct headers for ioctl handling Christian Brauner
2025-02-28 12:44 ` [PATCH RFC 09/10] selftests/pidfd: move more defines to common header Christian Brauner
2025-02-28 12:44 ` [PATCH RFC 10/10] selftests/pidfd: add PIDFD_INFO_EXIT tests Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).