public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL
@ 2026-02-26 13:50 Christian Brauner
  2026-02-26 13:50 ` [PATCH v5 1/6] clone: add CLONE_AUTOREAP Christian Brauner
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Christian Brauner @ 2026-02-26 13:50 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add three new clone3() flags for pidfd-based process lifecycle
management.

=== CLONE_AUTOREAP ===

CLONE_AUTOREAP makes a child process auto-reap on exit without ever
becoming a zombie. This is a per-process property in contrast to the
existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for SIGCHLD
which applies to all children of a given parent.

Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property
affecting all children which makes it unsuitable for libraries or
applications that need selective auto-reaping of specific children while
still being able to wait() on others.

CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct.
When the child exits do_notify_parent() checks this flag and causes
exit_notify() to transition the task directly to EXIT_DEAD. Since the
flag lives on the child it survives reparenting: if the original parent
exits and the child is reparented to a subreaper or init the child still
auto-reaps when it eventually exits. This is cleaner than forcing the
subreaper to get SIGCHLD and then reaping it. If the parent doesn't care
the subreaper won't care. If there's a subreaper that would care it
would be easy enough to add a prctl() that either just turns back on
SIGCHLD and turns off auto-reaping or a prctl() that just notifies the
subreaper whenever a child is reparented to it.

CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to
monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern. No exit signal is delivered so exit_signal must be zero.
CLONE_THREAD and CLONE_PARENT are rejected: CLONE_THREAD because
autoreap is a process-level property, and CLONE_PARENT because an
autoreap child reparented via CLONE_PARENT could become an invisible
zombie under a parent that never calls wait().

The flag is not inherited by the autoreap process's own children. Each
child that should be autoreaped must be explicitly created with
CLONE_AUTOREAP.

=== CLONE_NNP ===

CLONE_NNP sets no_new_privs on the child at clone time. Unlike
prctl(PR_SET_NO_NEW_PRIVS) which a process sets on itself, CLONE_NNP
allows the parent to impose no_new_privs on the child at creation
without affecting the parent's own privileges. CLONE_THREAD is rejected
because threads share credentials. CLONE_NNP is useful on its own for
any spawn-and-sandbox pattern but was specifically introduced to enable
unprivileged usage of CLONE_PIDFD_AUTOKILL.

=== CLONE_PIDFD_AUTOKILL ===

This flag ties a child's lifetime to the pidfd returned from clone3().
When the last reference to the struct file created by clone3() is closed
the kernel sends SIGKILL to the child. A pidfd obtained via pidfd_open()
for the same process does not keep the child alive and does not trigger
autokill - only the specific struct file from clone3() has this
property. This is useful for container runtimes, service managers, and
sandboxed subprocess execution - any scenario where the child must die
if the parent crashes or abandons the pidfd or just wants a throwaway
helper process.

CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP. It
requires CLONE_PIDFD because the whole point is tying the child's
lifetime to the pidfd. It requires CLONE_AUTOREAP because a killed child
with no one to reap it would become a zombie - the primary use case is
the parent crashing or abandoning the pidfd so no one is around to call
waitpid(). CLONE_THREAD is rejected because autokill targets a process
not a thread.

If CLONE_NNP is specified together with CLONE_PIDFD_AUTOKILL an
unprivileged user may spawn a process that is autokilled. The child
cannot escalate privileges via setuid/setgid exec after being spawned.
If CLONE_PIDFD_AUTOKILL is specified without CLONE_NNP the caller must
have have CAP_SYS_ADMIN in its user namespace.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v5:
- Split no_new_privs into separate CLONE_NNP flag instead of having
  CLONE_PIDFD_AUTOKILL implicitly set it.
- CLONE_PIDFD_AUTOKILL now requires either CLONE_NNP or CAP_SYS_ADMIN.
- Link to v4: https://patch.msgid.link/20260223-work-pidfs-autoreap-v4-0-e393c08c09d1@kernel.org

Changes in v4:
- Set no_new_privs on child when CLONE_PIDFD_AUTOKILL is used. This
  prevents the child from escalating privileges via setuid/setgid exec
  and eliminates the need for magical resets during credential changes.
  The parent retains full privileges.
- Replace autokill_pidfd pointer with PIDFD_AUTOKILL file flag checked
  in pidfs_file_release(). This eliminates the need for pointer
  comparison, stale pointer concerns, and WRITE_ONCE/READ_ONCE pairing
  (Oleg, Jann).
- Reject CLONE_AUTOREAP | CLONE_PARENT to prevent a CLONE_AUTOREAP
  child from creating silent zombies via clone(CLONE_PARENT) (Oleg).
- Link to v3: https://patch.msgid.link/20260217-work-pidfs-autoreap-v3-0-33a403c20111@kernel.org

Changes in v2:
- Add CLONE_PIDFD_AUTOKILL flag
- Decouple CLONE_AUTOREAP from CLONE_PIDFD: the autoreap mechanism has
  no dependency on pidfds. This allows fire-and-forget patterns where
  the parent does not need exit status.
- Link to v1: https://patch.msgid.link/20260216-work-pidfs-autoreap-v1-0-e63f663008f2@kernel.org

---
Christian Brauner (6):
      clone: add CLONE_AUTOREAP
      clone: add CLONE_NNP
      pidfd: add CLONE_PIDFD_AUTOKILL
      selftests/pidfd: add CLONE_AUTOREAP tests
      selftests/pidfd: add CLONE_NNP tests
      selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests

 fs/pidfs.c                                         |  38 +-
 include/linux/sched/signal.h                       |   1 +
 include/uapi/linux/pidfd.h                         |   1 +
 include/uapi/linux/sched.h                         |   3 +
 kernel/fork.c                                      |  49 +-
 kernel/ptrace.c                                    |   3 +-
 kernel/signal.c                                    |   4 +
 tools/testing/selftests/pidfd/.gitignore           |   1 +
 tools/testing/selftests/pidfd/Makefile             |   2 +-
 .../testing/selftests/pidfd/pidfd_autoreap_test.c  | 900 +++++++++++++++++++++
 10 files changed, 991 insertions(+), 11 deletions(-)
---
base-commit: 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f
change-id: 20260214-work-pidfs-autoreap-3ee677e240a8


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 1/6] clone: add CLONE_AUTOREAP
  2026-02-26 13:50 [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Christian Brauner
@ 2026-02-26 13:50 ` Christian Brauner
  2026-02-26 13:51 ` [PATCH v5 2/6] clone: add CLONE_NNP Christian Brauner
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2026-02-26 13:50 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add a new clone3() flag CLONE_AUTOREAP that makes a child process
auto-reap on exit without ever becoming a zombie. This is a per-process
property in contrast to the existing auto-reap mechanism via
SA_NOCLDWAIT or SIG_IGN for SIGCHLD which applies to all children of a
given parent.

Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property
affecting all children which makes it unsuitable for libraries or
applications that need selective auto-reaping of specific children while
still being able to wait() on others.

CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct.
When the child exits do_notify_parent() checks this flag and causes
exit_notify() to transition the task directly to EXIT_DEAD. Since the
flag lives on the child it survives reparenting: if the original parent
exits and the child is reparented to a subreaper or init the child still
auto-reaps when it eventually exits.

CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to
monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern where the parent simply doesn't care about the child's exit
status. No exit signal is delivered so exit_signal must be zero.

CLONE_AUTOREAP is rejected in combination with CLONE_PARENT. If a
CLONE_AUTOREAP child were to clone(CLONE_PARENT) the new grandchild
would inherit exit_signal == 0 from the autoreap parent's group leader
but without signal->autoreap. This grandchild would become a zombie that
never sends a signal and is never autoreaped - confusing and arguably
broken behavior.

The flag is not inherited by the autoreap process's own children. Each
child that should be autoreaped must be explicitly created with
CLONE_AUTOREAP.

Link: https://github.com/uapi-group/kernel-features/issues/45
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/sched/signal.h |  1 +
 include/uapi/linux/sched.h   |  1 +
 kernel/fork.c                | 14 +++++++++++++-
 kernel/ptrace.c              |  3 ++-
 kernel/signal.c              |  4 ++++
 5 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index a22248aebcf9..f842c86b806f 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -132,6 +132,7 @@ struct signal_struct {
 	 */
 	unsigned int		is_child_subreaper:1;
 	unsigned int		has_child_subreaper:1;
+	unsigned int		autoreap:1;
 
 #ifdef CONFIG_POSIX_TIMERS
 
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 359a14cc76a4..8a22ea640817 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -36,6 +36,7 @@
 /* Flags for the clone3() syscall. */
 #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
 #define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */
+#define CLONE_AUTOREAP 0x400000000ULL /* Auto-reap child on exit. */
 
 /*
  * cloning flags intersect with CSIGNAL so can be used with unshare and clone3
diff --git a/kernel/fork.c b/kernel/fork.c
index e832da9d15a4..0dedf2999f0c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2028,6 +2028,15 @@ __latent_entropy struct task_struct *copy_process(
 			return ERR_PTR(-EINVAL);
 	}
 
+	if (clone_flags & CLONE_AUTOREAP) {
+		if (clone_flags & CLONE_THREAD)
+			return ERR_PTR(-EINVAL);
+		if (clone_flags & CLONE_PARENT)
+			return ERR_PTR(-EINVAL);
+		if (args->exit_signal)
+			return ERR_PTR(-EINVAL);
+	}
+
 	/*
 	 * Force any signals received before this point to be delivered
 	 * before the fork happens.  Collect up signals sent to multiple
@@ -2435,6 +2444,8 @@ __latent_entropy struct task_struct *copy_process(
 			 */
 			p->signal->has_child_subreaper = p->real_parent->signal->has_child_subreaper ||
 							 p->real_parent->signal->is_child_subreaper;
+			if (clone_flags & CLONE_AUTOREAP)
+				p->signal->autoreap = 1;
 			list_add_tail(&p->sibling, &p->real_parent->children);
 			list_add_tail_rcu(&p->tasks, &init_task.tasks);
 			attach_pid(p, PIDTYPE_TGID);
@@ -2897,7 +2908,8 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
 {
 	/* Verify that no unknown flags are passed along. */
 	if (kargs->flags &
-	    ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP))
+	    ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP |
+	      CLONE_AUTOREAP))
 		return false;
 
 	/*
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 392ec2f75f01..68c17daef8d4 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -549,7 +549,8 @@ static bool __ptrace_detach(struct task_struct *tracer, struct task_struct *p)
 	if (!dead && thread_group_empty(p)) {
 		if (!same_thread_group(p->real_parent, tracer))
 			dead = do_notify_parent(p, p->exit_signal);
-		else if (ignoring_children(tracer->sighand)) {
+		else if (ignoring_children(tracer->sighand) ||
+			 p->signal->autoreap) {
 			__wake_up_parent(p, tracer);
 			dead = true;
 		}
diff --git a/kernel/signal.c b/kernel/signal.c
index d65d0fe24bfb..e61f39fa8c8a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2251,6 +2251,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 		if (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN)
 			sig = 0;
 	}
+	if (!tsk->ptrace && tsk->signal->autoreap) {
+		autoreap = true;
+		sig = 0;
+	}
 	/*
 	 * Send with __send_signal as si_pid and si_uid are in the
 	 * parent's namespaces.

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 2/6] clone: add CLONE_NNP
  2026-02-26 13:50 [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Christian Brauner
  2026-02-26 13:50 ` [PATCH v5 1/6] clone: add CLONE_AUTOREAP Christian Brauner
@ 2026-02-26 13:51 ` Christian Brauner
  2026-02-26 13:51 ` [PATCH v5 3/6] pidfd: add CLONE_PIDFD_AUTOKILL Christian Brauner
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2026-02-26 13:51 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add a new clone3() flag CLONE_NNP that sets no_new_privs on the child
process at clone time. This is analogous to prctl(PR_SET_NO_NEW_PRIVS)
but applied at process creation rather than requiring a separate step
after the child starts running.

CLONE_NNP is rejected with CLONE_THREAD. It's conceptually a lot simpler
if the whole thread-group is forced into NNP and not have single threads
running around with NNP.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/uapi/linux/sched.h |  1 +
 kernel/fork.c              | 10 +++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 8a22ea640817..7b1b87473ebb 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -37,6 +37,7 @@
 #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
 #define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */
 #define CLONE_AUTOREAP 0x400000000ULL /* Auto-reap child on exit. */
+#define CLONE_NNP 0x1000000000ULL /* Set no_new_privs on child. */
 
 /*
  * cloning flags intersect with CSIGNAL so can be used with unshare and clone3
diff --git a/kernel/fork.c b/kernel/fork.c
index 0dedf2999f0c..a3202ee278d8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2037,6 +2037,11 @@ __latent_entropy struct task_struct *copy_process(
 			return ERR_PTR(-EINVAL);
 	}
 
+	if (clone_flags & CLONE_NNP) {
+		if (clone_flags & CLONE_THREAD)
+			return ERR_PTR(-EINVAL);
+	}
+
 	/*
 	 * Force any signals received before this point to be delivered
 	 * before the fork happens.  Collect up signals sent to multiple
@@ -2421,6 +2426,9 @@ __latent_entropy struct task_struct *copy_process(
 	 */
 	copy_seccomp(p);
 
+	if (clone_flags & CLONE_NNP)
+		task_set_no_new_privs(p);
+
 	init_task_pid_links(p);
 	if (likely(p->pid)) {
 		ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);
@@ -2909,7 +2917,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
 	/* Verify that no unknown flags are passed along. */
 	if (kargs->flags &
 	    ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP |
-	      CLONE_AUTOREAP))
+	      CLONE_AUTOREAP | CLONE_NNP))
 		return false;
 
 	/*

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 3/6] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-26 13:50 [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Christian Brauner
  2026-02-26 13:50 ` [PATCH v5 1/6] clone: add CLONE_AUTOREAP Christian Brauner
  2026-02-26 13:51 ` [PATCH v5 2/6] clone: add CLONE_NNP Christian Brauner
@ 2026-02-26 13:51 ` Christian Brauner
  2026-03-02 17:16   ` Jann Horn
  2026-02-26 13:51 ` [PATCH v5 4/6] selftests/pidfd: add CLONE_AUTOREAP tests Christian Brauner
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 10+ messages in thread
From: Christian Brauner @ 2026-02-26 13:51 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's
lifetime to the pidfd returned from clone3(). When the last reference to
the struct file created by clone3() is closed the kernel sends SIGKILL
to the child. A pidfd obtained via pidfd_open() for the same process
does not keep the child alive and does not trigger autokill - only the
specific struct file from clone3() has this property.

This is useful for container runtimes, service managers, and sandboxed
subprocess execution - any scenario where the child must die if the
parent crashes or abandons the pidfd.

CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD (the whole point is tying
lifetime to the pidfd file) and CLONE_AUTOREAP (a killed child with no
one to reap it would become a zombie). CLONE_THREAD is rejected because
autokill targets a process not a thread.

The clone3 pidfd is identified by the PIDFD_AUTOKILL file flag set on
the struct file at clone3() time. The pidfs .release handler checks this
flag and sends SIGKILL via do_send_sig_info(SIGKILL, SEND_SIG_PRIV, ...)
only when it is set. Files from pidfd_open() or open_by_handle_at() are
distinct struct files that do not carry this flag. dup()/fork() share the
same struct file so they extend the child's lifetime until the last
reference drops.

CLONE_PIDFD_AUTOKILL uses a privilege model based on CLONE_NNP: without
CLONE_NNP the child could escalate privileges via setuid/setgid exec
after being spawned, so the caller must have CAP_SYS_ADMIN in its user
namespace. With CLONE_NNP the child can never gain new privileges so
unprivileged usage is allowed. This is a deliberate departure from the
pdeath_signal model which is reset during secureexec and commit_creds()
rendering it useless for container runtimes that need to deprivilege
themselves.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c                 | 38 ++++++++++++++++++++++++++++++++------
 include/uapi/linux/pidfd.h |  1 +
 include/uapi/linux/sched.h |  1 +
 kernel/fork.c              | 29 ++++++++++++++++++++++++++---
 4 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 318253344b5c..a8d1bca0395d 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -8,6 +8,8 @@
 #include <linux/mount.h>
 #include <linux/pid.h>
 #include <linux/pidfs.h>
+#include <linux/sched/signal.h>
+#include <linux/signal.h>
 #include <linux/pid_namespace.h>
 #include <linux/poll.h>
 #include <linux/proc_fs.h>
@@ -637,7 +639,28 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	return open_namespace(ns_common);
 }
 
+static int pidfs_file_release(struct inode *inode, struct file *file)
+{
+	struct pid *pid = inode->i_private;
+	struct task_struct *task;
+
+	if (!(file->f_flags & PIDFD_AUTOKILL))
+		return 0;
+
+	guard(rcu)();
+	task = pid_task(pid, PIDTYPE_TGID);
+	if (!task)
+		return 0;
+
+	/* Not available for kthreads or user workers for now. */
+	if (WARN_ON_ONCE(task->flags & (PF_KTHREAD | PF_USER_WORKER)))
+		return 0;
+	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_TGID);
+	return 0;
+}
+
 static const struct file_operations pidfs_file_operations = {
+	.release	= pidfs_file_release,
 	.poll		= pidfd_poll,
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo	= pidfd_show_fdinfo,
@@ -1093,11 +1116,11 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
 	int ret;
 
 	/*
-	 * Ensure that PIDFD_STALE can be passed as a flag without
-	 * overloading other uapi pidfd flags.
+	 * Ensure that internal pidfd flags don't overlap with each
+	 * other or with uapi pidfd flags.
 	 */
-	BUILD_BUG_ON(PIDFD_STALE == PIDFD_THREAD);
-	BUILD_BUG_ON(PIDFD_STALE == PIDFD_NONBLOCK);
+	BUILD_BUG_ON(hweight32(PIDFD_THREAD | PIDFD_NONBLOCK |
+				PIDFD_STALE | PIDFD_AUTOKILL) != 4);
 
 	ret = path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path);
 	if (ret < 0)
@@ -1108,9 +1131,12 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
 	flags &= ~PIDFD_STALE;
 	flags |= O_RDWR;
 	pidfd_file = dentry_open(&path, flags, current_cred());
-	/* Raise PIDFD_THREAD explicitly as do_dentry_open() strips it. */
+	/*
+	 * Raise PIDFD_THREAD and PIDFD_AUTOKILL explicitly as
+	 * do_dentry_open() strips O_EXCL and O_TRUNC.
+	 */
 	if (!IS_ERR(pidfd_file))
-		pidfd_file->f_flags |= (flags & PIDFD_THREAD);
+		pidfd_file->f_flags |= (flags & (PIDFD_THREAD | PIDFD_AUTOKILL));
 
 	return pidfd_file;
 }
diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
index ea9a6811fc76..9281956a9f32 100644
--- a/include/uapi/linux/pidfd.h
+++ b/include/uapi/linux/pidfd.h
@@ -13,6 +13,7 @@
 #ifdef __KERNEL__
 #include <linux/sched.h>
 #define PIDFD_STALE CLONE_PIDFD
+#define PIDFD_AUTOKILL O_TRUNC
 #endif
 
 /* Flags for pidfd_send_signal(). */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 7b1b87473ebb..0aafb4652afc 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -37,6 +37,7 @@
 #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
 #define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */
 #define CLONE_AUTOREAP 0x400000000ULL /* Auto-reap child on exit. */
+#define CLONE_PIDFD_AUTOKILL 0x800000000ULL /* Kill child when clone pidfd closes. */
 #define CLONE_NNP 0x1000000000ULL /* Set no_new_privs on child. */
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index a3202ee278d8..0f4944ce378d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2042,6 +2042,24 @@ __latent_entropy struct task_struct *copy_process(
 			return ERR_PTR(-EINVAL);
 	}
 
+	if (clone_flags & CLONE_PIDFD_AUTOKILL) {
+		if (!(clone_flags & CLONE_PIDFD))
+			return ERR_PTR(-EINVAL);
+		if (!(clone_flags & CLONE_AUTOREAP))
+			return ERR_PTR(-EINVAL);
+		if (clone_flags & CLONE_THREAD)
+			return ERR_PTR(-EINVAL);
+		/*
+		 * Without CLONE_NNP the child could escalate privileges
+		 * after being spawned, so require CAP_SYS_ADMIN.
+		 * With CLONE_NNP the child can't gain new privileges,
+		 * so allow unprivileged usage.
+		 */
+		if (!(clone_flags & CLONE_NNP) &&
+		    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+			return ERR_PTR(-EPERM);
+	}
+
 	/*
 	 * Force any signals received before this point to be delivered
 	 * before the fork happens.  Collect up signals sent to multiple
@@ -2264,13 +2282,18 @@ __latent_entropy struct task_struct *copy_process(
 	 * if the fd table isn't shared).
 	 */
 	if (clone_flags & CLONE_PIDFD) {
-		int flags = (clone_flags & CLONE_THREAD) ? PIDFD_THREAD : 0;
+		unsigned flags = PIDFD_STALE;
+
+		if (clone_flags & CLONE_THREAD)
+			flags |= PIDFD_THREAD;
+		if (clone_flags & CLONE_PIDFD_AUTOKILL)
+			flags |= PIDFD_AUTOKILL;
 
 		/*
 		 * Note that no task has been attached to @pid yet indicate
 		 * that via CLONE_PIDFD.
 		 */
-		retval = pidfd_prepare(pid, flags | PIDFD_STALE, &pidfile);
+		retval = pidfd_prepare(pid, flags, &pidfile);
 		if (retval < 0)
 			goto bad_fork_free_pid;
 		pidfd = retval;
@@ -2917,7 +2940,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
 	/* Verify that no unknown flags are passed along. */
 	if (kargs->flags &
 	    ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP |
-	      CLONE_AUTOREAP | CLONE_NNP))
+	      CLONE_AUTOREAP | CLONE_NNP | CLONE_PIDFD_AUTOKILL))
 		return false;
 
 	/*

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 4/6] selftests/pidfd: add CLONE_AUTOREAP tests
  2026-02-26 13:50 [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Christian Brauner
                   ` (2 preceding siblings ...)
  2026-02-26 13:51 ` [PATCH v5 3/6] pidfd: add CLONE_PIDFD_AUTOKILL Christian Brauner
@ 2026-02-26 13:51 ` Christian Brauner
  2026-02-26 13:51 ` [PATCH v5 5/6] selftests/pidfd: add CLONE_NNP tests Christian Brauner
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2026-02-26 13:51 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add tests for the new CLONE_AUTOREAP clone3() flag:

- autoreap_without_pidfd: CLONE_AUTOREAP without CLONE_PIDFD works
  (fire-and-forget)
- autoreap_rejects_exit_signal: CLONE_AUTOREAP with non-zero
  exit_signal fails
- autoreap_rejects_parent: CLONE_AUTOREAP with CLONE_PARENT fails
- autoreap_rejects_thread: CLONE_AUTOREAP with CLONE_THREAD fails
- autoreap_basic: child exits, pidfd poll works, PIDFD_GET_INFO returns
  correct exit code, waitpid() returns -ECHILD
- autoreap_signaled: child killed by signal, exit info correct via pidfd
- autoreap_reparent: autoreap grandchild reparented to subreaper still
  auto-reaps
- autoreap_multithreaded: autoreap process with sub-threads auto-reaps
  after last thread exits
- autoreap_no_inherit: grandchild forked without CLONE_AUTOREAP becomes
  a regular zombie

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/.gitignore           |   1 +
 tools/testing/selftests/pidfd/Makefile             |   2 +-
 .../testing/selftests/pidfd/pidfd_autoreap_test.c  | 496 +++++++++++++++++++++
 3 files changed, 498 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore
index 144e7ff65d6a..4cd8ec7fd349 100644
--- a/tools/testing/selftests/pidfd/.gitignore
+++ b/tools/testing/selftests/pidfd/.gitignore
@@ -12,3 +12,4 @@ pidfd_info_test
 pidfd_exec_helper
 pidfd_xattr_test
 pidfd_setattr_test
+pidfd_autoreap_test
diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index 764a8f9ecefa..4211f91e9af8 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -4,7 +4,7 @@ CFLAGS += -g $(KHDR_INCLUDES) $(TOOLS_INCLUDES) -pthread -Wall
 TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test \
 	pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test \
 	pidfd_file_handle_test pidfd_bind_mount pidfd_info_test \
-	pidfd_xattr_test pidfd_setattr_test
+	pidfd_xattr_test pidfd_setattr_test pidfd_autoreap_test
 
 TEST_GEN_PROGS_EXTENDED := pidfd_exec_helper
 
diff --git a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
new file mode 100644
index 000000000000..e230d2fe4a64
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
@@ -0,0 +1,496 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2026 Christian Brauner <brauner@kernel.org>
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/types.h>
+#include <poll.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/ioctl.h>
+#include <sys/prctl.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "pidfd.h"
+#include "kselftest_harness.h"
+
+#ifndef CLONE_AUTOREAP
+#define CLONE_AUTOREAP 0x400000000ULL
+#endif
+
+static pid_t create_autoreap_child(int *pidfd)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_AUTOREAP,
+		.exit_signal	= 0,
+		.pidfd		= ptr_to_u64(pidfd),
+	};
+
+	return sys_clone3(&args, sizeof(args));
+}
+
+/*
+ * Test that CLONE_AUTOREAP works without CLONE_PIDFD (fire-and-forget).
+ */
+TEST(autoreap_without_pidfd)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_AUTOREAP,
+		.exit_signal	= 0,
+	};
+	pid_t pid;
+	int ret;
+
+	pid = sys_clone3(&args, sizeof(args));
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0)
+		_exit(0);
+
+	/*
+	 * Give the child a moment to exit and be autoreaped.
+	 * Then verify no zombie remains.
+	 */
+	usleep(200000);
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+}
+
+/*
+ * Test that CLONE_AUTOREAP with a non-zero exit_signal fails.
+ */
+TEST(autoreap_rejects_exit_signal)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_AUTOREAP,
+		.exit_signal	= SIGCHLD,
+	};
+	pid_t pid;
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * Test that CLONE_AUTOREAP with CLONE_PARENT fails.
+ */
+TEST(autoreap_rejects_parent)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_AUTOREAP | CLONE_PARENT,
+		.exit_signal	= 0,
+	};
+	pid_t pid;
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * Test that CLONE_AUTOREAP with CLONE_THREAD fails.
+ */
+TEST(autoreap_rejects_thread)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_AUTOREAP | CLONE_THREAD |
+				  CLONE_SIGHAND | CLONE_VM,
+		.exit_signal	= 0,
+	};
+	pid_t pid;
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * Basic test: create an autoreap child, let it exit, verify:
+ * - pidfd becomes readable (poll returns POLLIN)
+ * - PIDFD_GET_INFO returns the correct exit code
+ * - waitpid() returns -1/ECHILD (no zombie)
+ */
+TEST(autoreap_basic)
+{
+	struct pidfd_info info = { .mask = PIDFD_INFO_EXIT };
+	int pidfd = -1, ret;
+	struct pollfd pfd;
+	pid_t pid;
+
+	pid = create_autoreap_child(&pidfd);
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0)
+		_exit(42);
+
+	ASSERT_GE(pidfd, 0);
+
+	/* Wait for the child to exit via pidfd poll. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+	ASSERT_TRUE(pfd.revents & POLLIN);
+
+	/* Verify exit info via PIDFD_GET_INFO. */
+	ret = ioctl(pidfd, PIDFD_GET_INFO, &info);
+	ASSERT_EQ(ret, 0);
+	ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT);
+	/*
+	 * exit_code is in waitpid format: for _exit(42),
+	 * WIFEXITED is true and WEXITSTATUS is 42.
+	 */
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 42);
+
+	/* Verify no zombie: waitpid should fail with ECHILD. */
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+
+	close(pidfd);
+}
+
+/*
+ * Test that an autoreap child killed by a signal reports
+ * the correct exit info.
+ */
+TEST(autoreap_signaled)
+{
+	struct pidfd_info info = { .mask = PIDFD_INFO_EXIT };
+	int pidfd = -1, ret;
+	struct pollfd pfd;
+	pid_t pid;
+
+	pid = create_autoreap_child(&pidfd);
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		pause();
+		_exit(1);
+	}
+
+	ASSERT_GE(pidfd, 0);
+
+	/* Kill the child. */
+	ret = sys_pidfd_send_signal(pidfd, SIGKILL, NULL, 0);
+	ASSERT_EQ(ret, 0);
+
+	/* Wait for exit via pidfd. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+	ASSERT_TRUE(pfd.revents & POLLIN);
+
+	/* Verify signal info. */
+	ret = ioctl(pidfd, PIDFD_GET_INFO, &info);
+	ASSERT_EQ(ret, 0);
+	ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT);
+	ASSERT_TRUE(WIFSIGNALED(info.exit_code));
+	ASSERT_EQ(WTERMSIG(info.exit_code), SIGKILL);
+
+	/* No zombie. */
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+
+	close(pidfd);
+}
+
+/*
+ * Test autoreap survives reparenting: middle process creates an
+ * autoreap grandchild, then exits. The grandchild gets reparented
+ * to us (the grandparent, which is a subreaper). When the grandchild
+ * exits, it should still be autoreaped - no zombie under us.
+ */
+TEST(autoreap_reparent)
+{
+	int ipc_sockets[2], ret;
+	int pidfd = -1;
+	struct pollfd pfd;
+	pid_t mid_pid, grandchild_pid;
+	char buf[32] = {};
+
+	/* Make ourselves a subreaper so reparented children come to us. */
+	ret = prctl(PR_SET_CHILD_SUBREAPER, 1);
+	ASSERT_EQ(ret, 0);
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	ASSERT_EQ(ret, 0);
+
+	mid_pid = fork();
+	ASSERT_GE(mid_pid, 0);
+
+	if (mid_pid == 0) {
+		/* Middle child: create an autoreap grandchild. */
+		int gc_pidfd = -1;
+
+		close(ipc_sockets[0]);
+
+		grandchild_pid = create_autoreap_child(&gc_pidfd);
+		if (grandchild_pid < 0) {
+			write_nointr(ipc_sockets[1], "E", 1);
+			close(ipc_sockets[1]);
+			_exit(1);
+		}
+
+		if (grandchild_pid == 0) {
+			/* Grandchild: wait for signal to exit. */
+			close(ipc_sockets[1]);
+			if (gc_pidfd >= 0)
+				close(gc_pidfd);
+			pause();
+			_exit(0);
+		}
+
+		/* Send grandchild PID to grandparent. */
+		snprintf(buf, sizeof(buf), "%d", grandchild_pid);
+		write_nointr(ipc_sockets[1], buf, strlen(buf));
+		close(ipc_sockets[1]);
+		if (gc_pidfd >= 0)
+			close(gc_pidfd);
+
+		/* Middle child exits, grandchild gets reparented. */
+		_exit(0);
+	}
+
+	close(ipc_sockets[1]);
+
+	/* Read grandchild's PID. */
+	ret = read_nointr(ipc_sockets[0], buf, sizeof(buf) - 1);
+	close(ipc_sockets[0]);
+	ASSERT_GT(ret, 0);
+
+	if (buf[0] == 'E') {
+		waitpid(mid_pid, NULL, 0);
+		prctl(PR_SET_CHILD_SUBREAPER, 0);
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	}
+
+	grandchild_pid = atoi(buf);
+	ASSERT_GT(grandchild_pid, 0);
+
+	/* Wait for the middle child to exit. */
+	ret = waitpid(mid_pid, NULL, 0);
+	ASSERT_EQ(ret, mid_pid);
+
+	/*
+	 * Now the grandchild is reparented to us (subreaper).
+	 * Open a pidfd for the grandchild and kill it.
+	 */
+	pidfd = sys_pidfd_open(grandchild_pid, 0);
+	ASSERT_GE(pidfd, 0);
+
+	ret = sys_pidfd_send_signal(pidfd, SIGKILL, NULL, 0);
+	ASSERT_EQ(ret, 0);
+
+	/* Wait for it to exit via pidfd poll. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+	ASSERT_TRUE(pfd.revents & POLLIN);
+
+	/*
+	 * The grandchild should have been autoreaped even though
+	 * we (the new parent) haven't set SA_NOCLDWAIT.
+	 * waitpid should return -1/ECHILD.
+	 */
+	ret = waitpid(grandchild_pid, NULL, WNOHANG);
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, ECHILD);
+
+	close(pidfd);
+
+	/* Clean up subreaper status. */
+	prctl(PR_SET_CHILD_SUBREAPER, 0);
+}
+
+static int thread_sock_fd;
+
+static void *thread_func(void *arg)
+{
+	/* Signal parent we're running. */
+	write_nointr(thread_sock_fd, "1", 1);
+
+	/* Give main thread time to call _exit() first. */
+	usleep(200000);
+
+	return NULL;
+}
+
+/*
+ * Test that an autoreap child with multiple threads is properly
+ * autoreaped only after all threads have exited.
+ */
+TEST(autoreap_multithreaded)
+{
+	struct pidfd_info info = { .mask = PIDFD_INFO_EXIT };
+	int ipc_sockets[2], ret;
+	int pidfd = -1;
+	struct pollfd pfd;
+	pid_t pid;
+	char c;
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	ASSERT_EQ(ret, 0);
+
+	pid = create_autoreap_child(&pidfd);
+	if (pid < 0 && errno == EINVAL) {
+		close(ipc_sockets[0]);
+		close(ipc_sockets[1]);
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	}
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		pthread_t thread;
+
+		close(ipc_sockets[0]);
+
+		/*
+		 * Create a sub-thread that outlives the main thread.
+		 * The thread signals readiness, then sleeps.
+		 * The main thread waits briefly, then calls _exit().
+		 */
+		thread_sock_fd = ipc_sockets[1];
+		pthread_create(&thread, NULL, thread_func, NULL);
+		pthread_detach(thread);
+
+		/* Wait for thread to be running. */
+		usleep(100000);
+
+		/* Main thread exits; sub-thread is still alive. */
+		_exit(99);
+	}
+
+	close(ipc_sockets[1]);
+
+	/* Wait for the sub-thread to signal readiness. */
+	ret = read_nointr(ipc_sockets[0], &c, 1);
+	close(ipc_sockets[0]);
+	ASSERT_EQ(ret, 1);
+
+	/* Wait for the process to fully exit via pidfd poll. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+	ASSERT_TRUE(pfd.revents & POLLIN);
+
+	/* Verify exit info. */
+	ret = ioctl(pidfd, PIDFD_GET_INFO, &info);
+	ASSERT_EQ(ret, 0);
+	ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT);
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 99);
+
+	/* No zombie. */
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+
+	close(pidfd);
+}
+
+/*
+ * Test that autoreap is NOT inherited by grandchildren.
+ */
+TEST(autoreap_no_inherit)
+{
+	int ipc_sockets[2], ret;
+	int pidfd = -1;
+	pid_t pid;
+	char buf[2] = {};
+	struct pollfd pfd;
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	ASSERT_EQ(ret, 0);
+
+	pid = create_autoreap_child(&pidfd);
+	if (pid < 0 && errno == EINVAL) {
+		close(ipc_sockets[0]);
+		close(ipc_sockets[1]);
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	}
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		pid_t gc;
+		int status;
+
+		close(ipc_sockets[0]);
+
+		/* Autoreap child forks a grandchild (without autoreap). */
+		gc = fork();
+		if (gc < 0) {
+			write_nointr(ipc_sockets[1], "E", 1);
+			_exit(1);
+		}
+		if (gc == 0) {
+			/* Grandchild: exit immediately. */
+			close(ipc_sockets[1]);
+			_exit(77);
+		}
+
+		/*
+		 * The grandchild should become a regular zombie
+		 * since it was NOT created with CLONE_AUTOREAP.
+		 * Wait for it to verify.
+		 */
+		ret = waitpid(gc, &status, 0);
+		if (ret == gc && WIFEXITED(status) &&
+		    WEXITSTATUS(status) == 77) {
+			write_nointr(ipc_sockets[1], "P", 1);
+		} else {
+			write_nointr(ipc_sockets[1], "F", 1);
+		}
+		close(ipc_sockets[1]);
+		_exit(0);
+	}
+
+	close(ipc_sockets[1]);
+
+	ret = read_nointr(ipc_sockets[0], buf, 1);
+	close(ipc_sockets[0]);
+	ASSERT_EQ(ret, 1);
+
+	/*
+	 * 'P' means the autoreap child was able to waitpid() its
+	 * grandchild (correct - grandchild should be a normal zombie,
+	 * not autoreaped).
+	 */
+	ASSERT_EQ(buf[0], 'P');
+
+	/* Wait for the autoreap child to exit. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+
+	/* Autoreap child itself should be autoreaped. */
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+
+	close(pidfd);
+}
+
+TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 5/6] selftests/pidfd: add CLONE_NNP tests
  2026-02-26 13:50 [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Christian Brauner
                   ` (3 preceding siblings ...)
  2026-02-26 13:51 ` [PATCH v5 4/6] selftests/pidfd: add CLONE_AUTOREAP tests Christian Brauner
@ 2026-02-26 13:51 ` Christian Brauner
  2026-02-26 13:51 ` [PATCH v5 6/6] selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests Christian Brauner
  2026-02-28 12:49 ` [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Oleg Nesterov
  6 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2026-02-26 13:51 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add tests for the new CLONE_NNP flag:

- nnp_sets_no_new_privs: Verify a child created with CLONE_NNP has
  no_new_privs set while the parent does not.

- nnp_rejects_thread: Verify CLONE_NNP | CLONE_THREAD is rejected
  with -EINVAL since threads share credentials.

- autoreap_no_new_privs_unset: Verify a plain CLONE_AUTOREAP child
  does not get no_new_privs.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../testing/selftests/pidfd/pidfd_autoreap_test.c  | 126 +++++++++++++++++++++
 1 file changed, 126 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
index e230d2fe4a64..5fb11230fb07 100644
--- a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
@@ -26,6 +26,10 @@
 #define CLONE_AUTOREAP 0x400000000ULL
 #endif
 
+#ifndef CLONE_NNP
+#define CLONE_NNP 0x1000000000ULL
+#endif
+
 static pid_t create_autoreap_child(int *pidfd)
 {
 	struct __clone_args args = {
@@ -493,4 +497,126 @@ TEST(autoreap_no_inherit)
 	close(pidfd);
 }
 
+/*
+ * Test that CLONE_NNP sets no_new_privs on the child.
+ * The child checks via prctl(PR_GET_NO_NEW_PRIVS) and reports back.
+ * The parent must NOT have no_new_privs set afterwards.
+ */
+TEST(nnp_sets_no_new_privs)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_AUTOREAP | CLONE_NNP,
+		.exit_signal	= 0,
+	};
+	struct pidfd_info info = { .mask = PIDFD_INFO_EXIT };
+	int pidfd = -1, ret;
+	struct pollfd pfd;
+	pid_t pid;
+
+	/* Ensure parent does not already have no_new_privs. */
+	ret = prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
+	ASSERT_EQ(ret, 0) {
+		TH_LOG("Parent already has no_new_privs set, cannot run test");
+	}
+
+	args.pidfd = ptr_to_u64(&pidfd);
+
+	pid = sys_clone3(&args, sizeof(args));
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_NNP not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		/*
+		 * Child: check no_new_privs. Exit 0 if set, 1 if not.
+		 */
+		ret = prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
+		_exit(ret == 1 ? 0 : 1);
+	}
+
+	ASSERT_GE(pidfd, 0);
+
+	/* Parent must still NOT have no_new_privs. */
+	ret = prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
+	ASSERT_EQ(ret, 0) {
+		TH_LOG("Parent got no_new_privs after creating CLONE_NNP child");
+	}
+
+	/* Wait for child to exit. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+
+	/* Verify child exited with 0 (no_new_privs was set). */
+	ret = ioctl(pidfd, PIDFD_GET_INFO, &info);
+	ASSERT_EQ(ret, 0);
+	ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT);
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 0) {
+		TH_LOG("Child did not have no_new_privs set");
+	}
+
+	close(pidfd);
+}
+
+/*
+ * Test that CLONE_NNP with CLONE_THREAD fails with EINVAL.
+ */
+TEST(nnp_rejects_thread)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_NNP | CLONE_THREAD |
+				  CLONE_SIGHAND | CLONE_VM,
+		.exit_signal	= 0,
+	};
+	pid_t pid;
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * Test that a plain CLONE_AUTOREAP child does NOT get no_new_privs.
+ * Only CLONE_NNP should set it.
+ */
+TEST(autoreap_no_new_privs_unset)
+{
+	struct pidfd_info info = { .mask = PIDFD_INFO_EXIT };
+	int pidfd = -1, ret;
+	struct pollfd pfd;
+	pid_t pid;
+
+	pid = create_autoreap_child(&pidfd);
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		/*
+		 * Child: check no_new_privs. Exit 0 if NOT set, 1 if set.
+		 */
+		ret = prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
+		_exit(ret == 0 ? 0 : 1);
+	}
+
+	ASSERT_GE(pidfd, 0);
+
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+
+	ret = ioctl(pidfd, PIDFD_GET_INFO, &info);
+	ASSERT_EQ(ret, 0);
+	ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT);
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 0) {
+		TH_LOG("Plain autoreap child unexpectedly has no_new_privs");
+	}
+
+	close(pidfd);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 6/6] selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests
  2026-02-26 13:50 [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Christian Brauner
                   ` (4 preceding siblings ...)
  2026-02-26 13:51 ` [PATCH v5 5/6] selftests/pidfd: add CLONE_NNP tests Christian Brauner
@ 2026-02-26 13:51 ` Christian Brauner
  2026-02-28 12:49 ` [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Oleg Nesterov
  6 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2026-02-26 13:51 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add tests for CLONE_PIDFD_AUTOKILL:

- autokill_basic: Verify closing the clone3 pidfd kills the child.
- autokill_requires_pidfd: Verify AUTOKILL without CLONE_PIDFD fails.
- autokill_requires_autoreap: Verify AUTOKILL without CLONE_AUTOREAP
  fails.
- autokill_rejects_thread: Verify AUTOKILL with CLONE_THREAD fails.
- autokill_pidfd_open_no_effect: Verify only the clone3 pidfd triggers
  autokill, not pidfd_open().
- autokill_requires_cap_sys_admin: Verify AUTOKILL without CLONE_NNP
  fails with -EPERM for an unprivileged caller.
- autokill_without_nnp_with_cap: Verify AUTOKILL without CLONE_NNP
  succeeds with CAP_SYS_ADMIN.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../testing/selftests/pidfd/pidfd_autoreap_test.c  | 278 +++++++++++++++++++++
 1 file changed, 278 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
index 5fb11230fb07..36adee6c424e 100644
--- a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
@@ -26,10 +26,37 @@
 #define CLONE_AUTOREAP 0x400000000ULL
 #endif
 
+#ifndef CLONE_PIDFD_AUTOKILL
+#define CLONE_PIDFD_AUTOKILL 0x800000000ULL
+#endif
+
 #ifndef CLONE_NNP
 #define CLONE_NNP 0x1000000000ULL
 #endif
 
+#ifndef _LINUX_CAPABILITY_VERSION_3
+#define _LINUX_CAPABILITY_VERSION_3 0x20080522
+#endif
+
+struct cap_header {
+	__u32 version;
+	int pid;
+};
+
+struct cap_data {
+	__u32 effective;
+	__u32 permitted;
+	__u32 inheritable;
+};
+
+static int drop_all_caps(void)
+{
+	struct cap_header hdr = { .version = _LINUX_CAPABILITY_VERSION_3 };
+	struct cap_data data[2] = {};
+
+	return syscall(__NR_capset, &hdr, data);
+}
+
 static pid_t create_autoreap_child(int *pidfd)
 {
 	struct __clone_args args = {
@@ -619,4 +646,255 @@ TEST(autoreap_no_new_privs_unset)
 	close(pidfd);
 }
 
+/*
+ * Helper: create a child with CLONE_PIDFD | CLONE_PIDFD_AUTOKILL | CLONE_AUTOREAP | CLONE_NNP.
+ */
+static pid_t create_autokill_child(int *pidfd)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_PIDFD_AUTOKILL |
+				  CLONE_AUTOREAP | CLONE_NNP,
+		.exit_signal	= 0,
+		.pidfd		= ptr_to_u64(pidfd),
+	};
+
+	return sys_clone3(&args, sizeof(args));
+}
+
+/*
+ * Basic autokill test: child blocks in pause(), parent closes the
+ * clone3 pidfd, child should be killed and autoreaped.
+ */
+TEST(autokill_basic)
+{
+	int pidfd = -1, pollfd_fd = -1, ret;
+	struct pollfd pfd;
+	pid_t pid;
+
+	pid = create_autokill_child(&pidfd);
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_PIDFD_AUTOKILL not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		pause();
+		_exit(1);
+	}
+
+	ASSERT_GE(pidfd, 0);
+
+	/*
+	 * Open a second pidfd via pidfd_open() so we can observe the
+	 * child's death after closing the clone3 pidfd.
+	 */
+	pollfd_fd = sys_pidfd_open(pid, 0);
+	ASSERT_GE(pollfd_fd, 0);
+
+	/* Close the clone3 pidfd — this should trigger autokill. */
+	close(pidfd);
+
+	/* Wait for the child to die via the pidfd_open'd fd. */
+	pfd.fd = pollfd_fd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+	ASSERT_TRUE(pfd.revents & POLLIN);
+
+	/* Child should be autoreaped — no zombie. */
+	usleep(100000);
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+
+	close(pollfd_fd);
+}
+
+/*
+ * CLONE_PIDFD_AUTOKILL without CLONE_PIDFD must fail with EINVAL.
+ */
+TEST(autokill_requires_pidfd)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD_AUTOKILL | CLONE_AUTOREAP,
+		.exit_signal	= 0,
+	};
+	pid_t pid;
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * CLONE_PIDFD_AUTOKILL without CLONE_AUTOREAP must fail with EINVAL.
+ */
+TEST(autokill_requires_autoreap)
+{
+	int pidfd = -1;
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_PIDFD_AUTOKILL,
+		.exit_signal	= 0,
+		.pidfd		= ptr_to_u64(&pidfd),
+	};
+	pid_t pid;
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * CLONE_PIDFD_AUTOKILL with CLONE_THREAD must fail with EINVAL.
+ */
+TEST(autokill_rejects_thread)
+{
+	int pidfd = -1;
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_PIDFD_AUTOKILL |
+				  CLONE_AUTOREAP | CLONE_THREAD |
+				  CLONE_SIGHAND | CLONE_VM,
+		.exit_signal	= 0,
+		.pidfd		= ptr_to_u64(&pidfd),
+	};
+	pid_t pid;
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * Test that only the clone3 pidfd triggers autokill, not pidfd_open().
+ * Close the pidfd_open'd fd first — child should survive.
+ * Then close the clone3 pidfd — child should be killed and autoreaped.
+ */
+TEST(autokill_pidfd_open_no_effect)
+{
+	int pidfd = -1, open_fd = -1, ret;
+	struct pollfd pfd;
+	pid_t pid;
+
+	pid = create_autokill_child(&pidfd);
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_PIDFD_AUTOKILL not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		pause();
+		_exit(1);
+	}
+
+	ASSERT_GE(pidfd, 0);
+
+	/* Open a second pidfd via pidfd_open(). */
+	open_fd = sys_pidfd_open(pid, 0);
+	ASSERT_GE(open_fd, 0);
+
+	/*
+	 * Close the pidfd_open'd fd — child should survive because
+	 * only the clone3 pidfd has autokill.
+	 */
+	close(open_fd);
+	usleep(200000);
+
+	/* Verify child is still alive by polling the clone3 pidfd. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 0);
+	ASSERT_EQ(ret, 0) {
+		TH_LOG("Child died after closing pidfd_open fd — should still be alive");
+	}
+
+	/* Open another observation fd before triggering autokill. */
+	open_fd = sys_pidfd_open(pid, 0);
+	ASSERT_GE(open_fd, 0);
+
+	/* Now close the clone3 pidfd — this triggers autokill. */
+	close(pidfd);
+
+	pfd.fd = open_fd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+	ASSERT_TRUE(pfd.revents & POLLIN);
+
+	/* Child should be autoreaped — no zombie. */
+	usleep(100000);
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+
+	close(open_fd);
+}
+
+/*
+ * Test that CLONE_PIDFD_AUTOKILL without CLONE_NNP fails with EPERM
+ * for an unprivileged caller.
+ */
+TEST(autokill_requires_cap_sys_admin)
+{
+	int pidfd = -1, ret;
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_PIDFD_AUTOKILL |
+				  CLONE_AUTOREAP,
+		.exit_signal	= 0,
+		.pidfd		= ptr_to_u64(&pidfd),
+	};
+	pid_t pid;
+
+	/* Drop all capabilities so we lack CAP_SYS_ADMIN. */
+	ret = drop_all_caps();
+	ASSERT_EQ(ret, 0);
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EPERM);
+}
+
+/*
+ * Test that CLONE_PIDFD_AUTOKILL without CLONE_NNP succeeds with
+ * CAP_SYS_ADMIN.
+ */
+TEST(autokill_without_nnp_with_cap)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_PIDFD_AUTOKILL |
+				  CLONE_AUTOREAP,
+		.exit_signal	= 0,
+	};
+	struct pidfd_info info = { .mask = PIDFD_INFO_EXIT };
+	int pidfd = -1, ret;
+	struct pollfd pfd;
+	pid_t pid;
+
+	if (geteuid() != 0)
+		SKIP(return, "Need root/CAP_SYS_ADMIN");
+
+	args.pidfd = ptr_to_u64(&pidfd);
+
+	pid = sys_clone3(&args, sizeof(args));
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_PIDFD_AUTOKILL not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0)
+		_exit(0);
+
+	ASSERT_GE(pidfd, 0);
+
+	/* Wait for child to exit. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+
+	ret = ioctl(pidfd, PIDFD_GET_INFO, &info);
+	ASSERT_EQ(ret, 0);
+	ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT);
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
+
+	close(pidfd);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL
  2026-02-26 13:50 [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Christian Brauner
                   ` (5 preceding siblings ...)
  2026-02-26 13:51 ` [PATCH v5 6/6] selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests Christian Brauner
@ 2026-02-28 12:49 ` Oleg Nesterov
  2026-03-02 10:28   ` Christian Brauner
  6 siblings, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2026-02-28 12:49 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On 02/26, Christian Brauner wrote:
>
> Christian Brauner (6):
>       clone: add CLONE_AUTOREAP
>       clone: add CLONE_NNP
>       pidfd: add CLONE_PIDFD_AUTOKILL

Well, I still think copy_process should deny
"CLONE_PARENT && current->signal->autoreap", in shis case the
new child will have .exit_signal == 0 without signal->autoreap.
But this really minor.

FWIW, I see no technical problems in 1-3, feel free to add

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL
  2026-02-28 12:49 ` [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Oleg Nesterov
@ 2026-03-02 10:28   ` Christian Brauner
  0 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2026-03-02 10:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jann Horn, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On Sat, Feb 28, 2026 at 01:49:56PM +0100, Oleg Nesterov wrote:
> On 02/26, Christian Brauner wrote:
> >
> > Christian Brauner (6):
> >       clone: add CLONE_AUTOREAP
> >       clone: add CLONE_NNP
> >       pidfd: add CLONE_PIDFD_AUTOKILL
> 
> Well, I still think copy_process should deny
> "CLONE_PARENT && current->signal->autoreap", in shis case the
> new child will have .exit_signal == 0 without signal->autoreap.
> But this really minor.

Sorry, my bad. I've added:

if ((clone_flags & CLONE_PARENT) && current->signal->autoreap)
    return ERR_PTR(-EINVAL);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 3/6] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-26 13:51 ` [PATCH v5 3/6] pidfd: add CLONE_PIDFD_AUTOKILL Christian Brauner
@ 2026-03-02 17:16   ` Jann Horn
  0 siblings, 0 replies; 10+ messages in thread
From: Jann Horn @ 2026-03-02 17:16 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On Thu, Feb 26, 2026 at 2:51 PM Christian Brauner <brauner@kernel.org> wrote:
> Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's
> lifetime to the pidfd returned from clone3(). When the last reference to
> the struct file created by clone3() is closed the kernel sends SIGKILL
> to the child. A pidfd obtained via pidfd_open() for the same process
> does not keep the child alive and does not trigger autokill - only the
> specific struct file from clone3() has this property.
>
> This is useful for container runtimes, service managers, and sandboxed
> subprocess execution - any scenario where the child must die if the
> parent crashes or abandons the pidfd.
>
> CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD (the whole point is tying
> lifetime to the pidfd file) and CLONE_AUTOREAP (a killed child with no
> one to reap it would become a zombie). CLONE_THREAD is rejected because
> autokill targets a process not a thread.
>
> The clone3 pidfd is identified by the PIDFD_AUTOKILL file flag set on
> the struct file at clone3() time. The pidfs .release handler checks this
> flag and sends SIGKILL via do_send_sig_info(SIGKILL, SEND_SIG_PRIV, ...)
> only when it is set. Files from pidfd_open() or open_by_handle_at() are
> distinct struct files that do not carry this flag. dup()/fork() share the
> same struct file so they extend the child's lifetime until the last
> reference drops.
>
> CLONE_PIDFD_AUTOKILL uses a privilege model based on CLONE_NNP: without
> CLONE_NNP the child could escalate privileges via setuid/setgid exec
> after being spawned, so the caller must have CAP_SYS_ADMIN in its user
> namespace. With CLONE_NNP the child can never gain new privileges so
> unprivileged usage is allowed. This is a deliberate departure from the
> pdeath_signal model which is reset during secureexec and commit_creds()
> rendering it useless for container runtimes that need to deprivilege
> themselves.
[...]
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a3202ee278d8..0f4944ce378d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2042,6 +2042,24 @@ __latent_entropy struct task_struct *copy_process(
>                         return ERR_PTR(-EINVAL);
>         }
>
> +       if (clone_flags & CLONE_PIDFD_AUTOKILL) {
> +               if (!(clone_flags & CLONE_PIDFD))
> +                       return ERR_PTR(-EINVAL);
> +               if (!(clone_flags & CLONE_AUTOREAP))
> +                       return ERR_PTR(-EINVAL);
> +               if (clone_flags & CLONE_THREAD)
> +                       return ERR_PTR(-EINVAL);
> +               /*
> +                * Without CLONE_NNP the child could escalate privileges
> +                * after being spawned, so require CAP_SYS_ADMIN.
> +                * With CLONE_NNP the child can't gain new privileges,
> +                * so allow unprivileged usage.
> +                */
> +               if (!(clone_flags & CLONE_NNP) &&
> +                   !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> +                       return ERR_PTR(-EPERM);
> +       }

That security model looks good to me.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-02 17:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-26 13:50 [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Christian Brauner
2026-02-26 13:50 ` [PATCH v5 1/6] clone: add CLONE_AUTOREAP Christian Brauner
2026-02-26 13:51 ` [PATCH v5 2/6] clone: add CLONE_NNP Christian Brauner
2026-02-26 13:51 ` [PATCH v5 3/6] pidfd: add CLONE_PIDFD_AUTOKILL Christian Brauner
2026-03-02 17:16   ` Jann Horn
2026-02-26 13:51 ` [PATCH v5 4/6] selftests/pidfd: add CLONE_AUTOREAP tests Christian Brauner
2026-02-26 13:51 ` [PATCH v5 5/6] selftests/pidfd: add CLONE_NNP tests Christian Brauner
2026-02-26 13:51 ` [PATCH v5 6/6] selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests Christian Brauner
2026-02-28 12:49 ` [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Oleg Nesterov
2026-03-02 10:28   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox