public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL
@ 2026-02-17 22:35 Christian Brauner
  2026-02-17 22:35 ` [PATCH RFC v3 1/4] clone: add CLONE_AUTOREAP Christian Brauner
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Christian Brauner @ 2026-02-17 22:35 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add two new clone3() flags for pidfd-based process lifecycle management.

CLONE_AUTOREAP makes a child process auto-reap on exit without ever
becoming a zombie. This is a per-process property in contrast to the
existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for SIGCHLD
which applies to all children of a given parent.

Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property
affecting all children which makes it unsuitable for libraries or
applications that need selective auto-reaping of specific children while
still being able to wait() on others.

CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct.
When the child exits do_notify_parent() checks this flag and returns
autoreap=true causing exit_notify() to transition the task directly to
EXIT_DEAD. Since the flag lives on the child it survives reparenting: if
the original parent exits and the child is reparented to a subreaper or
init the child still auto-reaps when it eventually exits. This is
cleaner then forcing the subreaper to get SIGCHLD and then reaping it.
If the parent doesn't care the subreaper won't care. If there's a
subreaper that would care it would be easy enough to add a prctl() that
either just turns back on SIGCHLD and turns of auto-reaping or a prctl()
that just notifies the subreaper whenever a child is reparented to it.

CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to
monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern. No exit signal is delivered so exit_signal must be zero.

The flag is not inherited by the autoreap process's own children. Each
child that should be autoreaped must be explicitly created with
CLONE_AUTOREAP.

CLONE_PIDFD_AUTOKILL ties a child's lifetime to the pidfd returned from
clone3(). When the last reference to the struct file created by clone3()
is closed the kernel sends SIGKILL to the child. A pidfd obtained via
pidfd_open() for the same process does not keep the child alive and does
not trigger autokill - only the specific struct file from clone3() has
this property. This is useful for container runtimes, service managers,
and sandboxed subprocess execution - any scenario where the child must
die if the parent crashes or abandons the pidfd.

CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP. It
requires CLONE_PIDFD because the whole point is tying the child's
lifetime to the pidfd. It requires CLONE_AUTOREAP because a killed child
with no one to reap it would become a zombie - the primary use case is
the parent crashing or abandoning the pidfd so no one is around to call
waitpid().

The clone3 pidfd is identified by storing a pointer to the struct file in
signal_struct.autokill_pidfd. The pidfs .release handler compares the
file being closed against this pointer and sends SIGKILL only on match.
dup()/fork() share the same struct file so they extend the child's
lifetime until the last reference drops.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Add CLONE_PIDFD_AUTOKILL flag
- Decouple CLONE_AUTOREAP from CLONE_PIDFD: the autoreap mechanism has
  no dependency on pidfds. This allows fire-and-forget patterns where
  the parent does not need exit status.
- Link to v1: https://patch.msgid.link/20260216-work-pidfs-autoreap-v1-0-e63f663008f2@kernel.org

---
Christian Brauner (4):
      clone: add CLONE_AUTOREAP
      pidfd: add CLONE_PIDFD_AUTOKILL
      selftests/pidfd: add CLONE_AUTOREAP tests
      selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests

 fs/pidfs.c                                         |  16 +
 include/linux/sched/signal.h                       |   4 +
 include/uapi/linux/sched.h                         |   2 +
 kernel/fork.c                                      |  28 +-
 kernel/ptrace.c                                    |   3 +-
 kernel/signal.c                                    |   4 +
 tools/testing/selftests/pidfd/.gitignore           |   1 +
 tools/testing/selftests/pidfd/Makefile             |   2 +-
 .../testing/selftests/pidfd/pidfd_autoreap_test.c  | 676 +++++++++++++++++++++
 9 files changed, 732 insertions(+), 4 deletions(-)
---
base-commit: 9702969978695d9a699a1f34771580cdbb153b33
change-id: 20260214-work-pidfs-autoreap-3ee677e240a8


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 1/4] clone: add CLONE_AUTOREAP
  2026-02-17 22:35 [PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL Christian Brauner
@ 2026-02-17 22:35 ` Christian Brauner
  2026-02-18 11:25   ` Oleg Nesterov
  2026-02-17 22:35 ` [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL Christian Brauner
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2026-02-17 22:35 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add a new clone3() flag CLONE_AUTOREAP that makes a child process
auto-reap on exit without ever becoming a zombie. This is a per-process
property in contrast to the existing auto-reap mechanism via
SA_NOCLDWAIT or SIG_IGN for SIGCHLD which applies to all children of a
given parent.

Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property
affecting all children which makes it unsuitable for libraries or
applications that need selective auto-reaping of specific children while
still being able to wait() on others.

CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct.
When the child exits do_notify_parent() checks this flag and returns
autoreap=true causing exit_notify() to transition the task directly to
EXIT_DEAD. Since the flag lives on the child it survives reparenting: if
the original parent exits and the child is reparented to a subreaper or
init the child still auto-reaps when it eventually exits.

CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to
monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern where the parent simply doesn't care about the child's exit
status. No exit signal is delivered so exit_signal must be zero.

The flag is not inherited by the autoreap process's own children. Each
child that should be autoreaped must be explicitly created with
CLONE_AUTOREAP.

Link: https://github.com/uapi-group/kernel-features/issues/45
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/sched/signal.h |  1 +
 include/uapi/linux/sched.h   |  1 +
 kernel/fork.c                | 14 +++++++++++++-
 kernel/ptrace.c              |  3 ++-
 kernel/signal.c              |  4 ++++
 5 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index a22248aebcf9..f842c86b806f 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -132,6 +132,7 @@ struct signal_struct {
 	 */
 	unsigned int		is_child_subreaper:1;
 	unsigned int		has_child_subreaper:1;
+	unsigned int		autoreap:1;
 
 #ifdef CONFIG_POSIX_TIMERS
 
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 359a14cc76a4..8a22ea640817 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -36,6 +36,7 @@
 /* Flags for the clone3() syscall. */
 #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
 #define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */
+#define CLONE_AUTOREAP 0x400000000ULL /* Auto-reap child on exit. */
 
 /*
  * cloning flags intersect with CSIGNAL so can be used with unshare and clone3
diff --git a/kernel/fork.c b/kernel/fork.c
index e832da9d15a4..bc27dc10c309 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2028,6 +2028,13 @@ __latent_entropy struct task_struct *copy_process(
 			return ERR_PTR(-EINVAL);
 	}
 
+	if (clone_flags & CLONE_AUTOREAP) {
+		if (clone_flags & CLONE_THREAD)
+			return ERR_PTR(-EINVAL);
+		if (args->exit_signal)
+			return ERR_PTR(-EINVAL);
+	}
+
 	/*
 	 * Force any signals received before this point to be delivered
 	 * before the fork happens.  Collect up signals sent to multiple
@@ -2374,6 +2381,8 @@ __latent_entropy struct task_struct *copy_process(
 		p->parent_exec_id = current->parent_exec_id;
 		if (clone_flags & CLONE_THREAD)
 			p->exit_signal = -1;
+		else if (clone_flags & CLONE_AUTOREAP)
+			p->exit_signal = 0;
 		else
 			p->exit_signal = current->group_leader->exit_signal;
 	} else {
@@ -2435,6 +2444,8 @@ __latent_entropy struct task_struct *copy_process(
 			 */
 			p->signal->has_child_subreaper = p->real_parent->signal->has_child_subreaper ||
 							 p->real_parent->signal->is_child_subreaper;
+			if (clone_flags & CLONE_AUTOREAP)
+				p->signal->autoreap = 1;
 			list_add_tail(&p->sibling, &p->real_parent->children);
 			list_add_tail_rcu(&p->tasks, &init_task.tasks);
 			attach_pid(p, PIDTYPE_TGID);
@@ -2897,7 +2908,8 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
 {
 	/* Verify that no unknown flags are passed along. */
 	if (kargs->flags &
-	    ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP))
+	    ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP |
+	      CLONE_AUTOREAP))
 		return false;
 
 	/*
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 392ec2f75f01..68c17daef8d4 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -549,7 +549,8 @@ static bool __ptrace_detach(struct task_struct *tracer, struct task_struct *p)
 	if (!dead && thread_group_empty(p)) {
 		if (!same_thread_group(p->real_parent, tracer))
 			dead = do_notify_parent(p, p->exit_signal);
-		else if (ignoring_children(tracer->sighand)) {
+		else if (ignoring_children(tracer->sighand) ||
+			 p->signal->autoreap) {
 			__wake_up_parent(p, tracer);
 			dead = true;
 		}
diff --git a/kernel/signal.c b/kernel/signal.c
index d65d0fe24bfb..e61f39fa8c8a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2251,6 +2251,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 		if (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN)
 			sig = 0;
 	}
+	if (!tsk->ptrace && tsk->signal->autoreap) {
+		autoreap = true;
+		sig = 0;
+	}
 	/*
 	 * Send with __send_signal as si_pid and si_uid are in the
 	 * parent's namespaces.

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-17 22:35 [PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL Christian Brauner
  2026-02-17 22:35 ` [PATCH RFC v3 1/4] clone: add CLONE_AUTOREAP Christian Brauner
@ 2026-02-17 22:35 ` Christian Brauner
  2026-02-17 23:17   ` Linus Torvalds
                     ` (2 more replies)
  2026-02-17 22:35 ` [PATCH RFC v3 3/4] selftests/pidfd: add CLONE_AUTOREAP tests Christian Brauner
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 19+ messages in thread
From: Christian Brauner @ 2026-02-17 22:35 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's
lifetime to the pidfd returned from clone3(). When the last reference to
the struct file created by clone3() is closed the kernel sends SIGKILL
to the child. A pidfd obtained via pidfd_open() for the same process
does not keep the child alive and does not trigger autokill - only the
specific struct file from clone3() has this property.

This is useful for container runtimes, service managers, and sandboxed
subprocess execution - any scenario where the child must die if the
parent crashes or abandons the pidfd.

CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD (the whole point is tying
lifetime to the pidfd file) and CLONE_AUTOREAP (a killed child with no
one to reap it would become a zombie). CLONE_THREAD is rejected because
autokill targets a process not a thread.

The clone3 pidfd is identified by storing a pointer to the struct file in
signal_struct.autokill_pidfd. The pidfs .release handler compares the
file being closed against this pointer and sends SIGKILL via
group_send_sig_info(SIGKILL, SEND_SIG_PRIV, ...) only on match. Files
from pidfd_open() or open_by_handle_at() are distinct struct files and
will never match. dup()/fork() share the same struct file so they extend
the child's lifetime until the last reference drops.

Unlike pdeath_signal autokill isn't disarmed on exec and on credential
changes that cross privilege boundaries. It would defeat the purpose of
this whole endeavour.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c                   | 16 ++++++++++++++++
 include/linux/sched/signal.h |  3 +++
 include/uapi/linux/sched.h   |  1 +
 kernel/fork.c                | 16 ++++++++++++++--
 4 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 318253344b5c..b3891b2097eb 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -8,6 +8,8 @@
 #include <linux/mount.h>
 #include <linux/pid.h>
 #include <linux/pidfs.h>
+#include <linux/sched/signal.h>
+#include <linux/signal.h>
 #include <linux/pid_namespace.h>
 #include <linux/poll.h>
 #include <linux/proc_fs.h>
@@ -637,7 +639,21 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	return open_namespace(ns_common);
 }
 
+static int pidfs_file_release(struct inode *inode, struct file *file)
+{
+	struct pid *pid = inode->i_private;
+	struct task_struct *task;
+
+	guard(rcu)();
+	task = pid_task(pid, PIDTYPE_TGID);
+	if (task && READ_ONCE(task->signal->autokill_pidfd) == file)
+		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_TGID);
+
+	return 0;
+}
+
 static const struct file_operations pidfs_file_operations = {
+	.release	= pidfs_file_release,
 	.poll		= pidfd_poll,
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo	= pidfd_show_fdinfo,
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index f842c86b806f..85a3de5c4030 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -134,6 +134,9 @@ struct signal_struct {
 	unsigned int		has_child_subreaper:1;
 	unsigned int		autoreap:1;
 
+	/* pidfd that triggers SIGKILL on close, or NULL */
+	const struct file	*autokill_pidfd;
+
 #ifdef CONFIG_POSIX_TIMERS
 
 	/* POSIX.1b Interval Timers */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 8a22ea640817..b1aea8a86e2f 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -37,6 +37,7 @@
 #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
 #define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */
 #define CLONE_AUTOREAP 0x400000000ULL /* Auto-reap child on exit. */
+#define CLONE_PIDFD_AUTOKILL 0x800000000ULL /* Kill child when clone pidfd closes. */
 
 /*
  * cloning flags intersect with CSIGNAL so can be used with unshare and clone3
diff --git a/kernel/fork.c b/kernel/fork.c
index bc27dc10c309..7bcdba54c9a0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2035,6 +2035,15 @@ __latent_entropy struct task_struct *copy_process(
 			return ERR_PTR(-EINVAL);
 	}
 
+	if (clone_flags & CLONE_PIDFD_AUTOKILL) {
+		if (!(clone_flags & CLONE_PIDFD))
+			return ERR_PTR(-EINVAL);
+		if (!(clone_flags & CLONE_AUTOREAP))
+			return ERR_PTR(-EINVAL);
+		if (clone_flags & CLONE_THREAD)
+			return ERR_PTR(-EINVAL);
+	}
+
 	/*
 	 * Force any signals received before this point to be delivered
 	 * before the fork happens.  Collect up signals sent to multiple
@@ -2470,8 +2479,11 @@ __latent_entropy struct task_struct *copy_process(
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
 
-	if (pidfile)
+	if (pidfile) {
+		if (clone_flags & CLONE_PIDFD_AUTOKILL)
+			p->signal->autokill_pidfd = pidfile;
 		fd_install(pidfd, pidfile);
+	}
 
 	proc_fork_connector(p);
 	sched_post_fork(p);
@@ -2909,7 +2921,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
 	/* Verify that no unknown flags are passed along. */
 	if (kargs->flags &
 	    ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP |
-	      CLONE_AUTOREAP))
+	      CLONE_AUTOREAP | CLONE_PIDFD_AUTOKILL))
 		return false;
 
 	/*

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 3/4] selftests/pidfd: add CLONE_AUTOREAP tests
  2026-02-17 22:35 [PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL Christian Brauner
  2026-02-17 22:35 ` [PATCH RFC v3 1/4] clone: add CLONE_AUTOREAP Christian Brauner
  2026-02-17 22:35 ` [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL Christian Brauner
@ 2026-02-17 22:35 ` Christian Brauner
  2026-02-17 22:35 ` [PATCH RFC v3 4/4] selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests Christian Brauner
  2026-02-17 22:46 ` [PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL Christian Brauner
  4 siblings, 0 replies; 19+ messages in thread
From: Christian Brauner @ 2026-02-17 22:35 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add tests for the new CLONE_AUTOREAP clone3() flag:

- autoreap_requires_pidfd: CLONE_AUTOREAP without CLONE_PIDFD fails
- autoreap_rejects_exit_signal: CLONE_AUTOREAP with non-zero
  exit_signal fails
- autoreap_rejects_thread: CLONE_AUTOREAP with CLONE_THREAD fails
- autoreap_basic: child exits, pidfd poll works, PIDFD_GET_INFO returns
  correct exit code, waitpid() returns -ECHILD
- autoreap_signaled: child killed by signal, exit info correct via pidfd
- autoreap_reparent: autoreap grandchild reparented to subreaper still
  auto-reaps
- autoreap_multithreaded: autoreap process with sub-threads auto-reaps
  after last thread exits
- autoreap_no_inherit: grandchild forked without CLONE_AUTOREAP becomes
  a regular zombie

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/.gitignore           |   1 +
 tools/testing/selftests/pidfd/Makefile             |   2 +-
 .../testing/selftests/pidfd/pidfd_autoreap_test.c  | 489 +++++++++++++++++++++
 3 files changed, 491 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore
index 144e7ff65d6a..4cd8ec7fd349 100644
--- a/tools/testing/selftests/pidfd/.gitignore
+++ b/tools/testing/selftests/pidfd/.gitignore
@@ -12,3 +12,4 @@ pidfd_info_test
 pidfd_exec_helper
 pidfd_xattr_test
 pidfd_setattr_test
+pidfd_autoreap_test
diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index 764a8f9ecefa..4211f91e9af8 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -4,7 +4,7 @@ CFLAGS += -g $(KHDR_INCLUDES) $(TOOLS_INCLUDES) -pthread -Wall
 TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test \
 	pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test \
 	pidfd_file_handle_test pidfd_bind_mount pidfd_info_test \
-	pidfd_xattr_test pidfd_setattr_test
+	pidfd_xattr_test pidfd_setattr_test pidfd_autoreap_test
 
 TEST_GEN_PROGS_EXTENDED := pidfd_exec_helper
 
diff --git a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
new file mode 100644
index 000000000000..3c0c45359473
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
@@ -0,0 +1,489 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2026 Christian Brauner <brauner@kernel.org>
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/types.h>
+#include <poll.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/ioctl.h>
+#include <sys/prctl.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "pidfd.h"
+#include "kselftest_harness.h"
+
+#ifndef CLONE_AUTOREAP
+#define CLONE_AUTOREAP 0x400000000ULL
+#endif
+
+static pid_t create_autoreap_child(int *pidfd)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_AUTOREAP,
+		.exit_signal	= 0,
+		.pidfd		= ptr_to_u64(pidfd),
+	};
+
+	return sys_clone3(&args, sizeof(args));
+}
+
+/*
+ * Test that CLONE_AUTOREAP works without CLONE_PIDFD (fire-and-forget).
+ */
+TEST(autoreap_without_pidfd)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_AUTOREAP,
+		.exit_signal	= 0,
+	};
+	pid_t pid;
+	int ret;
+
+	pid = sys_clone3(&args, sizeof(args));
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0)
+		_exit(0);
+
+	/*
+	 * Give the child a moment to exit and be autoreaped.
+	 * Then verify no zombie remains.
+	 */
+	usleep(200000);
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+}
+
+/*
+ * Test that CLONE_AUTOREAP with a non-zero exit_signal fails.
+ */
+TEST(autoreap_rejects_exit_signal)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_AUTOREAP,
+		.exit_signal	= SIGCHLD,
+	};
+	int pidfd = -1;
+	pid_t pid;
+
+	args.pidfd = ptr_to_u64(&pidfd);
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * Test that CLONE_AUTOREAP with CLONE_THREAD fails.
+ */
+TEST(autoreap_rejects_thread)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_AUTOREAP |
+				  CLONE_THREAD | CLONE_SIGHAND |
+				  CLONE_VM,
+		.exit_signal	= 0,
+	};
+	int pidfd = -1;
+	pid_t pid;
+
+	args.pidfd = ptr_to_u64(&pidfd);
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * Basic test: create an autoreap child, let it exit, verify:
+ * - pidfd becomes readable (poll returns POLLIN)
+ * - PIDFD_GET_INFO returns the correct exit code
+ * - waitpid() returns -1/ECHILD (no zombie)
+ */
+TEST(autoreap_basic)
+{
+	struct pidfd_info info = { .mask = PIDFD_INFO_EXIT };
+	int pidfd = -1, ret;
+	struct pollfd pfd;
+	pid_t pid;
+
+	pid = create_autoreap_child(&pidfd);
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0)
+		_exit(42);
+
+	ASSERT_GE(pidfd, 0);
+
+	/* Wait for the child to exit via pidfd poll. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+	ASSERT_TRUE(pfd.revents & POLLIN);
+
+	/* Verify exit info via PIDFD_GET_INFO. */
+	ret = ioctl(pidfd, PIDFD_GET_INFO, &info);
+	ASSERT_EQ(ret, 0);
+	ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT);
+	/*
+	 * exit_code is in waitpid format: for _exit(42),
+	 * WIFEXITED is true and WEXITSTATUS is 42.
+	 */
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 42);
+
+	/* Verify no zombie: waitpid should fail with ECHILD. */
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+
+	close(pidfd);
+}
+
+/*
+ * Test that an autoreap child killed by a signal reports
+ * the correct exit info.
+ */
+TEST(autoreap_signaled)
+{
+	struct pidfd_info info = { .mask = PIDFD_INFO_EXIT };
+	int pidfd = -1, ret;
+	struct pollfd pfd;
+	pid_t pid;
+
+	pid = create_autoreap_child(&pidfd);
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		pause();
+		_exit(1);
+	}
+
+	ASSERT_GE(pidfd, 0);
+
+	/* Kill the child. */
+	ret = sys_pidfd_send_signal(pidfd, SIGKILL, NULL, 0);
+	ASSERT_EQ(ret, 0);
+
+	/* Wait for exit via pidfd. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+	ASSERT_TRUE(pfd.revents & POLLIN);
+
+	/* Verify signal info. */
+	ret = ioctl(pidfd, PIDFD_GET_INFO, &info);
+	ASSERT_EQ(ret, 0);
+	ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT);
+	ASSERT_TRUE(WIFSIGNALED(info.exit_code));
+	ASSERT_EQ(WTERMSIG(info.exit_code), SIGKILL);
+
+	/* No zombie. */
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+
+	close(pidfd);
+}
+
+/*
+ * Test autoreap survives reparenting: middle process creates an
+ * autoreap grandchild, then exits. The grandchild gets reparented
+ * to us (the grandparent, which is a subreaper). When the grandchild
+ * exits, it should still be autoreaped - no zombie under us.
+ */
+TEST(autoreap_reparent)
+{
+	int ipc_sockets[2], ret;
+	int pidfd = -1;
+	struct pollfd pfd;
+	pid_t mid_pid, grandchild_pid;
+	char buf[32] = {};
+
+	/* Make ourselves a subreaper so reparented children come to us. */
+	ret = prctl(PR_SET_CHILD_SUBREAPER, 1);
+	ASSERT_EQ(ret, 0);
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	ASSERT_EQ(ret, 0);
+
+	mid_pid = fork();
+	ASSERT_GE(mid_pid, 0);
+
+	if (mid_pid == 0) {
+		/* Middle child: create an autoreap grandchild. */
+		int gc_pidfd = -1;
+
+		close(ipc_sockets[0]);
+
+		grandchild_pid = create_autoreap_child(&gc_pidfd);
+		if (grandchild_pid < 0) {
+			write_nointr(ipc_sockets[1], "E", 1);
+			close(ipc_sockets[1]);
+			_exit(1);
+		}
+
+		if (grandchild_pid == 0) {
+			/* Grandchild: wait for signal to exit. */
+			close(ipc_sockets[1]);
+			if (gc_pidfd >= 0)
+				close(gc_pidfd);
+			pause();
+			_exit(0);
+		}
+
+		/* Send grandchild PID to grandparent. */
+		snprintf(buf, sizeof(buf), "%d", grandchild_pid);
+		write_nointr(ipc_sockets[1], buf, strlen(buf));
+		close(ipc_sockets[1]);
+		if (gc_pidfd >= 0)
+			close(gc_pidfd);
+
+		/* Middle child exits, grandchild gets reparented. */
+		_exit(0);
+	}
+
+	close(ipc_sockets[1]);
+
+	/* Read grandchild's PID. */
+	ret = read_nointr(ipc_sockets[0], buf, sizeof(buf) - 1);
+	close(ipc_sockets[0]);
+	ASSERT_GT(ret, 0);
+
+	if (buf[0] == 'E') {
+		waitpid(mid_pid, NULL, 0);
+		prctl(PR_SET_CHILD_SUBREAPER, 0);
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	}
+
+	grandchild_pid = atoi(buf);
+	ASSERT_GT(grandchild_pid, 0);
+
+	/* Wait for the middle child to exit. */
+	ret = waitpid(mid_pid, NULL, 0);
+	ASSERT_EQ(ret, mid_pid);
+
+	/*
+	 * Now the grandchild is reparented to us (subreaper).
+	 * Open a pidfd for the grandchild and kill it.
+	 */
+	pidfd = sys_pidfd_open(grandchild_pid, 0);
+	ASSERT_GE(pidfd, 0);
+
+	ret = sys_pidfd_send_signal(pidfd, SIGKILL, NULL, 0);
+	ASSERT_EQ(ret, 0);
+
+	/* Wait for it to exit via pidfd poll. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+	ASSERT_TRUE(pfd.revents & POLLIN);
+
+	/*
+	 * The grandchild should have been autoreaped even though
+	 * we (the new parent) haven't set SA_NOCLDWAIT.
+	 * waitpid should return -1/ECHILD.
+	 */
+	ret = waitpid(grandchild_pid, NULL, WNOHANG);
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, ECHILD);
+
+	close(pidfd);
+
+	/* Clean up subreaper status. */
+	prctl(PR_SET_CHILD_SUBREAPER, 0);
+}
+
+static int thread_sock_fd;
+
+static void *thread_func(void *arg)
+{
+	/* Signal parent we're running. */
+	write_nointr(thread_sock_fd, "1", 1);
+
+	/* Give main thread time to call _exit() first. */
+	usleep(200000);
+
+	return NULL;
+}
+
+/*
+ * Test that an autoreap child with multiple threads is properly
+ * autoreaped only after all threads have exited.
+ */
+TEST(autoreap_multithreaded)
+{
+	struct pidfd_info info = { .mask = PIDFD_INFO_EXIT };
+	int ipc_sockets[2], ret;
+	int pidfd = -1;
+	struct pollfd pfd;
+	pid_t pid;
+	char c;
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	ASSERT_EQ(ret, 0);
+
+	pid = create_autoreap_child(&pidfd);
+	if (pid < 0 && errno == EINVAL) {
+		close(ipc_sockets[0]);
+		close(ipc_sockets[1]);
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	}
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		pthread_t thread;
+
+		close(ipc_sockets[0]);
+
+		/*
+		 * Create a sub-thread that outlives the main thread.
+		 * The thread signals readiness, then sleeps.
+		 * The main thread waits briefly, then calls _exit().
+		 */
+		thread_sock_fd = ipc_sockets[1];
+		pthread_create(&thread, NULL, thread_func, NULL);
+		pthread_detach(thread);
+
+		/* Wait for thread to be running. */
+		usleep(100000);
+
+		/* Main thread exits; sub-thread is still alive. */
+		_exit(99);
+	}
+
+	close(ipc_sockets[1]);
+
+	/* Wait for the sub-thread to signal readiness. */
+	ret = read_nointr(ipc_sockets[0], &c, 1);
+	close(ipc_sockets[0]);
+	ASSERT_EQ(ret, 1);
+
+	/* Wait for the process to fully exit via pidfd poll. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+	ASSERT_TRUE(pfd.revents & POLLIN);
+
+	/* Verify exit info. */
+	ret = ioctl(pidfd, PIDFD_GET_INFO, &info);
+	ASSERT_EQ(ret, 0);
+	ASSERT_TRUE(info.mask & PIDFD_INFO_EXIT);
+	ASSERT_TRUE(WIFEXITED(info.exit_code));
+	ASSERT_EQ(WEXITSTATUS(info.exit_code), 99);
+
+	/* No zombie. */
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+
+	close(pidfd);
+}
+
+/*
+ * Test that autoreap is NOT inherited by grandchildren.
+ */
+TEST(autoreap_no_inherit)
+{
+	int ipc_sockets[2], ret;
+	int pidfd = -1;
+	pid_t pid;
+	char buf[2] = {};
+	struct pollfd pfd;
+
+	ret = socketpair(AF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	ASSERT_EQ(ret, 0);
+
+	pid = create_autoreap_child(&pidfd);
+	if (pid < 0 && errno == EINVAL) {
+		close(ipc_sockets[0]);
+		close(ipc_sockets[1]);
+		SKIP(return, "CLONE_AUTOREAP not supported");
+	}
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		pid_t gc;
+		int status;
+
+		close(ipc_sockets[0]);
+
+		/* Autoreap child forks a grandchild (without autoreap). */
+		gc = fork();
+		if (gc < 0) {
+			write_nointr(ipc_sockets[1], "E", 1);
+			_exit(1);
+		}
+		if (gc == 0) {
+			/* Grandchild: exit immediately. */
+			close(ipc_sockets[1]);
+			_exit(77);
+		}
+
+		/*
+		 * The grandchild should become a regular zombie
+		 * since it was NOT created with CLONE_AUTOREAP.
+		 * Wait for it to verify.
+		 */
+		ret = waitpid(gc, &status, 0);
+		if (ret == gc && WIFEXITED(status) &&
+		    WEXITSTATUS(status) == 77) {
+			write_nointr(ipc_sockets[1], "P", 1);
+		} else {
+			write_nointr(ipc_sockets[1], "F", 1);
+		}
+		close(ipc_sockets[1]);
+		_exit(0);
+	}
+
+	close(ipc_sockets[1]);
+
+	ret = read_nointr(ipc_sockets[0], buf, 1);
+	close(ipc_sockets[0]);
+	ASSERT_EQ(ret, 1);
+
+	/*
+	 * 'P' means the autoreap child was able to waitpid() its
+	 * grandchild (correct - grandchild should be a normal zombie,
+	 * not autoreaped).
+	 */
+	ASSERT_EQ(buf[0], 'P');
+
+	/* Wait for the autoreap child to exit. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+
+	/* Autoreap child itself should be autoreaped. */
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+
+	close(pidfd);
+}
+
+TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 4/4] selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests
  2026-02-17 22:35 [PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL Christian Brauner
                   ` (2 preceding siblings ...)
  2026-02-17 22:35 ` [PATCH RFC v3 3/4] selftests/pidfd: add CLONE_AUTOREAP tests Christian Brauner
@ 2026-02-17 22:35 ` Christian Brauner
  2026-02-17 22:46 ` [PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL Christian Brauner
  4 siblings, 0 replies; 19+ messages in thread
From: Christian Brauner @ 2026-02-17 22:35 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add tests for the new CLONE_PIDFD_AUTOKILL clone3() flag:

- autokill_basic: child blocks in pause(), parent closes clone3 pidfd,
  child is killed and autoreaped
- autokill_requires_pidfd: CLONE_PIDFD_AUTOKILL without CLONE_PIDFD
  fails with EINVAL
- autokill_requires_autoreap: CLONE_PIDFD_AUTOKILL without
  CLONE_AUTOREAP fails with EINVAL
- autokill_rejects_thread: CLONE_PIDFD_AUTOKILL with CLONE_THREAD fails
  with EINVAL
- autokill_pidfd_open_no_effect: closing a pidfd_open() fd does not kill
  the child, closing the clone3 pidfd does

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../testing/selftests/pidfd/pidfd_autoreap_test.c  | 187 +++++++++++++++++++++
 1 file changed, 187 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
index 3c0c45359473..a1dc4f075fc3 100644
--- a/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_autoreap_test.c
@@ -28,6 +28,10 @@
 #define CLONE_AUTOREAP 0x400000000ULL
 #endif
 
+#ifndef CLONE_PIDFD_AUTOKILL
+#define CLONE_PIDFD_AUTOKILL 0x800000000ULL
+#endif
+
 static pid_t create_autoreap_child(int *pidfd)
 {
 	struct __clone_args args = {
@@ -486,4 +490,187 @@ TEST(autoreap_no_inherit)
 	close(pidfd);
 }
 
+/*
+ * Helper: create a child with CLONE_PIDFD | CLONE_PIDFD_AUTOKILL | CLONE_AUTOREAP.
+ */
+static pid_t create_autokill_child(int *pidfd)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_PIDFD_AUTOKILL |
+				  CLONE_AUTOREAP,
+		.exit_signal	= 0,
+		.pidfd		= ptr_to_u64(pidfd),
+	};
+
+	return sys_clone3(&args, sizeof(args));
+}
+
+/*
+ * Basic autokill test: child blocks in pause(), parent closes the
+ * clone3 pidfd, child should be killed and autoreaped.
+ */
+TEST(autokill_basic)
+{
+	int pidfd = -1, pollfd_fd = -1, ret;
+	struct pollfd pfd;
+	pid_t pid;
+
+	pid = create_autokill_child(&pidfd);
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_PIDFD_AUTOKILL not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		pause();
+		_exit(1);
+	}
+
+	ASSERT_GE(pidfd, 0);
+
+	/*
+	 * Open a second pidfd via pidfd_open() so we can observe the
+	 * child's death after closing the clone3 pidfd.
+	 */
+	pollfd_fd = sys_pidfd_open(pid, 0);
+	ASSERT_GE(pollfd_fd, 0);
+
+	/* Close the clone3 pidfd — this should trigger autokill. */
+	close(pidfd);
+
+	/* Wait for the child to die via the pidfd_open'd fd. */
+	pfd.fd = pollfd_fd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+	ASSERT_TRUE(pfd.revents & POLLIN);
+
+	/* Child should be autoreaped — no zombie. */
+	usleep(100000);
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+
+	close(pollfd_fd);
+}
+
+/*
+ * CLONE_PIDFD_AUTOKILL without CLONE_PIDFD must fail with EINVAL.
+ */
+TEST(autokill_requires_pidfd)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD_AUTOKILL | CLONE_AUTOREAP,
+		.exit_signal	= 0,
+	};
+	pid_t pid;
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * CLONE_PIDFD_AUTOKILL without CLONE_AUTOREAP must fail with EINVAL.
+ */
+TEST(autokill_requires_autoreap)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_PIDFD_AUTOKILL,
+		.exit_signal	= SIGCHLD,
+	};
+	int pidfd = -1;
+	pid_t pid;
+
+	args.pidfd = ptr_to_u64(&pidfd);
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * CLONE_PIDFD_AUTOKILL with CLONE_THREAD must fail with EINVAL.
+ */
+TEST(autokill_rejects_thread)
+{
+	struct __clone_args args = {
+		.flags		= CLONE_PIDFD | CLONE_PIDFD_AUTOKILL |
+				  CLONE_AUTOREAP | CLONE_THREAD |
+				  CLONE_SIGHAND | CLONE_VM,
+		.exit_signal	= 0,
+	};
+	int pidfd = -1;
+	pid_t pid;
+
+	args.pidfd = ptr_to_u64(&pidfd);
+
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_EQ(pid, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * Test that only the clone3 pidfd triggers autokill, not pidfd_open().
+ * Close the pidfd_open'd fd first — child should survive.
+ * Then close the clone3 pidfd — child should be killed and autoreaped.
+ */
+TEST(autokill_pidfd_open_no_effect)
+{
+	int pidfd = -1, open_fd = -1, ret;
+	struct pollfd pfd;
+	pid_t pid;
+
+	pid = create_autokill_child(&pidfd);
+	if (pid < 0 && errno == EINVAL)
+		SKIP(return, "CLONE_PIDFD_AUTOKILL not supported");
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		pause();
+		_exit(1);
+	}
+
+	ASSERT_GE(pidfd, 0);
+
+	/* Open a second pidfd via pidfd_open(). */
+	open_fd = sys_pidfd_open(pid, 0);
+	ASSERT_GE(open_fd, 0);
+
+	/*
+	 * Close the pidfd_open'd fd — child should survive because
+	 * only the clone3 pidfd has autokill.
+	 */
+	close(open_fd);
+	usleep(200000);
+
+	/* Verify child is still alive by polling the clone3 pidfd. */
+	pfd.fd = pidfd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 0);
+	ASSERT_EQ(ret, 0) {
+		TH_LOG("Child died after closing pidfd_open fd — should still be alive");
+	}
+
+	/* Open another observation fd before triggering autokill. */
+	open_fd = sys_pidfd_open(pid, 0);
+	ASSERT_GE(open_fd, 0);
+
+	/* Now close the clone3 pidfd — this triggers autokill. */
+	close(pidfd);
+
+	pfd.fd = open_fd;
+	pfd.events = POLLIN;
+	ret = poll(&pfd, 1, 5000);
+	ASSERT_EQ(ret, 1);
+	ASSERT_TRUE(pfd.revents & POLLIN);
+
+	/* Child should be autoreaped — no zombie. */
+	usleep(100000);
+	ret = waitpid(pid, NULL, WNOHANG);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, ECHILD);
+
+	close(open_fd);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL
  2026-02-17 22:35 [PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL Christian Brauner
                   ` (3 preceding siblings ...)
  2026-02-17 22:35 ` [PATCH RFC v3 4/4] selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests Christian Brauner
@ 2026-02-17 22:46 ` Christian Brauner
  4 siblings, 0 replies; 19+ messages in thread
From: Christian Brauner @ 2026-02-17 22:46 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn, Andy Lutomirski, Linus Torvalds
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, linux-fsdevel

> CLONE_PIDFD_AUTOKILL ties a child's lifetime to the pidfd returned from
> clone3(). When the last reference to the struct file created by clone3()
> is closed the kernel sends SIGKILL to the child.

So this is for me one of the most useful features that I've been
pondering for a long time but always put off. It's usefulness is
intimately tied to the fact that the kill-on-close contract cannot be
flaunted no matter what gets executed (freebsd has the same behavior for
pdfork()).

If the parent says to SIGKILL the child once the fd is closed then it
isn't reset no matter if privileged exec or credential change. This is
in contrast to related mechanisms such as pdeath_signal which gets reset
by all kinds of crap but then can be set again and it's just cumbersome
and not super useful. Not even signal delivery is guaranteed as
permission are checked for that as well.

My ideal model for kill-on-close is to just ruthlessly enforce that the
kernel murders anything once the file is released. But I would really
like to get some thoughts on this.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-17 22:35 ` [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL Christian Brauner
@ 2026-02-17 23:17   ` Linus Torvalds
  2026-02-17 23:38     ` Jann Horn
  2026-02-17 23:43   ` Jann Horn
  2026-02-18 11:50   ` Oleg Nesterov
  2 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2026-02-17 23:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Jann Horn, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On Tue, 17 Feb 2026 at 14:36, Christian Brauner <brauner@kernel.org> wrote:
>
> Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's
> lifetime to the pidfd returned from clone3(). When the last reference to
> the struct file created by clone3() is closed the kernel sends SIGKILL
> to the child.

Did I read this right? You can now basically kill suid binaries that
you started but don't have rights to kill any other way.

If I'm right, this is completely broken. Please explain.

              Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-17 23:17   ` Linus Torvalds
@ 2026-02-17 23:38     ` Jann Horn
  2026-02-17 23:44       ` Linus Torvalds
  2026-02-18 10:21       ` Christian Brauner
  0 siblings, 2 replies; 19+ messages in thread
From: Jann Horn @ 2026-02-17 23:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On Wed, Feb 18, 2026 at 12:18 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, 17 Feb 2026 at 14:36, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's
> > lifetime to the pidfd returned from clone3(). When the last reference to
> > the struct file created by clone3() is closed the kernel sends SIGKILL
> > to the child.
>
> Did I read this right? You can now basically kill suid binaries that
> you started but don't have rights to kill any other way.
>
> If I'm right, this is completely broken. Please explain.

You can already send SIGHUP to such binaries through things like job
control, right?
Do we know if there are setuid binaries out there that change their
ruid and suid to prevent being killable via kill_ok_by_cred(), then
set SIGHUP to SIG_IGN to not be killable via job control, and then do
some work that shouldn't be interrupted?

Also, on a Linux system with systemd, I believe a normal user, when
running in the context of a user session (but not when running in the
context of a system service), can already SIGKILL anything they launch
by launching it in a systemd user service, then doing something like
"echo 1 > /sys/fs/cgroup/user.slice/user-$UID.slice/user@$UID.service/app.slice/<servicename>.scope/cgroup.kill"
because systemd delegates cgroups for anything a user runs to that
user; and cgroup.kill goes through the codepath
cgroup_kill_write -> cgroup_kill -> __cgroup_kill -> send_sig(SIGKILL,
task, 0) -> send_sig_info -> do_send_sig_info
which, as far as I know, bypasses the normal signal sending permission
checks. (For comparison, group_send_sig_info() first calls
check_kill_permission(), then do_send_sig_info().)

I agree that this would be a change to the security model, but I'm not
sure if it would be that big a change. I guess an alternative might be
to instead gate the clone() flag on a `task_no_new_privs(current) ||
ns_capable()` check like in seccomp, but that might be too restrictive
for the usecases Christian has in mind...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-17 22:35 ` [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL Christian Brauner
  2026-02-17 23:17   ` Linus Torvalds
@ 2026-02-17 23:43   ` Jann Horn
  2026-02-18 10:00     ` Christian Brauner
  2026-02-18 11:50   ` Oleg Nesterov
  2 siblings, 1 reply; 19+ messages in thread
From: Jann Horn @ 2026-02-17 23:43 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On Tue, Feb 17, 2026 at 11:36 PM Christian Brauner <brauner@kernel.org> wrote:
> Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's
> lifetime to the pidfd returned from clone3(). When the last reference to
> the struct file created by clone3() is closed the kernel sends SIGKILL
> to the child. A pidfd obtained via pidfd_open() for the same process
> does not keep the child alive and does not trigger autokill - only the
> specific struct file from clone3() has this property.
>
> This is useful for container runtimes, service managers, and sandboxed
> subprocess execution - any scenario where the child must die if the
> parent crashes or abandons the pidfd.

Idle thought, feel free to ignore:
In those scenarios, I guess what you'd ideally want would be a way to
kill the entire process hierarchy, not just the one process that was
spawned? Unless the process is anyway PID 1 of its own pid namespace.
But that would probably be more invasive and kind of an orthogonal
feature...

[...]
> +static int pidfs_file_release(struct inode *inode, struct file *file)
> +{
> +       struct pid *pid = inode->i_private;
> +       struct task_struct *task;
> +
> +       guard(rcu)();
> +       task = pid_task(pid, PIDTYPE_TGID);
> +       if (task && READ_ONCE(task->signal->autokill_pidfd) == file)

Can you maybe also clear out the task->signal->autokill_pidfd pointer
here? It should be fine in practice either way, but theoretically,
with the current code, this equality check could wrongly match if the
actual autokill file has been released and a new pidfd file has been
reallocated at the same address... Of course, at worst that would kill
a task that has already been killed, so it wouldn't be particularly
bad, but still it's ugly.

> +               do_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_TGID);
> +
> +       return 0;
> +}
[...]
> @@ -2470,8 +2479,11 @@ __latent_entropy struct task_struct *copy_process(
>         syscall_tracepoint_update(p);
>         write_unlock_irq(&tasklist_lock);
>
> -       if (pidfile)
> +       if (pidfile) {
> +               if (clone_flags & CLONE_PIDFD_AUTOKILL)
> +                       p->signal->autokill_pidfd = pidfile;

WRITE_ONCE() to match the READ_ONCE() in pidfs_file_release()?

>                 fd_install(pidfd, pidfile);
> +       }
>
>         proc_fork_connector(p);
>         sched_post_fork(p);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-17 23:38     ` Jann Horn
@ 2026-02-17 23:44       ` Linus Torvalds
  2026-02-18  8:18         ` Christian Brauner
  2026-02-18 13:29         ` Theodore Tso
  2026-02-18 10:21       ` Christian Brauner
  1 sibling, 2 replies; 19+ messages in thread
From: Linus Torvalds @ 2026-02-17 23:44 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On Tue, 17 Feb 2026 at 15:38, Jann Horn <jannh@google.com> wrote:
>
> You can already send SIGHUP to such binaries through things like job
> control, right?

But at least those can be blocked, and people can disassociate
themselves from a tty if they care etc.

This seems like it can't be blocked any way, although I guess you can
just do the double fork dance to distance yourself from your parent.

> Also, on a Linux system with systemd, I believe a normal user, when
> running in the context of a user session (but not when running in the
> context of a system service), can already SIGKILL anything they launch
> by launching it in a systemd user service, then doing something [...]

Ugh. But at least it's not the kernel that does it, and we have rules
for sending signals.

> I agree that this would be a change to the security model, but I'm not
> sure if it would be that big a change.

I would expect most normal binaries to expect to be killed with ^C etc
anyway, so in that sense this is indeed likely not a big deal. But at
least those are well-known and traditional ways of getting signals
that people kind of expecy.

But it does seem to violate all the normal 'kill()' checks, and it
smells horribly bad.

            Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-17 23:44       ` Linus Torvalds
@ 2026-02-18  8:18         ` Christian Brauner
  2026-02-18 14:00           ` Theodore Tso
  2026-02-18 13:29         ` Theodore Tso
  1 sibling, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2026-02-18  8:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jann Horn, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On Tue, Feb 17, 2026 at 03:44:52PM -0800, Linus Torvalds wrote:
> On Tue, 17 Feb 2026 at 15:38, Jann Horn <jannh@google.com> wrote:
> >
> > You can already send SIGHUP to such binaries through things like job
> > control, right?
> 
> But at least those can be blocked, and people can disassociate
> themselves from a tty if they care etc.
> 
> This seems like it can't be blocked any way, although I guess you can
> just do the double fork dance to distance yourself from your parent.
> 
> > Also, on a Linux system with systemd, I believe a normal user, when
> > running in the context of a user session (but not when running in the
> > context of a system service), can already SIGKILL anything they launch
> > by launching it in a systemd user service, then doing something [...]
> 
> Ugh. But at least it's not the kernel that does it, and we have rules
> for sending signals.
> 
> > I agree that this would be a change to the security model, but I'm not
> > sure if it would be that big a change.
> 
> I would expect most normal binaries to expect to be killed with ^C etc
> anyway, so in that sense this is indeed likely not a big deal. But at
> least those are well-known and traditional ways of getting signals
> that people kind of expecy.

I think you missed the message that I sent as a reply right away.

I'm very aware that as written this will allow users to kill setuid
binaries. I explictly wrote the first RFC so autokill isn't reset during
bprm->secureexec nor during commit_creds() - in contrast to pdeath
signal. I'm very aware of all of this and am calling it out in the
commit message as well.

The kill-on-close contract cannot be flaunted no matter what gets
executed very much in contrast to pdeath_signal which is annoying
because it magically gets unset and then userspace needs to know when it
got unset and then needs to reset it again.

My ideal model for kill-on-close is to just ruthlessly enforce that the
kernel murders anything once the file is released. I would value input
under what circumstances we could make this work without having the
kernel magically unset it under magical circumstances that are
completely opaque to userspace.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-17 23:43   ` Jann Horn
@ 2026-02-18 10:00     ` Christian Brauner
  0 siblings, 0 replies; 19+ messages in thread
From: Christian Brauner @ 2026-02-18 10:00 UTC (permalink / raw)
  To: Jann Horn
  Cc: Oleg Nesterov, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On Wed, Feb 18, 2026 at 12:43:59AM +0100, Jann Horn wrote:
> On Tue, Feb 17, 2026 at 11:36 PM Christian Brauner <brauner@kernel.org> wrote:
> > Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's
> > lifetime to the pidfd returned from clone3(). When the last reference to
> > the struct file created by clone3() is closed the kernel sends SIGKILL
> > to the child. A pidfd obtained via pidfd_open() for the same process
> > does not keep the child alive and does not trigger autokill - only the
> > specific struct file from clone3() has this property.
> >
> > This is useful for container runtimes, service managers, and sandboxed
> > subprocess execution - any scenario where the child must die if the
> > parent crashes or abandons the pidfd.
> 
> Idle thought, feel free to ignore:
> In those scenarios, I guess what you'd ideally want would be a way to
> kill the entire process hierarchy, not just the one process that was
> spawned? Unless the process is anyway PID 1 of its own pid namespace.
> But that would probably be more invasive and kind of an orthogonal
> feature...

It's something that I have as an exploration item on a ToDo. :)

> 
> [...]
> > +static int pidfs_file_release(struct inode *inode, struct file *file)
> > +{
> > +       struct pid *pid = inode->i_private;
> > +       struct task_struct *task;
> > +
> > +       guard(rcu)();
> > +       task = pid_task(pid, PIDTYPE_TGID);
> > +       if (task && READ_ONCE(task->signal->autokill_pidfd) == file)
> 
> Can you maybe also clear out the task->signal->autokill_pidfd pointer
> here? It should be fine in practice either way, but theoretically,

Yes, of course.

> with the current code, this equality check could wrongly match if the
> actual autokill file has been released and a new pidfd file has been
> reallocated at the same address... Of course, at worst that would kill
> a task that has already been killed, so it wouldn't be particularly
> bad, but still it's ugly.
> 
> > +               do_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_TGID);
> > +
> > +       return 0;
> > +}
> [...]
> > @@ -2470,8 +2479,11 @@ __latent_entropy struct task_struct *copy_process(
> >         syscall_tracepoint_update(p);
> >         write_unlock_irq(&tasklist_lock);
> >
> > -       if (pidfile)
> > +       if (pidfile) {
> > +               if (clone_flags & CLONE_PIDFD_AUTOKILL)
> > +                       p->signal->autokill_pidfd = pidfile;
> 
> WRITE_ONCE() to match the READ_ONCE() in pidfs_file_release()?

Agreed.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-17 23:38     ` Jann Horn
  2026-02-17 23:44       ` Linus Torvalds
@ 2026-02-18 10:21       ` Christian Brauner
  1 sibling, 0 replies; 19+ messages in thread
From: Christian Brauner @ 2026-02-18 10:21 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linus Torvalds, Oleg Nesterov, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On Wed, Feb 18, 2026 at 12:38:02AM +0100, Jann Horn wrote:
> On Wed, Feb 18, 2026 at 12:18 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Tue, 17 Feb 2026 at 14:36, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's
> > > lifetime to the pidfd returned from clone3(). When the last reference to
> > > the struct file created by clone3() is closed the kernel sends SIGKILL
> > > to the child.
> >
> > Did I read this right? You can now basically kill suid binaries that
> > you started but don't have rights to kill any other way.
> >
> > If I'm right, this is completely broken. Please explain.
> 
> You can already send SIGHUP to such binaries through things like job
> control, right?
> Do we know if there are setuid binaries out there that change their
> ruid and suid to prevent being killable via kill_ok_by_cred(), then
> set SIGHUP to SIG_IGN to not be killable via job control, and then do
> some work that shouldn't be interrupted?
> 
> Also, on a Linux system with systemd, I believe a normal user, when
> running in the context of a user session (but not when running in the
> context of a system service), can already SIGKILL anything they launch
> by launching it in a systemd user service, then doing something like
> "echo 1 > /sys/fs/cgroup/user.slice/user-$UID.slice/user@$UID.service/app.slice/<servicename>.scope/cgroup.kill"
> because systemd delegates cgroups for anything a user runs to that
> user; and cgroup.kill goes through the codepath
> cgroup_kill_write -> cgroup_kill -> __cgroup_kill -> send_sig(SIGKILL,
> task, 0) -> send_sig_info -> do_send_sig_info
> which, as far as I know, bypasses the normal signal sending permission
> checks. (For comparison, group_send_sig_info() first calls
> check_kill_permission(), then do_send_sig_info().)
> 
> I agree that this would be a change to the security model, but I'm not
> sure if it would be that big a change. I guess an alternative might be
> to instead gate the clone() flag on a `task_no_new_privs(current) ||
> ns_capable()` check like in seccomp, but that might be too restrictive
> for the usecases Christian has in mind...

So I'm going to briefly reiterate what I wrote in my other replies because
I really don't want to get anyone the impression that I don't understand
that this is a change in the security model - It's what I explicitly
wanted to discuss:

  I'm very aware that as written this will allow users to kill setuid
  binaries. I explictly wrote the first RFC so autokill isn't reset during
  bprm->secureexec nor during commit_creds() - in contrast to pdeath
  signal.

I did indeed think of simply using the seccomp model. I have a long
document about all of the different implications for all of this.

Ideally we'd not have to use the seccomp model but if we have to I'm
fine with it. There are two problems I would want to avoid though. Right
now pdeath_signal is reset on _any_ set*id() transition via
commit_creds(). Which makes it really useless.

For example, if you setup a container the child sets pdeath_signal so it
gets auto-killed when the container setup process dies. But as soon as
the child uses set*id() calls to become privileged over the container's
namespaces pdeath_signal magically gets reset. So all container runtimes
have this annoying code in some form:

static int do_start(void *data) /* container workload that gets setup */
{

<snip>

        /* This prctl must be before the synchro, so if the parent dies before
         * we set the parent death signal, we will detect its death with the
         * synchro right after, otherwise we have a window where the parent can
         * exit before we set the pdeath signal leading to a unsupervized
         * container.
         */
        ret = lxc_set_death_signal(SIGKILL, handler->monitor_pid, status_fd);
        if (ret < 0) {
                SYSERROR("Failed to set PR_SET_PDEATHSIG to SIGKILL");
                goto out_warn_father;
        }

<snip>

        /* If we are in a new user namespace, become root there to have
         * privilege over our namespace.
         */
        if (!list_empty(&handler->conf->id_map)) {

<snip>

                /* Drop groups only after we switched to a valid gid in the new
                 * user namespace.
                 */
                if (!lxc_drop_groups() &&
                    (handler->am_root || errno != EPERM))
                        goto out_warn_father;

                if (!lxc_switch_uid_gid(nsuid, nsgid))
                        goto out_warn_father;

                ret = prctl(PR_SET_DUMPABLE, prctl_arg(1), prctl_arg(0),
                            prctl_arg(0), prctl_arg(0));
                if (ret < 0)
                        goto out_warn_father;

                /* set{g,u}id() clears deathsignal */
                ret = lxc_set_death_signal(SIGKILL, handler->monitor_pid, status_fd);
                if (ret < 0) {
                        SYSERROR("Failed to set PR_SET_PDEATHSIG to SIGKILL");
                        goto out_warn_father;
                }

<sip>

I can't stress how useless this often makes pdeath_signal. Let alone
that the child must set it so there's always a race with the parent
dying while the child is setting it. And obviously it isn't just
containers. It's anything that deprivileges itself including some
services.

If we require the seccomp task_no_new_privs() thing I really really
would like to not have to reset autokill during commit_creds().

Because then it is at least consistent for task_no_new_privs() without
magic resets.

TL;DR as long as we can come up with a model where there are no magical
resets of the property by the kernel this is useful.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 1/4] clone: add CLONE_AUTOREAP
  2026-02-17 22:35 ` [PATCH RFC v3 1/4] clone: add CLONE_AUTOREAP Christian Brauner
@ 2026-02-18 11:25   ` Oleg Nesterov
  2026-02-18 13:30     ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2026-02-18 11:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On 02/17, Christian Brauner wrote:
>
> CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to
> monitor the child's exit via poll() and retrieve exit status via
> PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
> pattern where the parent simply doesn't care about the child's exit
> status. No exit signal is delivered so exit_signal must be zero.
                                         ^^^^^^^^^^^^^^^^^^^^^^^^

Well, it has no effect if signal->autoreap is true. Probably makes
sense to enforce this rule anyway... but see below.

> @@ -2028,6 +2028,13 @@ __latent_entropy struct task_struct *copy_process(
>  			return ERR_PTR(-EINVAL);
>  	}
>
> +	if (clone_flags & CLONE_AUTOREAP) {
> +		if (clone_flags & CLONE_THREAD)
> +			return ERR_PTR(-EINVAL);
> +		if (args->exit_signal)
> +			return ERR_PTR(-EINVAL);
> +	}
> +
>  	/*
>  	 * Force any signals received before this point to be delivered
>  	 * before the fork happens.  Collect up signals sent to multiple
> @@ -2374,6 +2381,8 @@ __latent_entropy struct task_struct *copy_process(
>  		p->parent_exec_id = current->parent_exec_id;
>  		if (clone_flags & CLONE_THREAD)
>  			p->exit_signal = -1;
> +		else if (clone_flags & CLONE_AUTOREAP)
> +			p->exit_signal = 0;

So this is only needed for the CLONE_PARENT|CLONE_AUTOREAP case. Do we
really need to support this case? Not that I see anything wrong, but let
me ask anyway.

OTOH, what if a CLONE_AUTOREAP'ed child does clone(CLONE_PARENT) ?
in this case args->exit_signal is ignored, so the new child will run
with exit_signal == 0 but without signal->autoreap. This means it will
become a zombie without sending a signal. Again, I see nothing really
wrong, just this looks a bit confusing to me.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-17 22:35 ` [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL Christian Brauner
  2026-02-17 23:17   ` Linus Torvalds
  2026-02-17 23:43   ` Jann Horn
@ 2026-02-18 11:50   ` Oleg Nesterov
  2026-02-18 13:31     ` Christian Brauner
  2 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2026-02-18 11:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On 02/17, Christian Brauner wrote:
>
> @@ -2470,8 +2479,11 @@ __latent_entropy struct task_struct *copy_process(
>  	syscall_tracepoint_update(p);
>  	write_unlock_irq(&tasklist_lock);
>  
> -	if (pidfile)
> +	if (pidfile) {
> +		if (clone_flags & CLONE_PIDFD_AUTOKILL)
> +			p->signal->autokill_pidfd = pidfile;
>  		fd_install(pidfd, pidfile);

Just curious... Instead of adding signal->autokill_pidfd, can't we
add another "not fcntl" PIDFD_AUTOKILL flag that lives in ->f_flags ?

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-17 23:44       ` Linus Torvalds
  2026-02-18  8:18         ` Christian Brauner
@ 2026-02-18 13:29         ` Theodore Tso
  1 sibling, 0 replies; 19+ messages in thread
From: Theodore Tso @ 2026-02-18 13:29 UTC (permalink / raw)
  Cc: Jann Horn, Christian Brauner, Oleg Nesterov, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-fsdevel

On Tue, Feb 17, 2026 at 03:44:52PM -0800, Linus Torvalds wrote:
> On Tue, 17 Feb 2026 at 15:38, Jann Horn <jannh@google.com> wrote:
> >
> > You can already send SIGHUP to such binaries through things like job
> > control, right?
> 
> But at least those can be blocked, and people can disassociate
> themselves from a tty if they care etc.

Does CLONE_PIDFD_AUTOKILL need to send a SIGKILL?  Could it be
something that could be trapped/blocked, like SIGHUP or SIGTERM?  Or
maybe we could do the SIGHUP, wait 30 seconds (+/- a random delay), if
it hasn't exited, send SIGTERM, wait another 30 seconds (+/- a random
delay) if it hasn't exited send a SIGKILL.  That's still a change in
the security model, but it's less likely to cause problems if the goal
is to try to catch a setuid program while it is in the middle of
editing some critical file such as /etc/sudo.conf or /etc/passwd or
some such.

I bet we'll still see some zero days coming out of this, but we can at
least mitigate likelihood of security breach.

							- Ted
							

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 1/4] clone: add CLONE_AUTOREAP
  2026-02-18 11:25   ` Oleg Nesterov
@ 2026-02-18 13:30     ` Christian Brauner
  0 siblings, 0 replies; 19+ messages in thread
From: Christian Brauner @ 2026-02-18 13:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jann Horn, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On Wed, Feb 18, 2026 at 12:25:26PM +0100, Oleg Nesterov wrote:
> On 02/17, Christian Brauner wrote:
> >
> > CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to
> > monitor the child's exit via poll() and retrieve exit status via
> > PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
> > pattern where the parent simply doesn't care about the child's exit
> > status. No exit signal is delivered so exit_signal must be zero.
>                                          ^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Well, it has no effect if signal->autoreap is true. Probably makes
> sense to enforce this rule anyway... but see below.
> 
> > @@ -2028,6 +2028,13 @@ __latent_entropy struct task_struct *copy_process(
> >  			return ERR_PTR(-EINVAL);
> >  	}
> >
> > +	if (clone_flags & CLONE_AUTOREAP) {
> > +		if (clone_flags & CLONE_THREAD)
> > +			return ERR_PTR(-EINVAL);
> > +		if (args->exit_signal)
> > +			return ERR_PTR(-EINVAL);
> > +	}
> > +
> >  	/*
> >  	 * Force any signals received before this point to be delivered
> >  	 * before the fork happens.  Collect up signals sent to multiple
> > @@ -2374,6 +2381,8 @@ __latent_entropy struct task_struct *copy_process(
> >  		p->parent_exec_id = current->parent_exec_id;
> >  		if (clone_flags & CLONE_THREAD)
> >  			p->exit_signal = -1;
> > +		else if (clone_flags & CLONE_AUTOREAP)
> > +			p->exit_signal = 0;
> 
> So this is only needed for the CLONE_PARENT|CLONE_AUTOREAP case. Do we
> really need to support this case? Not that I see anything wrong, but let
> me ask anyway.
> 
> OTOH, what if a CLONE_AUTOREAP'ed child does clone(CLONE_PARENT) ?
> in this case args->exit_signal is ignored, so the new child will run
> with exit_signal == 0 but without signal->autoreap. This means it will
> become a zombie without sending a signal. Again, I see nothing really
> wrong, just this looks a bit confusing to me.

Good point, I think makes sense to not allow CLONE_PARENT with this.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-18 11:50   ` Oleg Nesterov
@ 2026-02-18 13:31     ` Christian Brauner
  0 siblings, 0 replies; 19+ messages in thread
From: Christian Brauner @ 2026-02-18 13:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jann Horn, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-fsdevel

On Wed, Feb 18, 2026 at 12:50:41PM +0100, Oleg Nesterov wrote:
> On 02/17, Christian Brauner wrote:
> >
> > @@ -2470,8 +2479,11 @@ __latent_entropy struct task_struct *copy_process(
> >  	syscall_tracepoint_update(p);
> >  	write_unlock_irq(&tasklist_lock);
> >  
> > -	if (pidfile)
> > +	if (pidfile) {
> > +		if (clone_flags & CLONE_PIDFD_AUTOKILL)
> > +			p->signal->autokill_pidfd = pidfile;
> >  		fd_install(pidfd, pidfile);
> 
> Just curious... Instead of adding signal->autokill_pidfd, can't we
> add another "not fcntl" PIDFD_AUTOKILL flag that lives in ->f_flags ?

This is a version I had as well and yes, that works too!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL
  2026-02-18  8:18         ` Christian Brauner
@ 2026-02-18 14:00           ` Theodore Tso
  0 siblings, 0 replies; 19+ messages in thread
From: Theodore Tso @ 2026-02-18 14:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Jann Horn, Oleg Nesterov, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-fsdevel

On Wed, Feb 18, 2026 at 09:18:49AM +0100, Christian Brauner wrote:
> The kill-on-close contract cannot be flaunted no matter what gets
> executed very much in contrast to pdeath_signal which is annoying
> because it magically gets unset and then userspace needs to know when it
> got unset and then needs to reset it again.

I think you mean "violated", not "flaunted", above.

If a process can do the double-fork dance to avoid getting killed, is
that a problem with your use case?

What if we give the process time to exit before we bring down the
hammer, as I suggested in another message on this thread?

> My ideal model for kill-on-close is to just ruthlessly enforce that the
> kernel murders anything once the file is released. I would value input
> under what circumstances we could make this work without having the
> kernel magically unset it under magical circumstances that are
> completely opaque to userspace.

I don't think this proposal would fly, but what if an exec of a setuid
binary fails with an error if the AUTOKILL flag is set?   :-)

       	     	     	      	  	   	- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-02-18 14:00 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-17 22:35 [PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL Christian Brauner
2026-02-17 22:35 ` [PATCH RFC v3 1/4] clone: add CLONE_AUTOREAP Christian Brauner
2026-02-18 11:25   ` Oleg Nesterov
2026-02-18 13:30     ` Christian Brauner
2026-02-17 22:35 ` [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL Christian Brauner
2026-02-17 23:17   ` Linus Torvalds
2026-02-17 23:38     ` Jann Horn
2026-02-17 23:44       ` Linus Torvalds
2026-02-18  8:18         ` Christian Brauner
2026-02-18 14:00           ` Theodore Tso
2026-02-18 13:29         ` Theodore Tso
2026-02-18 10:21       ` Christian Brauner
2026-02-17 23:43   ` Jann Horn
2026-02-18 10:00     ` Christian Brauner
2026-02-18 11:50   ` Oleg Nesterov
2026-02-18 13:31     ` Christian Brauner
2026-02-17 22:35 ` [PATCH RFC v3 3/4] selftests/pidfd: add CLONE_AUTOREAP tests Christian Brauner
2026-02-17 22:35 ` [PATCH RFC v3 4/4] selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests Christian Brauner
2026-02-17 22:46 ` [PATCH RFC v3 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox