public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL
@ 2026-02-26 13:50 Christian Brauner
  2026-02-26 13:50 ` [PATCH v5 1/6] clone: add CLONE_AUTOREAP Christian Brauner
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Christian Brauner @ 2026-02-26 13:50 UTC (permalink / raw)
  To: Oleg Nesterov, Jann Horn
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-fsdevel, Christian Brauner

Add three new clone3() flags for pidfd-based process lifecycle
management.

=== CLONE_AUTOREAP ===

CLONE_AUTOREAP makes a child process auto-reap on exit without ever
becoming a zombie. This is a per-process property in contrast to the
existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for SIGCHLD
which applies to all children of a given parent.

Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property
affecting all children which makes it unsuitable for libraries or
applications that need selective auto-reaping of specific children while
still being able to wait() on others.

CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct.
When the child exits do_notify_parent() checks this flag and causes
exit_notify() to transition the task directly to EXIT_DEAD. Since the
flag lives on the child it survives reparenting: if the original parent
exits and the child is reparented to a subreaper or init the child still
auto-reaps when it eventually exits. This is cleaner than forcing the
subreaper to get SIGCHLD and then reaping it. If the parent doesn't care
the subreaper won't care. If there's a subreaper that would care it
would be easy enough to add a prctl() that either just turns back on
SIGCHLD and turns off auto-reaping or a prctl() that just notifies the
subreaper whenever a child is reparented to it.

CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to
monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern. No exit signal is delivered so exit_signal must be zero.
CLONE_THREAD and CLONE_PARENT are rejected: CLONE_THREAD because
autoreap is a process-level property, and CLONE_PARENT because an
autoreap child reparented via CLONE_PARENT could become an invisible
zombie under a parent that never calls wait().

The flag is not inherited by the autoreap process's own children. Each
child that should be autoreaped must be explicitly created with
CLONE_AUTOREAP.

=== CLONE_NNP ===

CLONE_NNP sets no_new_privs on the child at clone time. Unlike
prctl(PR_SET_NO_NEW_PRIVS) which a process sets on itself, CLONE_NNP
allows the parent to impose no_new_privs on the child at creation
without affecting the parent's own privileges. CLONE_THREAD is rejected
because threads share credentials. CLONE_NNP is useful on its own for
any spawn-and-sandbox pattern but was specifically introduced to enable
unprivileged usage of CLONE_PIDFD_AUTOKILL.

=== CLONE_PIDFD_AUTOKILL ===

This flag ties a child's lifetime to the pidfd returned from clone3().
When the last reference to the struct file created by clone3() is closed
the kernel sends SIGKILL to the child. A pidfd obtained via pidfd_open()
for the same process does not keep the child alive and does not trigger
autokill - only the specific struct file from clone3() has this
property. This is useful for container runtimes, service managers, and
sandboxed subprocess execution - any scenario where the child must die
if the parent crashes or abandons the pidfd or just wants a throwaway
helper process.

CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP. It
requires CLONE_PIDFD because the whole point is tying the child's
lifetime to the pidfd. It requires CLONE_AUTOREAP because a killed child
with no one to reap it would become a zombie - the primary use case is
the parent crashing or abandoning the pidfd so no one is around to call
waitpid(). CLONE_THREAD is rejected because autokill targets a process
not a thread.

If CLONE_NNP is specified together with CLONE_PIDFD_AUTOKILL an
unprivileged user may spawn a process that is autokilled. The child
cannot escalate privileges via setuid/setgid exec after being spawned.
If CLONE_PIDFD_AUTOKILL is specified without CLONE_NNP the caller must
have have CAP_SYS_ADMIN in its user namespace.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v5:
- Split no_new_privs into separate CLONE_NNP flag instead of having
  CLONE_PIDFD_AUTOKILL implicitly set it.
- CLONE_PIDFD_AUTOKILL now requires either CLONE_NNP or CAP_SYS_ADMIN.
- Link to v4: https://patch.msgid.link/20260223-work-pidfs-autoreap-v4-0-e393c08c09d1@kernel.org

Changes in v4:
- Set no_new_privs on child when CLONE_PIDFD_AUTOKILL is used. This
  prevents the child from escalating privileges via setuid/setgid exec
  and eliminates the need for magical resets during credential changes.
  The parent retains full privileges.
- Replace autokill_pidfd pointer with PIDFD_AUTOKILL file flag checked
  in pidfs_file_release(). This eliminates the need for pointer
  comparison, stale pointer concerns, and WRITE_ONCE/READ_ONCE pairing
  (Oleg, Jann).
- Reject CLONE_AUTOREAP | CLONE_PARENT to prevent a CLONE_AUTOREAP
  child from creating silent zombies via clone(CLONE_PARENT) (Oleg).
- Link to v3: https://patch.msgid.link/20260217-work-pidfs-autoreap-v3-0-33a403c20111@kernel.org

Changes in v2:
- Add CLONE_PIDFD_AUTOKILL flag
- Decouple CLONE_AUTOREAP from CLONE_PIDFD: the autoreap mechanism has
  no dependency on pidfds. This allows fire-and-forget patterns where
  the parent does not need exit status.
- Link to v1: https://patch.msgid.link/20260216-work-pidfs-autoreap-v1-0-e63f663008f2@kernel.org

---
Christian Brauner (6):
      clone: add CLONE_AUTOREAP
      clone: add CLONE_NNP
      pidfd: add CLONE_PIDFD_AUTOKILL
      selftests/pidfd: add CLONE_AUTOREAP tests
      selftests/pidfd: add CLONE_NNP tests
      selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests

 fs/pidfs.c                                         |  38 +-
 include/linux/sched/signal.h                       |   1 +
 include/uapi/linux/pidfd.h                         |   1 +
 include/uapi/linux/sched.h                         |   3 +
 kernel/fork.c                                      |  49 +-
 kernel/ptrace.c                                    |   3 +-
 kernel/signal.c                                    |   4 +
 tools/testing/selftests/pidfd/.gitignore           |   1 +
 tools/testing/selftests/pidfd/Makefile             |   2 +-
 .../testing/selftests/pidfd/pidfd_autoreap_test.c  | 900 +++++++++++++++++++++
 10 files changed, 991 insertions(+), 11 deletions(-)
---
base-commit: 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f
change-id: 20260214-work-pidfs-autoreap-3ee677e240a8


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-02 17:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-26 13:50 [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Christian Brauner
2026-02-26 13:50 ` [PATCH v5 1/6] clone: add CLONE_AUTOREAP Christian Brauner
2026-02-26 13:51 ` [PATCH v5 2/6] clone: add CLONE_NNP Christian Brauner
2026-02-26 13:51 ` [PATCH v5 3/6] pidfd: add CLONE_PIDFD_AUTOKILL Christian Brauner
2026-03-02 17:16   ` Jann Horn
2026-02-26 13:51 ` [PATCH v5 4/6] selftests/pidfd: add CLONE_AUTOREAP tests Christian Brauner
2026-02-26 13:51 ` [PATCH v5 5/6] selftests/pidfd: add CLONE_NNP tests Christian Brauner
2026-02-26 13:51 ` [PATCH v5 6/6] selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests Christian Brauner
2026-02-28 12:49 ` [PATCH v5 0/6] pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL Oleg Nesterov
2026-03-02 10:28   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox