linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL for v6.16] vfs pidfs
@ 2025-05-23 12:40 Christian Brauner
  2025-05-23 12:42 ` [GIT PULL for v6.16] vfs coredump Christian Brauner
  2025-05-26 18:36 ` [GIT PULL for v6.16] vfs pidfs pr-tracker-bot
  0 siblings, 2 replies; 4+ messages in thread
From: Christian Brauner @ 2025-05-23 12:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel

Hey Linus,

This pull request serves as the base for the coredump pull request I'm
sending separately:

https://lore.kernel.org/20250523-vfs-coredump-66643655f2fe@brauner

The reason is simply that this branch has been in -next for such a long
time that clearly delineating both topics by moving things around would
have caused more churn for little gain since the relationship is tight
enough that it fits into both categories.

/* Summary */

This contains pidfs updates for this cycle:

Features:

- Allow to hand out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
  socket option.

  SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
  the process that called connect() or listen(). This is heavily used to
  safely authenticate clients in userspace avoiding security bugs due to
  pid recycling races (dbus, polkit, systemd, etc.).

  SO_PEERPIDFD currently doesn't support handing out pidfds if the
  sk->sk_peer_pid thread-group leader has already been reaped. In this
  case it currently returns EINVAL. Userspace still wants to get a pidfd
  for a reaped process to have a stable handle it can pass on. This is
  especially useful now that it is possible to retrieve exit information
  through a pidfd via the PIDFD_GET_INFO ioctl()'s PIDFD_INFO_EXIT flag.

  Another summary has been provided by David Rheinsberg:

  > A pidfd can outlive the task it refers to, and thus user-space must
  > already be prepared that the task underlying a pidfd is gone at the time
  > they get their hands on the pidfd. For instance, resolving the pidfd to
  > a PID via the fdinfo must be prepared to read `-1`.
  >
  > Despite user-space knowing that a pidfd might be stale, several kernel
  > APIs currently add another layer that checks for this. In particular,
  > SO_PEERPIDFD returns `EINVAL` if the peer-task was already reaped,
  > but returns a stale pidfd if the task is reaped immediately after the
  > respective alive-check.
  >
  > This has the unfortunate effect that user-space now has two ways to
  > check for the exact same scenario: A syscall might return
  > EINVAL/ESRCH/... *or* the pidfd might be stale, even though there is no
  > particular reason to distinguish both cases. This also propagates
  > through user-space APIs, which pass on pidfds. They must be prepared to
  > pass on `-1` *or* the pidfd, because there is no guaranteed way to get a
  > stale pidfd from the kernel.
  > Userspace must already deal with a pidfd referring to a reaped task as
  > the task may exit and get reaped at any time will there are still many
  > pidfds referring to it.

  In order to allow handing out reaped pidfd SO_PEERPIDFD needs to ensure
  that PIDFD_INFO_EXIT information is available whenever a pidfd for a
  reaped task is created by PIDFD_INFO_EXIT. The uapi promises that reaped
  pidfds are only handed out if it is guaranteed that the caller sees the
  exit information:

  TEST_F(pidfd_info, success_reaped)
  {
          struct pidfd_info info = {
                  .mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
          };

          /*
           * Process has already been reaped and PIDFD_INFO_EXIT been set.
           * Verify that we can retrieve the exit status of the process.
           */
          ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
          ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
          ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
          ASSERT_TRUE(WIFEXITED(info.exit_code));
          ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
  }

  To hand out pidfds for reaped processes we thus allocate a pidfs entry
  for the relevant sk->sk_peer_pid at the time the sk->sk_peer_pid is
  stashed and drop it when the socket is destroyed. This guarantees that
  exit information will always be recorded for the sk->sk_peer_pid task
  and we can hand out pidfds for reaped processes.

- Hand a pidfd to the coredump usermode helper process.

  Give userspace a way to instruct the kernel to install a pidfd for the
  crashing process into the process started as a usermode helper. There's
  still tricky race-windows that cannot be easily or sometimes not closed
  at all by userspace. There's various ways like looking at the start time
  of a process to make sure that the usermode helper process is started
  after the crashing process but it's all very very brittle and fraught
  with peril.

  The crashed-but-not-reaped process can be killed by userspace before
  coredump processing programs like systemd-coredump have had time to
  manually open a PIDFD from the PID the kernel provides them, which means
  they can be tricked into reading from an arbitrary process, and they run
  with full privileges as they are usermode helper processes.

  Even if that specific race-window wouldn't exist it's still the safest
  and cleanest way to let the kernel provide the pidfd directly instead of
  requiring userspace to do it manually. In parallel with this commit we
  already have systemd adding support for this in [1].

  When the usermode helper process is forked we install a pidfd file
  descriptor three into the usermode helper's file descriptor table so
  it's available to the exec'd program.

  Since usermode helpers are either children of the system_unbound_wq
  workqueue or kthreadd we know that the file descriptor table is empty
  and can thus always use three as the file descriptor number.

  Note, that we'll install a pidfd for the thread-group leader even if a
  subthread is calling do_coredump(). We know that task linkage hasn't
  been removed yet and even if this @current isn't the actual thread-group
  leader we know that the thread-group leader cannot be reaped until
  @current has exited.

- Allow to tell when a task has not been found from finding the wrong
  task when creating a pidfd.

  We currently report EINVAL whenever a struct pid has no tasked attached
  anymore thereby conflating two concepts:

  (1) The task has already been reaped.
  (2) The caller requested a pidfd for a thread-group leader but the pid
      actually references a struct pid that isn't used as a thread-group
      leader.

  This is causing issues for non-threaded workloads as in where they
  expect ESRCH to be reported, not EINVAL.

  So allow userspace to reliably distinguish between (1) and (2).

- Make it possible to detect when a pidfs entry would outlive the struct
  pid it pinned.

- Add a range of new selftests.

Cleanups:

- Remove unneeded NULL check from pidfd_prepare() for passed struct pid.

- Avoid pointless reference count bump during release_task().

Fixes:

- Various fixes to the pidfd and coredump selftests.

- Fix error handling for replace_fd() when spawning coredump usermode helper.

/* Testing */

gcc (Debian 14.2.0-19) 14.2.0
Debian clang version 19.1.7 (3)

No build failures or warnings were observed.

/* Conflicts */

Merge conflicts with mainline
=============================

No known conflicts.

Merge conflicts with other trees
================================

No known conflicts.

The following changes since commit 0af2f6be1b4281385b618cb86ad946eded089ac8:

  Linux 6.15-rc1 (2025-04-06 13:11:33 -0700)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.16-rc1.pidfs

for you to fetch changes up to db56723ceaec87aa5cf871e623f464934b266228:

  pidfs: detect refcount bugs (2025-05-06 13:59:00 +0200)

Please consider pulling these changes from the signed vfs-6.16-rc1.pidfs tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.16-rc1.pidfs

----------------------------------------------------------------
Christian Brauner (20):
      selftests/pidfd: adapt to recent changes
      pidfd: remove unneeded NULL check from pidfd_prepare()
      pidfd: improve uapi when task isn't found
      selftest/pidfd: add test for thread-group leader pidfd open for thread
      Merge patch series "pidfd: improve uapi when task isn't found"
      exit: move wake_up_all() pidfd waiters into __unhash_process()
      pidfs: ensure consistent ENOENT/ESRCH reporting
      Merge patch series "pidfs: ensure consistent ENOENT/ESRCH reporting"
      net, pidfd: report EINVAL for ESRCH
      pidfs: register pid in pidfs
      net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
      pidfs: get rid of __pidfd_prepare()
      net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
      Merge patch series "net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid"
      Merge patch series "selftests: coredump: Some bug fixes"
      pidfs: move O_RDWR into pidfs_alloc_file()
      coredump: fix error handling for replace_fd()
      coredump: hand a pidfd to the usermode coredump helper
      Merge patch series "coredump: hand a pidfd to the usermode coredump helper"
      pidfs: detect refcount bugs

Nam Cao (3):
      selftests: coredump: Properly initialize pointer
      selftests: coredump: Fix test failure for slow machines
      selftests: coredump: Raise timeout to 2 minutes

Oleg Nesterov (1):
      release_task: kill the no longer needed get/put_pid(thread_pid)

 fs/coredump.c                                     | 65 +++++++++++++++--
 fs/pidfs.c                                        | 82 ++++++++++++++++++---
 include/linux/coredump.h                          |  1 +
 include/linux/pid.h                               |  2 +-
 include/linux/pidfs.h                             |  3 +
 include/uapi/linux/pidfd.h                        |  2 +-
 kernel/exit.c                                     | 10 ++-
 kernel/fork.c                                     | 88 ++++++++++-------------
 kernel/pid.c                                      |  6 +-
 net/core/sock.c                                   | 12 +++-
 net/unix/af_unix.c                                | 85 +++++++++++++++++++---
 tools/testing/selftests/coredump/stackdump_test.c | 10 ++-
 tools/testing/selftests/pidfd/pidfd_info_test.c   | 13 ++--
 13 files changed, 286 insertions(+), 93 deletions(-)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [GIT PULL for v6.16] vfs coredump
  2025-05-23 12:40 [GIT PULL for v6.16] vfs pidfs Christian Brauner
@ 2025-05-23 12:42 ` Christian Brauner
  2025-05-26 18:36   ` pr-tracker-bot
  2025-05-26 18:36 ` [GIT PULL for v6.16] vfs pidfs pr-tracker-bot
  1 sibling, 1 reply; 4+ messages in thread
From: Christian Brauner @ 2025-05-23 12:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel

Hey Linus,

This pull request uses the pidfs pull request as base I'm sending out
separately as:

https://lore.kernel.org/20250523-vfs-pidfs-aa1d59a1e9b3@brauner

The reason is simply that the work here is sufficiently standalone that
it deserves a separate pull request but it builds on the work in the
pidfs pull request. Because it uses the ability for SO_PEERPIDFD to hand
out pidfds for reaped tasks.

I'm appending the full stat which includes the pidfd work just for
completeness sake.

/* Summary */

This adds support for sending coredumps over an AF_UNIX socket. It also
makes (implicit) use of the new SO_PEERPIDFD ability to hand out pidfds
for reaped peer tasks.

The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a saf way to
handle them instead of relying on super privileged coredumping helpers.

This will also be significantly more lightweight since the kernel
doens't have to do a fork()+exec() for each crashing process to spawn a
usermodehelper. Instead the kernel just connects to the AF_UNIX socket
and userspace can process it concurrently however it sees fit. Support
for userspace is incoming starting with systemd-coredump.

There's more work coming in that direction next cycle. The rest below
goes into some details and background.

Coredumping currently supports two modes:

(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
    spawned as a child of the system_unbound_wq or kthreadd.

For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.

The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:

        |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.

In the example the core_pattern shown causes the kernel to spawn
systemd-coredump as a usermode helper. There's various conceptual
consequences of this (non-exhaustive list):

- systemd-coredump is spawned with file descriptor number 0 (stdin)
  connected to the read-end of the pipe. All other file descriptors are
  closed. That specifically includes 1 (stdout) and 2 (stderr). This has
  already caused bugs because userspace assumed that this cannot happen
  (Whether or not this is a sane assumption is irrelevant.).

- systemd-coredump will be spawned as a child of system_unbound_wq. So
  it is not a child of any userspace process and specifically not a
  child of PID 1. It cannot be waited upon and is in a weird hybrid
  upcall which are difficult for userspace to control correctly.

- systemd-coredump is spawned with full kernel privileges. This
  necessitates all kinds of weird privilege dropping excercises in
  userspace to make this safe.

- A new usermode helper has to be spawned for each crashing process.

This adds a new mode:

(3) Dumping into an AF_UNIX socket.

Userspace can set /proc/sys/kernel/core_pattern to:

        @/path/to/coredump.socket

The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps.

The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket:

- The coredump server uses SO_PEERPIDFD to get a stable handle on
  the connected crashing task. The retrieved pidfd will provide a stable
  reference even if the crashing task gets SIGKILLed while generating
  the coredump. That is a huge attack vector right now.

- By setting core_pipe_limit non-zero userspace can guarantee that the
  crashing task cannot be reaped behind it's back and thus process all
  necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
  detect whether /proc/<pid> still refers to the same process.

  The core_pipe_limit isn't used to rate-limit connections to the
  socket. This can simply be done via AF_UNIX socket directly.

- The pidfd for the crashing task will contain information how the task
  coredumps. The PIDFD_GET_INFO ioctl gained a new flag
  PIDFD_INFO_COREDUMP which can be used to retreive the coredump
  information.

  If the coredump gets a new coredump client connection the kernel
  guarantees that PIDFD_INFO_COREDUMP information is available.
  Currently the following information is provided in the new
  @coredump_mask extension to struct pidfd_info:

  * PIDFD_COREDUMPED is raised if the task did actually coredump.
  * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
    undumpable).
  * PIDFD_COREDUMP_USER is raised if this is a regular coredump and
    doesn't need special care by the coredump server.
  * PIDFD_COREDUMP_ROOT is raised if the generated coredump should be
    treated as sensitive and the coredump server should restrict access
    to the generated coredump to sufficiently privileged users.

/* Testing */

gcc (Debian 14.2.0-19) 14.2.0
Debian clang version 19.1.7 (3)

No build failures or warnings were observed.

/* Conflicts */

Merge conflicts with mainline
=============================

No known conflicts.

Merge conflicts with other trees
================================

No known conflicts.

The following changes since commit 0af2f6be1b4281385b618cb86ad946eded089ac8:

  Linux 6.15-rc1 (2025-04-06 13:11:33 -0700)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.16-rc1.coredump

for you to fetch changes up to 4e83ae6ec87dddac070ba349d3b839589b1bb957:

  mips, net: ensure that SOCK_COREDUMP is defined (2025-05-23 11:02:16 +0200)

Please consider pulling these changes from the signed vfs-6.16-rc1.coredump tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.16-rc1.coredump

----------------------------------------------------------------
Christian Brauner (30):
      selftests/pidfd: adapt to recent changes
      pidfd: remove unneeded NULL check from pidfd_prepare()
      pidfd: improve uapi when task isn't found
      selftest/pidfd: add test for thread-group leader pidfd open for thread
      Merge patch series "pidfd: improve uapi when task isn't found"
      exit: move wake_up_all() pidfd waiters into __unhash_process()
      pidfs: ensure consistent ENOENT/ESRCH reporting
      Merge patch series "pidfs: ensure consistent ENOENT/ESRCH reporting"
      net, pidfd: report EINVAL for ESRCH
      pidfs: register pid in pidfs
      net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
      pidfs: get rid of __pidfd_prepare()
      net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
      Merge patch series "net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid"
      Merge patch series "selftests: coredump: Some bug fixes"
      pidfs: move O_RDWR into pidfs_alloc_file()
      coredump: fix error handling for replace_fd()
      coredump: hand a pidfd to the usermode coredump helper
      Merge patch series "coredump: hand a pidfd to the usermode coredump helper"
      coredump: massage format_corename()
      coredump: massage do_coredump()
      coredump: reflow dump helpers a little
      coredump: add coredump socket
      pidfs, coredump: add PIDFD_INFO_COREDUMP
      coredump: show supported coredump modes
      coredump: validate socket name as it is written
      selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
      selftests/coredump: add tests for AF_UNIX coredumps
      Merge patch series "coredump: add coredump socket"
      mips, net: ensure that SOCK_COREDUMP is defined

Nam Cao (3):
      selftests: coredump: Properly initialize pointer
      selftests: coredump: Fix test failure for slow machines
      selftests: coredump: Raise timeout to 2 minutes

Oleg Nesterov (1):
      release_task: kill the no longer needed get/put_pid(thread_pid)

 arch/mips/include/asm/socket.h                    |   9 -
 fs/coredump.c                                     | 461 ++++++++++++++++-----
 fs/pidfs.c                                        | 137 ++++++-
 include/linux/coredump.h                          |   1 +
 include/linux/net.h                               |   4 +-
 include/linux/pid.h                               |   2 +-
 include/linux/pidfs.h                             |   8 +
 include/uapi/linux/pidfd.h                        |  18 +-
 kernel/exit.c                                     |  10 +-
 kernel/fork.c                                     |  88 ++--
 kernel/pid.c                                      |   5 -
 net/core/sock.c                                   |  12 +-
 net/unix/af_unix.c                                | 137 +++++--
 tools/testing/selftests/coredump/stackdump_test.c | 477 +++++++++++++++++++++-
 tools/testing/selftests/pidfd/pidfd.h             |  22 +
 tools/testing/selftests/pidfd/pidfd_info_test.c   |  13 +-
 16 files changed, 1198 insertions(+), 206 deletions(-)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [GIT PULL for v6.16] vfs pidfs
  2025-05-23 12:40 [GIT PULL for v6.16] vfs pidfs Christian Brauner
  2025-05-23 12:42 ` [GIT PULL for v6.16] vfs coredump Christian Brauner
@ 2025-05-26 18:36 ` pr-tracker-bot
  1 sibling, 0 replies; 4+ messages in thread
From: pr-tracker-bot @ 2025-05-26 18:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, linux-kernel

The pull request you sent on Fri, 23 May 2025 14:40:47 +0200:

> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.16-rc1.pidfs

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/7d7a103d299eb5b95d67873c5ea7db419eaaebc0

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [GIT PULL for v6.16] vfs coredump
  2025-05-23 12:42 ` [GIT PULL for v6.16] vfs coredump Christian Brauner
@ 2025-05-26 18:36   ` pr-tracker-bot
  0 siblings, 0 replies; 4+ messages in thread
From: pr-tracker-bot @ 2025-05-26 18:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, linux-kernel

The pull request you sent on Fri, 23 May 2025 14:42:11 +0200:

> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.16-rc1.coredump

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/c5bfc48d5472fc60abafb510668d7bc3b5ecb401

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-05-26 18:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-23 12:40 [GIT PULL for v6.16] vfs pidfs Christian Brauner
2025-05-23 12:42 ` [GIT PULL for v6.16] vfs coredump Christian Brauner
2025-05-26 18:36   ` pr-tracker-bot
2025-05-26 18:36 ` [GIT PULL for v6.16] vfs pidfs pr-tracker-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).