From: Christian Brauner <brauner@kernel.org>
To: Jann Horn <jannh@google.com>
Cc: "Eric Dumazet" <edumazet@google.com>,
"Kuniyuki Iwashima" <kuniyu@amazon.com>,
"Oleg Nesterov" <oleg@redhat.com>,
linux-fsdevel@vger.kernel.org,
"David S. Miller" <davem@davemloft.net>,
"Alexander Viro" <viro@zeniv.linux.org.uk>,
"Daan De Meyer" <daan.j.demeyer@gmail.com>,
"David Rheinsberg" <david@readahead.eu>,
"Jakub Kicinski" <kuba@kernel.org>, "Jan Kara" <jack@suse.cz>,
"Lennart Poettering" <lennart@poettering.net>,
"Luca Boccassi" <bluca@debian.org>, "Mike Yuan" <me@yhndnzj.com>,
"Paolo Abeni" <pabeni@redhat.com>,
"Simon Horman" <horms@kernel.org>,
"Zbigniew Jędrzejewski-Szmek" <zbyszek@in.waw.pl>,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
"Alexander Mikhalitsyn" <alexander@mihalicyn.com>
Subject: Re: [PATCH RFC v3 04/10] coredump: add coredump socket
Date: Mon, 5 May 2025 16:46:23 +0200 [thread overview]
Message-ID: <20250505-gedrillt-luchs-8ee39d639078@brauner> (raw)
In-Reply-To: <CAG48ez2PNFmaMCg9u7febjDgYytxi5eB-261sZBHrfBcTgavfA@mail.gmail.com>
On Mon, May 05, 2025 at 02:55:18PM +0200, Jann Horn wrote:
> On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > Coredumping currently supports two modes:
> >
> > (1) Dumping directly into a file somewhere on the filesystem.
> > (2) Dumping into a pipe connected to a usermode helper process
> > spawned as a child of the system_unbound_wq or kthreadd.
> >
> > For simplicity I'm mostly ignoring (1). There's probably still some
> > users of (1) out there but processing coredumps in this way can be
> > considered adventurous especially in the face of set*id binaries.
> >
> > The most common option should be (2) by now. It works by allowing
> > userspace to put a string into /proc/sys/kernel/core_pattern like:
> >
> > |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> >
> > The "|" at the beginning indicates to the kernel that a pipe must be
> > used. The path following the pipe indicator is a path to a binary that
> > will be spawned as a usermode helper process. Any additional parameters
> > pass information about the task that is generating the coredump to the
> > binary that processes the coredump.
> >
> > In the example core_pattern shown above systemd-coredump is spawned as a
> > usermode helper. There's various conceptual consequences of this
> > (non-exhaustive list):
> >
> > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> > connected to the read-end of the pipe. All other file descriptors are
> > closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> > already caused bugs because userspace assumed that this cannot happen
> > (Whether or not this is a sane assumption is irrelevant.).
> >
> > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> > it is not a child of any userspace process and specifically not a
> > child of PID 1. It cannot be waited upon and is in a weird hybrid
> > upcall which are difficult for userspace to control correctly.
> >
> > - systemd-coredump is spawned with full kernel privileges. This
> > necessitates all kinds of weird privilege dropping excercises in
> > userspace to make this safe.
> >
> > - A new usermode helper has to be spawned for each crashing process.
> >
> > This series adds a new mode:
> >
> > (3) Dumping into an abstract AF_UNIX socket.
> >
> > Userspace can set /proc/sys/kernel/core_pattern to:
> >
> > @linuxafsk/coredump_socket
> >
> > The "@" at the beginning indicates to the kernel that the abstract
> > AF_UNIX coredump socket will be used to process coredumps.
> >
> > The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> > for now.
> >
> > The coredump socket is located in the initial network namespace. To bind
> > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > user namespace. Listening and reading can happen from whatever
> > unprivileged context is necessary to safely process coredumps.
> >
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket. For now only tasks that
> > are acctually coredumping are allowed to connect to the initial coredump
> > socket.
> >
> > - The coredump server should use SO_PEERPIDFD to get a stable handle on
> > the connected crashing task. The retrieved pidfd will provide a stable
> > reference even if the crashing task gets SIGKILLed while generating
> > the coredump.
> >
> > - By setting core_pipe_limit non-zero userspace can guarantee that the
> > crashing task cannot be reaped behind it's back and thus process all
> > necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
> > detect whether /proc/<pid> still refers to the same process.
> >
> > The core_pipe_limit isn't used to rate-limit connections to the
> > socket. This can simply be done via AF_UNIX socket directly.
> >
> > - The pidfd for the crashing task will contain information how the task
> > coredumps. The PIDFD_GET_INFO ioctl gained a new flag
> > PIDFD_INFO_COREDUMP which can be used to retreive the coredump
> > information.
> >
> > If the coredump gets a new coredump client connection the kernel
> > guarantees that PIDFD_INFO_COREDUMP information is available.
> > Currently the following information is provided in the new
> > @coredump_mask extension to struct pidfd_info:
> >
> > * PIDFD_COREDUMPED is raised if the task did actually coredump.
> > * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
> > undumpable).
> > * PIDFD_COREDUMP_USER is raised if this is a regular coredump and
> > doesn't need special care by the coredump server.
> > * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
> > treated as sensitive and the coredump server should restrict to the
> > generated coredump to sufficiently privileged users.
> >
> > - Since unix_stream_connect() runs bpf programs during connect it's
> > possible to even redirect or multiplex coredumps to other sockets.
>
> Or change the userspace protocol used for containers such that the
> init-namespace coredumping helper forwards the FD it accept()ed into a
> container via SCM_RIGHTS...
Yeah, that would also work.
>
> > - The coredump server should mark itself as non-dumpable.
> > To capture coredumps for the coredump server itself a bpf program
> > should be run at connect to redirect it to another socket in
> > userspace. This can be useful for debugging crashing coredump servers.
> >
> > - A container coredump server in a separate network namespace can simply
> > bind to linuxafsk/coredump.socket and systemd-coredump fowards
> > coredumps to the container.
> >
> > - Fwiw, one idea is to handle coredumps via per-user/session coredump
> > servers that run with that users privileges.
> >
> > The coredump server listens on the coredump socket and accepts a
> > new coredump connection. It then retrieves SO_PEERPIDFD for the
> > client, inspects uid/gid and hands the accepted client to the users
> > own coredump handler which runs with the users privileges only.
>
> (Though that would only be okay if it's not done for suid dumping cases.)
Yes, I had considered adding a comment about only doing that when
PIDFD_COREDUMP_ROOT isn't set and wondered if anyone would comment on
it. :)
>
> > The new coredump socket will allow userspace to not have to rely on
> > usermode helpers for processing coredumps and provides a safer way to
> > handle them instead of relying on super privileged coredumping helpers.
> >
> > This will also be significantly more lightweight since no fork()+exec()
> > for the usermodehelper is required for each crashing process. The
> > coredump server in userspace can just keep a worker pool.
>
> I mean, if coredumping is a performance bottleneck, something is
> probably seriously wrong with the system... I don't think we need to
> optimize for execution speed in this area.
>
> > This is easy to test:
> >
> > (a) coredump processing (we're using socat):
> >
> > > cat coredump_socket.sh
> > #!/bin/bash
> >
> > set -x
> >
> > sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
> > sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc
> >
> > (b) trigger a coredump:
> >
> > user1@localhost:~/data/scripts$ cat crash.c
> > #include <stdio.h>
> > #include <unistd.h>
> >
> > int main(int argc, char *argv[])
> > {
> > fprintf(stderr, "%u\n", (1 / 0));
> > _exit(0);
> > }
>
> This looks pretty neat overall!
>
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > fs/coredump.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 107 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index 1779299b8c61..c60f86c473ad 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> > @@ -44,7 +44,11 @@
> > #include <linux/sysctl.h>
> > #include <linux/elf.h>
> > #include <linux/pidfs.h>
> > +#include <linux/net.h>
> > +#include <linux/socket.h>
> > +#include <net/net_namespace.h>
> > #include <uapi/linux/pidfd.h>
> > +#include <uapi/linux/un.h>
> >
> > #include <linux/uaccess.h>
> > #include <asm/mmu_context.h>
> > @@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
> > enum coredump_type_t {
> > COREDUMP_FILE = 1,
> > COREDUMP_PIPE = 2,
> > + COREDUMP_SOCK = 3,
> > };
> >
> > struct core_name {
> > @@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> > cn->corename = NULL;
> > if (*pat_ptr == '|')
> > cn->core_type = COREDUMP_PIPE;
> > + else if (*pat_ptr == '@')
> > + cn->core_type = COREDUMP_SOCK;
> > else
> > cn->core_type = COREDUMP_FILE;
> > if (expand_corename(cn, core_name_size))
> > return -ENOMEM;
> > cn->corename[0] = '\0';
> >
> > - if (cn->core_type == COREDUMP_PIPE) {
> > + switch (cn->core_type) {
> > + case COREDUMP_PIPE: {
> > int argvs = sizeof(core_pattern) / 2;
> > (*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
> > if (!(*argv))
> > @@ -247,6 +255,32 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> > ++pat_ptr;
> > if (!(*pat_ptr))
> > return -ENOMEM;
> > + break;
> > + }
> > + case COREDUMP_SOCK: {
> > + err = cn_printf(cn, "%s", pat_ptr);
> > + if (err)
> > + return err;
> > +
> > + /*
> > + * We can potentially allow this to be changed later but
> > + * I currently see no reason to.
> > + */
> > + if (strcmp(cn->corename, "@linuxafsk/coredump.socket"))
> > + return -EINVAL;
> > +
> > + /*
> > + * Currently no need to parse any other options.
> > + * Relevant information can be retrieved from the peer
> > + * pidfd retrievable via SO_PEERPIDFD by the receiver or
> > + * via /proc/<pid>, using the SO_PEERPIDFD to guard
> > + * against pid recycling when opening /proc/<pid>.
> > + */
> > + return 0;
> > + }
> > + default:
> > + WARN_ON_ONCE(cn->core_type != COREDUMP_FILE);
> > + break;
> > }
> >
> > /* Repeat as long as we have more pattern to process and more output
>
> I think the core_uses_pid logic at the end of this function needs to
> be adjusted to also exclude COREDUMP_SOCK?
Thanks! Fixed.
>
> > @@ -583,6 +617,17 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
> > return 0;
> > }
> >
> > +#ifdef CONFIG_UNIX
> > +struct sockaddr_un coredump_unix_socket = {
> > + .sun_family = AF_UNIX,
> > + .sun_path = "\0linuxafsk/coredump.socket",
> > +};
>
> Nit: Please make that static and const.
Done.
>
> > +/* Without trailing NUL byte. */
> > +#define COREDUMP_UNIX_SOCKET_ADDR_SIZE \
> > + (offsetof(struct sockaddr_un, sun_path) + \
> > + sizeof("\0linuxafsk/coredump.socket") - 1)
> > +#endif
> > +
> > void do_coredump(const kernel_siginfo_t *siginfo)
> > {
> > struct core_state core_state;
> > @@ -801,6 +846,40 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > }
> > break;
> > }
> > + case COREDUMP_SOCK: {
> > + struct file *file __free(fput) = NULL;
> > +#ifdef CONFIG_UNIX
> > + struct socket *socket;
> > +
> > + /*
> > + * It is possible that the userspace process which is
> > + * supposed to handle the coredump and is listening on
> > + * the AF_UNIX socket coredumps. Userspace should just
> > + * mark itself non dumpable.
> > + */
> > +
> > + retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> > + if (retval < 0)
> > + goto close_fail;
> > +
> > + file = sock_alloc_file(socket, 0, NULL);
> > + if (IS_ERR(file)) {
> > + sock_release(socket);
> > + retval = PTR_ERR(file);
> > + goto close_fail;
> > + }
> > +
> > + retval = kernel_connect(socket,
> > + (struct sockaddr *)(&coredump_unix_socket),
> > + COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0);
> > + if (retval)
> > + goto close_fail;
> > +
> > + cprm.limit = RLIM_INFINITY;
> > +#endif
>
> The non-CONFIG_UNIX case here should probably bail out?
It will bail-out later on !bprm->file where it'll report that @ support
is disabled but I think...
>
> > + cprm.file = no_free_ptr(file);
> > + break;
> > + }
> > default:
> > WARN_ON_ONCE(true);
> > retval = -EINVAL;
> > @@ -818,7 +897,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > * have this set to NULL.
> > */
> > if (!cprm.file) {
> > - coredump_report_failure("Core dump to |%s disabled", cn.corename);
> > + if (cn.core_type == COREDUMP_PIPE)
> > + coredump_report_failure("Core dump to |%s disabled", cn.corename);
> > + else
> > + coredump_report_failure("Core dump to @%s disabled", cn.corename);
>
> Are you actually truncating the initial "@" off of cn.corename, or is
> this going to print two "@" characters?
... that bailing out earlier is nicer than stripping the @off
pointlessly.
>
> > goto close_fail;
> > }
> > if (!dump_vma_snapshot(&cprm))
> > @@ -839,8 +921,28 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > file_end_write(cprm.file);
> > free_vma_snapshot(&cprm);
> > }
> > - if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
> > - wait_for_dump_helpers(cprm.file);
> > +
> > + if (core_pipe_limit) {
> > + switch (cn.core_type) {
> > + case COREDUMP_PIPE:
> > + wait_for_dump_helpers(cprm.file);
> > + break;
> > + case COREDUMP_SOCK: {
> > + char buf[1];
> > + /*
> > + * We use a simple read to wait for the coredump
> > + * processing to finish. Either the socket is
> > + * closed or we get sent unexpected data. In
> > + * both cases, we're done.
> > + */
> > + __kernel_read(cprm.file, buf, 1, NULL);
> > + break;
> > + }
> > + default:
> > + break;
> > + }
> > + }
> > +
> > close_fail:
> > if (cprm.file)
> > filp_close(cprm.file, NULL);
next prev parent reply other threads:[~2025-05-05 14:46 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 01/10] coredump: massage format_corname() Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 02/10] coredump: massage do_coredump() Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 03/10] net: reserve prefix Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 04/10] coredump: add coredump socket Christian Brauner
2025-05-05 12:55 ` Jann Horn
2025-05-05 13:06 ` Luca Boccassi
2025-05-05 14:46 ` Christian Brauner [this message]
2025-05-05 18:48 ` Kuniyuki Iwashima
2025-05-06 8:24 ` Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 05/10] coredump: validate socket name as it is written Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 06/10] coredump: show supported coredump modes Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 07/10] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket Christian Brauner
2025-05-05 13:08 ` Jann Horn
2025-05-05 14:06 ` Christian Brauner
2025-05-05 18:40 ` Kuniyuki Iwashima
2025-05-05 19:10 ` Jann Horn
2025-05-05 19:35 ` Kuniyuki Iwashima
2025-05-05 19:44 ` Kuniyuki Iwashima
2025-05-05 19:55 ` Jann Horn
2025-05-05 20:41 ` Kuniyuki Iwashima
2025-05-06 7:39 ` Christian Brauner
2025-05-06 14:51 ` Jann Horn
2025-05-06 15:16 ` Christian Brauner
2025-05-06 19:28 ` Kuniyuki Iwashima
2025-05-07 11:50 ` Mickaël Salaün
2025-05-05 19:55 ` Jann Horn
2025-05-05 20:30 ` Kuniyuki Iwashima
2025-05-06 8:06 ` Christian Brauner
2025-05-06 14:37 ` Jann Horn
2025-05-06 19:18 ` Kuniyuki Iwashima
2025-05-07 11:51 ` Mickaël Salaün
2025-05-07 14:22 ` Lennart Poettering
2025-05-07 22:10 ` Paul Moore
2025-05-05 11:13 ` [PATCH RFC v3 09/10] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 10/10] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
2025-05-05 14:41 ` [PATCH RFC v3 00/10] coredump: add coredump socket Mickaël Salaün
2025-05-05 14:56 ` Christian Brauner
2025-05-05 15:38 ` Mickaël Salaün
2025-05-05 14:59 ` Jann Horn
2025-05-05 15:39 ` Mickaël Salaün
2025-05-05 18:33 ` Kuniyuki Iwashima
2025-05-06 7:33 ` Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250505-gedrillt-luchs-8ee39d639078@brauner \
--to=brauner@kernel.org \
--cc=alexander@mihalicyn.com \
--cc=bluca@debian.org \
--cc=daan.j.demeyer@gmail.com \
--cc=davem@davemloft.net \
--cc=david@readahead.eu \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=jack@suse.cz \
--cc=jannh@google.com \
--cc=kuba@kernel.org \
--cc=kuniyu@amazon.com \
--cc=lennart@poettering.net \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=me@yhndnzj.com \
--cc=netdev@vger.kernel.org \
--cc=oleg@redhat.com \
--cc=pabeni@redhat.com \
--cc=viro@zeniv.linux.org.uk \
--cc=zbyszek@in.waw.pl \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox