[PATCH v4 00/11] coredump: add coredump socket

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 00/11] coredump: add coredump socket
@ 2025-05-07 16:13 Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 01/11] coredump: massage format_corname() Christian Brauner
                   ` (10 more replies)
  0 siblings, 11 replies; 18+ messages in thread
From: Christian Brauner @ 2025-05-07 16:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, linux-fsdevel, Jann Horn
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, Christian Brauner, Alexander Mikhalitsyn

Coredumping currently supports two modes:

(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
    spawned as a child of the system_unbound_wq or kthreadd.

For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.

The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:

        |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.

In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):

- systemd-coredump is spawned with file descriptor number 0 (stdin)
  connected to the read-end of the pipe. All other file descriptors are
  closed. That specifically includes 1 (stdout) and 2 (stderr). This has
  already caused bugs because userspace assumed that this cannot happen
  (Whether or not this is a sane assumption is irrelevant.).

- systemd-coredump will be spawned as a child of system_unbound_wq. So
  it is not a child of any userspace process and specifically not a
  child of PID 1. It cannot be waited upon and is in a weird hybrid
  upcall which are difficult for userspace to control correctly.

- systemd-coredump is spawned with full kernel privileges. This
  necessitates all kinds of weird privilege dropping excercises in
  userspace to make this safe.

- A new usermode helper has to be spawned for each crashing process.

This series adds a new mode:

(3) Dumping into an abstract AF_UNIX socket.

Userspace can set /proc/sys/kernel/core_pattern to:

        @linuxafsk/coredump_socket

The "@" at the beginning indicates to the kernel that the abstract
AF_UNIX coredump socket will be used to process coredumps.

The coredump socket uses the fixed address "linuxafsk/coredump.socket"
for now.

The coredump socket is located in the initial network namespace. To bind
the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
user namespace. Listening and reading can happen from whatever
unprivileged context is necessary to safely process coredumps.

When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket.

- The coredump server should use SO_PEERPIDFD to get a stable handle on
  the connected crashing task. The retrieved pidfd will provide a stable
  reference even if the crashing task gets SIGKILLed while generating
  the coredump.

- When a coredump connection is initiated use the socket cookie as the
  coredump cookie and store it in the pidfd. The receiver can now easily
  authenticate that the connection is coming from the kernel.

  Unless the coredump server expects to handle connection from
  non-crashing task it can validate that the connection has been made from
  a crashing task:

     fd_coredump = accept4(fd_socket, NULL, NULL, SOCK_CLOEXEC);
     getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD, &fd_peer_pidfd, &fd_peer_pidfd_len);

     struct pidfd_info info = {
             info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP,
     };

     ioctl(pidfd, PIDFD_GET_INFO, &info);
     /* Refuse connections that aren't from a crashing task. */
     if (!(info.mask & PIDFD_INFO_COREDUMP) || !(info.coredump_mask & PIDFD_COREDUMPED) )
             close(fd_coredump);

     /*
      * Make sure that the coredump cookie matches the connection cookie.
      * If they don't it's not the coredump connection from the kernel.
      * We'll get another connection request in a bit.
      */
     getsocketop(fd_coredump, SOL_SOCKET, SO_COOKIE, &peer_cookie, &peer_cookie_len);
     if (!info.coredump_cookie || (info.coredump_cookie != peer_cookie))
             close(fd_coredump);

  The kernel guarantees that by the time the connection is made the
  coredump info is available.

- By setting core_pipe_limit non-zero userspace can guarantee that the
  crashing task cannot be reaped behind it's back and thus process all
  necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
  detect whether /proc/<pid> still refers to the same process.

  The core_pipe_limit isn't used to rate-limit connections to the
  socket. This can simply be done via AF_UNIX socket directly.

- The pidfd for the crashing task will contain information how the task
  coredumps. The PIDFD_GET_INFO ioctl gained a new flag
  PIDFD_INFO_COREDUMP which can be used to retreive the coredump
  information.

  If the coredump gets a new coredump client connection the kernel
  guarantees that PIDFD_INFO_COREDUMP information is available.
  Currently the following information is provided in the new
  @coredump_mask extension to struct pidfd_info:

  * PIDFD_COREDUMPED is raised if the task did actually coredump.
  * PIDFD_COREDUMP_SKIP	is raised if the task skipped coredumping (e.g.,
    undumpable).
  * PIDFD_COREDUMP_USER	is raised if this is a regular coredump and
    doesn't need special care by the coredump server.
  * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
    treated as sensitive and the coredump server should restrict to the
    generated coredump to sufficiently privileged users.

- Since unix_stream_connect() runs bpf programs during connect it's
  possible to even redirect or multiplex coredumps to other sockets.

- The coredump server should mark itself as non-dumpable.
  To capture coredumps for the coredump server itself a bpf program
  should be run at connect to redirect it to another socket in
  userspace. This can be useful for debugging crashing coredump servers.

- A container coredump server in a separate network namespace can simply
  bind to linuxafsk/coredump.socket and systemd-coredump fowards
  coredumps to the container.

- Fwiw, one idea is to handle coredumps via per-user/session coredump
  servers that run with that users privileges.

  The coredump server listens on the coredump socket and accepts a
  new coredump connection. It then retrieves SO_PEERPIDFD for the
  client, inspects uid/gid and hands the accepted client to the users
  own coredump handler which runs with the users privileges only.

The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers.

This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can just keep a worker pool.

This is easy to test:

(a) coredump processing (we're using socat):

    > cat coredump_socket.sh
    #!/bin/bash

    set -x

    sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
    sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc

(b) trigger a coredump:

    user1@localhost:~/data/scripts$ cat crash.c
    #include <stdio.h>
    #include <unistd.h>

    int main(int argc, char *argv[])
    {
            fprintf(stderr, "%u\n", (1 / 0));
            _exit(0);
    }

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v4:
- Expose the coredump socket cookie through the pidfd. This allows the
  coredump server to easily recognize coredump socket connections.
- Link to v3: https://lore.kernel.org/20250505-work-coredump-socket-v3-0-e1832f0e1eae@kernel.org

Changes in v3:
- Use an abstract unix socket.
- Add documentation.
- Add selftests.
- Link to v2: https://lore.kernel.org/20250502-work-coredump-socket-v2-0-43259042ffc7@kernel.org

Changes in v2:
- Expose dumpability via PIDFD_GET_INFO.
- Place COREDUMP_SOCK handling under CONFIG_UNIX.
- Link to v1: https://lore.kernel.org/20250430-work-coredump-socket-v1-0-2faf027dbb47@kernel.org

---
Christian Brauner (11):
      coredump: massage format_corname()
      coredump: massage do_coredump()
      coredump: reflow dump helpers a little
      net: reserve prefix
      coredump: add coredump socket
      coredump: validate socket name as it is written
      coredump: show supported coredump modes
      pidfs, coredump: add PIDFD_INFO_COREDUMP
      pidfs, coredump: allow to verify coredump connection
      selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
      selftests/coredump: add tests for AF_UNIX coredumps

 fs/coredump.c                                     | 378 ++++++++++++++++------
 fs/pidfs.c                                        |  73 +++++
 include/linux/net.h                               |   1 +
 include/linux/pidfs.h                             |   4 +
 include/uapi/linux/pidfd.h                        |  17 +
 include/uapi/linux/un.h                           |   2 +
 net/unix/af_unix.c                                |  46 ++-
 tools/testing/selftests/coredump/stackdump_test.c | 273 +++++++++++++++-
 tools/testing/selftests/pidfd/pidfd.h             |  23 ++
 9 files changed, 721 insertions(+), 96 deletions(-)
---
base-commit: 4dd6566b5a8ca1e8c9ff2652c2249715d6c64217
change-id: 20250429-work-coredump-socket-87cc0f17729c


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 01/11] coredump: massage format_corname()
  2025-05-07 16:13 [PATCH v4 00/11] coredump: add coredump socket Christian Brauner
@ 2025-05-07 16:13 ` Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 02/11] coredump: massage do_coredump() Christian Brauner
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-05-07 16:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, linux-fsdevel, Jann Horn
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, Christian Brauner, Alexander Mikhalitsyn

We're going to extend the coredump code in follow-up patches.
Clean it up so we can do this more easily.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 41 ++++++++++++++++++++++++-----------------
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index d740a0411266..281320ea351f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -76,9 +76,15 @@ static char core_pattern[CORENAME_MAX_SIZE] = "core";
 static int core_name_size = CORENAME_MAX_SIZE;
 unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
 
+enum coredump_type_t {
+	COREDUMP_FILE = 1,
+	COREDUMP_PIPE = 2,
+};
+
 struct core_name {
 	char *corename;
 	int used, size;
+	enum coredump_type_t core_type;
 };
 
 static int expand_corename(struct core_name *cn, int size)
@@ -218,18 +224,21 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 {
 	const struct cred *cred = current_cred();
 	const char *pat_ptr = core_pattern;
-	int ispipe = (*pat_ptr == '|');
 	bool was_space = false;
 	int pid_in_pattern = 0;
 	int err = 0;
 
 	cn->used = 0;
 	cn->corename = NULL;
+	if (*pat_ptr == '|')
+		cn->core_type = COREDUMP_PIPE;
+	else
+		cn->core_type = COREDUMP_FILE;
 	if (expand_corename(cn, core_name_size))
 		return -ENOMEM;
 	cn->corename[0] = '\0';
 
-	if (ispipe) {
+	if (cn->core_type == COREDUMP_PIPE) {
 		int argvs = sizeof(core_pattern) / 2;
 		(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
 		if (!(*argv))
@@ -247,7 +256,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 		 * Split on spaces before doing template expansion so that
 		 * %e and %E don't get split if they have spaces in them
 		 */
-		if (ispipe) {
+		if (cn->core_type == COREDUMP_PIPE) {
 			if (isspace(*pat_ptr)) {
 				if (cn->used != 0)
 					was_space = true;
@@ -353,7 +362,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 				 * Installing a pidfd only makes sense if
 				 * we actually spawn a usermode helper.
 				 */
-				if (!ispipe)
+				if (!(cn->core_type != COREDUMP_PIPE))
 					break;
 
 				/*
@@ -384,12 +393,12 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 	 * If core_pattern does not include a %p (as is the default)
 	 * and core_uses_pid is set, then .%pid will be appended to
 	 * the filename. Do not do this for piped commands. */
-	if (!ispipe && !pid_in_pattern && core_uses_pid) {
+	if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
 		err = cn_printf(cn, ".%d", task_tgid_vnr(current));
 		if (err)
 			return err;
 	}
-	return ispipe;
+	return 0;
 }
 
 static int zap_process(struct signal_struct *signal, int exit_code)
@@ -583,7 +592,6 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 	const struct cred *old_cred;
 	struct cred *cred;
 	int retval = 0;
-	int ispipe;
 	size_t *argv = NULL;
 	int argc = 0;
 	/* require nonrelative corefile path and be extra careful */
@@ -632,19 +640,18 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 
 	old_cred = override_creds(cred);
 
-	ispipe = format_corename(&cn, &cprm, &argv, &argc);
+	retval = format_corename(&cn, &cprm, &argv, &argc);
+	if (retval < 0) {
+		coredump_report_failure("format_corename failed, aborting core");
+		goto fail_unlock;
+	}
 
-	if (ispipe) {
+	if (cn.core_type == COREDUMP_PIPE) {
 		int argi;
 		int dump_count;
 		char **helper_argv;
 		struct subprocess_info *sub_info;
 
-		if (ispipe < 0) {
-			coredump_report_failure("format_corename failed, aborting core");
-			goto fail_unlock;
-		}
-
 		if (cprm.limit == 1) {
 			/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
 			 *
@@ -695,7 +702,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 			coredump_report_failure("|%s pipe failed", cn.corename);
 			goto close_fail;
 		}
-	} else {
+	} else if (cn.core_type == COREDUMP_FILE) {
 		struct mnt_idmap *idmap;
 		struct inode *inode;
 		int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
@@ -823,13 +830,13 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		file_end_write(cprm.file);
 		free_vma_snapshot(&cprm);
 	}
-	if (ispipe && core_pipe_limit)
+	if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
 		wait_for_dump_helpers(cprm.file);
 close_fail:
 	if (cprm.file)
 		filp_close(cprm.file, NULL);
 fail_dropcount:
-	if (ispipe)
+	if (cn.core_type == COREDUMP_PIPE)
 		atomic_dec(&core_dump_count);
 fail_unlock:
 	kfree(argv);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 02/11] coredump: massage do_coredump()
  2025-05-07 16:13 [PATCH v4 00/11] coredump: add coredump socket Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 01/11] coredump: massage format_corname() Christian Brauner
@ 2025-05-07 16:13 ` Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 03/11] coredump: reflow dump helpers a little Christian Brauner
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-05-07 16:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, linux-fsdevel, Jann Horn
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, Christian Brauner, Alexander Mikhalitsyn

We're going to extend the coredump code in follow-up patches.
Clean it up so we can do this more easily.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 122 +++++++++++++++++++++++++++++++---------------------------
 1 file changed, 65 insertions(+), 57 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 281320ea351f..41491dbfafdf 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -646,63 +646,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		goto fail_unlock;
 	}
 
-	if (cn.core_type == COREDUMP_PIPE) {
-		int argi;
-		int dump_count;
-		char **helper_argv;
-		struct subprocess_info *sub_info;
-
-		if (cprm.limit == 1) {
-			/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
-			 *
-			 * Normally core limits are irrelevant to pipes, since
-			 * we're not writing to the file system, but we use
-			 * cprm.limit of 1 here as a special value, this is a
-			 * consistent way to catch recursive crashes.
-			 * We can still crash if the core_pattern binary sets
-			 * RLIM_CORE = !1, but it runs as root, and can do
-			 * lots of stupid things.
-			 *
-			 * Note that we use task_tgid_vnr here to grab the pid
-			 * of the process group leader.  That way we get the
-			 * right pid if a thread in a multi-threaded
-			 * core_pattern process dies.
-			 */
-			coredump_report_failure("RLIMIT_CORE is set to 1, aborting core");
-			goto fail_unlock;
-		}
-		cprm.limit = RLIM_INFINITY;
-
-		dump_count = atomic_inc_return(&core_dump_count);
-		if (core_pipe_limit && (core_pipe_limit < dump_count)) {
-			coredump_report_failure("over core_pipe_limit, skipping core dump");
-			goto fail_dropcount;
-		}
-
-		helper_argv = kmalloc_array(argc + 1, sizeof(*helper_argv),
-					    GFP_KERNEL);
-		if (!helper_argv) {
-			coredump_report_failure("%s failed to allocate memory", __func__);
-			goto fail_dropcount;
-		}
-		for (argi = 0; argi < argc; argi++)
-			helper_argv[argi] = cn.corename + argv[argi];
-		helper_argv[argi] = NULL;
-
-		retval = -ENOMEM;
-		sub_info = call_usermodehelper_setup(helper_argv[0],
-						helper_argv, NULL, GFP_KERNEL,
-						umh_coredump_setup, NULL, &cprm);
-		if (sub_info)
-			retval = call_usermodehelper_exec(sub_info,
-							  UMH_WAIT_EXEC);
-
-		kfree(helper_argv);
-		if (retval) {
-			coredump_report_failure("|%s pipe failed", cn.corename);
-			goto close_fail;
-		}
-	} else if (cn.core_type == COREDUMP_FILE) {
+	switch (cn.core_type) {
+	case COREDUMP_FILE: {
 		struct mnt_idmap *idmap;
 		struct inode *inode;
 		int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
@@ -796,6 +741,69 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		if (do_truncate(idmap, cprm.file->f_path.dentry,
 				0, 0, cprm.file))
 			goto close_fail;
+		break;
+	}
+	case COREDUMP_PIPE: {
+		int argi;
+		int dump_count;
+		char **helper_argv;
+		struct subprocess_info *sub_info;
+
+		if (cprm.limit == 1) {
+			/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
+			 *
+			 * Normally core limits are irrelevant to pipes, since
+			 * we're not writing to the file system, but we use
+			 * cprm.limit of 1 here as a special value, this is a
+			 * consistent way to catch recursive crashes.
+			 * We can still crash if the core_pattern binary sets
+			 * RLIM_CORE = !1, but it runs as root, and can do
+			 * lots of stupid things.
+			 *
+			 * Note that we use task_tgid_vnr here to grab the pid
+			 * of the process group leader.  That way we get the
+			 * right pid if a thread in a multi-threaded
+			 * core_pattern process dies.
+			 */
+			coredump_report_failure("RLIMIT_CORE is set to 1, aborting core");
+			goto fail_unlock;
+		}
+		cprm.limit = RLIM_INFINITY;
+
+		dump_count = atomic_inc_return(&core_dump_count);
+		if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+			coredump_report_failure("over core_pipe_limit, skipping core dump");
+			goto fail_dropcount;
+		}
+
+		helper_argv = kmalloc_array(argc + 1, sizeof(*helper_argv),
+					    GFP_KERNEL);
+		if (!helper_argv) {
+			coredump_report_failure("%s failed to allocate memory", __func__);
+			goto fail_dropcount;
+		}
+		for (argi = 0; argi < argc; argi++)
+			helper_argv[argi] = cn.corename + argv[argi];
+		helper_argv[argi] = NULL;
+
+		retval = -ENOMEM;
+		sub_info = call_usermodehelper_setup(helper_argv[0],
+						helper_argv, NULL, GFP_KERNEL,
+						umh_coredump_setup, NULL, &cprm);
+		if (sub_info)
+			retval = call_usermodehelper_exec(sub_info,
+							  UMH_WAIT_EXEC);
+
+		kfree(helper_argv);
+		if (retval) {
+			coredump_report_failure("|%s pipe failed", cn.corename);
+			goto close_fail;
+		}
+		break;
+	}
+	default:
+		WARN_ON_ONCE(true);
+		goto close_fail;
 	}
 
 	/* get us an unshared descriptor table; almost always a no-op */

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 03/11] coredump: reflow dump helpers a little
  2025-05-07 16:13 [PATCH v4 00/11] coredump: add coredump socket Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 01/11] coredump: massage format_corname() Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 02/11] coredump: massage do_coredump() Christian Brauner
@ 2025-05-07 16:13 ` Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 04/11] net: reserve prefix Christian Brauner
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-05-07 16:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, linux-fsdevel, Jann Horn
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, Christian Brauner, Alexander Mikhalitsyn

They look rather messy right now.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 41491dbfafdf..b2eda7b176e4 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -867,10 +867,9 @@ static int __dump_emit(struct coredump_params *cprm, const void *addr, int nr)
 	struct file *file = cprm->file;
 	loff_t pos = file->f_pos;
 	ssize_t n;
+
 	if (cprm->written + nr > cprm->limit)
 		return 0;
-
-
 	if (dump_interrupted())
 		return 0;
 	n = __kernel_write(file, addr, nr, &pos);
@@ -887,20 +886,21 @@ static int __dump_skip(struct coredump_params *cprm, size_t nr)
 {
 	static char zeroes[PAGE_SIZE];
 	struct file *file = cprm->file;
+
 	if (file->f_mode & FMODE_LSEEK) {
-		if (dump_interrupted() ||
-		    vfs_llseek(file, nr, SEEK_CUR) < 0)
+		if (dump_interrupted() || vfs_llseek(file, nr, SEEK_CUR) < 0)
 			return 0;
 		cprm->pos += nr;
 		return 1;
-	} else {
-		while (nr > PAGE_SIZE) {
-			if (!__dump_emit(cprm, zeroes, PAGE_SIZE))
-				return 0;
-			nr -= PAGE_SIZE;
-		}
-		return __dump_emit(cprm, zeroes, nr);
 	}
+
+	while (nr > PAGE_SIZE) {
+		if (!__dump_emit(cprm, zeroes, PAGE_SIZE))
+			return 0;
+		nr -= PAGE_SIZE;
+	}
+
+	return __dump_emit(cprm, zeroes, nr);
 }
 
 int dump_emit(struct coredump_params *cprm, const void *addr, int nr)

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 04/11] net: reserve prefix
  2025-05-07 16:13 [PATCH v4 00/11] coredump: add coredump socket Christian Brauner
                   ` (2 preceding siblings ...)
  2025-05-07 16:13 ` [PATCH v4 03/11] coredump: reflow dump helpers a little Christian Brauner
@ 2025-05-07 16:13 ` Christian Brauner
  2025-05-07 22:45   ` Kuniyuki Iwashima
  2025-05-07 16:13 ` [PATCH v4 05/11] coredump: add coredump socket Christian Brauner
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2025-05-07 16:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, linux-fsdevel, Jann Horn
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, Christian Brauner, Alexander Mikhalitsyn

Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
CAP_NET_ADMIN in the owning user namespace of the network namespace to
bind it. This will be used in next patches to support the coredump
socket but is a generally useful concept.

The collision risk is so low that we can just start using it. Userspace
must already be prepared to retry if a given abstract address isn't
usable anyway.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/uapi/linux/un.h |  2 ++
 net/unix/af_unix.c      | 39 +++++++++++++++++++++++++++++++++++----
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/un.h b/include/uapi/linux/un.h
index 0ad59dc8b686..bbd5ad508dfa 100644
--- a/include/uapi/linux/un.h
+++ b/include/uapi/linux/un.h
@@ -5,6 +5,8 @@
 #include <linux/socket.h>
 
 #define UNIX_PATH_MAX	108
+/* reserved AF_UNIX socket namespace. */
+#define UNIX_SOCKET_NAMESPACE "linuxafsk/"
 
 struct sockaddr_un {
 	__kernel_sa_family_t sun_family; /* AF_UNIX */
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 472f8aa9ea15..148d008862e7 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -114,6 +114,13 @@ static atomic_long_t unix_nr_socks;
 static struct hlist_head bsd_socket_buckets[UNIX_HASH_SIZE / 2];
 static spinlock_t bsd_socket_locks[UNIX_HASH_SIZE / 2];
 
+static const struct sockaddr_un linuxafsk_addr = {
+	.sun_family = AF_UNIX,
+	.sun_path = "\0"UNIX_SOCKET_NAMESPACE,
+};
+
+#define UNIX_SOCKET_NAMESPACE_ADDR_LEN (offsetof(struct sockaddr_un, sun_path) + sizeof(UNIX_SOCKET_NAMESPACE))
+
 /* SMP locking strategy:
  *    hash table is protected with spinlock.
  *    each socket state is protected by separate spinlock.
@@ -436,6 +443,30 @@ static struct sock *__unix_find_socket_byname(struct net *net,
 	return NULL;
 }
 
+static int unix_may_bind_name(struct net *net, struct sockaddr_un *sunname,
+			      int len, unsigned int hash)
+{
+	struct sock *s;
+
+	s = __unix_find_socket_byname(net, sunname, len, hash);
+	if (s)
+		return -EADDRINUSE;
+
+	/*
+	 * Check whether this is our reserved prefix and if so ensure
+	 * that only privileged processes can bind it.
+	 */
+	if (UNIX_SOCKET_NAMESPACE_ADDR_LEN <= len &&
+	    !memcmp(&linuxafsk_addr, sunname, UNIX_SOCKET_NAMESPACE_ADDR_LEN)) {
+		/* Don't bind the namespace itself. */
+		if (UNIX_SOCKET_NAMESPACE_ADDR_LEN == len)
+			return -ECONNREFUSED;
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+			return -ECONNREFUSED;
+	}
+	return 0;
+}
+
 static inline struct sock *unix_find_socket_byname(struct net *net,
 						   struct sockaddr_un *sunname,
 						   int len, unsigned int hash)
@@ -1258,10 +1289,10 @@ static int unix_autobind(struct sock *sk)
 	new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
 	unix_table_double_lock(net, old_hash, new_hash);
 
-	if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash)) {
+	if (unix_may_bind_name(net, addr->name, addr->len, new_hash)) {
 		unix_table_double_unlock(net, old_hash, new_hash);
 
-		/* __unix_find_socket_byname() may take long time if many names
+		/* unix_may_bind_name() may take long time if many names
 		 * are already in use.
 		 */
 		cond_resched();
@@ -1379,7 +1410,8 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
 	new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
 	unix_table_double_lock(net, old_hash, new_hash);
 
-	if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash))
+	err = unix_may_bind_name(net, addr->name, addr->len, new_hash);
+	if (err)
 		goto out_spin;
 
 	__unix_set_addr_hash(net, sk, addr, new_hash);
@@ -1389,7 +1421,6 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
 
 out_spin:
 	unix_table_double_unlock(net, old_hash, new_hash);
-	err = -EADDRINUSE;
 out_mutex:
 	mutex_unlock(&u->bindlock);
 out:

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 04/11] net: reserve prefix
  2025-05-07 16:13 ` [PATCH v4 04/11] net: reserve prefix Christian Brauner
@ 2025-05-07 22:45   ` Kuniyuki Iwashima
  2025-05-08  6:16     ` Christian Brauner
  0 siblings, 1 reply; 18+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-07 22:45 UTC (permalink / raw)
  To: brauner
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, jannh, kuba, kuniyu, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

From: Christian Brauner <brauner@kernel.org>
Date: Wed, 07 May 2025 18:13:37 +0200
> Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
> CAP_NET_ADMIN in the owning user namespace of the network namespace to
> bind it. This will be used in next patches to support the coredump
> socket but is a generally useful concept.

I really think we shouldn't reserve address and it should be
configurable by users via core_pattern as with the other
coredump types.

AF_UNIX doesn't support SO_REUSEPORT, so once the socket is
dying, user can't start the new coredump listener until it's
fully cleaned up, which adds unnecessary drawback.

The semantic should be same with other types, and the todo
for the coredump service is prepare file (file, process, socket)
that can receive data and set its name to core_pattern.

Also, the abstract socket is namespced by design and there is
no point in enforcing the same restriction to non-initial netns.


> 
> The collision risk is so low that we can just start using it. Userspace
> must already be prepared to retry if a given abstract address isn't
> usable anyway.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  include/uapi/linux/un.h |  2 ++
>  net/unix/af_unix.c      | 39 +++++++++++++++++++++++++++++++++++----
>  2 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/un.h b/include/uapi/linux/un.h
> index 0ad59dc8b686..bbd5ad508dfa 100644
> --- a/include/uapi/linux/un.h
> +++ b/include/uapi/linux/un.h
> @@ -5,6 +5,8 @@
>  #include <linux/socket.h>
>  
>  #define UNIX_PATH_MAX	108
> +/* reserved AF_UNIX socket namespace. */
> +#define UNIX_SOCKET_NAMESPACE "linuxafsk/"
>  
>  struct sockaddr_un {
>  	__kernel_sa_family_t sun_family; /* AF_UNIX */
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 472f8aa9ea15..148d008862e7 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -114,6 +114,13 @@ static atomic_long_t unix_nr_socks;
>  static struct hlist_head bsd_socket_buckets[UNIX_HASH_SIZE / 2];
>  static spinlock_t bsd_socket_locks[UNIX_HASH_SIZE / 2];
>  
> +static const struct sockaddr_un linuxafsk_addr = {
> +	.sun_family = AF_UNIX,
> +	.sun_path = "\0"UNIX_SOCKET_NAMESPACE,
> +};
> +
> +#define UNIX_SOCKET_NAMESPACE_ADDR_LEN (offsetof(struct sockaddr_un, sun_path) + sizeof(UNIX_SOCKET_NAMESPACE))
> +
>  /* SMP locking strategy:
>   *    hash table is protected with spinlock.
>   *    each socket state is protected by separate spinlock.
> @@ -436,6 +443,30 @@ static struct sock *__unix_find_socket_byname(struct net *net,
>  	return NULL;
>  }
>  
> +static int unix_may_bind_name(struct net *net, struct sockaddr_un *sunname,
> +			      int len, unsigned int hash)
> +{
> +	struct sock *s;
> +
> +	s = __unix_find_socket_byname(net, sunname, len, hash);
> +	if (s)
> +		return -EADDRINUSE;
> +
> +	/*
> +	 * Check whether this is our reserved prefix and if so ensure
> +	 * that only privileged processes can bind it.
> +	 */
> +	if (UNIX_SOCKET_NAMESPACE_ADDR_LEN <= len &&
> +	    !memcmp(&linuxafsk_addr, sunname, UNIX_SOCKET_NAMESPACE_ADDR_LEN)) {
> +		/* Don't bind the namespace itself. */
> +		if (UNIX_SOCKET_NAMESPACE_ADDR_LEN == len)
> +			return -ECONNREFUSED;
> +		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
> +			return -ECONNREFUSED;
> +	}
> +	return 0;
> +}
> +
>  static inline struct sock *unix_find_socket_byname(struct net *net,
>  						   struct sockaddr_un *sunname,
>  						   int len, unsigned int hash)
> @@ -1258,10 +1289,10 @@ static int unix_autobind(struct sock *sk)
>  	new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
>  	unix_table_double_lock(net, old_hash, new_hash);
>  
> -	if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash)) {
> +	if (unix_may_bind_name(net, addr->name, addr->len, new_hash)) {
>  		unix_table_double_unlock(net, old_hash, new_hash);
>  
> -		/* __unix_find_socket_byname() may take long time if many names
> +		/* unix_may_bind_name() may take long time if many names
>  		 * are already in use.
>  		 */
>  		cond_resched();
> @@ -1379,7 +1410,8 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
>  	new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
>  	unix_table_double_lock(net, old_hash, new_hash);
>  
> -	if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash))
> +	err = unix_may_bind_name(net, addr->name, addr->len, new_hash);
> +	if (err)
>  		goto out_spin;
>  
>  	__unix_set_addr_hash(net, sk, addr, new_hash);
> @@ -1389,7 +1421,6 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
>  
>  out_spin:
>  	unix_table_double_unlock(net, old_hash, new_hash);
> -	err = -EADDRINUSE;
>  out_mutex:
>  	mutex_unlock(&u->bindlock);
>  out:
> 
> -- 
> 2.47.2

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 04/11] net: reserve prefix
  2025-05-07 22:45   ` Kuniyuki Iwashima
@ 2025-05-08  6:16     ` Christian Brauner
  2025-05-08 21:47       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2025-05-08  6:16 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, jannh, kuba, lennart, linux-fsdevel, linux-kernel, me,
	netdev, oleg, pabeni, viro, zbyszek

On Wed, May 07, 2025 at 03:45:52PM -0700, Kuniyuki Iwashima wrote:
> From: Christian Brauner <brauner@kernel.org>
> Date: Wed, 07 May 2025 18:13:37 +0200
> > Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
> > CAP_NET_ADMIN in the owning user namespace of the network namespace to
> > bind it. This will be used in next patches to support the coredump
> > socket but is a generally useful concept.
> 
> I really think we shouldn't reserve address and it should be
> configurable by users via core_pattern as with the other
> coredump types.
> 
> AF_UNIX doesn't support SO_REUSEPORT, so once the socket is
> dying, user can't start the new coredump listener until it's
> fully cleaned up, which adds unnecessary drawback.

This really doesn't matter.

> The semantic should be same with other types, and the todo
> for the coredump service is prepare file (file, process, socket)
> that can receive data and set its name to core_pattern.

We need to perform a capability check during bind() for the host's
coredump socket. Otherwise if the coredump server crashes an
unprivileged attacker can simply bind the address and receive all
coredumps from suid binaries.

This is also a problem for legitimate coredump server updates. To change
the coredump address the coredump server must first setup a new socket
and then update core_pattern and then shutdown the old coredump socket.

Now an unprivileged attacker can rebind the old coredump socket address
but there's still a crashing task that got scheduled out after it copied
the old coredump server address but before it connected to the coredump
server. The new server is now up and the old server's address has been
reused by the attacker. Now the crashing task gets scheduled back in and
connects to the unprivileged attacker and forwards its suid dump to the
attacker.

The name of the socket needs to be protected. This can be done by prefix
but the simplest way is what I did in my earlier version and to just use
a well-known name. The name really doesn't matter and all it adds is
potential for subtle bugs. I want the coredump code I have to maintain
to have as little moving parts as possible.

I'm happy to drop the patch to reserve the prefix as that seems to
bother you. But the coredump socket name won't be configurable. It'd be
good if we could just compromise here. Without the capability check on
bind we can just throw this all out as that's never going to be safe.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 04/11] net: reserve prefix
  2025-05-08  6:16     ` Christian Brauner
@ 2025-05-08 21:47       ` Kuniyuki Iwashima
  2025-05-09  5:54         ` Christian Brauner
  0 siblings, 1 reply; 18+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-08 21:47 UTC (permalink / raw)
  To: brauner
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, jannh, kuba, kuniyu, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

From: Christian Brauner <brauner@kernel.org>
Date: Thu, 8 May 2025 08:16:29 +0200
> On Wed, May 07, 2025 at 03:45:52PM -0700, Kuniyuki Iwashima wrote:
> > From: Christian Brauner <brauner@kernel.org>
> > Date: Wed, 07 May 2025 18:13:37 +0200
> > > Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
> > > CAP_NET_ADMIN in the owning user namespace of the network namespace to
> > > bind it. This will be used in next patches to support the coredump
> > > socket but is a generally useful concept.
> > 
> > I really think we shouldn't reserve address and it should be
> > configurable by users via core_pattern as with the other
> > coredump types.
> > 
> > AF_UNIX doesn't support SO_REUSEPORT, so once the socket is
> > dying, user can't start the new coredump listener until it's
> > fully cleaned up, which adds unnecessary drawback.
> 
> This really doesn't matter.
> 
> > The semantic should be same with other types, and the todo
> > for the coredump service is prepare file (file, process, socket)
> > that can receive data and set its name to core_pattern.
> 
> We need to perform a capability check during bind() for the host's
> coredump socket. Otherwise if the coredump server crashes an
> unprivileged attacker can simply bind the address and receive all
> coredumps from suid binaries.

As I mentioned in the previous thread, this can be better
handled by BPF LSM with more fine-grained rule.

1. register a socket with its name to BPF map
2. check if the destination socket is registered at connect

Even when LSM is not availalbe, the cgroup BPF prog can make
connect() fail if the destination name is not registered
in the map.

> 
> This is also a problem for legitimate coredump server updates. To change
> the coredump address the coredump server must first setup a new socket
> and then update core_pattern and then shutdown the old coredump socket.

So, for completeness, the server should set up a cgroup BPF
prog to route the request for the old name to the new one.

Here, the bpf map above can be reused to check if the socket
name is registered in the map or route to another socket in
the map.

Then, the unprivileged issue below and the non-dumpable issue
mentioned in the cover letter can also be resolved.

The server is expected to have CAP_SYS_ADMIN, so BPF should
play a role.


> 
> Now an unprivileged attacker can rebind the old coredump socket address
> but there's still a crashing task that got scheduled out after it copied
> the old coredump server address but before it connected to the coredump
> server. The new server is now up and the old server's address has been
> reused by the attacker. Now the crashing task gets scheduled back in and
> connects to the unprivileged attacker and forwards its suid dump to the
> attacker.
> 
> The name of the socket needs to be protected. This can be done by prefix
> but the simplest way is what I did in my earlier version and to just use
> a well-known name. The name really doesn't matter and all it adds is
> potential for subtle bugs. I want the coredump code I have to maintain
> to have as little moving parts as possible.
> 
> I'm happy to drop the patch to reserve the prefix as that seems to
> bother you. But the coredump socket name won't be configurable. It'd be
> good if we could just compromise here. Without the capability check on
> bind we can just throw this all out as that's never going to be safe.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 04/11] net: reserve prefix
  2025-05-08 21:47       ` Kuniyuki Iwashima
@ 2025-05-09  5:54         ` Christian Brauner
  2025-05-09  8:07           ` Daniel Borkmann
  0 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2025-05-09  5:54 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, jannh, kuba, lennart, linux-fsdevel, linux-kernel, me,
	netdev, oleg, pabeni, viro, zbyszek

On Thu, May 08, 2025 at 02:47:45PM -0700, Kuniyuki Iwashima wrote:
> From: Christian Brauner <brauner@kernel.org>
> Date: Thu, 8 May 2025 08:16:29 +0200
> > On Wed, May 07, 2025 at 03:45:52PM -0700, Kuniyuki Iwashima wrote:
> > > From: Christian Brauner <brauner@kernel.org>
> > > Date: Wed, 07 May 2025 18:13:37 +0200
> > > > Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
> > > > CAP_NET_ADMIN in the owning user namespace of the network namespace to
> > > > bind it. This will be used in next patches to support the coredump
> > > > socket but is a generally useful concept.
> > > 
> > > I really think we shouldn't reserve address and it should be
> > > configurable by users via core_pattern as with the other
> > > coredump types.
> > > 
> > > AF_UNIX doesn't support SO_REUSEPORT, so once the socket is
> > > dying, user can't start the new coredump listener until it's
> > > fully cleaned up, which adds unnecessary drawback.
> > 
> > This really doesn't matter.
> > 
> > > The semantic should be same with other types, and the todo
> > > for the coredump service is prepare file (file, process, socket)
> > > that can receive data and set its name to core_pattern.
> > 
> > We need to perform a capability check during bind() for the host's
> > coredump socket. Otherwise if the coredump server crashes an
> > unprivileged attacker can simply bind the address and receive all
> > coredumps from suid binaries.
> 
> As I mentioned in the previous thread, this can be better
> handled by BPF LSM with more fine-grained rule.
> 
> 1. register a socket with its name to BPF map
> 2. check if the destination socket is registered at connect
> 
> Even when LSM is not availalbe, the cgroup BPF prog can make
> connect() fail if the destination name is not registered
> in the map.
> 
> > 
> > This is also a problem for legitimate coredump server updates. To change
> > the coredump address the coredump server must first setup a new socket
> > and then update core_pattern and then shutdown the old coredump socket.
> 
> So, for completeness, the server should set up a cgroup BPF
> prog to route the request for the old name to the new one.
> 
> Here, the bpf map above can be reused to check if the socket
> name is registered in the map or route to another socket in
> the map.
> 
> Then, the unprivileged issue below and the non-dumpable issue
> mentioned in the cover letter can also be resolved.
> 
> The server is expected to have CAP_SYS_ADMIN, so BPF should
> play a role.

This has been explained by multiple people over the course of this
thread already. It is simply not acceptable for basic kernel
functionality to be unsafe without the use of additional separate
subsystems. It is not ok to require bpf for a core kernel api to be
safely usable. It's irrelevant whether that's for security or cgroup
hooks. None of which we can require.

I won't even get this past Linus for that matter because he will rightly
NAK that hard and probably ask me whether I've paid any attention to
basic kernel development requirements in the last 10 years. Let alone
for coredumping which handles crashing suid binaries. I understand the
urge to outsurce this problem to userspace but that's not ok.

Coredumping is a core kernel service and all options have to be safely
usable by themselves. In fact, that goes for any kernel API and
especially VFS apis.

Using AF_UNIX sockets will be a major step forward in both simplicity
and security. We've compromised on every front so far. It's not too much
to ask for a basic permission check on a single well-known address
that's exposed as a kernel-level service.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 04/11] net: reserve prefix
  2025-05-09  5:54         ` Christian Brauner
@ 2025-05-09  8:07           ` Daniel Borkmann
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2025-05-09  8:07 UTC (permalink / raw)
  To: Christian Brauner, Kuniyuki Iwashima
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, jannh, kuba, lennart, linux-fsdevel, linux-kernel, me,
	netdev, oleg, pabeni, viro, zbyszek

On 5/9/25 7:54 AM, Christian Brauner wrote:
> On Thu, May 08, 2025 at 02:47:45PM -0700, Kuniyuki Iwashima wrote:
>> From: Christian Brauner <brauner@kernel.org>
>> Date: Thu, 8 May 2025 08:16:29 +0200
>>> On Wed, May 07, 2025 at 03:45:52PM -0700, Kuniyuki Iwashima wrote:
>>>> From: Christian Brauner <brauner@kernel.org>
>>>> Date: Wed, 07 May 2025 18:13:37 +0200
>>>>> Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
>>>>> CAP_NET_ADMIN in the owning user namespace of the network namespace to
>>>>> bind it. This will be used in next patches to support the coredump
>>>>> socket but is a generally useful concept.
>>>>
>>>> I really think we shouldn't reserve address and it should be
>>>> configurable by users via core_pattern as with the other
>>>> coredump types.
>>>>
>>>> AF_UNIX doesn't support SO_REUSEPORT, so once the socket is
>>>> dying, user can't start the new coredump listener until it's
>>>> fully cleaned up, which adds unnecessary drawback.
>>>
>>> This really doesn't matter.
>>>
>>>> The semantic should be same with other types, and the todo
>>>> for the coredump service is prepare file (file, process, socket)
>>>> that can receive data and set its name to core_pattern.
>>>
>>> We need to perform a capability check during bind() for the host's
>>> coredump socket. Otherwise if the coredump server crashes an
>>> unprivileged attacker can simply bind the address and receive all
>>> coredumps from suid binaries.
>>
>> As I mentioned in the previous thread, this can be better
>> handled by BPF LSM with more fine-grained rule.
>>
>> 1. register a socket with its name to BPF map
>> 2. check if the destination socket is registered at connect
>>
>> Even when LSM is not availalbe, the cgroup BPF prog can make
>> connect() fail if the destination name is not registered
>> in the map.
>>
>>> This is also a problem for legitimate coredump server updates. To change
>>> the coredump address the coredump server must first setup a new socket
>>> and then update core_pattern and then shutdown the old coredump socket.
>>
>> So, for completeness, the server should set up a cgroup BPF
>> prog to route the request for the old name to the new one.
>>
>> Here, the bpf map above can be reused to check if the socket
>> name is registered in the map or route to another socket in
>> the map.
>>
>> Then, the unprivileged issue below and the non-dumpable issue
>> mentioned in the cover letter can also be resolved.
>>
>> The server is expected to have CAP_SYS_ADMIN, so BPF should
>> play a role.
> 
> This has been explained by multiple people over the course of this
> thread already. It is simply not acceptable for basic kernel
> functionality to be unsafe without the use of additional separate
> subsystems. It is not ok to require bpf for a core kernel api to be
> safely usable. It's irrelevant whether that's for security or cgroup
> hooks. None of which we can require.

As much as I like BPF, but I agree with Christian here that we should
not rely on other subsystems in addition, which might even be compiled
out in some cases where coredumps are needed (e.g. embedded).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 05/11] coredump: add coredump socket
  2025-05-07 16:13 [PATCH v4 00/11] coredump: add coredump socket Christian Brauner
                   ` (3 preceding siblings ...)
  2025-05-07 16:13 ` [PATCH v4 04/11] net: reserve prefix Christian Brauner
@ 2025-05-07 16:13 ` Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 06/11] coredump: validate socket name as it is written Christian Brauner
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-05-07 16:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, linux-fsdevel, Jann Horn
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, Christian Brauner, Alexander Mikhalitsyn

Coredumping currently supports two modes:

(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
    spawned as a child of the system_unbound_wq or kthreadd.

For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.

The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:

        |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.

In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):

- systemd-coredump is spawned with file descriptor number 0 (stdin)
  connected to the read-end of the pipe. All other file descriptors are
  closed. That specifically includes 1 (stdout) and 2 (stderr). This has
  already caused bugs because userspace assumed that this cannot happen
  (Whether or not this is a sane assumption is irrelevant.).

- systemd-coredump will be spawned as a child of system_unbound_wq. So
  it is not a child of any userspace process and specifically not a
  child of PID 1. It cannot be waited upon and is in a weird hybrid
  upcall which are difficult for userspace to control correctly.

- systemd-coredump is spawned with full kernel privileges. This
  necessitates all kinds of weird privilege dropping excercises in
  userspace to make this safe.

- A new usermode helper has to be spawned for each crashing process.

This series adds a new mode:

(3) Dumping into an abstract AF_UNIX socket.

Userspace can set /proc/sys/kernel/core_pattern to:

        @linuxafsk/coredump_socket

The "@" at the beginning indicates to the kernel that the abstract
AF_UNIX coredump socket will be used to process coredumps.

The coredump socket uses the fixed address "linuxafsk/coredump.socket"
for now.

The coredump socket is located in the initial network namespace. To bind
the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
user namespace. Listening and reading can happen from whatever
unprivileged context is necessary to safely process coredumps.

When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket.

- The coredump server should use SO_PEERPIDFD to get a stable handle on
  the connected crashing task. The retrieved pidfd will provide a stable
  reference even if the crashing task gets SIGKILLed while generating
  the coredump.

- By setting core_pipe_limit non-zero userspace can guarantee that the
  crashing task cannot be reaped behind it's back and thus process all
  necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
  detect whether /proc/<pid> still refers to the same process.

  The core_pipe_limit isn't used to rate-limit connections to the
  socket. This can simply be done via AF_UNIX socket directly.

- The pidfd for the crashing task will contain information how the task
  coredumps. The PIDFD_GET_INFO ioctl gained a new flag
  PIDFD_INFO_COREDUMP which can be used to retreive the coredump
  information.

  If the coredump gets a new coredump client connection the kernel
  guarantees that PIDFD_INFO_COREDUMP information is available.
  Currently the following information is provided in the new
  @coredump_mask extension to struct pidfd_info:

  * PIDFD_COREDUMPED is raised if the task did actually coredump.
  * PIDFD_COREDUMP_SKIP	is raised if the task skipped coredumping (e.g.,
    undumpable).
  * PIDFD_COREDUMP_USER	is raised if this is a regular coredump and
    doesn't need special care by the coredump server.
  * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
    treated as sensitive and the coredump server should restrict to the
    generated coredump to sufficiently privileged users.

- Since unix_stream_connect() runs bpf programs during connect it's
  possible to even redirect or multiplex coredumps to other sockets.

- The coredump server should mark itself as non-dumpable.
  To capture coredumps for the coredump server itself a bpf program
  should be run at connect to redirect it to another socket in
  userspace. This can be useful for debugging crashing coredump servers.

- A container coredump server in a separate network namespace can simply
  bind to linuxafsk/coredump.socket and systemd-coredump fowards
  coredumps to the container.

- Fwiw, one idea is to handle coredumps via per-user/session coredump
  servers that run with that users privileges.

  The coredump server listens on the coredump socket and accepts a
  new coredump connection. It then retrieves SO_PEERPIDFD for the
  client, inspects uid/gid and hands the accepted client to the users
  own coredump handler which runs with the users privileges only.

The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers.

This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can just keep a worker pool.

This is easy to test:

(a) coredump processing (we're using socat):

    > cat coredump_socket.sh
    #!/bin/bash

    set -x

    sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
    sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc

(b) trigger a coredump:

    user1@localhost:~/data/scripts$ cat crash.c
    #include <stdio.h>
    #include <unistd.h>

    int main(int argc, char *argv[])
    {
            fprintf(stderr, "%u\n", (1 / 0));
            _exit(0);
    }

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 136 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 128 insertions(+), 8 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index b2eda7b176e4..d61e15d855d2 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -44,7 +44,11 @@
 #include <linux/sysctl.h>
 #include <linux/elf.h>
 #include <linux/pidfs.h>
+#include <linux/net.h>
+#include <linux/socket.h>
+#include <net/net_namespace.h>
 #include <uapi/linux/pidfd.h>
+#include <uapi/linux/un.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
 enum coredump_type_t {
 	COREDUMP_FILE = 1,
 	COREDUMP_PIPE = 2,
+	COREDUMP_SOCK = 3,
 };
 
 struct core_name {
@@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 	cn->corename = NULL;
 	if (*pat_ptr == '|')
 		cn->core_type = COREDUMP_PIPE;
+	else if (*pat_ptr == '@')
+		cn->core_type = COREDUMP_SOCK;
 	else
 		cn->core_type = COREDUMP_FILE;
 	if (expand_corename(cn, core_name_size))
 		return -ENOMEM;
 	cn->corename[0] = '\0';
 
-	if (cn->core_type == COREDUMP_PIPE) {
+	switch (cn->core_type) {
+	case COREDUMP_PIPE: {
 		int argvs = sizeof(core_pattern) / 2;
 		(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
 		if (!(*argv))
@@ -247,6 +255,34 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 		++pat_ptr;
 		if (!(*pat_ptr))
 			return -ENOMEM;
+		break;
+	}
+	case COREDUMP_SOCK: {
+		err = cn_printf(cn, "%s", pat_ptr);
+		if (err)
+			return err;
+
+		/*
+		 * We can potentially allow this to be changed later but
+		 * I currently see no reason to.
+		 */
+		if (strcmp(cn->corename, "@linuxafsk/coredump.socket"))
+			return -EINVAL;
+
+		/*
+		 * Currently no need to parse any other options.
+		 * Relevant information can be retrieved from the peer
+		 * pidfd retrievable via SO_PEERPIDFD by the receiver or
+		 * via /proc/<pid>, using the SO_PEERPIDFD to guard
+		 * against pid recycling when opening /proc/<pid>.
+		 */
+		return 0;
+	}
+	case COREDUMP_FILE:
+		break;
+	default:
+		WARN_ON_ONCE(true);
+		return -EINVAL;
 	}
 
 	/* Repeat as long as we have more pattern to process and more output
@@ -393,11 +429,20 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 	 * If core_pattern does not include a %p (as is the default)
 	 * and core_uses_pid is set, then .%pid will be appended to
 	 * the filename. Do not do this for piped commands. */
-	if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
-		err = cn_printf(cn, ".%d", task_tgid_vnr(current));
-		if (err)
-			return err;
+	if (!pid_in_pattern && core_uses_pid) {
+		switch (cn->core_type) {
+		case COREDUMP_FILE:
+			return cn_printf(cn, ".%d", task_tgid_vnr(current));
+		case COREDUMP_PIPE:
+			break;
+		case COREDUMP_SOCK:
+			break;
+		default:
+			WARN_ON_ONCE(true);
+			return -EINVAL;
+		}
 	}
+
 	return 0;
 }
 
@@ -583,6 +628,17 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
 	return 0;
 }
 
+#ifdef CONFIG_UNIX
+static const struct sockaddr_un coredump_unix_socket = {
+	.sun_family = AF_UNIX,
+	.sun_path = "\0linuxafsk/coredump.socket",
+};
+/* Without trailing NUL byte. */
+#define COREDUMP_UNIX_SOCKET_ADDR_SIZE            \
+	(offsetof(struct sockaddr_un, sun_path) + \
+	 sizeof("\0linuxafsk/coredump.socket") - 1)
+#endif
+
 void do_coredump(const kernel_siginfo_t *siginfo)
 {
 	struct core_state core_state;
@@ -801,6 +857,45 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		}
 		break;
 	}
+	case COREDUMP_SOCK: {
+#ifdef CONFIG_UNIX
+		struct file *file __free(fput) = NULL;
+		struct socket *socket;
+
+		/*
+		 * It is possible that the userspace process which is
+		 * supposed to handle the coredump and is listening on
+		 * the AF_UNIX socket coredumps. Userspace should just
+		 * mark itself non dumpable.
+		 */
+
+		retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
+		if (retval < 0)
+			goto close_fail;
+
+		file = sock_alloc_file(socket, 0, NULL);
+		if (IS_ERR(file)) {
+			sock_release(socket);
+			goto close_fail;
+		}
+
+		retval = kernel_connect(socket,
+					(struct sockaddr *)(&coredump_unix_socket),
+					COREDUMP_UNIX_SOCKET_ADDR_SIZE, O_NONBLOCK);
+		if (retval) {
+			if (retval == -EAGAIN)
+				coredump_report_failure("Skipping as coredump socket connection %s couldn't complete immediately", cn.corename);
+			goto close_fail;
+		}
+
+		cprm.limit = RLIM_INFINITY;
+		cprm.file = no_free_ptr(file);
+#else
+		coredump_report_failure("Core dump socket support %s disabled", cn.corename);
+		goto close_fail;
+#endif
+		break;
+	}
 	default:
 		WARN_ON_ONCE(true);
 		goto close_fail;
@@ -838,8 +933,33 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		file_end_write(cprm.file);
 		free_vma_snapshot(&cprm);
 	}
-	if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
-		wait_for_dump_helpers(cprm.file);
+
+	/*
+	 * When core_pipe_limit is set we wait for the coredump server
+	 * or usermodehelper to finish before exiting so it can e.g.,
+	 * inspect /proc/<pid>.
+	 */
+	if (core_pipe_limit) {
+		switch (cn.core_type) {
+		case COREDUMP_PIPE:
+			wait_for_dump_helpers(cprm.file);
+			break;
+		case COREDUMP_SOCK: {
+			char buf[1];
+			/*
+			 * We use a simple read to wait for the coredump
+			 * processing to finish. Either the socket is
+			 * closed or we get sent unexpected data. In
+			 * both cases, we're done.
+			 */
+			__kernel_read(cprm.file, buf, 1, NULL);
+			break;
+		}
+		default:
+			break;
+		}
+	}
+
 close_fail:
 	if (cprm.file)
 		filp_close(cprm.file, NULL);
@@ -1069,7 +1189,7 @@ EXPORT_SYMBOL(dump_align);
 void validate_coredump_safety(void)
 {
 	if (suid_dumpable == SUID_DUMP_ROOT &&
-	    core_pattern[0] != '/' && core_pattern[0] != '|') {
+	    core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != '@') {
 
 		coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: "
 			"pipe handler or fully qualified core dump path required. "

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 06/11] coredump: validate socket name as it is written
  2025-05-07 16:13 [PATCH v4 00/11] coredump: add coredump socket Christian Brauner
                   ` (4 preceding siblings ...)
  2025-05-07 16:13 ` [PATCH v4 05/11] coredump: add coredump socket Christian Brauner
@ 2025-05-07 16:13 ` Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 07/11] coredump: show supported coredump modes Christian Brauner
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-05-07 16:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, linux-fsdevel, Jann Horn
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, Christian Brauner, Alexander Mikhalitsyn

In contrast to other parameters written into
/proc/sys/kernel/core_pattern that never fail we can validate enabling
the new AF_UNIX support. This is obviously racy as hell but it's always
been that way.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index d61e15d855d2..0f00f77be988 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1200,10 +1200,21 @@ void validate_coredump_safety(void)
 static int proc_dostring_coredump(const struct ctl_table *table, int write,
 		  void *buffer, size_t *lenp, loff_t *ppos)
 {
-	int error = proc_dostring(table, write, buffer, lenp, ppos);
+	int error;
+	ssize_t retval;
+	char old_core_pattern[CORENAME_MAX_SIZE];
 
-	if (!error)
-		validate_coredump_safety();
+	retval = strscpy(old_core_pattern, core_pattern, CORENAME_MAX_SIZE);
+
+	error = proc_dostring(table, write, buffer, lenp, ppos);
+	if (error)
+		return error;
+	if (core_pattern[0] == '@' && strcmp(core_pattern, "@linuxafsk/coredump.socket")) {
+		strscpy(core_pattern, old_core_pattern, retval + 1);
+		return -EINVAL;
+	}
+
+	validate_coredump_safety();
 	return error;
 }
 

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 07/11] coredump: show supported coredump modes
  2025-05-07 16:13 [PATCH v4 00/11] coredump: add coredump socket Christian Brauner
                   ` (5 preceding siblings ...)
  2025-05-07 16:13 ` [PATCH v4 06/11] coredump: validate socket name as it is written Christian Brauner
@ 2025-05-07 16:13 ` Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 08/11] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-05-07 16:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, linux-fsdevel, Jann Horn
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, Christian Brauner, Alexander Mikhalitsyn

Allow userspace to discover what coredump modes are supported.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index 0f00f77be988..e1e6f02e0ed7 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1220,6 +1220,12 @@ static int proc_dostring_coredump(const struct ctl_table *table, int write,
 
 static const unsigned int core_file_note_size_min = CORE_FILE_NOTE_SIZE_DEFAULT;
 static const unsigned int core_file_note_size_max = CORE_FILE_NOTE_SIZE_MAX;
+static char core_modes[] = {
+	"file\npipe"
+#ifdef CONFIG_UNIX
+	"\nlinuxafsk/coredump.socket"
+#endif
+};
 
 static const struct ctl_table coredump_sysctls[] = {
 	{
@@ -1263,6 +1269,13 @@ static const struct ctl_table coredump_sysctls[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "core_modes",
+		.data		= core_modes,
+		.maxlen		= sizeof(core_modes) - 1,
+		.mode		= 0444,
+		.proc_handler	= proc_dostring,
+	},
 };
 
 static int __init init_fs_coredump_sysctls(void)

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 08/11] pidfs, coredump: add PIDFD_INFO_COREDUMP
  2025-05-07 16:13 [PATCH v4 00/11] coredump: add coredump socket Christian Brauner
                   ` (6 preceding siblings ...)
  2025-05-07 16:13 ` [PATCH v4 07/11] coredump: show supported coredump modes Christian Brauner
@ 2025-05-07 16:13 ` Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 09/11] pidfs, coredump: allow to verify coredump connection Christian Brauner
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-05-07 16:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, linux-fsdevel, Jann Horn
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, Christian Brauner, Alexander Mikhalitsyn

Let userspace know that the task coredumped and whether it was dumped as
root or as regular user. The latter is needed so that access permissions
to the executable are correctly handled.

I don't think this requires any additional privileges checks. The
missing exposure of the dumpability attribute of a given task is an
issue we should fix given that we already expose whether a task is
coredumping or not.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c              | 36 ++++++++++++++++++++++++++++++
 fs/pidfs.c                 | 55 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pidfs.h      |  3 +++
 include/uapi/linux/pidfd.h | 16 ++++++++++++++
 4 files changed, 110 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index e1e6f02e0ed7..ddff1854988f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -46,7 +46,9 @@
 #include <linux/pidfs.h>
 #include <linux/net.h>
 #include <linux/socket.h>
+#include <net/af_unix.h>
 #include <net/net_namespace.h>
+#include <net/sock.h>
 #include <uapi/linux/pidfd.h>
 #include <uapi/linux/un.h>
 
@@ -599,6 +601,8 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
 		if (IS_ERR(pidfs_file))
 			return PTR_ERR(pidfs_file);
 
+		pidfs_coredump(cp);
+
 		/*
 		 * Usermode helpers are childen of either
 		 * system_unbound_wq or of kthreadd. So we know that
@@ -879,15 +883,47 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 			goto close_fail;
 		}
 
+		/*
+		 * Set the thread-group leader pid which is used for the
+		 * peer credentials during connect() below. Then
+		 * immediately register it in pidfs...
+		 */
+		cprm.pid = task_tgid(current);
+		retval = pidfs_register_pid(cprm.pid);
+		if (retval) {
+			sock_release(socket);
+			goto close_fail;
+		}
+
+		/*
+		 * ... and set the coredump information so userspace
+		 * has it available after connect()...
+		 */
+		pidfs_coredump(&cprm);
+
+		/*
+		 * ... On connect() the peer credentials are recorded
+		 * and @cprm.pid registered in pidfs...
+		 */
 		retval = kernel_connect(socket,
 					(struct sockaddr *)(&coredump_unix_socket),
 					COREDUMP_UNIX_SOCKET_ADDR_SIZE, O_NONBLOCK);
+
+		/*
+		 * ... So we can safely put our pidfs reference now...
+		 */
+		pidfs_put_pid(cprm.pid);
+
 		if (retval) {
 			if (retval == -EAGAIN)
 				coredump_report_failure("Skipping as coredump socket connection %s couldn't complete immediately", cn.corename);
 			goto close_fail;
 		}
 
+		/* ... and validate that @sk_peer_pid matches @cprm.pid. */
+		if (WARN_ON_ONCE(unix_peer(socket->sk)->sk_peer_pid != cprm.pid))
+			goto close_fail;
+
 		cprm.limit = RLIM_INFINITY;
 		cprm.file = no_free_ptr(file);
 #else
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 3b39e471840b..8c4d83fb115b 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -20,6 +20,7 @@
 #include <linux/time_namespace.h>
 #include <linux/utsname.h>
 #include <net/net_namespace.h>
+#include <linux/coredump.h>
 
 #include "internal.h"
 #include "mount.h"
@@ -33,6 +34,7 @@ static struct kmem_cache *pidfs_cachep __ro_after_init;
 struct pidfs_exit_info {
 	__u64 cgroupid;
 	__s32 exit_code;
+	__u32 coredump_mask;
 };
 
 struct pidfs_inode {
@@ -240,6 +242,22 @@ static inline bool pid_in_current_pidns(const struct pid *pid)
 	return false;
 }
 
+static __u32 pidfs_coredump_mask(unsigned long mm_flags)
+{
+	switch (__get_dumpable(mm_flags)) {
+	case SUID_DUMP_USER:
+		return PIDFD_COREDUMP_USER;
+	case SUID_DUMP_ROOT:
+		return PIDFD_COREDUMP_ROOT;
+	case SUID_DUMP_DISABLE:
+		return PIDFD_COREDUMP_SKIP;
+	default:
+		WARN_ON_ONCE(true);
+	}
+
+	return 0;
+}
+
 static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
@@ -280,6 +298,11 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 		}
 	}
 
+	if (mask & PIDFD_INFO_COREDUMP) {
+		kinfo.mask |= PIDFD_INFO_COREDUMP;
+		kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
+	}
+
 	task = get_pid_task(pid, PIDTYPE_PID);
 	if (!task) {
 		/*
@@ -296,6 +319,13 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 	if (!c)
 		return -ESRCH;
 
+	if (!(kinfo.mask & PIDFD_INFO_COREDUMP)) {
+		task_lock(task);
+		if (task->mm)
+			kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags);
+		task_unlock(task);
+	}
+
 	/* Unconditionally return identifiers and credentials, the rest only on request */
 
 	user_ns = current_user_ns();
@@ -559,6 +589,31 @@ void pidfs_exit(struct task_struct *tsk)
 	}
 }
 
+void pidfs_coredump(const struct coredump_params *cprm)
+{
+	struct pid *pid = cprm->pid;
+	struct pidfs_exit_info *exit_info;
+	struct dentry *dentry;
+	struct inode *inode;
+	__u32 coredump_mask = 0;
+
+	dentry = stashed_dentry_get(&pid->stashed);
+	if (WARN_ON_ONCE(!dentry))
+		return;
+
+	inode = d_inode(dentry);
+	exit_info = &pidfs_i(inode)->__pei;
+	/* Note how we were coredumped. */
+	coredump_mask = pidfs_coredump_mask(cprm->mm_flags);
+	/* Note that we actually did coredump. */
+	coredump_mask |= PIDFD_COREDUMPED;
+	/* If coredumping is set to skip we should never end up here. */
+	VFS_WARN_ON_ONCE(coredump_mask & PIDFD_COREDUMP_SKIP);
+	smp_store_release(&exit_info->coredump_mask, coredump_mask);
+	/* Fwiw, this cannot be the last reference. */
+	dput(dentry);
+}
+
 static struct vfsmount *pidfs_mnt __ro_after_init;
 
 /*
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index 2676890c4d0d..f7729b9371bc 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -2,11 +2,14 @@
 #ifndef _LINUX_PID_FS_H
 #define _LINUX_PID_FS_H
 
+struct coredump_params;
+
 struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
 void __init pidfs_init(void);
 void pidfs_add_pid(struct pid *pid);
 void pidfs_remove_pid(struct pid *pid);
 void pidfs_exit(struct task_struct *tsk);
+void pidfs_coredump(const struct coredump_params *cprm);
 extern const struct dentry_operations pidfs_dentry_operations;
 int pidfs_register_pid(struct pid *pid);
 void pidfs_get_pid(struct pid *pid);
diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
index 8c1511edd0e9..84ac709f560c 100644
--- a/include/uapi/linux/pidfd.h
+++ b/include/uapi/linux/pidfd.h
@@ -25,9 +25,23 @@
 #define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
 #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
 #define PIDFD_INFO_EXIT			(1UL << 3) /* Only returned if requested. */
+#define PIDFD_INFO_COREDUMP		(1UL << 4) /* Only returned if requested. */
 
 #define PIDFD_INFO_SIZE_VER0		64 /* sizeof first published struct */
 
+/*
+ * Values for @coredump_mask in pidfd_info.
+ * Only valid if PIDFD_INFO_SUID_COREDUMP is set in @mask.
+ *
+ * Note, the @PIDFD_COREDUMP_ROOT flag indicates that the generated
+ * coredump should be treated as sensitive and access should only be
+ * granted to privileged users.
+ */
+#define PIDFD_COREDUMPED	(1U << 0) /* Did crash and... */
+#define PIDFD_COREDUMP_SKIP	(1U << 1) /* coredumping generation was skipped. */
+#define PIDFD_COREDUMP_USER	(1U << 2) /* coredump was done as the user. */
+#define PIDFD_COREDUMP_ROOT	(1U << 3) /* coredump was done as root. */
+
 /*
  * The concept of process and threads in userland and the kernel is a confusing
  * one - within the kernel every thread is a 'task' with its own individual PID,
@@ -92,6 +106,8 @@ struct pidfd_info {
 	__u32 fsuid;
 	__u32 fsgid;
 	__s32 exit_code;
+	__u32 coredump_mask;
+	__u32 __spare1;
 };
 
 #define PIDFS_IOCTL_MAGIC 0xFF

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 09/11] pidfs, coredump: allow to verify coredump connection
  2025-05-07 16:13 [PATCH v4 00/11] coredump: add coredump socket Christian Brauner
                   ` (7 preceding siblings ...)
  2025-05-07 16:13 ` [PATCH v4 08/11] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
@ 2025-05-07 16:13 ` Christian Brauner
  2025-05-07 18:34   ` Mickaël Salaün
  2025-05-07 16:13 ` [PATCH v4 10/11] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 11/11] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
  10 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2025-05-07 16:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, linux-fsdevel, Jann Horn
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, Christian Brauner, Alexander Mikhalitsyn

When a coredump connection is initiated use the socket cookie as the
coredump cookie and store it in the pidfd. The receiver can now easily
authenticate that the connection is coming from the kernel.

Unless the coredump server expects to handle connection from
non-crashing task it can validate that the connection has been made from
a crashing task:

   fd_coredump = accept4(fd_socket, NULL, NULL, SOCK_CLOEXEC);
   getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD, &fd_peer_pidfd, &fd_peer_pidfd_len);

   struct pidfd_info info = {
           info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP,
   };

   ioctl(pidfd, PIDFD_GET_INFO, &info);
   /* Refuse connections that aren't from a crashing task. */
   if (!(info.mask & PIDFD_INFO_COREDUMP) || !(info.coredump_mask & PIDFD_COREDUMPED) )
           close(fd_coredump);

   /*
    * Make sure that the coredump cookie matches the connection cookie.
    * If they don't it's not the coredump connection from the kernel.
    * We'll get another connection request in a bit.
    */
   getsocketop(fd_coredump, SOL_SOCKET, SO_COOKIE, &peer_cookie, &peer_cookie_len);
   if (!info.coredump_cookie || (info.coredump_cookie != peer_cookie))
           close(fd_coredump);

The kernel guarantees that by the time the connection is made the
coredump info is available.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c              |  3 ++-
 fs/pidfs.c                 | 20 +++++++++++++++++++-
 include/linux/net.h        |  1 +
 include/linux/pidfs.h      |  1 +
 include/uapi/linux/pidfd.h |  1 +
 net/unix/af_unix.c         |  7 +++++++
 6 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index ddff1854988f..cfb7a3459785 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -907,7 +907,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		 */
 		retval = kernel_connect(socket,
 					(struct sockaddr *)(&coredump_unix_socket),
-					COREDUMP_UNIX_SOCKET_ADDR_SIZE, O_NONBLOCK);
+					COREDUMP_UNIX_SOCKET_ADDR_SIZE, O_NONBLOCK |
+					SOCK_COREDUMP);
 
 		/*
 		 * ... So we can safely put our pidfs reference now...
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 8c4d83fb115b..7ff1e7923f19 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -35,6 +35,7 @@ struct pidfs_exit_info {
 	__u64 cgroupid;
 	__s32 exit_code;
 	__u32 coredump_mask;
+	__u64 coredump_cookie;
 };
 
 struct pidfs_inode {
@@ -300,6 +301,7 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 
 	if (mask & PIDFD_INFO_COREDUMP) {
 		kinfo.mask |= PIDFD_INFO_COREDUMP;
+		kinfo.coredump_cookie = READ_ONCE(pidfs_i(inode)->__pei.coredump_cookie);
 		kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
 	}
 
@@ -321,8 +323,10 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 
 	if (!(kinfo.mask & PIDFD_INFO_COREDUMP)) {
 		task_lock(task);
-		if (task->mm)
+		if (task->mm) {
+			kinfo.coredump_cookie = READ_ONCE(pidfs_i(inode)->__pei.coredump_cookie);
 			kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags);
+		}
 		task_unlock(task);
 	}
 
@@ -589,6 +593,20 @@ void pidfs_exit(struct task_struct *tsk)
 	}
 }
 
+void pidfs_coredump_cookie(struct pid *pid, u64 coredump_cookie)
+{
+	struct pidfs_exit_info *exit_info;
+	struct dentry *dentry = pid->stashed;
+	struct inode *inode;
+
+	if (WARN_ON_ONCE(!dentry))
+		return;
+
+	inode = d_inode(dentry);
+	exit_info = &pidfs_i(inode)->__pei;
+	smp_store_release(&exit_info->coredump_cookie, coredump_cookie);
+}
+
 void pidfs_coredump(const struct coredump_params *cprm)
 {
 	struct pid *pid = cprm->pid;
diff --git a/include/linux/net.h b/include/linux/net.h
index 0ff950eecc6b..005f1e52e7f1 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -81,6 +81,7 @@ enum sock_type {
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK	O_NONBLOCK
 #endif
+#define SOCK_COREDUMP	O_TRUNC
 
 #endif /* ARCH_HAS_SOCKET_TYPES */
 
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index f7729b9371bc..5875037be272 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -10,6 +10,7 @@ void pidfs_add_pid(struct pid *pid);
 void pidfs_remove_pid(struct pid *pid);
 void pidfs_exit(struct task_struct *tsk);
 void pidfs_coredump(const struct coredump_params *cprm);
+void pidfs_coredump_cookie(struct pid *pid, u64 coredump_cookie);
 extern const struct dentry_operations pidfs_dentry_operations;
 int pidfs_register_pid(struct pid *pid);
 void pidfs_get_pid(struct pid *pid);
diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
index 84ac709f560c..f46819a02d23 100644
--- a/include/uapi/linux/pidfd.h
+++ b/include/uapi/linux/pidfd.h
@@ -108,6 +108,7 @@ struct pidfd_info {
 	__s32 exit_code;
 	__u32 coredump_mask;
 	__u32 __spare1;
+	__u64 coredump_cookie;
 };
 
 #define PIDFS_IOCTL_MAGIC 0xFF
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 148d008862e7..45e7a6363939 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -101,6 +101,7 @@
 #include <linux/string.h>
 #include <linux/uaccess.h>
 #include <linux/pidfs.h>
+#include <linux/sock_diag.h>
 #include <net/af_unix.h>
 #include <net/net_namespace.h>
 #include <net/scm.h>
@@ -771,6 +772,7 @@ static void unix_release_sock(struct sock *sk, int embrion)
 
 struct unix_peercred {
 	struct pid *peer_pid;
+	u64 cookie;
 	const struct cred *peer_cred;
 };
 
@@ -806,6 +808,8 @@ static void drop_peercred(struct unix_peercred *peercred)
 static inline void init_peercred(struct sock *sk,
 				 const struct unix_peercred *peercred)
 {
+	if (peercred->cookie)
+		pidfs_coredump_cookie(peercred->peer_pid, peercred->cookie);
 	sk->sk_peer_pid = peercred->peer_pid;
 	sk->sk_peer_cred = peercred->peer_cred;
 }
@@ -1717,6 +1721,9 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	unix_peer(newsk)	= sk;
 	newsk->sk_state		= TCP_ESTABLISHED;
 	newsk->sk_type		= sk->sk_type;
+	/* Prepare a new socket cookie for the receiver. */
+	if (flags & SOCK_COREDUMP)
+		peercred.cookie = sock_gen_cookie(newsk);
 	init_peercred(newsk, &peercred);
 	newu = unix_sk(newsk);
 	newu->listener = other;

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 09/11] pidfs, coredump: allow to verify coredump connection
  2025-05-07 16:13 ` [PATCH v4 09/11] pidfs, coredump: allow to verify coredump connection Christian Brauner
@ 2025-05-07 18:34   ` Mickaël Salaün
  0 siblings, 0 replies; 18+ messages in thread
From: Mickaël Salaün @ 2025-05-07 18:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kuniyuki Iwashima, linux-fsdevel, Jann Horn, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Alexander Mikhalitsyn, linux-security-module

On Wed, May 07, 2025 at 06:13:42PM +0200, Christian Brauner wrote:
> When a coredump connection is initiated use the socket cookie as the
> coredump cookie and store it in the pidfd. The receiver can now easily
> authenticate that the connection is coming from the kernel.
> 
> Unless the coredump server expects to handle connection from
> non-crashing task it can validate that the connection has been made from
> a crashing task:
> 
>    fd_coredump = accept4(fd_socket, NULL, NULL, SOCK_CLOEXEC);
>    getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD, &fd_peer_pidfd, &fd_peer_pidfd_len);
> 
>    struct pidfd_info info = {
>            info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP,
>    };
> 
>    ioctl(pidfd, PIDFD_GET_INFO, &info);
>    /* Refuse connections that aren't from a crashing task. */
>    if (!(info.mask & PIDFD_INFO_COREDUMP) || !(info.coredump_mask & PIDFD_COREDUMPED) )
>            close(fd_coredump);
> 
>    /*
>     * Make sure that the coredump cookie matches the connection cookie.
>     * If they don't it's not the coredump connection from the kernel.
>     * We'll get another connection request in a bit.
>     */
>    getsocketop(fd_coredump, SOL_SOCKET, SO_COOKIE, &peer_cookie, &peer_cookie_len);
>    if (!info.coredump_cookie || (info.coredump_cookie != peer_cookie))
>            close(fd_coredump);
> 
> The kernel guarantees that by the time the connection is made the
> coredump info is available.

Nice approach to tie the coredump socket with the coredumped pidfd!
This indeed removes previous race condition.

I guess a socket's cookie is never zero?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 10/11] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
  2025-05-07 16:13 [PATCH v4 00/11] coredump: add coredump socket Christian Brauner
                   ` (8 preceding siblings ...)
  2025-05-07 16:13 ` [PATCH v4 09/11] pidfs, coredump: allow to verify coredump connection Christian Brauner
@ 2025-05-07 16:13 ` Christian Brauner
  2025-05-07 16:13 ` [PATCH v4 11/11] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
  10 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-05-07 16:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, linux-fsdevel, Jann Horn
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, Christian Brauner, Alexander Mikhalitsyn

Add PIDFD_INFO_COREDUMP infrastructure so we can use it in tests.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index 55bcf81a2b9a..887c74007086 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -131,6 +131,26 @@
 #define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
 #endif
 
+#ifndef PIDFD_INFO_COREDUMP
+#define PIDFD_INFO_COREDUMP	(1UL << 4)
+#endif
+
+#ifndef PIDFD_COREDUMPED
+#define PIDFD_COREDUMPED	(1U << 0) /* Did crash and... */
+#endif
+
+#ifndef PIDFD_COREDUMP_SKIP
+#define PIDFD_COREDUMP_SKIP	(1U << 1) /* coredumping generation was skipped. */
+#endif
+
+#ifndef PIDFD_COREDUMP_USER
+#define PIDFD_COREDUMP_USER	(1U << 2) /* coredump was done as the user. */
+#endif
+
+#ifndef PIDFD_COREDUMP_ROOT
+#define PIDFD_COREDUMP_ROOT	(1U << 3) /* coredump was done as root. */
+#endif
+
 #ifndef PIDFD_THREAD
 #define PIDFD_THREAD O_EXCL
 #endif
@@ -150,6 +170,9 @@ struct pidfd_info {
 	__u32 fsuid;
 	__u32 fsgid;
 	__s32 exit_code;
+	__u32 coredump_mask;
+	__u32 __spare1;
+	__u64 coredump_cookie;
 };
 
 /*

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 11/11] selftests/coredump: add tests for AF_UNIX coredumps
  2025-05-07 16:13 [PATCH v4 00/11] coredump: add coredump socket Christian Brauner
                   ` (9 preceding siblings ...)
  2025-05-07 16:13 ` [PATCH v4 10/11] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure Christian Brauner
@ 2025-05-07 16:13 ` Christian Brauner
  10 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-05-07 16:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, linux-fsdevel, Jann Horn
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, Christian Brauner, Alexander Mikhalitsyn

Add a simple test for generating coredumps via AF_UNIX sockets.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/coredump/stackdump_test.c | 273 +++++++++++++++++++++-
 1 file changed, 272 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/coredump/stackdump_test.c b/tools/testing/selftests/coredump/stackdump_test.c
index fe3c728cd6be..a86f4ba0a367 100644
--- a/tools/testing/selftests/coredump/stackdump_test.c
+++ b/tools/testing/selftests/coredump/stackdump_test.c
@@ -5,10 +5,15 @@
 #include <linux/limits.h>
 #include <pthread.h>
 #include <string.h>
+#include <sys/mount.h>
 #include <sys/resource.h>
+#include <sys/stat.h>
+#include <sys/socket.h>
+#include <sys/un.h>
 #include <unistd.h>
 
 #include "../kselftest_harness.h"
+#include "../pidfd/pidfd.h"
 
 #define STACKDUMP_FILE "stack_values"
 #define STACKDUMP_SCRIPT "stackdump"
@@ -35,6 +40,7 @@ static void crashing_child(void)
 FIXTURE(coredump)
 {
 	char original_core_pattern[256];
+	pid_t pid_coredump_server;
 };
 
 FIXTURE_SETUP(coredump)
@@ -44,6 +50,7 @@ FIXTURE_SETUP(coredump)
 	char *dir;
 	int ret;
 
+	self->pid_coredump_server = -ESRCH;
 	file = fopen("/proc/sys/kernel/core_pattern", "r");
 	ASSERT_NE(NULL, file);
 
@@ -61,10 +68,15 @@ FIXTURE_TEARDOWN(coredump)
 {
 	const char *reason;
 	FILE *file;
-	int ret;
+	int ret, status;
 
 	unlink(STACKDUMP_FILE);
 
+	if (self->pid_coredump_server > 0) {
+		kill(self->pid_coredump_server, SIGTERM);
+		waitpid(self->pid_coredump_server, &status, 0);
+	}
+
 	file = fopen("/proc/sys/kernel/core_pattern", "w");
 	if (!file) {
 		reason = "Unable to open core_pattern";
@@ -154,4 +166,263 @@ TEST_F_TIMEOUT(coredump, stackdump, 120)
 	fclose(file);
 }
 
+TEST_F(coredump, socket)
+{
+	int fd, pidfd, ret, status;
+	FILE *file;
+	pid_t pid, pid_coredump_server;
+	struct stat st;
+	char core_file[PATH_MAX];
+	struct pidfd_info info = {};
+	int ipc_sockets[2];
+	char c;
+
+	ASSERT_EQ(unshare(CLONE_NEWNS), 0);
+	ASSERT_EQ(mount(NULL, "/", NULL, MS_PRIVATE | MS_REC, NULL), 0);
+	ASSERT_EQ(mount(NULL, "/tmp", "tmpfs", 0, NULL), 0);
+
+	file = fopen("/proc/sys/kernel/core_pattern", "w");
+	ASSERT_NE(NULL, file);
+
+	ret = fprintf(file, "@linuxafsk/coredump.socket");
+	ASSERT_EQ(ret, strlen("@linuxafsk/coredump.socket"));
+	ASSERT_EQ(fclose(file), 0);
+
+	ret = socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	ASSERT_EQ(ret, 0);
+
+	pid_coredump_server = fork();
+	ASSERT_GE(pid_coredump_server, 0);
+	if (pid_coredump_server == 0) {
+		int fd_socket, fd_coredump, fd_peer_pidfd, fd_core_file;
+		__u64 peer_cookie;
+		socklen_t fd_peer_pidfd_len, peer_cookie_len;
+		static const struct sockaddr_un coredump_sk = {
+			.sun_family = AF_UNIX,
+			.sun_path = "\0linuxafsk/coredump.socket",
+		};
+		static const size_t coredump_sk_len =
+			offsetof(struct sockaddr_un, sun_path) +
+			sizeof("linuxafsk/coredump.socket"); /* +1 for leading NUL */
+
+		close(ipc_sockets[0]);
+
+		fd_socket = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
+		if (fd_socket < 0)
+			_exit(EXIT_FAILURE);
+
+		ret = bind(fd_socket, (const struct sockaddr *)&coredump_sk, coredump_sk_len);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to bind coredump socket\n");
+			close(fd_socket);
+			close(ipc_sockets[1]);
+			_exit(EXIT_FAILURE);
+		}
+
+		ret = listen(fd_socket, 1);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to listen on coredump socket\n");
+			close(fd_socket);
+			close(ipc_sockets[1]);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (write_nointr(ipc_sockets[1], "1", 1) < 0) {
+			close(fd_socket);
+			close(ipc_sockets[1]);
+			_exit(EXIT_FAILURE);
+		}
+
+		close(ipc_sockets[1]);
+
+		fd_coredump = accept4(fd_socket, NULL, NULL, SOCK_CLOEXEC);
+		if (fd_coredump < 0) {
+			fprintf(stderr, "Failed to accept coredump socket connection\n");
+			close(fd_socket);
+			_exit(EXIT_FAILURE);
+		}
+
+		peer_cookie_len = sizeof(peer_cookie);
+		ret = getsockopt(fd_coredump, SOL_SOCKET, SO_COOKIE,
+				 &peer_cookie, &peer_cookie_len);
+		if (ret < 0) {
+			fprintf(stderr, "%m - Failed to retrieve cookie for coredump socket connection\n");
+			close(fd_coredump);
+			close(fd_socket);
+			_exit(EXIT_FAILURE);
+		}
+
+		fd_peer_pidfd_len = sizeof(fd_peer_pidfd);
+		ret = getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD,
+				 &fd_peer_pidfd, &fd_peer_pidfd_len);
+		if (ret < 0) {
+			fprintf(stderr, "%m - Failed to retrieve peer pidfd for coredump socket connection\n");
+			close(fd_coredump);
+			close(fd_socket);
+			_exit(EXIT_FAILURE);
+		}
+
+		memset(&info, 0, sizeof(info));
+		info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP;
+		ret = ioctl(fd_peer_pidfd, PIDFD_GET_INFO, &info);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to retrieve pidfd info from peer pidfd for coredump socket connection\n");
+			close(fd_coredump);
+			close(fd_socket);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (!(info.mask & PIDFD_INFO_COREDUMP)) {
+			fprintf(stderr, "Missing coredump information from coredumping task\n");
+			close(fd_coredump);
+			close(fd_socket);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (!(info.coredump_mask & PIDFD_COREDUMPED)) {
+			fprintf(stderr, "Received connection from non-coredumping task\n");
+			close(fd_coredump);
+			close(fd_socket);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (!info.coredump_cookie) {
+			fprintf(stderr, "Missing coredump cookie\n");
+			close(fd_coredump);
+			close(fd_socket);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (info.coredump_cookie != peer_cookie) {
+			fprintf(stderr, "Mismatching coredump cookies\n");
+			close(fd_coredump);
+			close(fd_socket);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		fd_core_file = creat("/tmp/coredump.file", 0644);
+		if (fd_core_file < 0) {
+			fprintf(stderr, "Failed to create coredump file\n");
+			close(fd_coredump);
+			close(fd_socket);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		for (;;) {
+			char buffer[4096];
+			ssize_t bytes_read, bytes_write;
+
+			bytes_read = read(fd_coredump, buffer, sizeof(buffer));
+			if (bytes_read < 0) {
+				close(fd_coredump);
+				close(fd_socket);
+				close(fd_peer_pidfd);
+				close(fd_core_file);
+				_exit(EXIT_FAILURE);
+			}
+
+			if (bytes_read == 0)
+				break;
+
+			bytes_write = write(fd_core_file, buffer, bytes_read);
+			if (bytes_read != bytes_write) {
+				close(fd_coredump);
+				close(fd_socket);
+				close(fd_peer_pidfd);
+				close(fd_core_file);
+				_exit(EXIT_FAILURE);
+			}
+		}
+
+		close(fd_coredump);
+		close(fd_socket);
+		close(fd_peer_pidfd);
+		close(fd_core_file);
+		_exit(EXIT_SUCCESS);
+	}
+	self->pid_coredump_server = pid_coredump_server;
+
+	EXPECT_EQ(close(ipc_sockets[1]), 0);
+	ASSERT_EQ(read_nointr(ipc_sockets[0], &c, 1), 1);
+	EXPECT_EQ(close(ipc_sockets[0]), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+	if (pid == 0)
+		crashing_child();
+
+	pidfd = sys_pidfd_open(pid, 0);
+	ASSERT_GE(pidfd, 0);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFSIGNALED(status));
+	ASSERT_TRUE(WCOREDUMP(status));
+
+	info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP;
+	ASSERT_EQ(ioctl(pidfd, PIDFD_GET_INFO, &info), 0);
+	ASSERT_GT((info.mask & PIDFD_INFO_COREDUMP), 0);
+	ASSERT_GT((info.coredump_mask & PIDFD_COREDUMPED), 0);
+
+	waitpid(pid_coredump_server, &status, 0);
+	self->pid_coredump_server = -ESRCH;
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+
+	ASSERT_EQ(stat("/tmp/coredump.file", &st), 0);
+	ASSERT_GT(st.st_size, 0);
+	/*
+	 * We should somehow validate the produced core file.
+	 * For now just allow for visual inspection
+	 */
+	system("file /tmp/coredump.file");
+}
+
+TEST_F(coredump, socket_econnrefused)
+{
+	int fd_socket;
+	static const struct sockaddr_un linuxafsk = {
+		.sun_family = AF_UNIX,
+		.sun_path = "\0linuxafsk/",
+	};
+	static const size_t linuxafsk_len =
+		offsetof(struct sockaddr_un, sun_path) +
+		sizeof("linuxafsk/"); /* +1 for leading NUL */
+
+	fd_socket = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
+	ASSERT_GT(fd_socket, 0);
+
+	ASSERT_NE(bind(fd_socket, (const struct sockaddr *)&linuxafsk, linuxafsk_len), 0);
+	ASSERT_EQ(errno, ECONNREFUSED);
+	EXPECT_EQ(close(fd_socket), 0);
+}
+
+TEST_F(coredump, socket_econnrefused_privilege)
+{
+	int fd_socket;
+	static const struct sockaddr_un linuxafsk = {
+		.sun_family = AF_UNIX,
+		.sun_path = "\0linuxafsk/nope",
+	};
+	static const size_t linuxafsk_len =
+		offsetof(struct sockaddr_un, sun_path) +
+		sizeof("linuxafsk/nope"); /* +1 for leading NUL */
+
+	ASSERT_EQ(seteuid(1234), 0);
+
+	fd_socket = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
+	ASSERT_GT(fd_socket, 0);
+
+	ASSERT_NE(bind(fd_socket, (const struct sockaddr *)&linuxafsk, linuxafsk_len), 0);
+	ASSERT_EQ(errno, ECONNREFUSED);
+	EXPECT_EQ(close(fd_socket), 0);
+
+	ASSERT_EQ(seteuid(0), 0);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-05-09  8:07 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 16:13 [PATCH v4 00/11] coredump: add coredump socket Christian Brauner
2025-05-07 16:13 ` [PATCH v4 01/11] coredump: massage format_corname() Christian Brauner
2025-05-07 16:13 ` [PATCH v4 02/11] coredump: massage do_coredump() Christian Brauner
2025-05-07 16:13 ` [PATCH v4 03/11] coredump: reflow dump helpers a little Christian Brauner
2025-05-07 16:13 ` [PATCH v4 04/11] net: reserve prefix Christian Brauner
2025-05-07 22:45   ` Kuniyuki Iwashima
2025-05-08  6:16     ` Christian Brauner
2025-05-08 21:47       ` Kuniyuki Iwashima
2025-05-09  5:54         ` Christian Brauner
2025-05-09  8:07           ` Daniel Borkmann
2025-05-07 16:13 ` [PATCH v4 05/11] coredump: add coredump socket Christian Brauner
2025-05-07 16:13 ` [PATCH v4 06/11] coredump: validate socket name as it is written Christian Brauner
2025-05-07 16:13 ` [PATCH v4 07/11] coredump: show supported coredump modes Christian Brauner
2025-05-07 16:13 ` [PATCH v4 08/11] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
2025-05-07 16:13 ` [PATCH v4 09/11] pidfs, coredump: allow to verify coredump connection Christian Brauner
2025-05-07 18:34   ` Mickaël Salaün
2025-05-07 16:13 ` [PATCH v4 10/11] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure Christian Brauner
2025-05-07 16:13 ` [PATCH v4 11/11] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).