[PATCH RFC v3 00/10] coredump: add coredump socket

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v3 00/10] coredump: add coredump socket
@ 2025-05-05 11:13 Christian Brauner
  2025-05-05 11:13 ` [PATCH RFC v3 01/10] coredump: massage format_corname() Christian Brauner
                   ` (11 more replies)
  0 siblings, 12 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 11:13 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn
  Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Christian Brauner, Alexander Mikhalitsyn

Coredumping currently supports two modes:

(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
    spawned as a child of the system_unbound_wq or kthreadd.

For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.

The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:

        |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.

In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):

- systemd-coredump is spawned with file descriptor number 0 (stdin)
  connected to the read-end of the pipe. All other file descriptors are
  closed. That specifically includes 1 (stdout) and 2 (stderr). This has
  already caused bugs because userspace assumed that this cannot happen
  (Whether or not this is a sane assumption is irrelevant.).

- systemd-coredump will be spawned as a child of system_unbound_wq. So
  it is not a child of any userspace process and specifically not a
  child of PID 1. It cannot be waited upon and is in a weird hybrid
  upcall which are difficult for userspace to control correctly.

- systemd-coredump is spawned with full kernel privileges. This
  necessitates all kinds of weird privilege dropping excercises in
  userspace to make this safe.

- A new usermode helper has to be spawned for each crashing process.

This series adds a new mode:

(3) Dumping into an abstract AF_UNIX socket.

Userspace can set /proc/sys/kernel/core_pattern to:

        @linuxafsk/coredump_socket

The "@" at the beginning indicates to the kernel that the abstract
AF_UNIX coredump socket will be used to process coredumps.

The coredump socket uses the fixed address "linuxafsk/coredump.socket"
for now.

The coredump socket is located in the initial network namespace. To bind
the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
user namespace. Listening and reading can happen from whatever
unprivileged context is necessary to safely process coredumps.

When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket. For now only tasks that
are acctually coredumping are allowed to connect to the initial coredump
socket.

- The coredump server should use SO_PEERPIDFD to get a stable handle on
  the connected crashing task. The retrieved pidfd will provide a stable
  reference even if the crashing task gets SIGKILLed while generating
  the coredump.

- By setting core_pipe_limit non-zero userspace can guarantee that the
  crashing task cannot be reaped behind it's back and thus process all
  necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
  detect whether /proc/<pid> still refers to the same process.

  The core_pipe_limit isn't used to rate-limit connections to the
  socket. This can simply be done via AF_UNIX socket directly.

- The pidfd for the crashing task will contain information how the task
  coredumps. The PIDFD_GET_INFO ioctl gained a new flag
  PIDFD_INFO_COREDUMP which can be used to retreive the coredump
  information.

  If the coredump gets a new coredump client connection the kernel
  guarantees that PIDFD_INFO_COREDUMP information is available.
  Currently the following information is provided in the new
  @coredump_mask extension to struct pidfd_info:

  * PIDFD_COREDUMPED is raised if the task did actually coredump.
  * PIDFD_COREDUMP_SKIP	is raised if the task skipped coredumping (e.g.,
    undumpable).
  * PIDFD_COREDUMP_USER	is raised if this is a regular coredump and
    doesn't need special care by the coredump server.
  * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
    treated as sensitive and the coredump server should restrict to the
    generated coredump to sufficiently privileged users.

- Since unix_stream_connect() runs bpf programs during connect it's
  possible to even redirect or multiplex coredumps to other sockets.

- The coredump server should mark itself as non-dumpable.
  To capture coredumps for the coredump server itself a bpf program
  should be run at connect to redirect it to another socket in
  userspace. This can be useful for debugging crashing coredump servers.

- A container coredump server in a separate network namespace can simply
  bind to linuxafsk/coredump.socket and systemd-coredump fowards
  coredumps to the container.

- Fwiw, one idea is to handle coredumps via per-user/session coredump
  servers that run with that users privileges.

  The coredump server listens on the coredump socket and accepts a
  new coredump connection. It then retrieves SO_PEERPIDFD for the
  client, inspects uid/gid and hands the accepted client to the users
  own coredump handler which runs with the users privileges only.

The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers.

This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can just keep a worker pool.

This is easy to test:

(a) coredump processing (we're using socat):

    > cat coredump_socket.sh
    #!/bin/bash

    set -x

    sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
    sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc

(b) trigger a coredump:

    user1@localhost:~/data/scripts$ cat crash.c
    #include <stdio.h>
    #include <unistd.h>

    int main(int argc, char *argv[])
    {
            fprintf(stderr, "%u\n", (1 / 0));
            _exit(0);
    }

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v3:
- Use an abstract unix socket.
- Add documentation.
- Add selftests.
- Link to v2: https://lore.kernel.org/20250502-work-coredump-socket-v2-0-43259042ffc7@kernel.org

Changes in v2:
- Expose dumpability via PIDFD_GET_INFO.
- Place COREDUMP_SOCK handling under CONFIG_UNIX.
- Link to v1: https://lore.kernel.org/20250430-work-coredump-socket-v1-0-2faf027dbb47@kernel.org

---
Christian Brauner (10):
      coredump: massage format_corname()
      coredump: massage do_coredump()
      net: reserve prefix
      coredump: add coredump socket
      coredump: validate socket name as it is written
      coredump: show supported coredump modes
      pidfs, coredump: add PIDFD_INFO_COREDUMP
      net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
      selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
      selftests/coredump: add tests for AF_UNIX coredumps

 fs/coredump.c                                     | 358 +++++++++++++++++-----
 fs/pidfs.c                                        |  68 ++++
 include/linux/coredump.h                          |  12 +
 include/linux/pidfs.h                             |   4 +
 include/uapi/linux/pidfd.h                        |  16 +
 include/uapi/linux/un.h                           |   2 +
 net/unix/af_unix.c                                |  64 +++-
 tools/testing/selftests/coredump/stackdump_test.c |  71 ++++-
 tools/testing/selftests/pidfd/pidfd.h             |  22 ++
 9 files changed, 528 insertions(+), 89 deletions(-)
---
base-commit: 4dd6566b5a8ca1e8c9ff2652c2249715d6c64217
change-id: 20250429-work-coredump-socket-87cc0f17729c

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH RFC v3 01/10] coredump: massage format_corname()
  2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
@ 2025-05-05 11:13 ` Christian Brauner
  2025-05-05 11:13 ` [PATCH RFC v3 02/10] coredump: massage do_coredump() Christian Brauner
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 11:13 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn
  Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Christian Brauner, Alexander Mikhalitsyn

We're going to extend the coredump code in follow-up patches.
Clean it up so we can do this more easily.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 41 ++++++++++++++++++++++++-----------------
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index d740a0411266..281320ea351f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -76,9 +76,15 @@ static char core_pattern[CORENAME_MAX_SIZE] = "core";
 static int core_name_size = CORENAME_MAX_SIZE;
 unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
 
+enum coredump_type_t {
+	COREDUMP_FILE = 1,
+	COREDUMP_PIPE = 2,
+};
+
 struct core_name {
 	char *corename;
 	int used, size;
+	enum coredump_type_t core_type;
 };
 
 static int expand_corename(struct core_name *cn, int size)
@@ -218,18 +224,21 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 {
 	const struct cred *cred = current_cred();
 	const char *pat_ptr = core_pattern;
-	int ispipe = (*pat_ptr == '|');
 	bool was_space = false;
 	int pid_in_pattern = 0;
 	int err = 0;
 
 	cn->used = 0;
 	cn->corename = NULL;
+	if (*pat_ptr == '|')
+		cn->core_type = COREDUMP_PIPE;
+	else
+		cn->core_type = COREDUMP_FILE;
 	if (expand_corename(cn, core_name_size))
 		return -ENOMEM;
 	cn->corename[0] = '\0';
 
-	if (ispipe) {
+	if (cn->core_type == COREDUMP_PIPE) {
 		int argvs = sizeof(core_pattern) / 2;
 		(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
 		if (!(*argv))
@@ -247,7 +256,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 		 * Split on spaces before doing template expansion so that
 		 * %e and %E don't get split if they have spaces in them
 		 */
-		if (ispipe) {
+		if (cn->core_type == COREDUMP_PIPE) {
 			if (isspace(*pat_ptr)) {
 				if (cn->used != 0)
 					was_space = true;
@@ -353,7 +362,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 				 * Installing a pidfd only makes sense if
 				 * we actually spawn a usermode helper.
 				 */
-				if (!ispipe)
+				if (!(cn->core_type != COREDUMP_PIPE))
 					break;
 
 				/*
@@ -384,12 +393,12 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 	 * If core_pattern does not include a %p (as is the default)
 	 * and core_uses_pid is set, then .%pid will be appended to
 	 * the filename. Do not do this for piped commands. */
-	if (!ispipe && !pid_in_pattern && core_uses_pid) {
+	if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
 		err = cn_printf(cn, ".%d", task_tgid_vnr(current));
 		if (err)
 			return err;
 	}
-	return ispipe;
+	return 0;
 }
 
 static int zap_process(struct signal_struct *signal, int exit_code)
@@ -583,7 +592,6 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 	const struct cred *old_cred;
 	struct cred *cred;
 	int retval = 0;
-	int ispipe;
 	size_t *argv = NULL;
 	int argc = 0;
 	/* require nonrelative corefile path and be extra careful */
@@ -632,19 +640,18 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 
 	old_cred = override_creds(cred);
 
-	ispipe = format_corename(&cn, &cprm, &argv, &argc);
+	retval = format_corename(&cn, &cprm, &argv, &argc);
+	if (retval < 0) {
+		coredump_report_failure("format_corename failed, aborting core");
+		goto fail_unlock;
+	}
 
-	if (ispipe) {
+	if (cn.core_type == COREDUMP_PIPE) {
 		int argi;
 		int dump_count;
 		char **helper_argv;
 		struct subprocess_info *sub_info;
 
-		if (ispipe < 0) {
-			coredump_report_failure("format_corename failed, aborting core");
-			goto fail_unlock;
-		}
-
 		if (cprm.limit == 1) {
 			/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
 			 *
@@ -695,7 +702,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 			coredump_report_failure("|%s pipe failed", cn.corename);
 			goto close_fail;
 		}
-	} else {
+	} else if (cn.core_type == COREDUMP_FILE) {
 		struct mnt_idmap *idmap;
 		struct inode *inode;
 		int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
@@ -823,13 +830,13 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		file_end_write(cprm.file);
 		free_vma_snapshot(&cprm);
 	}
-	if (ispipe && core_pipe_limit)
+	if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
 		wait_for_dump_helpers(cprm.file);
 close_fail:
 	if (cprm.file)
 		filp_close(cprm.file, NULL);
 fail_dropcount:
-	if (ispipe)
+	if (cn.core_type == COREDUMP_PIPE)
 		atomic_dec(&core_dump_count);
 fail_unlock:
 	kfree(argv);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC v3 02/10] coredump: massage do_coredump()
  2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
  2025-05-05 11:13 ` [PATCH RFC v3 01/10] coredump: massage format_corname() Christian Brauner
@ 2025-05-05 11:13 ` Christian Brauner
  2025-05-05 11:13 ` [PATCH RFC v3 03/10] net: reserve prefix Christian Brauner
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 11:13 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn
  Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Christian Brauner, Alexander Mikhalitsyn

We're going to extend the coredump code in follow-up patches.
Clean it up so we can do this more easily.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 123 +++++++++++++++++++++++++++++++---------------------------
 1 file changed, 66 insertions(+), 57 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 281320ea351f..1779299b8c61 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -646,63 +646,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		goto fail_unlock;
 	}
 
-	if (cn.core_type == COREDUMP_PIPE) {
-		int argi;
-		int dump_count;
-		char **helper_argv;
-		struct subprocess_info *sub_info;
-
-		if (cprm.limit == 1) {
-			/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
-			 *
-			 * Normally core limits are irrelevant to pipes, since
-			 * we're not writing to the file system, but we use
-			 * cprm.limit of 1 here as a special value, this is a
-			 * consistent way to catch recursive crashes.
-			 * We can still crash if the core_pattern binary sets
-			 * RLIM_CORE = !1, but it runs as root, and can do
-			 * lots of stupid things.
-			 *
-			 * Note that we use task_tgid_vnr here to grab the pid
-			 * of the process group leader.  That way we get the
-			 * right pid if a thread in a multi-threaded
-			 * core_pattern process dies.
-			 */
-			coredump_report_failure("RLIMIT_CORE is set to 1, aborting core");
-			goto fail_unlock;
-		}
-		cprm.limit = RLIM_INFINITY;
-
-		dump_count = atomic_inc_return(&core_dump_count);
-		if (core_pipe_limit && (core_pipe_limit < dump_count)) {
-			coredump_report_failure("over core_pipe_limit, skipping core dump");
-			goto fail_dropcount;
-		}
-
-		helper_argv = kmalloc_array(argc + 1, sizeof(*helper_argv),
-					    GFP_KERNEL);
-		if (!helper_argv) {
-			coredump_report_failure("%s failed to allocate memory", __func__);
-			goto fail_dropcount;
-		}
-		for (argi = 0; argi < argc; argi++)
-			helper_argv[argi] = cn.corename + argv[argi];
-		helper_argv[argi] = NULL;
-
-		retval = -ENOMEM;
-		sub_info = call_usermodehelper_setup(helper_argv[0],
-						helper_argv, NULL, GFP_KERNEL,
-						umh_coredump_setup, NULL, &cprm);
-		if (sub_info)
-			retval = call_usermodehelper_exec(sub_info,
-							  UMH_WAIT_EXEC);
-
-		kfree(helper_argv);
-		if (retval) {
-			coredump_report_failure("|%s pipe failed", cn.corename);
-			goto close_fail;
-		}
-	} else if (cn.core_type == COREDUMP_FILE) {
+	switch (cn.core_type) {
+	case COREDUMP_FILE: {
 		struct mnt_idmap *idmap;
 		struct inode *inode;
 		int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
@@ -796,6 +741,70 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		if (do_truncate(idmap, cprm.file->f_path.dentry,
 				0, 0, cprm.file))
 			goto close_fail;
+		break;
+	}
+	case COREDUMP_PIPE: {
+		int argi;
+		int dump_count;
+		char **helper_argv;
+		struct subprocess_info *sub_info;
+
+		if (cprm.limit == 1) {
+			/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
+			 *
+			 * Normally core limits are irrelevant to pipes, since
+			 * we're not writing to the file system, but we use
+			 * cprm.limit of 1 here as a special value, this is a
+			 * consistent way to catch recursive crashes.
+			 * We can still crash if the core_pattern binary sets
+			 * RLIM_CORE = !1, but it runs as root, and can do
+			 * lots of stupid things.
+			 *
+			 * Note that we use task_tgid_vnr here to grab the pid
+			 * of the process group leader.  That way we get the
+			 * right pid if a thread in a multi-threaded
+			 * core_pattern process dies.
+			 */
+			coredump_report_failure("RLIMIT_CORE is set to 1, aborting core");
+			goto fail_unlock;
+		}
+		cprm.limit = RLIM_INFINITY;
+
+		dump_count = atomic_inc_return(&core_dump_count);
+		if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+			coredump_report_failure("over core_pipe_limit, skipping core dump");
+			goto fail_dropcount;
+		}
+
+		helper_argv = kmalloc_array(argc + 1, sizeof(*helper_argv),
+					    GFP_KERNEL);
+		if (!helper_argv) {
+			coredump_report_failure("%s failed to allocate memory", __func__);
+			goto fail_dropcount;
+		}
+		for (argi = 0; argi < argc; argi++)
+			helper_argv[argi] = cn.corename + argv[argi];
+		helper_argv[argi] = NULL;
+
+		retval = -ENOMEM;
+		sub_info = call_usermodehelper_setup(helper_argv[0],
+						helper_argv, NULL, GFP_KERNEL,
+						umh_coredump_setup, NULL, &cprm);
+		if (sub_info)
+			retval = call_usermodehelper_exec(sub_info,
+							  UMH_WAIT_EXEC);
+
+		kfree(helper_argv);
+		if (retval) {
+			coredump_report_failure("|%s pipe failed", cn.corename);
+			goto close_fail;
+		}
+		break;
+	}
+	default:
+		WARN_ON_ONCE(true);
+		retval = -EINVAL;
+		goto close_fail;
 	}
 
 	/* get us an unshared descriptor table; almost always a no-op */

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC v3 03/10] net: reserve prefix
  2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
  2025-05-05 11:13 ` [PATCH RFC v3 01/10] coredump: massage format_corname() Christian Brauner
  2025-05-05 11:13 ` [PATCH RFC v3 02/10] coredump: massage do_coredump() Christian Brauner
@ 2025-05-05 11:13 ` Christian Brauner
  2025-05-05 11:13 ` [PATCH RFC v3 04/10] coredump: add coredump socket Christian Brauner
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 11:13 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn
  Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Christian Brauner, Alexander Mikhalitsyn

Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
CAP_NET_ADMIN in the owning user namespace of the network namespace to
bind it. This will be used in next patches to support the coredump
socket but is a generally useful concept.

The collision risk is so low that we can just start using it. Userspace
must already be prepared to retry if a given abstract address isn't
usable anyway.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/uapi/linux/un.h |  2 ++
 net/unix/af_unix.c      | 45 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/un.h b/include/uapi/linux/un.h
index 0ad59dc8b686..bbd5ad508dfa 100644
--- a/include/uapi/linux/un.h
+++ b/include/uapi/linux/un.h
@@ -5,6 +5,8 @@
 #include <linux/socket.h>
 
 #define UNIX_PATH_MAX	108
+/* reserved AF_UNIX socket namespace. */
+#define UNIX_SOCKET_NAMESPACE "linuxafsk/"
 
 struct sockaddr_un {
 	__kernel_sa_family_t sun_family; /* AF_UNIX */
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 472f8aa9ea15..edc2f143f401 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -114,6 +114,9 @@ static atomic_long_t unix_nr_socks;
 static struct hlist_head bsd_socket_buckets[UNIX_HASH_SIZE / 2];
 static spinlock_t bsd_socket_locks[UNIX_HASH_SIZE / 2];
 
+static struct sockaddr_un linuxafsk_addr;
+static size_t linuxafsk_addr_len;
+
 /* SMP locking strategy:
  *    hash table is protected with spinlock.
  *    each socket state is protected by separate spinlock.
@@ -436,6 +439,30 @@ static struct sock *__unix_find_socket_byname(struct net *net,
 	return NULL;
 }
 
+static int unix_may_bind_name(struct net *net, struct sockaddr_un *sunname,
+			      int len, unsigned int hash)
+{
+	struct sock *s;
+
+	s = __unix_find_socket_byname(net, sunname, len, hash);
+	if (s)
+		return -EADDRINUSE;
+
+	/*
+	 * Check whether this is our reserved prefix and if so ensure
+	 * that only privileged processes can bind it.
+	 */
+	if (linuxafsk_addr_len <= len &&
+	    !memcmp(&linuxafsk_addr, sunname, linuxafsk_addr_len)) {
+		/* Don't bind the namespace itself. */
+		if (linuxafsk_addr_len == len)
+			return -ECONNREFUSED;
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+			return -ECONNREFUSED;
+	}
+	return 0;
+}
+
 static inline struct sock *unix_find_socket_byname(struct net *net,
 						   struct sockaddr_un *sunname,
 						   int len, unsigned int hash)
@@ -1258,10 +1285,10 @@ static int unix_autobind(struct sock *sk)
 	new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
 	unix_table_double_lock(net, old_hash, new_hash);
 
-	if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash)) {
+	if (unix_may_bind_name(net, addr->name, addr->len, new_hash)) {
 		unix_table_double_unlock(net, old_hash, new_hash);
 
-		/* __unix_find_socket_byname() may take long time if many names
+		/* unix_may_bind_name() may take long time if many names
 		 * are already in use.
 		 */
 		cond_resched();
@@ -1379,7 +1406,8 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
 	new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
 	unix_table_double_lock(net, old_hash, new_hash);
 
-	if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash))
+	err = unix_may_bind_name(net, addr->name, addr->len, new_hash);
+	if (err)
 		goto out_spin;
 
 	__unix_set_addr_hash(net, sk, addr, new_hash);
@@ -1389,7 +1417,6 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
 
 out_spin:
 	unix_table_double_unlock(net, old_hash, new_hash);
-	err = -EADDRINUSE;
 out_mutex:
 	mutex_unlock(&u->bindlock);
 out:
@@ -3841,6 +3868,16 @@ static int __init af_unix_init(void)
 
 	BUILD_BUG_ON(sizeof(struct unix_skb_parms) > sizeof_field(struct sk_buff, cb));
 
+	/*
+	 * We need a leading NUL byte for the abstract namespace. Just
+	 * use the trailing one given by sizeof().
+	 */
+	linuxafsk_addr_len = offsetof(struct sockaddr_un, sun_path) + sizeof(UNIX_SOCKET_NAMESPACE);
+	linuxafsk_addr.sun_family = AF_UNIX;
+	memcpy(linuxafsk_addr.sun_path + 1, UNIX_SOCKET_NAMESPACE, sizeof(UNIX_SOCKET_NAMESPACE) - 1);
+	/* Technically not needed, but let's be explicit. */
+	linuxafsk_addr.sun_path[0] = '\0';
+
 	for (i = 0; i < UNIX_HASH_SIZE / 2; i++) {
 		spin_lock_init(&bsd_socket_locks[i]);
 		INIT_HLIST_HEAD(&bsd_socket_buckets[i]);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC v3 04/10] coredump: add coredump socket
  2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
                   ` (2 preceding siblings ...)
  2025-05-05 11:13 ` [PATCH RFC v3 03/10] net: reserve prefix Christian Brauner
@ 2025-05-05 11:13 ` Christian Brauner
  2025-05-05 12:55   ` Jann Horn
  2025-05-05 18:48   ` Kuniyuki Iwashima
  2025-05-05 11:13 ` [PATCH RFC v3 05/10] coredump: validate socket name as it is written Christian Brauner
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 11:13 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn
  Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Christian Brauner, Alexander Mikhalitsyn

Coredumping currently supports two modes:

(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
    spawned as a child of the system_unbound_wq or kthreadd.

For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.

The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:

        |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.

In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):

- systemd-coredump is spawned with file descriptor number 0 (stdin)
  connected to the read-end of the pipe. All other file descriptors are
  closed. That specifically includes 1 (stdout) and 2 (stderr). This has
  already caused bugs because userspace assumed that this cannot happen
  (Whether or not this is a sane assumption is irrelevant.).

- systemd-coredump will be spawned as a child of system_unbound_wq. So
  it is not a child of any userspace process and specifically not a
  child of PID 1. It cannot be waited upon and is in a weird hybrid
  upcall which are difficult for userspace to control correctly.

- systemd-coredump is spawned with full kernel privileges. This
  necessitates all kinds of weird privilege dropping excercises in
  userspace to make this safe.

- A new usermode helper has to be spawned for each crashing process.

This series adds a new mode:

(3) Dumping into an abstract AF_UNIX socket.

Userspace can set /proc/sys/kernel/core_pattern to:

        @linuxafsk/coredump_socket

The "@" at the beginning indicates to the kernel that the abstract
AF_UNIX coredump socket will be used to process coredumps.

The coredump socket uses the fixed address "linuxafsk/coredump.socket"
for now.

The coredump socket is located in the initial network namespace. To bind
the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
user namespace. Listening and reading can happen from whatever
unprivileged context is necessary to safely process coredumps.

When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket. For now only tasks that
are acctually coredumping are allowed to connect to the initial coredump
socket.

- The coredump server should use SO_PEERPIDFD to get a stable handle on
  the connected crashing task. The retrieved pidfd will provide a stable
  reference even if the crashing task gets SIGKILLed while generating
  the coredump.

- By setting core_pipe_limit non-zero userspace can guarantee that the
  crashing task cannot be reaped behind it's back and thus process all
  necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
  detect whether /proc/<pid> still refers to the same process.

  The core_pipe_limit isn't used to rate-limit connections to the
  socket. This can simply be done via AF_UNIX socket directly.

- The pidfd for the crashing task will contain information how the task
  coredumps. The PIDFD_GET_INFO ioctl gained a new flag
  PIDFD_INFO_COREDUMP which can be used to retreive the coredump
  information.

  If the coredump gets a new coredump client connection the kernel
  guarantees that PIDFD_INFO_COREDUMP information is available.
  Currently the following information is provided in the new
  @coredump_mask extension to struct pidfd_info:

  * PIDFD_COREDUMPED is raised if the task did actually coredump.
  * PIDFD_COREDUMP_SKIP	is raised if the task skipped coredumping (e.g.,
    undumpable).
  * PIDFD_COREDUMP_USER	is raised if this is a regular coredump and
    doesn't need special care by the coredump server.
  * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
    treated as sensitive and the coredump server should restrict to the
    generated coredump to sufficiently privileged users.

- Since unix_stream_connect() runs bpf programs during connect it's
  possible to even redirect or multiplex coredumps to other sockets.

- The coredump server should mark itself as non-dumpable.
  To capture coredumps for the coredump server itself a bpf program
  should be run at connect to redirect it to another socket in
  userspace. This can be useful for debugging crashing coredump servers.

- A container coredump server in a separate network namespace can simply
  bind to linuxafsk/coredump.socket and systemd-coredump fowards
  coredumps to the container.

- Fwiw, one idea is to handle coredumps via per-user/session coredump
  servers that run with that users privileges.

  The coredump server listens on the coredump socket and accepts a
  new coredump connection. It then retrieves SO_PEERPIDFD for the
  client, inspects uid/gid and hands the accepted client to the users
  own coredump handler which runs with the users privileges only.

The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers.

This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can just keep a worker pool.

This is easy to test:

(a) coredump processing (we're using socat):

    > cat coredump_socket.sh
    #!/bin/bash

    set -x

    sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
    sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc

(b) trigger a coredump:

    user1@localhost:~/data/scripts$ cat crash.c
    #include <stdio.h>
    #include <unistd.h>

    int main(int argc, char *argv[])
    {
            fprintf(stderr, "%u\n", (1 / 0));
            _exit(0);
    }

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 107 insertions(+), 5 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 1779299b8c61..c60f86c473ad 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -44,7 +44,11 @@
 #include <linux/sysctl.h>
 #include <linux/elf.h>
 #include <linux/pidfs.h>
+#include <linux/net.h>
+#include <linux/socket.h>
+#include <net/net_namespace.h>
 #include <uapi/linux/pidfd.h>
+#include <uapi/linux/un.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
 enum coredump_type_t {
 	COREDUMP_FILE = 1,
 	COREDUMP_PIPE = 2,
+	COREDUMP_SOCK = 3,
 };
 
 struct core_name {
@@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 	cn->corename = NULL;
 	if (*pat_ptr == '|')
 		cn->core_type = COREDUMP_PIPE;
+	else if (*pat_ptr == '@')
+		cn->core_type = COREDUMP_SOCK;
 	else
 		cn->core_type = COREDUMP_FILE;
 	if (expand_corename(cn, core_name_size))
 		return -ENOMEM;
 	cn->corename[0] = '\0';
 
-	if (cn->core_type == COREDUMP_PIPE) {
+	switch (cn->core_type) {
+	case COREDUMP_PIPE: {
 		int argvs = sizeof(core_pattern) / 2;
 		(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
 		if (!(*argv))
@@ -247,6 +255,32 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 		++pat_ptr;
 		if (!(*pat_ptr))
 			return -ENOMEM;
+		break;
+	}
+	case COREDUMP_SOCK: {
+		err = cn_printf(cn, "%s", pat_ptr);
+		if (err)
+			return err;
+
+		/*
+		 * We can potentially allow this to be changed later but
+		 * I currently see no reason to.
+		 */
+		if (strcmp(cn->corename, "@linuxafsk/coredump.socket"))
+			return -EINVAL;
+
+		/*
+		 * Currently no need to parse any other options.
+		 * Relevant information can be retrieved from the peer
+		 * pidfd retrievable via SO_PEERPIDFD by the receiver or
+		 * via /proc/<pid>, using the SO_PEERPIDFD to guard
+		 * against pid recycling when opening /proc/<pid>.
+		 */
+		return 0;
+	}
+	default:
+		WARN_ON_ONCE(cn->core_type != COREDUMP_FILE);
+		break;
 	}
 
 	/* Repeat as long as we have more pattern to process and more output
@@ -583,6 +617,17 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
 	return 0;
 }
 
+#ifdef CONFIG_UNIX
+struct sockaddr_un coredump_unix_socket = {
+	.sun_family = AF_UNIX,
+	.sun_path = "\0linuxafsk/coredump.socket",
+};
+/* Without trailing NUL byte. */
+#define COREDUMP_UNIX_SOCKET_ADDR_SIZE            \
+	(offsetof(struct sockaddr_un, sun_path) + \
+	 sizeof("\0linuxafsk/coredump.socket") - 1)
+#endif
+
 void do_coredump(const kernel_siginfo_t *siginfo)
 {
 	struct core_state core_state;
@@ -801,6 +846,40 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		}
 		break;
 	}
+	case COREDUMP_SOCK: {
+		struct file *file __free(fput) = NULL;
+#ifdef CONFIG_UNIX
+		struct socket *socket;
+
+		/*
+		 * It is possible that the userspace process which is
+		 * supposed to handle the coredump and is listening on
+		 * the AF_UNIX socket coredumps. Userspace should just
+		 * mark itself non dumpable.
+		 */
+
+		retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
+		if (retval < 0)
+			goto close_fail;
+
+		file = sock_alloc_file(socket, 0, NULL);
+		if (IS_ERR(file)) {
+			sock_release(socket);
+			retval = PTR_ERR(file);
+			goto close_fail;
+		}
+
+		retval = kernel_connect(socket,
+					(struct sockaddr *)(&coredump_unix_socket),
+					COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0);
+		if (retval)
+			goto close_fail;
+
+		cprm.limit = RLIM_INFINITY;
+#endif
+		cprm.file = no_free_ptr(file);
+		break;
+	}
 	default:
 		WARN_ON_ONCE(true);
 		retval = -EINVAL;
@@ -818,7 +897,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		 * have this set to NULL.
 		 */
 		if (!cprm.file) {
-			coredump_report_failure("Core dump to |%s disabled", cn.corename);
+			if (cn.core_type == COREDUMP_PIPE)
+				coredump_report_failure("Core dump to |%s disabled", cn.corename);
+			else
+				coredump_report_failure("Core dump to @%s disabled", cn.corename);
 			goto close_fail;
 		}
 		if (!dump_vma_snapshot(&cprm))
@@ -839,8 +921,28 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		file_end_write(cprm.file);
 		free_vma_snapshot(&cprm);
 	}
-	if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
-		wait_for_dump_helpers(cprm.file);
+
+	if (core_pipe_limit) {
+		switch (cn.core_type) {
+		case COREDUMP_PIPE:
+			wait_for_dump_helpers(cprm.file);
+			break;
+		case COREDUMP_SOCK: {
+			char buf[1];
+			/*
+			 * We use a simple read to wait for the coredump
+			 * processing to finish. Either the socket is
+			 * closed or we get sent unexpected data. In
+			 * both cases, we're done.
+			 */
+			__kernel_read(cprm.file, buf, 1, NULL);
+			break;
+		}
+		default:
+			break;
+		}
+	}
+
 close_fail:
 	if (cprm.file)
 		filp_close(cprm.file, NULL);
@@ -1070,7 +1172,7 @@ EXPORT_SYMBOL(dump_align);
 void validate_coredump_safety(void)
 {
 	if (suid_dumpable == SUID_DUMP_ROOT &&
-	    core_pattern[0] != '/' && core_pattern[0] != '|') {
+	    core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != '@') {
 
 		coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: "
 			"pipe handler or fully qualified core dump path required. "

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC v3 05/10] coredump: validate socket name as it is written
  2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
                   ` (3 preceding siblings ...)
  2025-05-05 11:13 ` [PATCH RFC v3 04/10] coredump: add coredump socket Christian Brauner
@ 2025-05-05 11:13 ` Christian Brauner
  2025-05-05 11:13 ` [PATCH RFC v3 06/10] coredump: show supported coredump modes Christian Brauner
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 11:13 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn
  Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Christian Brauner, Alexander Mikhalitsyn

In contrast to other parameters written into
/proc/sys/kernel/core_pattern that never fail we can validate enabling
the new AF_UNIX support. This is obviously racy as hell but it's always
been that way.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index c60f86c473ad..d3f12ba0f150 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1183,10 +1183,21 @@ void validate_coredump_safety(void)
 static int proc_dostring_coredump(const struct ctl_table *table, int write,
 		  void *buffer, size_t *lenp, loff_t *ppos)
 {
-	int error = proc_dostring(table, write, buffer, lenp, ppos);
+	int error;
+	ssize_t retval;
+	char old_core_pattern[CORENAME_MAX_SIZE];
+
+	retval = strscpy(old_core_pattern, core_pattern, CORENAME_MAX_SIZE);
+
+	error = proc_dostring(table, write, buffer, lenp, ppos);
+	if (error)
+		return error;
+	if (core_pattern[0] == '@' && strcmp(core_pattern, "@linuxafsk/coredump.socket")) {
+		strscpy(core_pattern, old_core_pattern, retval + 1);
+		return -EINVAL;
+	}
 
-	if (!error)
-		validate_coredump_safety();
+	validate_coredump_safety();
 	return error;
 }
 

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC v3 06/10] coredump: show supported coredump modes
  2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
                   ` (4 preceding siblings ...)
  2025-05-05 11:13 ` [PATCH RFC v3 05/10] coredump: validate socket name as it is written Christian Brauner
@ 2025-05-05 11:13 ` Christian Brauner
  2025-05-05 11:13 ` [PATCH RFC v3 07/10] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 11:13 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn
  Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Christian Brauner, Alexander Mikhalitsyn

Allow userspace to discover what coredump modes are supported.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index d3f12ba0f150..37dc5aa3806f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1203,6 +1203,12 @@ static int proc_dostring_coredump(const struct ctl_table *table, int write,
 
 static const unsigned int core_file_note_size_min = CORE_FILE_NOTE_SIZE_DEFAULT;
 static const unsigned int core_file_note_size_max = CORE_FILE_NOTE_SIZE_MAX;
+static char core_modes[] = {
+	"file\npipe"
+#ifdef CONFIG_UNIX
+	"\nlinuxafsk/coredump.socket"
+#endif
+};
 
 static const struct ctl_table coredump_sysctls[] = {
 	{
@@ -1246,6 +1252,13 @@ static const struct ctl_table coredump_sysctls[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "core_modes",
+		.data		= core_modes,
+		.maxlen		= sizeof(core_modes) - 1,
+		.mode		= 0444,
+		.proc_handler	= proc_dostring,
+	},
 };
 
 static int __init init_fs_coredump_sysctls(void)

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC v3 07/10] pidfs, coredump: add PIDFD_INFO_COREDUMP
  2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
                   ` (5 preceding siblings ...)
  2025-05-05 11:13 ` [PATCH RFC v3 06/10] coredump: show supported coredump modes Christian Brauner
@ 2025-05-05 11:13 ` Christian Brauner
  2025-05-05 11:13 ` [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket Christian Brauner
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 11:13 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn
  Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Christian Brauner, Alexander Mikhalitsyn

Let userspace know that the task coredumped and whether it was dumped as
root or as regular user. The latter is needed so that access permissions
to the executable are correctly handled.

I don't think this requires any additional privileges checks. The
missing exposure of the dumpability attribute of a given task is an
issue we should fix given that we already expose whether a task is
coredumping or not.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c              | 35 +++++++++++++++++++++++++++++
 fs/pidfs.c                 | 55 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pidfs.h      |  3 +++
 include/uapi/linux/pidfd.h | 16 ++++++++++++++
 4 files changed, 109 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index 37dc5aa3806f..a5f17aaeee32 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -46,7 +46,9 @@
 #include <linux/pidfs.h>
 #include <linux/net.h>
 #include <linux/socket.h>
+#include <net/af_unix.h>
 #include <net/net_namespace.h>
+#include <net/sock.h>
 #include <uapi/linux/pidfd.h>
 #include <uapi/linux/un.h>
 
@@ -588,6 +590,8 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
 		if (IS_ERR(pidfs_file))
 			return PTR_ERR(pidfs_file);
 
+		pidfs_coredump(cp);
+
 		/*
 		 * Usermode helpers are childen of either
 		 * system_unbound_wq or of kthreadd. So we know that
@@ -869,11 +873,42 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 			goto close_fail;
 		}
 
+		/*
+		 * Set the thread-group leader pid which is used for the
+		 * peer credentials during connect() below. Then
+		 * immediately register it in pidfs...
+		 */
+		cprm.pid = task_tgid(current);
+		retval = pidfs_register_pid(cprm.pid);
+		if (retval) {
+			sock_release(socket);
+			goto close_fail;
+		}
+
+		/*
+		 * ... and set the coredump information so userspace
+		 * has it available after connect()...
+		 */
+		pidfs_coredump(&cprm);
+
+		/*
+		 * ... On connect() the peer credentials are recorded
+		 * and @cprm.pid registered in pidfs...
+		 */
 		retval = kernel_connect(socket,
 					(struct sockaddr *)(&coredump_unix_socket),
 					COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0);
+		/*
+		 * ... So we can safely put our pidfs reference now...
+		 */
+		pidfs_put_pid(cprm.pid);
 		if (retval)
 			goto close_fail;
+		/* ... and validate that @sk_peer_pid matches @cprm.pid. */
+		if (WARN_ON_ONCE(unix_peer(socket->sk)->sk_peer_pid != cprm.pid)) {
+			retval = -ESRCH;
+			goto close_fail;
+		}
 
 		cprm.limit = RLIM_INFINITY;
 #endif
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 3b39e471840b..8c4d83fb115b 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -20,6 +20,7 @@
 #include <linux/time_namespace.h>
 #include <linux/utsname.h>
 #include <net/net_namespace.h>
+#include <linux/coredump.h>
 
 #include "internal.h"
 #include "mount.h"
@@ -33,6 +34,7 @@ static struct kmem_cache *pidfs_cachep __ro_after_init;
 struct pidfs_exit_info {
 	__u64 cgroupid;
 	__s32 exit_code;
+	__u32 coredump_mask;
 };
 
 struct pidfs_inode {
@@ -240,6 +242,22 @@ static inline bool pid_in_current_pidns(const struct pid *pid)
 	return false;
 }
 
+static __u32 pidfs_coredump_mask(unsigned long mm_flags)
+{
+	switch (__get_dumpable(mm_flags)) {
+	case SUID_DUMP_USER:
+		return PIDFD_COREDUMP_USER;
+	case SUID_DUMP_ROOT:
+		return PIDFD_COREDUMP_ROOT;
+	case SUID_DUMP_DISABLE:
+		return PIDFD_COREDUMP_SKIP;
+	default:
+		WARN_ON_ONCE(true);
+	}
+
+	return 0;
+}
+
 static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
@@ -280,6 +298,11 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 		}
 	}
 
+	if (mask & PIDFD_INFO_COREDUMP) {
+		kinfo.mask |= PIDFD_INFO_COREDUMP;
+		kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
+	}
+
 	task = get_pid_task(pid, PIDTYPE_PID);
 	if (!task) {
 		/*
@@ -296,6 +319,13 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 	if (!c)
 		return -ESRCH;
 
+	if (!(kinfo.mask & PIDFD_INFO_COREDUMP)) {
+		task_lock(task);
+		if (task->mm)
+			kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags);
+		task_unlock(task);
+	}
+
 	/* Unconditionally return identifiers and credentials, the rest only on request */
 
 	user_ns = current_user_ns();
@@ -559,6 +589,31 @@ void pidfs_exit(struct task_struct *tsk)
 	}
 }
 
+void pidfs_coredump(const struct coredump_params *cprm)
+{
+	struct pid *pid = cprm->pid;
+	struct pidfs_exit_info *exit_info;
+	struct dentry *dentry;
+	struct inode *inode;
+	__u32 coredump_mask = 0;
+
+	dentry = stashed_dentry_get(&pid->stashed);
+	if (WARN_ON_ONCE(!dentry))
+		return;
+
+	inode = d_inode(dentry);
+	exit_info = &pidfs_i(inode)->__pei;
+	/* Note how we were coredumped. */
+	coredump_mask = pidfs_coredump_mask(cprm->mm_flags);
+	/* Note that we actually did coredump. */
+	coredump_mask |= PIDFD_COREDUMPED;
+	/* If coredumping is set to skip we should never end up here. */
+	VFS_WARN_ON_ONCE(coredump_mask & PIDFD_COREDUMP_SKIP);
+	smp_store_release(&exit_info->coredump_mask, coredump_mask);
+	/* Fwiw, this cannot be the last reference. */
+	dput(dentry);
+}
+
 static struct vfsmount *pidfs_mnt __ro_after_init;
 
 /*
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index 2676890c4d0d..f7729b9371bc 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -2,11 +2,14 @@
 #ifndef _LINUX_PID_FS_H
 #define _LINUX_PID_FS_H
 
+struct coredump_params;
+
 struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
 void __init pidfs_init(void);
 void pidfs_add_pid(struct pid *pid);
 void pidfs_remove_pid(struct pid *pid);
 void pidfs_exit(struct task_struct *tsk);
+void pidfs_coredump(const struct coredump_params *cprm);
 extern const struct dentry_operations pidfs_dentry_operations;
 int pidfs_register_pid(struct pid *pid);
 void pidfs_get_pid(struct pid *pid);
diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
index 8c1511edd0e9..84ac709f560c 100644
--- a/include/uapi/linux/pidfd.h
+++ b/include/uapi/linux/pidfd.h
@@ -25,9 +25,23 @@
 #define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
 #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
 #define PIDFD_INFO_EXIT			(1UL << 3) /* Only returned if requested. */
+#define PIDFD_INFO_COREDUMP		(1UL << 4) /* Only returned if requested. */
 
 #define PIDFD_INFO_SIZE_VER0		64 /* sizeof first published struct */
 
+/*
+ * Values for @coredump_mask in pidfd_info.
+ * Only valid if PIDFD_INFO_SUID_COREDUMP is set in @mask.
+ *
+ * Note, the @PIDFD_COREDUMP_ROOT flag indicates that the generated
+ * coredump should be treated as sensitive and access should only be
+ * granted to privileged users.
+ */
+#define PIDFD_COREDUMPED	(1U << 0) /* Did crash and... */
+#define PIDFD_COREDUMP_SKIP	(1U << 1) /* coredumping generation was skipped. */
+#define PIDFD_COREDUMP_USER	(1U << 2) /* coredump was done as the user. */
+#define PIDFD_COREDUMP_ROOT	(1U << 3) /* coredump was done as root. */
+
 /*
  * The concept of process and threads in userland and the kernel is a confusing
  * one - within the kernel every thread is a 'task' with its own individual PID,
@@ -92,6 +106,8 @@ struct pidfd_info {
 	__u32 fsuid;
 	__u32 fsgid;
 	__s32 exit_code;
+	__u32 coredump_mask;
+	__u32 __spare1;
 };
 
 #define PIDFS_IOCTL_MAGIC 0xFF

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
                   ` (6 preceding siblings ...)
  2025-05-05 11:13 ` [PATCH RFC v3 07/10] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
@ 2025-05-05 11:13 ` Christian Brauner
  2025-05-05 13:08   ` Jann Horn
  2025-05-05 11:13 ` [PATCH RFC v3 09/10] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure Christian Brauner
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 11:13 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn
  Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Christian Brauner, Alexander Mikhalitsyn

Make sure that only tasks that actually coredumped may connect to the
coredump socket. This restriction may be loosened later in case
userspace processes would like to use it to generate their own
coredumps. Though it'd be wiser if userspace just exposed a separate
socket for that.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c            | 25 +++++++++++++++++++++++++
 fs/pidfs.c               | 13 +++++++++++++
 include/linux/coredump.h | 12 ++++++++++++
 include/linux/pidfs.h    |  1 +
 net/unix/af_unix.c       | 19 +++++++++++++------
 5 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index a5f17aaeee32..8a9620ce4fd0 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -630,6 +630,31 @@ struct sockaddr_un coredump_unix_socket = {
 #define COREDUMP_UNIX_SOCKET_ADDR_SIZE            \
 	(offsetof(struct sockaddr_un, sun_path) + \
 	 sizeof("\0linuxafsk/coredump.socket") - 1)
+
+static inline bool is_coredump_socket(const struct sockaddr_un *sunname,
+				      unsigned int len)
+{
+	return (COREDUMP_UNIX_SOCKET_ADDR_SIZE == len) &&
+	       !memcmp(&coredump_unix_socket, sunname, len);
+}
+
+/*
+ * For the coredump socket in the initial network namespace we only
+ * allow actual coredumping processes to connect to it, i.e., the kernel
+ * itself.
+ */
+bool unix_use_coredump_socket(const struct net *net, const struct pid *pid,
+			      const struct sockaddr_un *sunname,
+			      unsigned int len)
+{
+	if (net != &init_net)
+		return true;
+
+	if (!is_coredump_socket(sunname, len))
+		return true;
+
+	return pidfs_pid_coredumped(pid);
+}
 #endif
 
 void do_coredump(const kernel_siginfo_t *siginfo)
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 8c4d83fb115b..e0a4c34ddc42 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -258,6 +258,19 @@ static __u32 pidfs_coredump_mask(unsigned long mm_flags)
 	return 0;
 }
 
+bool pidfs_pid_coredumped(const struct pid *pid)
+{
+	struct inode *inode;
+
+	if (!pid)
+		return false;
+
+	/* We expect the caller to hold a reference to @pid->stashed. */
+	VFS_WARN_ON_ONCE(!pid->stashed);
+	inode = d_inode(pid->stashed);
+	return (READ_ONCE(pidfs_i(inode)->__pei.coredump_mask) & PIDFD_COREDUMPED);
+}
+
 static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 76e41805b92d..2b2f0c016c16 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -7,6 +7,8 @@
 #include <linux/fs.h>
 #include <asm/siginfo.h>
 
+struct sockaddr_un;
+
 #ifdef CONFIG_COREDUMP
 struct core_vma_metadata {
 	unsigned long start, end;
@@ -44,6 +46,9 @@ extern int dump_align(struct coredump_params *cprm, int align);
 int dump_user_range(struct coredump_params *cprm, unsigned long start,
 		    unsigned long len);
 extern void do_coredump(const kernel_siginfo_t *siginfo);
+bool unix_use_coredump_socket(const struct net *net, const struct pid *pid,
+			      const struct sockaddr_un *sunname,
+			      unsigned int len);
 
 /*
  * Logging for the coredump code, ratelimited.
@@ -64,6 +69,13 @@ extern void do_coredump(const kernel_siginfo_t *siginfo);
 
 #else
 static inline void do_coredump(const kernel_siginfo_t *siginfo) {}
+static inline bool unix_use_coredump_socket(const struct net *net,
+					    const struct pid *pid,
+					    const struct sockaddr_un *sunname,
+					    unsigned int len)
+{
+	return true;
+}
 
 #define coredump_report(...)
 #define coredump_report_failure(...)
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index f7729b9371bc..f9c287c0e0ff 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -14,5 +14,6 @@ extern const struct dentry_operations pidfs_dentry_operations;
 int pidfs_register_pid(struct pid *pid);
 void pidfs_get_pid(struct pid *pid);
 void pidfs_put_pid(struct pid *pid);
+bool pidfs_pid_coredumped(const struct pid *pid);
 
 #endif /* _LINUX_PID_FS_H */
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index edc2f143f401..613cf9225381 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -101,6 +101,7 @@
 #include <linux/string.h>
 #include <linux/uaccess.h>
 #include <linux/pidfs.h>
+#include <linux/coredump.h>
 #include <net/af_unix.h>
 #include <net/net_namespace.h>
 #include <net/scm.h>
@@ -1217,6 +1218,7 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
 }
 
 static struct sock *unix_find_abstract(struct net *net,
+				       const struct pid *peer_pid,
 				       struct sockaddr_un *sunaddr,
 				       int addr_len, int type)
 {
@@ -1224,6 +1226,9 @@ static struct sock *unix_find_abstract(struct net *net,
 	struct dentry *dentry;
 	struct sock *sk;
 
+	if (!unix_use_coredump_socket(net, peer_pid, sunaddr, addr_len))
+		return ERR_PTR(-ECONNREFUSED);
+
 	sk = unix_find_socket_byname(net, sunaddr, addr_len, hash);
 	if (!sk)
 		return ERR_PTR(-ECONNREFUSED);
@@ -1236,15 +1241,16 @@ static struct sock *unix_find_abstract(struct net *net,
 }
 
 static struct sock *unix_find_other(struct net *net,
-				    struct sockaddr_un *sunaddr,
-				    int addr_len, int type)
+				    const struct pid *peer_pid,
+				    struct sockaddr_un *sunaddr, int addr_len,
+				    int type)
 {
 	struct sock *sk;
 
 	if (sunaddr->sun_path[0])
 		sk = unix_find_bsd(sunaddr, addr_len, type);
 	else
-		sk = unix_find_abstract(net, sunaddr, addr_len, type);
+		sk = unix_find_abstract(net, peer_pid, sunaddr, addr_len, type);
 
 	return sk;
 }
@@ -1500,7 +1506,7 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr *addr,
 		}
 
 restart:
-		other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type);
+		other = unix_find_other(sock_net(sk), NULL, sunaddr, alen, sock->type);
 		if (IS_ERR(other)) {
 			err = PTR_ERR(other);
 			goto out;
@@ -1647,7 +1653,8 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 
 restart:
 	/*  Find listening sock. */
-	other = unix_find_other(net, sunaddr, addr_len, sk->sk_type);
+	other = unix_find_other(net, peercred.peer_pid, sunaddr, addr_len,
+				sk->sk_type);
 	if (IS_ERR(other)) {
 		err = PTR_ERR(other);
 		goto out_free_skb;
@@ -2115,7 +2122,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 
 	if (msg->msg_namelen) {
 lookup:
-		other = unix_find_other(sock_net(sk), msg->msg_name,
+		other = unix_find_other(sock_net(sk), NULL, msg->msg_name,
 					msg->msg_namelen, sk->sk_type);
 		if (IS_ERR(other)) {
 			err = PTR_ERR(other);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC v3 09/10] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
  2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
                   ` (7 preceding siblings ...)
  2025-05-05 11:13 ` [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket Christian Brauner
@ 2025-05-05 11:13 ` Christian Brauner
  2025-05-05 11:13 ` [PATCH RFC v3 10/10] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 11:13 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn
  Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Christian Brauner, Alexander Mikhalitsyn

Add PIDFD_INFO_COREDUMP infrastructure so we can use it in tests.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index 55bcf81a2b9a..6e812c3afca4 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -131,6 +131,26 @@
 #define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
 #endif
 
+#ifndef PIDFD_INFO_COREDUMP
+#define PIDFD_INFO_COREDUMP		(1UL << 4)
+#endif
+
+#ifndef PIDFD_COREDUMPED
+#define PIDFD_COREDUMPED	(1U << 0) /* Did crash and... */
+#endif
+
+#ifndef PIDFD_COREDUMP_SKIP
+#define PIDFD_COREDUMP_SKIP	(1U << 1) /* coredumping generation was skipped. */
+#endif
+
+#ifndef PIDFD_COREDUMP_USER
+#define PIDFD_COREDUMP_USER	(1U << 2) /* coredump was done as the user. */
+#endif
+
+#ifndef PIDFD_COREDUMP_ROOT
+#define PIDFD_COREDUMP_ROOT	(1U << 3) /* coredump was done as root. */
+#endif
+
 #ifndef PIDFD_THREAD
 #define PIDFD_THREAD O_EXCL
 #endif
@@ -150,6 +170,8 @@ struct pidfd_info {
 	__u32 fsuid;
 	__u32 fsgid;
 	__s32 exit_code;
+	__u32 coredump_mask;
+	__u32 __spare1;
 };
 
 /*

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC v3 10/10] selftests/coredump: add tests for AF_UNIX coredumps
  2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
                   ` (8 preceding siblings ...)
  2025-05-05 11:13 ` [PATCH RFC v3 09/10] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure Christian Brauner
@ 2025-05-05 11:13 ` Christian Brauner
  2025-05-05 14:41 ` [PATCH RFC v3 00/10] coredump: add coredump socket Mickaël Salaün
  2025-05-05 18:33 ` Kuniyuki Iwashima
  11 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 11:13 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn
  Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Christian Brauner, Alexander Mikhalitsyn

Add a simple test for generating coredumps via AF_UNIX sockets.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/coredump/stackdump_test.c | 71 ++++++++++++++++++++++-
 1 file changed, 70 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/coredump/stackdump_test.c b/tools/testing/selftests/coredump/stackdump_test.c
index fe3c728cd6be..4d1d444ca3d4 100644
--- a/tools/testing/selftests/coredump/stackdump_test.c
+++ b/tools/testing/selftests/coredump/stackdump_test.c
@@ -5,10 +5,13 @@
 #include <linux/limits.h>
 #include <pthread.h>
 #include <string.h>
+#include <sys/mount.h>
 #include <sys/resource.h>
+#include <sys/stat.h>
 #include <unistd.h>
 
 #include "../kselftest_harness.h"
+#include "../pidfd/pidfd.h"
 
 #define STACKDUMP_FILE "stack_values"
 #define STACKDUMP_SCRIPT "stackdump"
@@ -35,6 +38,7 @@ static void crashing_child(void)
 FIXTURE(coredump)
 {
 	char original_core_pattern[256];
+	pid_t pid_socat;
 };
 
 FIXTURE_SETUP(coredump)
@@ -44,6 +48,7 @@ FIXTURE_SETUP(coredump)
 	char *dir;
 	int ret;
 
+	self->pid_socat = -ESRCH;
 	file = fopen("/proc/sys/kernel/core_pattern", "r");
 	ASSERT_NE(NULL, file);
 
@@ -61,10 +66,15 @@ FIXTURE_TEARDOWN(coredump)
 {
 	const char *reason;
 	FILE *file;
-	int ret;
+	int ret, status;
 
 	unlink(STACKDUMP_FILE);
 
+	if (self->pid_socat > 0) {
+		kill(self->pid_socat, SIGTERM);
+		waitpid(self->pid_socat, &status, 0);
+	}
+
 	file = fopen("/proc/sys/kernel/core_pattern", "w");
 	if (!file) {
 		reason = "Unable to open core_pattern";
@@ -154,4 +164,63 @@ TEST_F_TIMEOUT(coredump, stackdump, 120)
 	fclose(file);
 }
 
+TEST_F(coredump, socket)
+{
+	int fd, pidfd, ret, status;
+	FILE *file;
+	pid_t pid, pid_socat;
+	struct stat st;
+	char core_file[PATH_MAX];
+	struct pidfd_info info = {};
+
+	ASSERT_EQ(unshare(CLONE_NEWNS), 0);
+	ASSERT_EQ(mount(NULL, "/", NULL, MS_PRIVATE | MS_REC, NULL), 0);
+	ASSERT_EQ(mount(NULL, "/tmp", "tmpfs", 0, NULL), 0);
+
+	pid_socat = fork();
+	ASSERT_GE(pid_socat, 0);
+	if (pid_socat == 0) {
+		execlp("socat", "socat",
+		       "abstract-listen:linuxafsk/coredump.socket,fork",
+		       "FILE:/tmp/coredump_file,create,append,trunc",
+		       (char *)NULL);
+		_exit(EXIT_FAILURE);
+	}
+	self->pid_socat = pid_socat;
+
+	file = fopen("/proc/sys/kernel/core_pattern", "w");
+	ASSERT_NE(NULL, file);
+
+	ret = fprintf(file, "@linuxafsk/coredump.socket");
+	ASSERT_EQ(ret, strlen("@linuxafsk/coredump.socket"));
+	ASSERT_EQ(fclose(file), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+	if (pid == 0)
+		crashing_child();
+
+	pidfd = sys_pidfd_open(pid, 0);
+	ASSERT_GE(pidfd, 0);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFSIGNALED(status));
+	ASSERT_TRUE(WCOREDUMP(status));
+
+	info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP;
+	ASSERT_EQ(ioctl(pidfd, PIDFD_GET_INFO, &info), 0);
+	ASSERT_GT((info.mask & PIDFD_INFO_COREDUMP), 0);
+	ASSERT_GT((info.coredump_mask & PIDFD_COREDUMPED), 0);
+
+	ASSERT_EQ(kill(pid_socat, SIGTERM), 0);
+	waitpid(pid_socat, &status, 0);
+	self->pid_socat = -ESRCH;
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 143);
+
+	/* We should somehow validate the produced core file. */
+	ASSERT_EQ(stat("/tmp/coredump_file", &st), 0);
+	ASSERT_GT(st.st_size, 0);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 04/10] coredump: add coredump socket
  2025-05-05 11:13 ` [PATCH RFC v3 04/10] coredump: add coredump socket Christian Brauner
@ 2025-05-05 12:55   ` Jann Horn
  2025-05-05 13:06     ` Luca Boccassi
  2025-05-05 14:46     ` Christian Brauner
  2025-05-05 18:48   ` Kuniyuki Iwashima
  1 sibling, 2 replies; 44+ messages in thread
From: Jann Horn @ 2025-05-05 12:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Alexander Mikhalitsyn

On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> Coredumping currently supports two modes:
>
> (1) Dumping directly into a file somewhere on the filesystem.
> (2) Dumping into a pipe connected to a usermode helper process
>     spawned as a child of the system_unbound_wq or kthreadd.
>
> For simplicity I'm mostly ignoring (1). There's probably still some
> users of (1) out there but processing coredumps in this way can be
> considered adventurous especially in the face of set*id binaries.
>
> The most common option should be (2) by now. It works by allowing
> userspace to put a string into /proc/sys/kernel/core_pattern like:
>
>         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
>
> The "|" at the beginning indicates to the kernel that a pipe must be
> used. The path following the pipe indicator is a path to a binary that
> will be spawned as a usermode helper process. Any additional parameters
> pass information about the task that is generating the coredump to the
> binary that processes the coredump.
>
> In the example core_pattern shown above systemd-coredump is spawned as a
> usermode helper. There's various conceptual consequences of this
> (non-exhaustive list):
>
> - systemd-coredump is spawned with file descriptor number 0 (stdin)
>   connected to the read-end of the pipe. All other file descriptors are
>   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
>   already caused bugs because userspace assumed that this cannot happen
>   (Whether or not this is a sane assumption is irrelevant.).
>
> - systemd-coredump will be spawned as a child of system_unbound_wq. So
>   it is not a child of any userspace process and specifically not a
>   child of PID 1. It cannot be waited upon and is in a weird hybrid
>   upcall which are difficult for userspace to control correctly.
>
> - systemd-coredump is spawned with full kernel privileges. This
>   necessitates all kinds of weird privilege dropping excercises in
>   userspace to make this safe.
>
> - A new usermode helper has to be spawned for each crashing process.
>
> This series adds a new mode:
>
> (3) Dumping into an abstract AF_UNIX socket.
>
> Userspace can set /proc/sys/kernel/core_pattern to:
>
>         @linuxafsk/coredump_socket
>
> The "@" at the beginning indicates to the kernel that the abstract
> AF_UNIX coredump socket will be used to process coredumps.
>
> The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> for now.
>
> The coredump socket is located in the initial network namespace. To bind
> the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> user namespace. Listening and reading can happen from whatever
> unprivileged context is necessary to safely process coredumps.
>
> When a task coredumps it opens a client socket in the initial network
> namespace and connects to the coredump socket. For now only tasks that
> are acctually coredumping are allowed to connect to the initial coredump
> socket.
>
> - The coredump server should use SO_PEERPIDFD to get a stable handle on
>   the connected crashing task. The retrieved pidfd will provide a stable
>   reference even if the crashing task gets SIGKILLed while generating
>   the coredump.
>
> - By setting core_pipe_limit non-zero userspace can guarantee that the
>   crashing task cannot be reaped behind it's back and thus process all
>   necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
>   detect whether /proc/<pid> still refers to the same process.
>
>   The core_pipe_limit isn't used to rate-limit connections to the
>   socket. This can simply be done via AF_UNIX socket directly.
>
> - The pidfd for the crashing task will contain information how the task
>   coredumps. The PIDFD_GET_INFO ioctl gained a new flag
>   PIDFD_INFO_COREDUMP which can be used to retreive the coredump
>   information.
>
>   If the coredump gets a new coredump client connection the kernel
>   guarantees that PIDFD_INFO_COREDUMP information is available.
>   Currently the following information is provided in the new
>   @coredump_mask extension to struct pidfd_info:
>
>   * PIDFD_COREDUMPED is raised if the task did actually coredump.
>   * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
>     undumpable).
>   * PIDFD_COREDUMP_USER is raised if this is a regular coredump and
>     doesn't need special care by the coredump server.
>   * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
>     treated as sensitive and the coredump server should restrict to the
>     generated coredump to sufficiently privileged users.
>
> - Since unix_stream_connect() runs bpf programs during connect it's
>   possible to even redirect or multiplex coredumps to other sockets.

Or change the userspace protocol used for containers such that the
init-namespace coredumping helper forwards the FD it accept()ed into a
container via SCM_RIGHTS...

> - The coredump server should mark itself as non-dumpable.
>   To capture coredumps for the coredump server itself a bpf program
>   should be run at connect to redirect it to another socket in
>   userspace. This can be useful for debugging crashing coredump servers.
>
> - A container coredump server in a separate network namespace can simply
>   bind to linuxafsk/coredump.socket and systemd-coredump fowards
>   coredumps to the container.
>
> - Fwiw, one idea is to handle coredumps via per-user/session coredump
>   servers that run with that users privileges.
>
>   The coredump server listens on the coredump socket and accepts a
>   new coredump connection. It then retrieves SO_PEERPIDFD for the
>   client, inspects uid/gid and hands the accepted client to the users
>   own coredump handler which runs with the users privileges only.

(Though that would only be okay if it's not done for suid dumping cases.)

> The new coredump socket will allow userspace to not have to rely on
> usermode helpers for processing coredumps and provides a safer way to
> handle them instead of relying on super privileged coredumping helpers.
>
> This will also be significantly more lightweight since no fork()+exec()
> for the usermodehelper is required for each crashing process. The
> coredump server in userspace can just keep a worker pool.

I mean, if coredumping is a performance bottleneck, something is
probably seriously wrong with the system... I don't think we need to
optimize for execution speed in this area.

> This is easy to test:
>
> (a) coredump processing (we're using socat):
>
>     > cat coredump_socket.sh
>     #!/bin/bash
>
>     set -x
>
>     sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
>     sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc
>
> (b) trigger a coredump:
>
>     user1@localhost:~/data/scripts$ cat crash.c
>     #include <stdio.h>
>     #include <unistd.h>
>
>     int main(int argc, char *argv[])
>     {
>             fprintf(stderr, "%u\n", (1 / 0));
>             _exit(0);
>     }

This looks pretty neat overall!

> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/coredump.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 107 insertions(+), 5 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 1779299b8c61..c60f86c473ad 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -44,7 +44,11 @@
>  #include <linux/sysctl.h>
>  #include <linux/elf.h>
>  #include <linux/pidfs.h>
> +#include <linux/net.h>
> +#include <linux/socket.h>
> +#include <net/net_namespace.h>
>  #include <uapi/linux/pidfd.h>
> +#include <uapi/linux/un.h>
>
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
>  enum coredump_type_t {
>         COREDUMP_FILE = 1,
>         COREDUMP_PIPE = 2,
> +       COREDUMP_SOCK = 3,
>  };
>
>  struct core_name {
> @@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>         cn->corename = NULL;
>         if (*pat_ptr == '|')
>                 cn->core_type = COREDUMP_PIPE;
> +       else if (*pat_ptr == '@')
> +               cn->core_type = COREDUMP_SOCK;
>         else
>                 cn->core_type = COREDUMP_FILE;
>         if (expand_corename(cn, core_name_size))
>                 return -ENOMEM;
>         cn->corename[0] = '\0';
>
> -       if (cn->core_type == COREDUMP_PIPE) {
> +       switch (cn->core_type) {
> +       case COREDUMP_PIPE: {
>                 int argvs = sizeof(core_pattern) / 2;
>                 (*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
>                 if (!(*argv))
> @@ -247,6 +255,32 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>                 ++pat_ptr;
>                 if (!(*pat_ptr))
>                         return -ENOMEM;
> +               break;
> +       }
> +       case COREDUMP_SOCK: {
> +               err = cn_printf(cn, "%s", pat_ptr);
> +               if (err)
> +                       return err;
> +
> +               /*
> +                * We can potentially allow this to be changed later but
> +                * I currently see no reason to.
> +                */
> +               if (strcmp(cn->corename, "@linuxafsk/coredump.socket"))
> +                       return -EINVAL;
> +
> +               /*
> +                * Currently no need to parse any other options.
> +                * Relevant information can be retrieved from the peer
> +                * pidfd retrievable via SO_PEERPIDFD by the receiver or
> +                * via /proc/<pid>, using the SO_PEERPIDFD to guard
> +                * against pid recycling when opening /proc/<pid>.
> +                */
> +               return 0;
> +       }
> +       default:
> +               WARN_ON_ONCE(cn->core_type != COREDUMP_FILE);
> +               break;
>         }
>
>         /* Repeat as long as we have more pattern to process and more output

I think the core_uses_pid logic at the end of this function needs to
be adjusted to also exclude COREDUMP_SOCK?

> @@ -583,6 +617,17 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
>         return 0;
>  }
>
> +#ifdef CONFIG_UNIX
> +struct sockaddr_un coredump_unix_socket = {
> +       .sun_family = AF_UNIX,
> +       .sun_path = "\0linuxafsk/coredump.socket",
> +};

Nit: Please make that static and const.

> +/* Without trailing NUL byte. */
> +#define COREDUMP_UNIX_SOCKET_ADDR_SIZE            \
> +       (offsetof(struct sockaddr_un, sun_path) + \
> +        sizeof("\0linuxafsk/coredump.socket") - 1)
> +#endif
> +
>  void do_coredump(const kernel_siginfo_t *siginfo)
>  {
>         struct core_state core_state;
> @@ -801,6 +846,40 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                 }
>                 break;
>         }
> +       case COREDUMP_SOCK: {
> +               struct file *file __free(fput) = NULL;
> +#ifdef CONFIG_UNIX
> +               struct socket *socket;
> +
> +               /*
> +                * It is possible that the userspace process which is
> +                * supposed to handle the coredump and is listening on
> +                * the AF_UNIX socket coredumps. Userspace should just
> +                * mark itself non dumpable.
> +                */
> +
> +               retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> +               if (retval < 0)
> +                       goto close_fail;
> +
> +               file = sock_alloc_file(socket, 0, NULL);
> +               if (IS_ERR(file)) {
> +                       sock_release(socket);
> +                       retval = PTR_ERR(file);
> +                       goto close_fail;
> +               }
> +
> +               retval = kernel_connect(socket,
> +                                       (struct sockaddr *)(&coredump_unix_socket),
> +                                       COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0);
> +               if (retval)
> +                       goto close_fail;
> +
> +               cprm.limit = RLIM_INFINITY;
> +#endif

The non-CONFIG_UNIX case here should probably bail out?

> +               cprm.file = no_free_ptr(file);
> +               break;
> +       }
>         default:
>                 WARN_ON_ONCE(true);
>                 retval = -EINVAL;
> @@ -818,7 +897,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                  * have this set to NULL.
>                  */
>                 if (!cprm.file) {
> -                       coredump_report_failure("Core dump to |%s disabled", cn.corename);
> +                       if (cn.core_type == COREDUMP_PIPE)
> +                               coredump_report_failure("Core dump to |%s disabled", cn.corename);
> +                       else
> +                               coredump_report_failure("Core dump to @%s disabled", cn.corename);

Are you actually truncating the initial "@" off of cn.corename, or is
this going to print two "@" characters?

>                         goto close_fail;
>                 }
>                 if (!dump_vma_snapshot(&cprm))
> @@ -839,8 +921,28 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                 file_end_write(cprm.file);
>                 free_vma_snapshot(&cprm);
>         }
> -       if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
> -               wait_for_dump_helpers(cprm.file);
> +
> +       if (core_pipe_limit) {
> +               switch (cn.core_type) {
> +               case COREDUMP_PIPE:
> +                       wait_for_dump_helpers(cprm.file);
> +                       break;
> +               case COREDUMP_SOCK: {
> +                       char buf[1];
> +                       /*
> +                        * We use a simple read to wait for the coredump
> +                        * processing to finish. Either the socket is
> +                        * closed or we get sent unexpected data. In
> +                        * both cases, we're done.
> +                        */
> +                       __kernel_read(cprm.file, buf, 1, NULL);
> +                       break;
> +               }
> +               default:
> +                       break;
> +               }
> +       }
> +
>  close_fail:
>         if (cprm.file)
>                 filp_close(cprm.file, NULL);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 04/10] coredump: add coredump socket
  2025-05-05 12:55   ` Jann Horn
@ 2025-05-05 13:06     ` Luca Boccassi
  2025-05-05 14:46     ` Christian Brauner
  1 sibling, 0 replies; 44+ messages in thread
From: Luca Boccassi @ 2025-05-05 13:06 UTC (permalink / raw)
  To: Jann Horn; +Cc: Christian Brauner, linux-fsdevel, linux-kernel, netdev

On Mon, 5 May 2025 at 13:55, Jann Horn <jannh@google.com> wrote:
>
> On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > The new coredump socket will allow userspace to not have to rely on
> > usermode helpers for processing coredumps and provides a safer way to
> > handle them instead of relying on super privileged coredumping helpers.
> >
> > This will also be significantly more lightweight since no fork()+exec()
> > for the usermodehelper is required for each crashing process. The
> > coredump server in userspace can just keep a worker pool.
>
> I mean, if coredumping is a performance bottleneck, something is
> probably seriously wrong with the system... I don't think we need to
> optimize for execution speed in this area.

It can be a bottleneck because it blocks restarting processes until it
completes, which directly impacts SLAs if the crashed process is
relevant for those purposes, and it often is, we have several use
cases where that's the case. The usermode helper costs hundreds of ms
in latency, so any optimization in that area is very welcome and worth
mentioning IMHO.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 11:13 ` [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket Christian Brauner
@ 2025-05-05 13:08   ` Jann Horn
  2025-05-05 14:06     ` Christian Brauner
  0 siblings, 1 reply; 44+ messages in thread
From: Jann Horn @ 2025-05-05 13:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Alexander Mikhalitsyn

On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> Make sure that only tasks that actually coredumped may connect to the
> coredump socket. This restriction may be loosened later in case
> userspace processes would like to use it to generate their own
> coredumps. Though it'd be wiser if userspace just exposed a separate
> socket for that.

This implementation kinda feels a bit fragile to me... I wonder if we
could instead have a flag inside the af_unix client socket that says
"this is a special client socket for coredumping".

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 13:08   ` Jann Horn
@ 2025-05-05 14:06     ` Christian Brauner
  2025-05-05 18:40       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 14:06 UTC (permalink / raw)
  To: Jann Horn
  Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Alexander Mikhalitsyn

On Mon, May 05, 2025 at 03:08:07PM +0200, Jann Horn wrote:
> On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > Make sure that only tasks that actually coredumped may connect to the
> > coredump socket. This restriction may be loosened later in case
> > userspace processes would like to use it to generate their own
> > coredumps. Though it'd be wiser if userspace just exposed a separate
> > socket for that.
> 
> This implementation kinda feels a bit fragile to me... I wonder if we
> could instead have a flag inside the af_unix client socket that says
> "this is a special client socket for coredumping".

Should be easily doable with a sock_flag().

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 00/10] coredump: add coredump socket
  2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
                   ` (9 preceding siblings ...)
  2025-05-05 11:13 ` [PATCH RFC v3 10/10] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
@ 2025-05-05 14:41 ` Mickaël Salaün
  2025-05-05 14:56   ` Christian Brauner
  2025-05-05 14:59   ` Jann Horn
  2025-05-05 18:33 ` Kuniyuki Iwashima
  11 siblings, 2 replies; 44+ messages in thread
From: Mickaël Salaün @ 2025-05-05 14:41 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Alexander Mikhalitsyn, linux-security-module

On Mon, May 05, 2025 at 01:13:38PM +0200, Christian Brauner wrote:
> Coredumping currently supports two modes:
> 
> (1) Dumping directly into a file somewhere on the filesystem.
> (2) Dumping into a pipe connected to a usermode helper process
>     spawned as a child of the system_unbound_wq or kthreadd.
> 
> For simplicity I'm mostly ignoring (1). There's probably still some
> users of (1) out there but processing coredumps in this way can be
> considered adventurous especially in the face of set*id binaries.
> 
> The most common option should be (2) by now. It works by allowing
> userspace to put a string into /proc/sys/kernel/core_pattern like:
> 
>         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> 
> The "|" at the beginning indicates to the kernel that a pipe must be
> used. The path following the pipe indicator is a path to a binary that
> will be spawned as a usermode helper process. Any additional parameters
> pass information about the task that is generating the coredump to the
> binary that processes the coredump.
> 
> In the example core_pattern shown above systemd-coredump is spawned as a
> usermode helper. There's various conceptual consequences of this
> (non-exhaustive list):
> 
> - systemd-coredump is spawned with file descriptor number 0 (stdin)
>   connected to the read-end of the pipe. All other file descriptors are
>   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
>   already caused bugs because userspace assumed that this cannot happen
>   (Whether or not this is a sane assumption is irrelevant.).
> 
> - systemd-coredump will be spawned as a child of system_unbound_wq. So
>   it is not a child of any userspace process and specifically not a
>   child of PID 1. It cannot be waited upon and is in a weird hybrid
>   upcall which are difficult for userspace to control correctly.
> 
> - systemd-coredump is spawned with full kernel privileges. This
>   necessitates all kinds of weird privilege dropping excercises in
>   userspace to make this safe.
> 
> - A new usermode helper has to be spawned for each crashing process.
> 
> This series adds a new mode:
> 
> (3) Dumping into an abstract AF_UNIX socket.
> 
> Userspace can set /proc/sys/kernel/core_pattern to:
> 
>         @linuxafsk/coredump_socket
> 
> The "@" at the beginning indicates to the kernel that the abstract
> AF_UNIX coredump socket will be used to process coredumps.
> 
> The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> for now.
> 
> The coredump socket is located in the initial network namespace. To bind
> the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> user namespace. Listening and reading can happen from whatever
> unprivileged context is necessary to safely process coredumps.
> 
> When a task coredumps it opens a client socket in the initial network
> namespace and connects to the coredump socket. For now only tasks that
> are acctually coredumping are allowed to connect to the initial coredump
> socket.

I think we should avoid using abstract UNIX sockets, especially for new
interfaces, because it is hard to properly control such access.  Can we
create new dedicated AF_UNIX protocols instead?  One could be used by a
privileged process in the initial namespace to create a socket to
collect coredumps, and the other could be dedicatde to coredumped
proccesses.  Such (coredump collector) file descriptor or new (proxy)
socketpair ones could be passed to containers.

> 
> - The coredump server should use SO_PEERPIDFD to get a stable handle on
>   the connected crashing task. The retrieved pidfd will provide a stable
>   reference even if the crashing task gets SIGKILLed while generating
>   the coredump.
> 
> - By setting core_pipe_limit non-zero userspace can guarantee that the
>   crashing task cannot be reaped behind it's back and thus process all
>   necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
>   detect whether /proc/<pid> still refers to the same process.
> 
>   The core_pipe_limit isn't used to rate-limit connections to the
>   socket. This can simply be done via AF_UNIX socket directly.
> 
> - The pidfd for the crashing task will contain information how the task
>   coredumps. The PIDFD_GET_INFO ioctl gained a new flag
>   PIDFD_INFO_COREDUMP which can be used to retreive the coredump
>   information.
> 
>   If the coredump gets a new coredump client connection the kernel
>   guarantees that PIDFD_INFO_COREDUMP information is available.
>   Currently the following information is provided in the new
>   @coredump_mask extension to struct pidfd_info:
> 
>   * PIDFD_COREDUMPED is raised if the task did actually coredump.
>   * PIDFD_COREDUMP_SKIP	is raised if the task skipped coredumping (e.g.,
>     undumpable).
>   * PIDFD_COREDUMP_USER	is raised if this is a regular coredump and
>     doesn't need special care by the coredump server.
>   * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
>     treated as sensitive and the coredump server should restrict to the
>     generated coredump to sufficiently privileged users.
> 
> - Since unix_stream_connect() runs bpf programs during connect it's
>   possible to even redirect or multiplex coredumps to other sockets.
> 
> - The coredump server should mark itself as non-dumpable.
>   To capture coredumps for the coredump server itself a bpf program
>   should be run at connect to redirect it to another socket in
>   userspace. This can be useful for debugging crashing coredump servers.
> 
> - A container coredump server in a separate network namespace can simply
>   bind to linuxafsk/coredump.socket and systemd-coredump fowards
>   coredumps to the container.
> 
> - Fwiw, one idea is to handle coredumps via per-user/session coredump
>   servers that run with that users privileges.
> 
>   The coredump server listens on the coredump socket and accepts a
>   new coredump connection. It then retrieves SO_PEERPIDFD for the
>   client, inspects uid/gid and hands the accepted client to the users
>   own coredump handler which runs with the users privileges only.
> 
> The new coredump socket will allow userspace to not have to rely on
> usermode helpers for processing coredumps and provides a safer way to
> handle them instead of relying on super privileged coredumping helpers.
> 
> This will also be significantly more lightweight since no fork()+exec()
> for the usermodehelper is required for each crashing process. The
> coredump server in userspace can just keep a worker pool.
> 
> This is easy to test:
> 
> (a) coredump processing (we're using socat):
> 
>     > cat coredump_socket.sh
>     #!/bin/bash
> 
>     set -x
> 
>     sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
>     sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc
> 
> (b) trigger a coredump:
> 
>     user1@localhost:~/data/scripts$ cat crash.c
>     #include <stdio.h>
>     #include <unistd.h>
> 
>     int main(int argc, char *argv[])
>     {
>             fprintf(stderr, "%u\n", (1 / 0));
>             _exit(0);
>     }
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Changes in v3:
> - Use an abstract unix socket.
> - Add documentation.
> - Add selftests.
> - Link to v2: https://lore.kernel.org/20250502-work-coredump-socket-v2-0-43259042ffc7@kernel.org
> 
> Changes in v2:
> - Expose dumpability via PIDFD_GET_INFO.
> - Place COREDUMP_SOCK handling under CONFIG_UNIX.
> - Link to v1: https://lore.kernel.org/20250430-work-coredump-socket-v1-0-2faf027dbb47@kernel.org
> 
> ---
> Christian Brauner (10):
>       coredump: massage format_corname()
>       coredump: massage do_coredump()
>       net: reserve prefix
>       coredump: add coredump socket
>       coredump: validate socket name as it is written
>       coredump: show supported coredump modes
>       pidfs, coredump: add PIDFD_INFO_COREDUMP
>       net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
>       selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
>       selftests/coredump: add tests for AF_UNIX coredumps
> 
>  fs/coredump.c                                     | 358 +++++++++++++++++-----
>  fs/pidfs.c                                        |  68 ++++
>  include/linux/coredump.h                          |  12 +
>  include/linux/pidfs.h                             |   4 +
>  include/uapi/linux/pidfd.h                        |  16 +
>  include/uapi/linux/un.h                           |   2 +
>  net/unix/af_unix.c                                |  64 +++-
>  tools/testing/selftests/coredump/stackdump_test.c |  71 ++++-
>  tools/testing/selftests/pidfd/pidfd.h             |  22 ++
>  9 files changed, 528 insertions(+), 89 deletions(-)
> ---
> base-commit: 4dd6566b5a8ca1e8c9ff2652c2249715d6c64217
> change-id: 20250429-work-coredump-socket-87cc0f17729c
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 04/10] coredump: add coredump socket
  2025-05-05 12:55   ` Jann Horn
  2025-05-05 13:06     ` Luca Boccassi
@ 2025-05-05 14:46     ` Christian Brauner
  1 sibling, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 14:46 UTC (permalink / raw)
  To: Jann Horn
  Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
	Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
	Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Alexander Mikhalitsyn

On Mon, May 05, 2025 at 02:55:18PM +0200, Jann Horn wrote:
> On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > Coredumping currently supports two modes:
> >
> > (1) Dumping directly into a file somewhere on the filesystem.
> > (2) Dumping into a pipe connected to a usermode helper process
> >     spawned as a child of the system_unbound_wq or kthreadd.
> >
> > For simplicity I'm mostly ignoring (1). There's probably still some
> > users of (1) out there but processing coredumps in this way can be
> > considered adventurous especially in the face of set*id binaries.
> >
> > The most common option should be (2) by now. It works by allowing
> > userspace to put a string into /proc/sys/kernel/core_pattern like:
> >
> >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> >
> > The "|" at the beginning indicates to the kernel that a pipe must be
> > used. The path following the pipe indicator is a path to a binary that
> > will be spawned as a usermode helper process. Any additional parameters
> > pass information about the task that is generating the coredump to the
> > binary that processes the coredump.
> >
> > In the example core_pattern shown above systemd-coredump is spawned as a
> > usermode helper. There's various conceptual consequences of this
> > (non-exhaustive list):
> >
> > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> >   connected to the read-end of the pipe. All other file descriptors are
> >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> >   already caused bugs because userspace assumed that this cannot happen
> >   (Whether or not this is a sane assumption is irrelevant.).
> >
> > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> >   it is not a child of any userspace process and specifically not a
> >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> >   upcall which are difficult for userspace to control correctly.
> >
> > - systemd-coredump is spawned with full kernel privileges. This
> >   necessitates all kinds of weird privilege dropping excercises in
> >   userspace to make this safe.
> >
> > - A new usermode helper has to be spawned for each crashing process.
> >
> > This series adds a new mode:
> >
> > (3) Dumping into an abstract AF_UNIX socket.
> >
> > Userspace can set /proc/sys/kernel/core_pattern to:
> >
> >         @linuxafsk/coredump_socket
> >
> > The "@" at the beginning indicates to the kernel that the abstract
> > AF_UNIX coredump socket will be used to process coredumps.
> >
> > The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> > for now.
> >
> > The coredump socket is located in the initial network namespace. To bind
> > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > user namespace. Listening and reading can happen from whatever
> > unprivileged context is necessary to safely process coredumps.
> >
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket. For now only tasks that
> > are acctually coredumping are allowed to connect to the initial coredump
> > socket.
> >
> > - The coredump server should use SO_PEERPIDFD to get a stable handle on
> >   the connected crashing task. The retrieved pidfd will provide a stable
> >   reference even if the crashing task gets SIGKILLed while generating
> >   the coredump.
> >
> > - By setting core_pipe_limit non-zero userspace can guarantee that the
> >   crashing task cannot be reaped behind it's back and thus process all
> >   necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
> >   detect whether /proc/<pid> still refers to the same process.
> >
> >   The core_pipe_limit isn't used to rate-limit connections to the
> >   socket. This can simply be done via AF_UNIX socket directly.
> >
> > - The pidfd for the crashing task will contain information how the task
> >   coredumps. The PIDFD_GET_INFO ioctl gained a new flag
> >   PIDFD_INFO_COREDUMP which can be used to retreive the coredump
> >   information.
> >
> >   If the coredump gets a new coredump client connection the kernel
> >   guarantees that PIDFD_INFO_COREDUMP information is available.
> >   Currently the following information is provided in the new
> >   @coredump_mask extension to struct pidfd_info:
> >
> >   * PIDFD_COREDUMPED is raised if the task did actually coredump.
> >   * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
> >     undumpable).
> >   * PIDFD_COREDUMP_USER is raised if this is a regular coredump and
> >     doesn't need special care by the coredump server.
> >   * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
> >     treated as sensitive and the coredump server should restrict to the
> >     generated coredump to sufficiently privileged users.
> >
> > - Since unix_stream_connect() runs bpf programs during connect it's
> >   possible to even redirect or multiplex coredumps to other sockets.
> 
> Or change the userspace protocol used for containers such that the
> init-namespace coredumping helper forwards the FD it accept()ed into a
> container via SCM_RIGHTS...

Yeah, that would also work.

> 
> > - The coredump server should mark itself as non-dumpable.
> >   To capture coredumps for the coredump server itself a bpf program
> >   should be run at connect to redirect it to another socket in
> >   userspace. This can be useful for debugging crashing coredump servers.
> >
> > - A container coredump server in a separate network namespace can simply
> >   bind to linuxafsk/coredump.socket and systemd-coredump fowards
> >   coredumps to the container.
> >
> > - Fwiw, one idea is to handle coredumps via per-user/session coredump
> >   servers that run with that users privileges.
> >
> >   The coredump server listens on the coredump socket and accepts a
> >   new coredump connection. It then retrieves SO_PEERPIDFD for the
> >   client, inspects uid/gid and hands the accepted client to the users
> >   own coredump handler which runs with the users privileges only.
> 
> (Though that would only be okay if it's not done for suid dumping cases.)

Yes, I had considered adding a comment about only doing that when
PIDFD_COREDUMP_ROOT isn't set and wondered if anyone would comment on
it. :)

> 
> > The new coredump socket will allow userspace to not have to rely on
> > usermode helpers for processing coredumps and provides a safer way to
> > handle them instead of relying on super privileged coredumping helpers.
> >
> > This will also be significantly more lightweight since no fork()+exec()
> > for the usermodehelper is required for each crashing process. The
> > coredump server in userspace can just keep a worker pool.
> 
> I mean, if coredumping is a performance bottleneck, something is
> probably seriously wrong with the system... I don't think we need to
> optimize for execution speed in this area.
> 
> > This is easy to test:
> >
> > (a) coredump processing (we're using socat):
> >
> >     > cat coredump_socket.sh
> >     #!/bin/bash
> >
> >     set -x
> >
> >     sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
> >     sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc
> >
> > (b) trigger a coredump:
> >
> >     user1@localhost:~/data/scripts$ cat crash.c
> >     #include <stdio.h>
> >     #include <unistd.h>
> >
> >     int main(int argc, char *argv[])
> >     {
> >             fprintf(stderr, "%u\n", (1 / 0));
> >             _exit(0);
> >     }
> 
> This looks pretty neat overall!
> 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/coredump.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 107 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index 1779299b8c61..c60f86c473ad 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> > @@ -44,7 +44,11 @@
> >  #include <linux/sysctl.h>
> >  #include <linux/elf.h>
> >  #include <linux/pidfs.h>
> > +#include <linux/net.h>
> > +#include <linux/socket.h>
> > +#include <net/net_namespace.h>
> >  #include <uapi/linux/pidfd.h>
> > +#include <uapi/linux/un.h>
> >
> >  #include <linux/uaccess.h>
> >  #include <asm/mmu_context.h>
> > @@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
> >  enum coredump_type_t {
> >         COREDUMP_FILE = 1,
> >         COREDUMP_PIPE = 2,
> > +       COREDUMP_SOCK = 3,
> >  };
> >
> >  struct core_name {
> > @@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> >         cn->corename = NULL;
> >         if (*pat_ptr == '|')
> >                 cn->core_type = COREDUMP_PIPE;
> > +       else if (*pat_ptr == '@')
> > +               cn->core_type = COREDUMP_SOCK;
> >         else
> >                 cn->core_type = COREDUMP_FILE;
> >         if (expand_corename(cn, core_name_size))
> >                 return -ENOMEM;
> >         cn->corename[0] = '\0';
> >
> > -       if (cn->core_type == COREDUMP_PIPE) {
> > +       switch (cn->core_type) {
> > +       case COREDUMP_PIPE: {
> >                 int argvs = sizeof(core_pattern) / 2;
> >                 (*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
> >                 if (!(*argv))
> > @@ -247,6 +255,32 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> >                 ++pat_ptr;
> >                 if (!(*pat_ptr))
> >                         return -ENOMEM;
> > +               break;
> > +       }
> > +       case COREDUMP_SOCK: {
> > +               err = cn_printf(cn, "%s", pat_ptr);
> > +               if (err)
> > +                       return err;
> > +
> > +               /*
> > +                * We can potentially allow this to be changed later but
> > +                * I currently see no reason to.
> > +                */
> > +               if (strcmp(cn->corename, "@linuxafsk/coredump.socket"))
> > +                       return -EINVAL;
> > +
> > +               /*
> > +                * Currently no need to parse any other options.
> > +                * Relevant information can be retrieved from the peer
> > +                * pidfd retrievable via SO_PEERPIDFD by the receiver or
> > +                * via /proc/<pid>, using the SO_PEERPIDFD to guard
> > +                * against pid recycling when opening /proc/<pid>.
> > +                */
> > +               return 0;
> > +       }
> > +       default:
> > +               WARN_ON_ONCE(cn->core_type != COREDUMP_FILE);
> > +               break;
> >         }
> >
> >         /* Repeat as long as we have more pattern to process and more output
> 
> I think the core_uses_pid logic at the end of this function needs to
> be adjusted to also exclude COREDUMP_SOCK?

Thanks! Fixed.

> 
> > @@ -583,6 +617,17 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
> >         return 0;
> >  }
> >
> > +#ifdef CONFIG_UNIX
> > +struct sockaddr_un coredump_unix_socket = {
> > +       .sun_family = AF_UNIX,
> > +       .sun_path = "\0linuxafsk/coredump.socket",
> > +};
> 
> Nit: Please make that static and const.

Done.

> 
> > +/* Without trailing NUL byte. */
> > +#define COREDUMP_UNIX_SOCKET_ADDR_SIZE            \
> > +       (offsetof(struct sockaddr_un, sun_path) + \
> > +        sizeof("\0linuxafsk/coredump.socket") - 1)
> > +#endif
> > +
> >  void do_coredump(const kernel_siginfo_t *siginfo)
> >  {
> >         struct core_state core_state;
> > @@ -801,6 +846,40 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> >                 }
> >                 break;
> >         }
> > +       case COREDUMP_SOCK: {
> > +               struct file *file __free(fput) = NULL;
> > +#ifdef CONFIG_UNIX
> > +               struct socket *socket;
> > +
> > +               /*
> > +                * It is possible that the userspace process which is
> > +                * supposed to handle the coredump and is listening on
> > +                * the AF_UNIX socket coredumps. Userspace should just
> > +                * mark itself non dumpable.
> > +                */
> > +
> > +               retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> > +               if (retval < 0)
> > +                       goto close_fail;
> > +
> > +               file = sock_alloc_file(socket, 0, NULL);
> > +               if (IS_ERR(file)) {
> > +                       sock_release(socket);
> > +                       retval = PTR_ERR(file);
> > +                       goto close_fail;
> > +               }
> > +
> > +               retval = kernel_connect(socket,
> > +                                       (struct sockaddr *)(&coredump_unix_socket),
> > +                                       COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0);
> > +               if (retval)
> > +                       goto close_fail;
> > +
> > +               cprm.limit = RLIM_INFINITY;
> > +#endif
> 
> The non-CONFIG_UNIX case here should probably bail out?

It will bail-out later on !bprm->file where it'll report that @ support
is disabled but I think...

> 
> > +               cprm.file = no_free_ptr(file);
> > +               break;
> > +       }
> >         default:
> >                 WARN_ON_ONCE(true);
> >                 retval = -EINVAL;
> > @@ -818,7 +897,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> >                  * have this set to NULL.
> >                  */
> >                 if (!cprm.file) {
> > -                       coredump_report_failure("Core dump to |%s disabled", cn.corename);
> > +                       if (cn.core_type == COREDUMP_PIPE)
> > +                               coredump_report_failure("Core dump to |%s disabled", cn.corename);
> > +                       else
> > +                               coredump_report_failure("Core dump to @%s disabled", cn.corename);
> 
> Are you actually truncating the initial "@" off of cn.corename, or is
> this going to print two "@" characters?

... that bailing out earlier is nicer than stripping the @off
pointlessly.

> 
> >                         goto close_fail;
> >                 }
> >                 if (!dump_vma_snapshot(&cprm))
> > @@ -839,8 +921,28 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> >                 file_end_write(cprm.file);
> >                 free_vma_snapshot(&cprm);
> >         }
> > -       if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
> > -               wait_for_dump_helpers(cprm.file);
> > +
> > +       if (core_pipe_limit) {
> > +               switch (cn.core_type) {
> > +               case COREDUMP_PIPE:
> > +                       wait_for_dump_helpers(cprm.file);
> > +                       break;
> > +               case COREDUMP_SOCK: {
> > +                       char buf[1];
> > +                       /*
> > +                        * We use a simple read to wait for the coredump
> > +                        * processing to finish. Either the socket is
> > +                        * closed or we get sent unexpected data. In
> > +                        * both cases, we're done.
> > +                        */
> > +                       __kernel_read(cprm.file, buf, 1, NULL);
> > +                       break;
> > +               }
> > +               default:
> > +                       break;
> > +               }
> > +       }
> > +
> >  close_fail:
> >         if (cprm.file)
> >                 filp_close(cprm.file, NULL);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 00/10] coredump: add coredump socket
  2025-05-05 14:41 ` [PATCH RFC v3 00/10] coredump: add coredump socket Mickaël Salaün
@ 2025-05-05 14:56   ` Christian Brauner
  2025-05-05 15:38     ` Mickaël Salaün
  2025-05-05 14:59   ` Jann Horn
  1 sibling, 1 reply; 44+ messages in thread
From: Christian Brauner @ 2025-05-05 14:56 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Alexander Mikhalitsyn, linux-security-module

On Mon, May 05, 2025 at 04:41:28PM +0200, Mickaël Salaün wrote:
> On Mon, May 05, 2025 at 01:13:38PM +0200, Christian Brauner wrote:
> > Coredumping currently supports two modes:
> > 
> > (1) Dumping directly into a file somewhere on the filesystem.
> > (2) Dumping into a pipe connected to a usermode helper process
> >     spawned as a child of the system_unbound_wq or kthreadd.
> > 
> > For simplicity I'm mostly ignoring (1). There's probably still some
> > users of (1) out there but processing coredumps in this way can be
> > considered adventurous especially in the face of set*id binaries.
> > 
> > The most common option should be (2) by now. It works by allowing
> > userspace to put a string into /proc/sys/kernel/core_pattern like:
> > 
> >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> > 
> > The "|" at the beginning indicates to the kernel that a pipe must be
> > used. The path following the pipe indicator is a path to a binary that
> > will be spawned as a usermode helper process. Any additional parameters
> > pass information about the task that is generating the coredump to the
> > binary that processes the coredump.
> > 
> > In the example core_pattern shown above systemd-coredump is spawned as a
> > usermode helper. There's various conceptual consequences of this
> > (non-exhaustive list):
> > 
> > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> >   connected to the read-end of the pipe. All other file descriptors are
> >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> >   already caused bugs because userspace assumed that this cannot happen
> >   (Whether or not this is a sane assumption is irrelevant.).
> > 
> > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> >   it is not a child of any userspace process and specifically not a
> >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> >   upcall which are difficult for userspace to control correctly.
> > 
> > - systemd-coredump is spawned with full kernel privileges. This
> >   necessitates all kinds of weird privilege dropping excercises in
> >   userspace to make this safe.
> > 
> > - A new usermode helper has to be spawned for each crashing process.
> > 
> > This series adds a new mode:
> > 
> > (3) Dumping into an abstract AF_UNIX socket.
> > 
> > Userspace can set /proc/sys/kernel/core_pattern to:
> > 
> >         @linuxafsk/coredump_socket
> > 
> > The "@" at the beginning indicates to the kernel that the abstract
> > AF_UNIX coredump socket will be used to process coredumps.
> > 
> > The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> > for now.
> > 
> > The coredump socket is located in the initial network namespace. To bind
> > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > user namespace. Listening and reading can happen from whatever
> > unprivileged context is necessary to safely process coredumps.
> > 
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket. For now only tasks that
> > are acctually coredumping are allowed to connect to the initial coredump
> > socket.
> 
> I think we should avoid using abstract UNIX sockets, especially for new

Abstract unix sockets are at the core of a modern Linux system. During
boot alone about 100 or so are created on a modern system when I counted
during testing. Sorry, but this is a no-show argument.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 00/10] coredump: add coredump socket
  2025-05-05 14:41 ` [PATCH RFC v3 00/10] coredump: add coredump socket Mickaël Salaün
  2025-05-05 14:56   ` Christian Brauner
@ 2025-05-05 14:59   ` Jann Horn
  2025-05-05 15:39     ` Mickaël Salaün
  1 sibling, 1 reply; 44+ messages in thread
From: Jann Horn @ 2025-05-05 14:59 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Christian Brauner, Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov,
	linux-fsdevel, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Alexander Mikhalitsyn, linux-security-module

On Mon, May 5, 2025 at 4:41 PM Mickaël Salaün <mic@digikod.net> wrote:
> On Mon, May 05, 2025 at 01:13:38PM +0200, Christian Brauner wrote:
> > Coredumping currently supports two modes:
> >
> > (1) Dumping directly into a file somewhere on the filesystem.
> > (2) Dumping into a pipe connected to a usermode helper process
> >     spawned as a child of the system_unbound_wq or kthreadd.
> >
> > For simplicity I'm mostly ignoring (1). There's probably still some
> > users of (1) out there but processing coredumps in this way can be
> > considered adventurous especially in the face of set*id binaries.
> >
> > The most common option should be (2) by now. It works by allowing
> > userspace to put a string into /proc/sys/kernel/core_pattern like:
> >
> >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> >
> > The "|" at the beginning indicates to the kernel that a pipe must be
> > used. The path following the pipe indicator is a path to a binary that
> > will be spawned as a usermode helper process. Any additional parameters
> > pass information about the task that is generating the coredump to the
> > binary that processes the coredump.
> >
> > In the example core_pattern shown above systemd-coredump is spawned as a
> > usermode helper. There's various conceptual consequences of this
> > (non-exhaustive list):
> >
> > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> >   connected to the read-end of the pipe. All other file descriptors are
> >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> >   already caused bugs because userspace assumed that this cannot happen
> >   (Whether or not this is a sane assumption is irrelevant.).
> >
> > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> >   it is not a child of any userspace process and specifically not a
> >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> >   upcall which are difficult for userspace to control correctly.
> >
> > - systemd-coredump is spawned with full kernel privileges. This
> >   necessitates all kinds of weird privilege dropping excercises in
> >   userspace to make this safe.
> >
> > - A new usermode helper has to be spawned for each crashing process.
> >
> > This series adds a new mode:
> >
> > (3) Dumping into an abstract AF_UNIX socket.
> >
> > Userspace can set /proc/sys/kernel/core_pattern to:
> >
> >         @linuxafsk/coredump_socket
> >
> > The "@" at the beginning indicates to the kernel that the abstract
> > AF_UNIX coredump socket will be used to process coredumps.
> >
> > The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> > for now.
> >
> > The coredump socket is located in the initial network namespace. To bind
> > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > user namespace. Listening and reading can happen from whatever
> > unprivileged context is necessary to safely process coredumps.
> >
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket. For now only tasks that
> > are acctually coredumping are allowed to connect to the initial coredump
> > socket.
>
> I think we should avoid using abstract UNIX sockets, especially for new
> interfaces, because it is hard to properly control such access.  Can we
> create new dedicated AF_UNIX protocols instead?  One could be used by a
> privileged process in the initial namespace to create a socket to
> collect coredumps, and the other could be dedicatde to coredumped
> proccesses.  Such (coredump collector) file descriptor or new (proxy)
> socketpair ones could be passed to containers.

I would agree with you if we were talking about designing a pure
userspace thing; but I think the limits that Christian added on bind()
and connect() to these special abstract names in this series
effectively make it behave as if they were dedicated AF_UNIX
protocols, and prevent things like random unprivileged userspace
processes bind()ing to them.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 00/10] coredump: add coredump socket
  2025-05-05 14:56   ` Christian Brauner
@ 2025-05-05 15:38     ` Mickaël Salaün
  0 siblings, 0 replies; 44+ messages in thread
From: Mickaël Salaün @ 2025-05-05 15:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
	Jann Horn, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Alexander Mikhalitsyn, linux-security-module

On Mon, May 05, 2025 at 04:56:04PM +0200, Christian Brauner wrote:
> On Mon, May 05, 2025 at 04:41:28PM +0200, Mickaël Salaün wrote:
> > On Mon, May 05, 2025 at 01:13:38PM +0200, Christian Brauner wrote:
> > > Coredumping currently supports two modes:
> > > 
> > > (1) Dumping directly into a file somewhere on the filesystem.
> > > (2) Dumping into a pipe connected to a usermode helper process
> > >     spawned as a child of the system_unbound_wq or kthreadd.
> > > 
> > > For simplicity I'm mostly ignoring (1). There's probably still some
> > > users of (1) out there but processing coredumps in this way can be
> > > considered adventurous especially in the face of set*id binaries.
> > > 
> > > The most common option should be (2) by now. It works by allowing
> > > userspace to put a string into /proc/sys/kernel/core_pattern like:
> > > 
> > >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> > > 
> > > The "|" at the beginning indicates to the kernel that a pipe must be
> > > used. The path following the pipe indicator is a path to a binary that
> > > will be spawned as a usermode helper process. Any additional parameters
> > > pass information about the task that is generating the coredump to the
> > > binary that processes the coredump.
> > > 
> > > In the example core_pattern shown above systemd-coredump is spawned as a
> > > usermode helper. There's various conceptual consequences of this
> > > (non-exhaustive list):
> > > 
> > > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> > >   connected to the read-end of the pipe. All other file descriptors are
> > >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> > >   already caused bugs because userspace assumed that this cannot happen
> > >   (Whether or not this is a sane assumption is irrelevant.).
> > > 
> > > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> > >   it is not a child of any userspace process and specifically not a
> > >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> > >   upcall which are difficult for userspace to control correctly.
> > > 
> > > - systemd-coredump is spawned with full kernel privileges. This
> > >   necessitates all kinds of weird privilege dropping excercises in
> > >   userspace to make this safe.
> > > 
> > > - A new usermode helper has to be spawned for each crashing process.
> > > 
> > > This series adds a new mode:
> > > 
> > > (3) Dumping into an abstract AF_UNIX socket.
> > > 
> > > Userspace can set /proc/sys/kernel/core_pattern to:
> > > 
> > >         @linuxafsk/coredump_socket
> > > 
> > > The "@" at the beginning indicates to the kernel that the abstract
> > > AF_UNIX coredump socket will be used to process coredumps.
> > > 
> > > The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> > > for now.
> > > 
> > > The coredump socket is located in the initial network namespace. To bind
> > > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > > user namespace. Listening and reading can happen from whatever
> > > unprivileged context is necessary to safely process coredumps.
> > > 
> > > When a task coredumps it opens a client socket in the initial network
> > > namespace and connects to the coredump socket. For now only tasks that
> > > are acctually coredumping are allowed to connect to the initial coredump
> > > socket.
> > 
> > I think we should avoid using abstract UNIX sockets, especially for new
> 
> Abstract unix sockets are at the core of a modern Linux system. During
> boot alone about 100 or so are created on a modern system when I counted
> during testing. Sorry, but this is a no-show argument.

These kind of socket being used does not mean they should be used for
new interfaces. :)

AFAIK, these socket types are currently only used for IPC, not between a
kernel interface and user space.  This patch series changes this
assumption.

Security policies already in place can block abstract connections, and
it might not be possible to differenciate between a kernel or user space
peer with the current configuration.  Please Cc the LSM mailing list for
such new interfaces.

You cut and ignored most of my reply, which explained my reasoning and
proposed an alternative:
> > interfaces, because it is hard to properly control such access.  Can we
> > create new dedicated AF_UNIX protocols instead?  One could be used by a
> > privileged process in the initial namespace to create a socket to
> > collect coredumps, and the other could be dedicatde to coredumped
> > proccesses.  Such (coredump collector) file descriptor or new (proxy)
> > socketpair ones could be passed to containers.

Only one new "protocol" would be required though (because the client
side is created by the kernel).  That would be a backward compatible
change, and such socket type could easily be identified by other part of
the kernel, including access control mechanisms.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 00/10] coredump: add coredump socket
  2025-05-05 14:59   ` Jann Horn
@ 2025-05-05 15:39     ` Mickaël Salaün
  0 siblings, 0 replies; 44+ messages in thread
From: Mickaël Salaün @ 2025-05-05 15:39 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov,
	linux-fsdevel, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	Alexander Mikhalitsyn, linux-security-module

On Mon, May 05, 2025 at 04:59:41PM +0200, Jann Horn wrote:
> On Mon, May 5, 2025 at 4:41 PM Mickaël Salaün <mic@digikod.net> wrote:
> > On Mon, May 05, 2025 at 01:13:38PM +0200, Christian Brauner wrote:
> > > Coredumping currently supports two modes:
> > >
> > > (1) Dumping directly into a file somewhere on the filesystem.
> > > (2) Dumping into a pipe connected to a usermode helper process
> > >     spawned as a child of the system_unbound_wq or kthreadd.
> > >
> > > For simplicity I'm mostly ignoring (1). There's probably still some
> > > users of (1) out there but processing coredumps in this way can be
> > > considered adventurous especially in the face of set*id binaries.
> > >
> > > The most common option should be (2) by now. It works by allowing
> > > userspace to put a string into /proc/sys/kernel/core_pattern like:
> > >
> > >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> > >
> > > The "|" at the beginning indicates to the kernel that a pipe must be
> > > used. The path following the pipe indicator is a path to a binary that
> > > will be spawned as a usermode helper process. Any additional parameters
> > > pass information about the task that is generating the coredump to the
> > > binary that processes the coredump.
> > >
> > > In the example core_pattern shown above systemd-coredump is spawned as a
> > > usermode helper. There's various conceptual consequences of this
> > > (non-exhaustive list):
> > >
> > > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> > >   connected to the read-end of the pipe. All other file descriptors are
> > >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> > >   already caused bugs because userspace assumed that this cannot happen
> > >   (Whether or not this is a sane assumption is irrelevant.).
> > >
> > > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> > >   it is not a child of any userspace process and specifically not a
> > >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> > >   upcall which are difficult for userspace to control correctly.
> > >
> > > - systemd-coredump is spawned with full kernel privileges. This
> > >   necessitates all kinds of weird privilege dropping excercises in
> > >   userspace to make this safe.
> > >
> > > - A new usermode helper has to be spawned for each crashing process.
> > >
> > > This series adds a new mode:
> > >
> > > (3) Dumping into an abstract AF_UNIX socket.
> > >
> > > Userspace can set /proc/sys/kernel/core_pattern to:
> > >
> > >         @linuxafsk/coredump_socket
> > >
> > > The "@" at the beginning indicates to the kernel that the abstract
> > > AF_UNIX coredump socket will be used to process coredumps.
> > >
> > > The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> > > for now.
> > >
> > > The coredump socket is located in the initial network namespace. To bind
> > > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > > user namespace. Listening and reading can happen from whatever
> > > unprivileged context is necessary to safely process coredumps.
> > >
> > > When a task coredumps it opens a client socket in the initial network
> > > namespace and connects to the coredump socket. For now only tasks that
> > > are acctually coredumping are allowed to connect to the initial coredump
> > > socket.
> >
> > I think we should avoid using abstract UNIX sockets, especially for new
> > interfaces, because it is hard to properly control such access.  Can we
> > create new dedicated AF_UNIX protocols instead?  One could be used by a
> > privileged process in the initial namespace to create a socket to
> > collect coredumps, and the other could be dedicatde to coredumped
> > proccesses.  Such (coredump collector) file descriptor or new (proxy)
> > socketpair ones could be passed to containers.
> 
> I would agree with you if we were talking about designing a pure
> userspace thing; but I think the limits that Christian added on bind()
> and connect() to these special abstract names in this series
> effectively make it behave as if they were dedicated AF_UNIX
> protocols, and prevent things like random unprivileged userspace
> processes bind()ing to them.

OK, so why not create a proper protocol?  That should also simplify the
interface.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 00/10] coredump: add coredump socket
  2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
                   ` (10 preceding siblings ...)
  2025-05-05 14:41 ` [PATCH RFC v3 00/10] coredump: add coredump socket Mickaël Salaün
@ 2025-05-05 18:33 ` Kuniyuki Iwashima
  2025-05-06  7:33   ` Christian Brauner
  11 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-05 18:33 UTC (permalink / raw)
  To: brauner
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, jannh, kuba, kuniyu, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

From: Christian Brauner <brauner@kernel.org>
Date: Mon, 05 May 2025 13:13:38 +0200
> Coredumping currently supports two modes:
> 
> (1) Dumping directly into a file somewhere on the filesystem.
> (2) Dumping into a pipe connected to a usermode helper process
>     spawned as a child of the system_unbound_wq or kthreadd.
> 
> For simplicity I'm mostly ignoring (1). There's probably still some
> users of (1) out there but processing coredumps in this way can be
> considered adventurous especially in the face of set*id binaries.
> 
> The most common option should be (2) by now. It works by allowing
> userspace to put a string into /proc/sys/kernel/core_pattern like:
> 
>         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> 
> The "|" at the beginning indicates to the kernel that a pipe must be
> used. The path following the pipe indicator is a path to a binary that
> will be spawned as a usermode helper process. Any additional parameters
> pass information about the task that is generating the coredump to the
> binary that processes the coredump.
> 
> In the example core_pattern shown above systemd-coredump is spawned as a
> usermode helper. There's various conceptual consequences of this
> (non-exhaustive list):
> 
> - systemd-coredump is spawned with file descriptor number 0 (stdin)
>   connected to the read-end of the pipe. All other file descriptors are
>   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
>   already caused bugs because userspace assumed that this cannot happen
>   (Whether or not this is a sane assumption is irrelevant.).
> 
> - systemd-coredump will be spawned as a child of system_unbound_wq. So
>   it is not a child of any userspace process and specifically not a
>   child of PID 1. It cannot be waited upon and is in a weird hybrid
>   upcall which are difficult for userspace to control correctly.
> 
> - systemd-coredump is spawned with full kernel privileges. This
>   necessitates all kinds of weird privilege dropping excercises in
>   userspace to make this safe.
> 
> - A new usermode helper has to be spawned for each crashing process.
> 
> This series adds a new mode:
> 
> (3) Dumping into an abstract AF_UNIX socket.
> 
> Userspace can set /proc/sys/kernel/core_pattern to:
> 
>         @linuxafsk/coredump_socket
> 
> The "@" at the beginning indicates to the kernel that the abstract
> AF_UNIX coredump socket will be used to process coredumps.
> 
> The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> for now.

What's behind this dicision from v2 ?

/proc/sys/kernel/core_pattern can only be set by Administrator
and I don't see the point in having this limitation on the
AF_UNIX side.



> 
> The coredump socket is located in the initial network namespace.

I understand this is a reasonable decision to avoid complicated
path management in the mount ns but keep connectivity from any
namespace.


> To bind
> the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> user namespace. Listening and reading can happen from whatever
> unprivileged context is necessary to safely process coredumps.
> 
> When a task coredumps it opens a client socket in the initial network
> namespace and connects to the coredump socket. For now only tasks that
> are acctually coredumping are allowed to connect to the initial coredump
> socket.

This can be controlled by BPF (cgroup sockops or LSM) if a user
really cares about spam clients.

I think how to set up coredump is userspace responsibility.


> 
> - The coredump server should use SO_PEERPIDFD to get a stable handle on
>   the connected crashing task. The retrieved pidfd will provide a stable
>   reference even if the crashing task gets SIGKILLed while generating
>   the coredump.
> 
> - By setting core_pipe_limit non-zero userspace can guarantee that the
>   crashing task cannot be reaped behind it's back and thus process all
>   necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
>   detect whether /proc/<pid> still refers to the same process.
> 
>   The core_pipe_limit isn't used to rate-limit connections to the
>   socket. This can simply be done via AF_UNIX socket directly.
> 
> - The pidfd for the crashing task will contain information how the task
>   coredumps. The PIDFD_GET_INFO ioctl gained a new flag
>   PIDFD_INFO_COREDUMP which can be used to retreive the coredump
>   information.
> 
>   If the coredump gets a new coredump client connection the kernel
>   guarantees that PIDFD_INFO_COREDUMP information is available.
>   Currently the following information is provided in the new
>   @coredump_mask extension to struct pidfd_info:
> 
>   * PIDFD_COREDUMPED is raised if the task did actually coredump.
>   * PIDFD_COREDUMP_SKIP	is raised if the task skipped coredumping (e.g.,
>     undumpable).
>   * PIDFD_COREDUMP_USER	is raised if this is a regular coredump and
>     doesn't need special care by the coredump server.
>   * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
>     treated as sensitive and the coredump server should restrict to the
>     generated coredump to sufficiently privileged users.
> 
> - Since unix_stream_connect() runs bpf programs during connect it's
>   possible to even redirect or multiplex coredumps to other sockets.

If the socket is in a cgroup, yes, and even if not, BPF LSM can
reject some requests.


> 
> - The coredump server should mark itself as non-dumpable.
>   To capture coredumps for the coredump server itself a bpf program
>   should be run at connect to redirect it to another socket in
>   userspace. This can be useful for debugging crashing coredump servers.
> 
> - A container coredump server in a separate network namespace can simply
>   bind to linuxafsk/coredump.socket and systemd-coredump fowards
>   coredumps to the container.

I think the name should be also configurable in non-initial netns.


> 
> - Fwiw, one idea is to handle coredumps via per-user/session coredump
>   servers that run with that users privileges.
> 
>   The coredump server listens on the coredump socket and accepts a
>   new coredump connection. It then retrieves SO_PEERPIDFD for the
>   client, inspects uid/gid and hands the accepted client to the users
>   own coredump handler which runs with the users privileges only.
> 
> The new coredump socket will allow userspace to not have to rely on
> usermode helpers for processing coredumps and provides a safer way to
> handle them instead of relying on super privileged coredumping helpers.
> 
> This will also be significantly more lightweight since no fork()+exec()
> for the usermodehelper is required for each crashing process. The
> coredump server in userspace can just keep a worker pool.
> 
> This is easy to test:
> 
> (a) coredump processing (we're using socat):
> 
>     > cat coredump_socket.sh
>     #!/bin/bash
> 
>     set -x
> 
>     sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
>     sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc
> 
> (b) trigger a coredump:
> 
>     user1@localhost:~/data/scripts$ cat crash.c
>     #include <stdio.h>
>     #include <unistd.h>
> 
>     int main(int argc, char *argv[])
>     {
>             fprintf(stderr, "%u\n", (1 / 0));
>             _exit(0);
>     }

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 14:06     ` Christian Brauner
@ 2025-05-05 18:40       ` Kuniyuki Iwashima
  2025-05-05 19:10         ` Jann Horn
  0 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-05 18:40 UTC (permalink / raw)
  To: brauner
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, jannh, kuba, kuniyu, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

From: Christian Brauner <brauner@kernel.org>
Date: Mon, 5 May 2025 16:06:40 +0200
> On Mon, May 05, 2025 at 03:08:07PM +0200, Jann Horn wrote:
> > On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > > Make sure that only tasks that actually coredumped may connect to the
> > > coredump socket. This restriction may be loosened later in case
> > > userspace processes would like to use it to generate their own
> > > coredumps. Though it'd be wiser if userspace just exposed a separate
> > > socket for that.
> > 
> > This implementation kinda feels a bit fragile to me... I wonder if we
> > could instead have a flag inside the af_unix client socket that says
> > "this is a special client socket for coredumping".
> 
> Should be easily doable with a sock_flag().

This restriction should be applied by BPF LSM.

It's hard to loosen such a default restriction as someone might
argue that's unexpected and regression.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 04/10] coredump: add coredump socket
  2025-05-05 11:13 ` [PATCH RFC v3 04/10] coredump: add coredump socket Christian Brauner
  2025-05-05 12:55   ` Jann Horn
@ 2025-05-05 18:48   ` Kuniyuki Iwashima
  2025-05-06  8:24     ` Christian Brauner
  1 sibling, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-05 18:48 UTC (permalink / raw)
  To: brauner
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, jannh, kuba, kuniyu, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

From: Christian Brauner <brauner@kernel.org>
Date: Mon, 05 May 2025 13:13:42 +0200
> @@ -801,6 +846,40 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  		}
>  		break;
>  	}
> +	case COREDUMP_SOCK: {
> +		struct file *file __free(fput) = NULL;
> +#ifdef CONFIG_UNIX
> +		struct socket *socket;
> +
> +		/*
> +		 * It is possible that the userspace process which is
> +		 * supposed to handle the coredump and is listening on
> +		 * the AF_UNIX socket coredumps. Userspace should just
> +		 * mark itself non dumpable.
> +		 */
> +
> +		retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> +		if (retval < 0)
> +			goto close_fail;
> +
> +		file = sock_alloc_file(socket, 0, NULL);
> +		if (IS_ERR(file)) {
> +			sock_release(socket);
> +			retval = PTR_ERR(file);
> +			goto close_fail;
> +		}
> +
> +		retval = kernel_connect(socket,
> +					(struct sockaddr *)(&coredump_unix_socket),
> +					COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0);

This blocks forever if the listener's accept() queue is full.

I think we don't want that and should pass O_NONBLOCK.

To keep the queue clean is userspace responsibility, and we don't
need to care about a weird user.


> +		if (retval)
> +			goto close_fail;
> +
> +		cprm.limit = RLIM_INFINITY;
> +#endif
> +		cprm.file = no_free_ptr(file);
> +		break;
> +	}
>  	default:
>  		WARN_ON_ONCE(true);
>  		retval = -EINVAL;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 18:40       ` Kuniyuki Iwashima
@ 2025-05-05 19:10         ` Jann Horn
  2025-05-05 19:35           ` Kuniyuki Iwashima
  2025-05-06  8:06           ` Christian Brauner
  0 siblings, 2 replies; 44+ messages in thread
From: Jann Horn @ 2025-05-05 19:10 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: brauner, alexander, bluca, daan.j.demeyer, davem, david, edumazet,
	horms, jack, kuba, lennart, linux-fsdevel, linux-kernel, me,
	netdev, oleg, pabeni, viro, zbyszek

On Mon, May 5, 2025 at 8:41 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: Christian Brauner <brauner@kernel.org>
> Date: Mon, 5 May 2025 16:06:40 +0200
> > On Mon, May 05, 2025 at 03:08:07PM +0200, Jann Horn wrote:
> > > On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > Make sure that only tasks that actually coredumped may connect to the
> > > > coredump socket. This restriction may be loosened later in case
> > > > userspace processes would like to use it to generate their own
> > > > coredumps. Though it'd be wiser if userspace just exposed a separate
> > > > socket for that.
> > >
> > > This implementation kinda feels a bit fragile to me... I wonder if we
> > > could instead have a flag inside the af_unix client socket that says
> > > "this is a special client socket for coredumping".
> >
> > Should be easily doable with a sock_flag().
>
> This restriction should be applied by BPF LSM.

I think we shouldn't allow random userspace processes to connect to
the core dump handling service and provide bogus inputs; that
unnecessarily increases the risk that a crafted coredump can be used
to exploit a bug in the service. So I think it makes sense to enforce
this restriction in the kernel.

My understanding is that BPF LSM creates fairly tight coupling between
userspace and the kernel implementation, and it is kind of unwieldy
for userspace. (I imagine the "man 5 core" manpage would get a bit
longer and describe more kernel implementation detail if you tried to
show how to write a BPF LSM that is capable of detecting unix domain
socket connections to a specific address that are not initiated by
core dumping.) I would like to keep it possible to implement core
userspace functionality in a best-practice way without needing eBPF.

> It's hard to loosen such a default restriction as someone might
> argue that's unexpected and regression.

If userspace wants to allow other processes to connect to the core
dumping service, that's easy to implement - userspace can listen on a
separate address that is not subject to these restrictions.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 19:10         ` Jann Horn
@ 2025-05-05 19:35           ` Kuniyuki Iwashima
  2025-05-05 19:44             ` Kuniyuki Iwashima
  2025-05-05 19:55             ` Jann Horn
  2025-05-06  8:06           ` Christian Brauner
  1 sibling, 2 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-05 19:35 UTC (permalink / raw)
  To: jannh
  Cc: alexander, bluca, brauner, daan.j.demeyer, davem, david, edumazet,
	horms, jack, kuba, kuniyu, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

From: Jann Horn <jannh@google.com>
Date: Mon, 5 May 2025 21:10:28 +0200
> On Mon, May 5, 2025 at 8:41 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > From: Christian Brauner <brauner@kernel.org>
> > Date: Mon, 5 May 2025 16:06:40 +0200
> > > On Mon, May 05, 2025 at 03:08:07PM +0200, Jann Horn wrote:
> > > > On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > > Make sure that only tasks that actually coredumped may connect to the
> > > > > coredump socket. This restriction may be loosened later in case
> > > > > userspace processes would like to use it to generate their own
> > > > > coredumps. Though it'd be wiser if userspace just exposed a separate
> > > > > socket for that.
> > > >
> > > > This implementation kinda feels a bit fragile to me... I wonder if we
> > > > could instead have a flag inside the af_unix client socket that says
> > > > "this is a special client socket for coredumping".
> > >
> > > Should be easily doable with a sock_flag().
> >
> > This restriction should be applied by BPF LSM.
> 
> I think we shouldn't allow random userspace processes to connect to
> the core dump handling service and provide bogus inputs; that
> unnecessarily increases the risk that a crafted coredump can be used
> to exploit a bug in the service. So I think it makes sense to enforce
> this restriction in the kernel.

It's already restricted by /proc/sys/kernel/core_pattern.
We don't need a duplicated logic.

Even when the process holding the listener dies, you can
still avoid such a leak.

e.g.

1. Set up a listener
2. Put the socket into a bpf map
3. Attach LSM at connect()

Then, the LSM checks if the destination socket is

  * listening on the name specified in /proc/sys/kernel/core_pattern
  * exists in the associated BPF map

So, if the socket is dies and a malicious user tries to hijack
the core_pattern name, LSM still rejects such connect().

Later, the admin can restart the program with different core_pattern.


> 
> My understanding is that BPF LSM creates fairly tight coupling between
> userspace and the kernel implementation, and it is kind of unwieldy
> for userspace. (I imagine the "man 5 core" manpage would get a bit
> longer and describe more kernel implementation detail if you tried to
> show how to write a BPF LSM that is capable of detecting unix domain
> socket connections to a specific address that are not initiated by
> core dumping.) I would like to keep it possible to implement core
> userspace functionality in a best-practice way without needing eBPF.

I think the untrusted user scenario is paranoia in most cases,
and the man page just says "if you really care, use BPF LSM".

If someone can listen on a name AND set it to core_pattern, most
likely something worse already happened.


> 
> > It's hard to loosen such a default restriction as someone might
> > argue that's unexpected and regression.
> 
> If userspace wants to allow other processes to connect to the core
> dumping service, that's easy to implement - userspace can listen on a
> separate address that is not subject to these restrictions.
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 19:35           ` Kuniyuki Iwashima
@ 2025-05-05 19:44             ` Kuniyuki Iwashima
  2025-05-05 19:55               ` Jann Horn
  2025-05-05 19:55             ` Jann Horn
  1 sibling, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-05 19:44 UTC (permalink / raw)
  To: kuniyu
  Cc: alexander, bluca, brauner, daan.j.demeyer, davem, david, edumazet,
	horms, jack, jannh, kuba, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Mon, 5 May 2025 12:35:50 -0700
> From: Jann Horn <jannh@google.com>
> Date: Mon, 5 May 2025 21:10:28 +0200
> > On Mon, May 5, 2025 at 8:41 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > From: Christian Brauner <brauner@kernel.org>
> > > Date: Mon, 5 May 2025 16:06:40 +0200
> > > > On Mon, May 05, 2025 at 03:08:07PM +0200, Jann Horn wrote:
> > > > > On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > Make sure that only tasks that actually coredumped may connect to the
> > > > > > coredump socket. This restriction may be loosened later in case
> > > > > > userspace processes would like to use it to generate their own
> > > > > > coredumps. Though it'd be wiser if userspace just exposed a separate
> > > > > > socket for that.
> > > > >
> > > > > This implementation kinda feels a bit fragile to me... I wonder if we
> > > > > could instead have a flag inside the af_unix client socket that says
> > > > > "this is a special client socket for coredumping".
> > > >
> > > > Should be easily doable with a sock_flag().
> > >
> > > This restriction should be applied by BPF LSM.
> > 
> > I think we shouldn't allow random userspace processes to connect to
> > the core dump handling service and provide bogus inputs; that
> > unnecessarily increases the risk that a crafted coredump can be used
> > to exploit a bug in the service. So I think it makes sense to enforce
> > this restriction in the kernel.
> 
> It's already restricted by /proc/sys/kernel/core_pattern.
> We don't need a duplicated logic.
> 
> Even when the process holding the listener dies, you can
> still avoid such a leak.
> 
> e.g.
> 
> 1. Set up a listener
> 2. Put the socket into a bpf map
> 3. Attach LSM at connect()
> 
> Then, the LSM checks if the destination socket is
> 
>   * listening on the name specified in /proc/sys/kernel/core_pattern
>   * exists in the associated BPF map

and LSM can check if the source socket is a kernel socket too.


> 
> So, if the socket is dies and a malicious user tries to hijack
> the core_pattern name, LSM still rejects such connect().
> 
> Later, the admin can restart the program with different core_pattern.
> 
> 
> > 
> > My understanding is that BPF LSM creates fairly tight coupling between
> > userspace and the kernel implementation, and it is kind of unwieldy
> > for userspace. (I imagine the "man 5 core" manpage would get a bit
> > longer and describe more kernel implementation detail if you tried to
> > show how to write a BPF LSM that is capable of detecting unix domain
> > socket connections to a specific address that are not initiated by
> > core dumping.) I would like to keep it possible to implement core
> > userspace functionality in a best-practice way without needing eBPF.
> 
> I think the untrusted user scenario is paranoia in most cases,
> and the man page just says "if you really care, use BPF LSM".
> 
> If someone can listen on a name AND set it to core_pattern, most
> likely something worse already happened.
> 
> 
> > 
> > > It's hard to loosen such a default restriction as someone might
> > > argue that's unexpected and regression.
> > 
> > If userspace wants to allow other processes to connect to the core
> > dumping service, that's easy to implement - userspace can listen on a
> > separate address that is not subject to these restrictions.
> > 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 19:35           ` Kuniyuki Iwashima
  2025-05-05 19:44             ` Kuniyuki Iwashima
@ 2025-05-05 19:55             ` Jann Horn
  2025-05-05 20:30               ` Kuniyuki Iwashima
  1 sibling, 1 reply; 44+ messages in thread
From: Jann Horn @ 2025-05-05 19:55 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: alexander, bluca, brauner, daan.j.demeyer, davem, david, edumazet,
	horms, jack, kuba, lennart, linux-fsdevel, linux-kernel, me,
	netdev, oleg, pabeni, viro, zbyszek

On Mon, May 5, 2025 at 9:38 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: Jann Horn <jannh@google.com>
> Date: Mon, 5 May 2025 21:10:28 +0200
> > On Mon, May 5, 2025 at 8:41 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > From: Christian Brauner <brauner@kernel.org>
> > > Date: Mon, 5 May 2025 16:06:40 +0200
> > > > On Mon, May 05, 2025 at 03:08:07PM +0200, Jann Horn wrote:
> > > > > On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > Make sure that only tasks that actually coredumped may connect to the
> > > > > > coredump socket. This restriction may be loosened later in case
> > > > > > userspace processes would like to use it to generate their own
> > > > > > coredumps. Though it'd be wiser if userspace just exposed a separate
> > > > > > socket for that.
> > > > >
> > > > > This implementation kinda feels a bit fragile to me... I wonder if we
> > > > > could instead have a flag inside the af_unix client socket that says
> > > > > "this is a special client socket for coredumping".
> > > >
> > > > Should be easily doable with a sock_flag().
> > >
> > > This restriction should be applied by BPF LSM.
> >
> > I think we shouldn't allow random userspace processes to connect to
> > the core dump handling service and provide bogus inputs; that
> > unnecessarily increases the risk that a crafted coredump can be used
> > to exploit a bug in the service. So I think it makes sense to enforce
> > this restriction in the kernel.
>
> It's already restricted by /proc/sys/kernel/core_pattern.
> We don't need a duplicated logic.

The core_pattern does not restrict which processes can call connect()
on the unix domain socket address.

> Even when the process holding the listener dies, you can
> still avoid such a leak.
>
> e.g.
>
> 1. Set up a listener
> 2. Put the socket into a bpf map
> 3. Attach LSM at connect()
>
> Then, the LSM checks if the destination socket is

Where does the LSM get the destination socket pointer from? The
socket_connect LSM hook happens before the point where the destination
socket is looked up. What you have in that hook is the unix socket
address structure.

>   * listening on the name specified in /proc/sys/kernel/core_pattern
>   * exists in the associated BPF map
>
> So, if the socket is dies and a malicious user tries to hijack
> the core_pattern name, LSM still rejects such connect().

This patch is not about a malicious user binding to the core dumping
service's unix domain socket address, that is blocked in "[PATCH RFC
v3 03/10] net: reserve prefix". This patch is about preventing
userspace from connect()ing to the legitimate listening socket.

> Later, the admin can restart the program with different core_pattern.
>
>
> >
> > My understanding is that BPF LSM creates fairly tight coupling between
> > userspace and the kernel implementation, and it is kind of unwieldy
> > for userspace. (I imagine the "man 5 core" manpage would get a bit
> > longer and describe more kernel implementation detail if you tried to
> > show how to write a BPF LSM that is capable of detecting unix domain
> > socket connections to a specific address that are not initiated by
> > core dumping.) I would like to keep it possible to implement core
> > userspace functionality in a best-practice way without needing eBPF.
>
> I think the untrusted user scenario is paranoia in most cases,
> and the man page just says "if you really care, use BPF LSM".

Are you saying that you expect crash dumping services to be written
with the expectation that the system will not have multiple users or
multiple security contexts?

> If someone can listen on a name AND set it to core_pattern, most
> likely something worse already happened.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 19:44             ` Kuniyuki Iwashima
@ 2025-05-05 19:55               ` Jann Horn
  2025-05-05 20:41                 ` Kuniyuki Iwashima
  2025-05-06  7:39                 ` Christian Brauner
  0 siblings, 2 replies; 44+ messages in thread
From: Jann Horn @ 2025-05-05 19:55 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: alexander, bluca, brauner, daan.j.demeyer, davem, david, edumazet,
	horms, jack, kuba, lennart, linux-fsdevel, linux-kernel, me,
	netdev, oleg, pabeni, viro, zbyszek

On Mon, May 5, 2025 at 9:45 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> and LSM can check if the source socket is a kernel socket too.

("a kernel socket" is not necessarily the same as "a kernel socket
intended for core dumping")

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 19:55             ` Jann Horn
@ 2025-05-05 20:30               ` Kuniyuki Iwashima
  0 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-05 20:30 UTC (permalink / raw)
  To: jannh
  Cc: alexander, bluca, brauner, daan.j.demeyer, davem, david, edumazet,
	horms, jack, kuba, kuniyu, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

From: Jann Horn <jannh@google.com>
Date: Mon, 5 May 2025 21:55:04 +0200
> On Mon, May 5, 2025 at 9:38 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > From: Jann Horn <jannh@google.com>
> > Date: Mon, 5 May 2025 21:10:28 +0200
> > > On Mon, May 5, 2025 at 8:41 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > From: Christian Brauner <brauner@kernel.org>
> > > > Date: Mon, 5 May 2025 16:06:40 +0200
> > > > > On Mon, May 05, 2025 at 03:08:07PM +0200, Jann Horn wrote:
> > > > > > On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > > Make sure that only tasks that actually coredumped may connect to the
> > > > > > > coredump socket. This restriction may be loosened later in case
> > > > > > > userspace processes would like to use it to generate their own
> > > > > > > coredumps. Though it'd be wiser if userspace just exposed a separate
> > > > > > > socket for that.
> > > > > >
> > > > > > This implementation kinda feels a bit fragile to me... I wonder if we
> > > > > > could instead have a flag inside the af_unix client socket that says
> > > > > > "this is a special client socket for coredumping".
> > > > >
> > > > > Should be easily doable with a sock_flag().
> > > >
> > > > This restriction should be applied by BPF LSM.
> > >
> > > I think we shouldn't allow random userspace processes to connect to
> > > the core dump handling service and provide bogus inputs; that
> > > unnecessarily increases the risk that a crafted coredump can be used
> > > to exploit a bug in the service. So I think it makes sense to enforce
> > > this restriction in the kernel.
> >
> > It's already restricted by /proc/sys/kernel/core_pattern.
> > We don't need a duplicated logic.
> 
> The core_pattern does not restrict which processes can call connect()
> on the unix domain socket address.

I misread it so added LSM can filter the source as well.


> 
> > Even when the process holding the listener dies, you can
> > still avoid such a leak.
> >
> > e.g.
> >
> > 1. Set up a listener
> > 2. Put the socket into a bpf map
> > 3. Attach LSM at connect()
> >
> > Then, the LSM checks if the destination socket is
> 
> Where does the LSM get the destination socket pointer from? The
> socket_connect LSM hook happens before the point where the destination
> socket is looked up. What you have in that hook is the unix socket
> address structure.

The hook is invoked after the lookup.
"sk" is the source and "other" is the destination.

        err = security_unix_stream_connect(sk, other, newsk);
        if (err) {
                unix_state_unlock(sk);
                goto out_unlock;
        }


> 
> >   * listening on the name specified in /proc/sys/kernel/core_pattern
> >   * exists in the associated BPF map
> >
> > So, if the socket is dies and a malicious user tries to hijack
> > the core_pattern name, LSM still rejects such connect().
> 
> This patch is not about a malicious user binding to the core dumping
> service's unix domain socket address, that is blocked in "[PATCH RFC
> v3 03/10] net: reserve prefix". This patch is about preventing
> userspace from connect()ing to the legitimate listening socket.

I mean both can be better handled by BPF.


> 
> > Later, the admin can restart the program with different core_pattern.
> >
> >
> > >
> > > My understanding is that BPF LSM creates fairly tight coupling between
> > > userspace and the kernel implementation, and it is kind of unwieldy
> > > for userspace. (I imagine the "man 5 core" manpage would get a bit
> > > longer and describe more kernel implementation detail if you tried to
> > > show how to write a BPF LSM that is capable of detecting unix domain
> > > socket connections to a specific address that are not initiated by
> > > core dumping.) I would like to keep it possible to implement core
> > > userspace functionality in a best-practice way without needing eBPF.
> >
> > I think the untrusted user scenario is paranoia in most cases,
> > and the man page just says "if you really care, use BPF LSM".
> 
> Are you saying that you expect crash dumping services to be written
> with the expectation that the system will not have multiple users or
> multiple security contexts?

Do you mean other program could use different LSM ?

I think different LSMs can co-exist, see call_int_hook().

I remember BPF LSM was not stackable at some point, but not sure
if it's still true.

I expect the service to use necessary functionality if it's
needed for the specific use case.

Not everyone need such restriction, and it's optional.  Why not
allowing those who really need it to implement it.


> 
> > If someone can listen on a name AND set it to core_pattern, most
> > likely something worse already happened.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 19:55               ` Jann Horn
@ 2025-05-05 20:41                 ` Kuniyuki Iwashima
  2025-05-06  7:39                 ` Christian Brauner
  1 sibling, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-05 20:41 UTC (permalink / raw)
  To: jannh
  Cc: alexander, bluca, brauner, daan.j.demeyer, davem, david, edumazet,
	horms, jack, kuba, kuniyu, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

From: Jann Horn <jannh@google.com>
Date: Mon, 5 May 2025 21:55:06 +0200
> On Mon, May 5, 2025 at 9:45 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > and LSM can check if the source socket is a kernel socket too.
> 
> ("a kernel socket" is not necessarily the same as "a kernel socket
> intended for core dumping")

Yes, but why we need to care about it :)

It doesn't happen or it's out-of-tree driver that is out-of-control
for us but should be in-control on the host where the service is
running.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 00/10] coredump: add coredump socket
  2025-05-05 18:33 ` Kuniyuki Iwashima
@ 2025-05-06  7:33   ` Christian Brauner
  0 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-06  7:33 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, jannh, kuba, lennart, linux-fsdevel, linux-kernel, me,
	netdev, oleg, pabeni, viro, zbyszek

> > The coredump socket is located in the initial network namespace.
> 
> I understand this is a reasonable decision to avoid complicated
> path management in the mount ns but keep connectivity from any
> namespace.

Yes, path lookup would not just be horrid it would also require playing
around with credentials and current->fs. The beauty in this
implementation is that its the crash dumping process itself which does
everything.

> > To bind
> > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > user namespace. Listening and reading can happen from whatever
> > unprivileged context is necessary to safely process coredumps.
> > 
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket. For now only tasks that
> > are acctually coredumping are allowed to connect to the initial coredump
> > socket.
> 
> This can be controlled by BPF (cgroup sockops or LSM) if a user
> really cares about spam clients.
> 
> I think how to set up coredump is userspace responsibility.

I'll reply to that in the other thread so we don't have millions of
branch points.

> > - Since unix_stream_connect() runs bpf programs during connect it's
> >   possible to even redirect or multiplex coredumps to other sockets.
> 
> If the socket is in a cgroup, yes, and even if not, BPF LSM can
> reject some requests.

Indeed. I've outlined that in an earlier version as well.

> > - The coredump server should mark itself as non-dumpable.
> >   To capture coredumps for the coredump server itself a bpf program
> >   should be run at connect to redirect it to another socket in
> >   userspace. This can be useful for debugging crashing coredump servers.
> > 
> > - A container coredump server in a separate network namespace can simply
> >   bind to linuxafsk/coredump.socket and systemd-coredump fowards
> >   coredumps to the container.
> 
> I think the name should be also configurable in non-initial netns.

I don't see a good reason for this. We can always relax that later if we
have to. The fixed address keeps the coredump setup very very dumb and
simple.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 19:55               ` Jann Horn
  2025-05-05 20:41                 ` Kuniyuki Iwashima
@ 2025-05-06  7:39                 ` Christian Brauner
  2025-05-06 14:51                   ` Jann Horn
  1 sibling, 1 reply; 44+ messages in thread
From: Christian Brauner @ 2025-05-06  7:39 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kuniyuki Iwashima, alexander, bluca, daan.j.demeyer, davem, david,
	edumazet, horms, jack, kuba, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

> ("a kernel socket" is not necessarily the same as "a kernel socket
> intended for core dumping")

Indeed. The usermodehelper is a kernel protocol. Here it's the task with
its own credentials that's connecting to a userspace socket. Which makes
this very elegant because it's just userspace IPC. No one is running
around with kernel credentials anywhere.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-05 19:10         ` Jann Horn
  2025-05-05 19:35           ` Kuniyuki Iwashima
@ 2025-05-06  8:06           ` Christian Brauner
  2025-05-06 14:37             ` Jann Horn
  2025-05-06 19:18             ` Kuniyuki Iwashima
  1 sibling, 2 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-06  8:06 UTC (permalink / raw)
  To: Jann Horn, Kuniyuki Iwashima
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, kuba, lennart, linux-fsdevel, linux-kernel, me, netdev,
	oleg, pabeni, viro, zbyszek

On Mon, May 05, 2025 at 09:10:28PM +0200, Jann Horn wrote:
> On Mon, May 5, 2025 at 8:41 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > From: Christian Brauner <brauner@kernel.org>
> > Date: Mon, 5 May 2025 16:06:40 +0200
> > > On Mon, May 05, 2025 at 03:08:07PM +0200, Jann Horn wrote:
> > > > On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > > Make sure that only tasks that actually coredumped may connect to the
> > > > > coredump socket. This restriction may be loosened later in case
> > > > > userspace processes would like to use it to generate their own
> > > > > coredumps. Though it'd be wiser if userspace just exposed a separate
> > > > > socket for that.
> > > >
> > > > This implementation kinda feels a bit fragile to me... I wonder if we
> > > > could instead have a flag inside the af_unix client socket that says
> > > > "this is a special client socket for coredumping".
> > >
> > > Should be easily doable with a sock_flag().
> >
> > This restriction should be applied by BPF LSM.
> 
> I think we shouldn't allow random userspace processes to connect to
> the core dump handling service and provide bogus inputs; that
> unnecessarily increases the risk that a crafted coredump can be used
> to exploit a bug in the service. So I think it makes sense to enforce
> this restriction in the kernel.
> 
> My understanding is that BPF LSM creates fairly tight coupling between
> userspace and the kernel implementation, and it is kind of unwieldy
> for userspace. (I imagine the "man 5 core" manpage would get a bit
> longer and describe more kernel implementation detail if you tried to
> show how to write a BPF LSM that is capable of detecting unix domain
> socket connections to a specific address that are not initiated by
> core dumping.) I would like to keep it possible to implement core
> userspace functionality in a best-practice way without needing eBPF.
> 
> > It's hard to loosen such a default restriction as someone might
> > argue that's unexpected and regression.
> 
> If userspace wants to allow other processes to connect to the core
> dumping service, that's easy to implement - userspace can listen on a
> separate address that is not subject to these restrictions.

I think Kuniyuki's point is defensible. And I did discuss this with
Lennart when I wrote the patch and he didn't see a point in preventing
other processes from connecting to the core dump socket. He actually
would like this to be possible because there's some userspace programs
out there that generate their own coredumps (Python?) and he wanted them
to use the general coredump socket to send them to.

I just found it more elegant to simply guarantee that only connections
are made to that socket come from coredumping tasks.

But I should note there are two ways to cleanly handle this in
userspace. I had already mentioned the bpf LSM in the contect of
rate-limiting in an earlier posting:

(1) complex:

    Use a bpf LSM to intercept the connection request via
    security_unix_stream_connect() in unix_stream_connect().

    The bpf program can simply check:

    current->signal->core_state

    and reject any connection if it isn't set to NULL.

    The big downside is that bpf (and security) need to be enabled.
    Neither is guaranteed and there's quite a few users out there that
    don't enable bpf.

(2) simple (and supported in this series):

    Userspace accepts a connection. It has to get SO_PEERPIDFD anyway.
    It then needs to verify:

    struct pidfd_info info = {
            info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP,
    };

    ioctl(pidfd, PIDFD_GET_INFO, &info);
    if (!(info.mask & PIDFD_INFO_COREDUMP)) {
            // Can't be from a coredumping task so we can close the
	    // connection without reading.
	    close(coredump_client_fd);
	    return;
    }

    /* This has to be set and is only settable by do_coredump(). */
    if (!(info.coredump_mask & PIDFD_COREDUMPED)) {
            // Can't be from a coredumping task so we can close the
	    // connection without reading.
	    close(coredump_client_fd);
	    return;
    }

    // Ok, this is a connection from a task that has coredumped, let's
    // handle it.

    The crux is that the series guarantees that by the time the
    connection is made the info whether the task/thread-group did
    coredump is guaranteed to be available via the pidfd.
 
I think if we document that most coredump servers have to do (2) then
this is fine. But I wouldn't mind a nod from Jann on this.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 04/10] coredump: add coredump socket
  2025-05-05 18:48   ` Kuniyuki Iwashima
@ 2025-05-06  8:24     ` Christian Brauner
  0 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2025-05-06  8:24 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, jannh, kuba, lennart, linux-fsdevel, linux-kernel, me,
	netdev, oleg, pabeni, viro, zbyszek

On Mon, May 05, 2025 at 11:48:43AM -0700, Kuniyuki Iwashima wrote:
> From: Christian Brauner <brauner@kernel.org>
> Date: Mon, 05 May 2025 13:13:42 +0200
> > @@ -801,6 +846,40 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> >  		}
> >  		break;
> >  	}
> > +	case COREDUMP_SOCK: {
> > +		struct file *file __free(fput) = NULL;
> > +#ifdef CONFIG_UNIX
> > +		struct socket *socket;
> > +
> > +		/*
> > +		 * It is possible that the userspace process which is
> > +		 * supposed to handle the coredump and is listening on
> > +		 * the AF_UNIX socket coredumps. Userspace should just
> > +		 * mark itself non dumpable.
> > +		 */
> > +
> > +		retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> > +		if (retval < 0)
> > +			goto close_fail;
> > +
> > +		file = sock_alloc_file(socket, 0, NULL);
> > +		if (IS_ERR(file)) {
> > +			sock_release(socket);
> > +			retval = PTR_ERR(file);
> > +			goto close_fail;
> > +		}
> > +
> > +		retval = kernel_connect(socket,
> > +					(struct sockaddr *)(&coredump_unix_socket),
> > +					COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0);
> 
> This blocks forever if the listener's accept() queue is full.
> 
> I think we don't want that and should pass O_NONBLOCK.
> 
> To keep the queue clean is userspace responsibility, and we don't
> need to care about a weird user.

That seems fine to me. I've changed that.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-06  8:06           ` Christian Brauner
@ 2025-05-06 14:37             ` Jann Horn
  2025-05-06 19:18             ` Kuniyuki Iwashima
  1 sibling, 0 replies; 44+ messages in thread
From: Jann Horn @ 2025-05-06 14:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kuniyuki Iwashima, alexander, bluca, daan.j.demeyer, davem, david,
	edumazet, horms, jack, kuba, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

On Tue, May 6, 2025 at 10:06 AM Christian Brauner <brauner@kernel.org> wrote:
> On Mon, May 05, 2025 at 09:10:28PM +0200, Jann Horn wrote:
> > On Mon, May 5, 2025 at 8:41 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > From: Christian Brauner <brauner@kernel.org>
> > > Date: Mon, 5 May 2025 16:06:40 +0200
> > > > On Mon, May 05, 2025 at 03:08:07PM +0200, Jann Horn wrote:
> > > > > On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > Make sure that only tasks that actually coredumped may connect to the
> > > > > > coredump socket. This restriction may be loosened later in case
> > > > > > userspace processes would like to use it to generate their own
> > > > > > coredumps. Though it'd be wiser if userspace just exposed a separate
> > > > > > socket for that.
> > > > >
> > > > > This implementation kinda feels a bit fragile to me... I wonder if we
> > > > > could instead have a flag inside the af_unix client socket that says
> > > > > "this is a special client socket for coredumping".
> > > >
> > > > Should be easily doable with a sock_flag().
> > >
> > > This restriction should be applied by BPF LSM.
> >
> > I think we shouldn't allow random userspace processes to connect to
> > the core dump handling service and provide bogus inputs; that
> > unnecessarily increases the risk that a crafted coredump can be used
> > to exploit a bug in the service. So I think it makes sense to enforce
> > this restriction in the kernel.
> >
> > My understanding is that BPF LSM creates fairly tight coupling between
> > userspace and the kernel implementation, and it is kind of unwieldy
> > for userspace. (I imagine the "man 5 core" manpage would get a bit
> > longer and describe more kernel implementation detail if you tried to
> > show how to write a BPF LSM that is capable of detecting unix domain
> > socket connections to a specific address that are not initiated by
> > core dumping.) I would like to keep it possible to implement core
> > userspace functionality in a best-practice way without needing eBPF.
> >
> > > It's hard to loosen such a default restriction as someone might
> > > argue that's unexpected and regression.
> >
> > If userspace wants to allow other processes to connect to the core
> > dumping service, that's easy to implement - userspace can listen on a
> > separate address that is not subject to these restrictions.
>
> I think Kuniyuki's point is defensible. And I did discuss this with
> Lennart when I wrote the patch and he didn't see a point in preventing
> other processes from connecting to the core dump socket. He actually
> would like this to be possible because there's some userspace programs
> out there that generate their own coredumps (Python?) and he wanted them
> to use the general coredump socket to send them to.
>
> I just found it more elegant to simply guarantee that only connections
> are made to that socket come from coredumping tasks.
>
> But I should note there are two ways to cleanly handle this in
> userspace. I had already mentioned the bpf LSM in the contect of
> rate-limiting in an earlier posting:
>
> (1) complex:
>
>     Use a bpf LSM to intercept the connection request via
>     security_unix_stream_connect() in unix_stream_connect().
>
>     The bpf program can simply check:
>
>     current->signal->core_state
>
>     and reject any connection if it isn't set to NULL.

I think that would be racy, since zap_threads sets that pointer before
ensuring that the other threads under the signal_struct are killed.

>     The big downside is that bpf (and security) need to be enabled.
>     Neither is guaranteed and there's quite a few users out there that
>     don't enable bpf.
>
> (2) simple (and supported in this series):
>
>     Userspace accepts a connection. It has to get SO_PEERPIDFD anyway.
>     It then needs to verify:
>
>     struct pidfd_info info = {
>             info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP,
>     };
>
>     ioctl(pidfd, PIDFD_GET_INFO, &info);
>     if (!(info.mask & PIDFD_INFO_COREDUMP)) {
>             // Can't be from a coredumping task so we can close the
>             // connection without reading.
>             close(coredump_client_fd);
>             return;
>     }
>
>     /* This has to be set and is only settable by do_coredump(). */
>     if (!(info.coredump_mask & PIDFD_COREDUMPED)) {
>             // Can't be from a coredumping task so we can close the
>             // connection without reading.
>             close(coredump_client_fd);
>             return;
>     }
>
>     // Ok, this is a connection from a task that has coredumped, let's
>     // handle it.
>
>     The crux is that the series guarantees that by the time the
>     connection is made the info whether the task/thread-group did
>     coredump is guaranteed to be available via the pidfd.
>
> I think if we document that most coredump servers have to do (2) then
> this is fine. But I wouldn't mind a nod from Jann on this.

I wouldn't recommend either of these as a way to verify that the data
coming over the socket is a core dump generated by the kernel, since
they both look racy in that regard.

But given that you're saying the initial userspace user wouldn't
actually want such a restriction, and that we could later provide a
separate way for userspace to check what initiated the connection, I
guess this is fine for now.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-06  7:39                 ` Christian Brauner
@ 2025-05-06 14:51                   ` Jann Horn
  2025-05-06 15:16                     ` Christian Brauner
  2025-05-07 11:50                     ` Mickaël Salaün
  0 siblings, 2 replies; 44+ messages in thread
From: Jann Horn @ 2025-05-06 14:51 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kuniyuki Iwashima, alexander, bluca, daan.j.demeyer, davem, david,
	edumazet, horms, jack, kuba, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

On Tue, May 6, 2025 at 9:39 AM Christian Brauner <brauner@kernel.org> wrote:
> > ("a kernel socket" is not necessarily the same as "a kernel socket
> > intended for core dumping")
>
> Indeed. The usermodehelper is a kernel protocol. Here it's the task with
> its own credentials that's connecting to a userspace socket. Which makes
> this very elegant because it's just userspace IPC. No one is running
> around with kernel credentials anywhere.

To be clear: I think your current patch is using special kernel
privileges in one regard, because kernel_connect() bypasses the
security_socket_connect() security hook. I think it is a good thing
that it bypasses security hooks in this way; I think we wouldn't want
LSMs to get in the way of this special connect(), since the task in
whose context the connect() call happens is not in control of this
connection; the system administrator is the one who decided that this
connect() should happen on core dumps. It is kind of inconsistent
though that that separate security_unix_stream_connect() LSM hook will
still be invoked in this case, and we might have to watch out to make
sure that LSMs won't end up blocking such connections... which I think
is related to what Mickael was saying on the other thread. Landlock
currently doesn't filter abstract connections at that hook, so for now
this would only be relevant for SELinux and Smack. I guess those are
maybe less problematic in this regard because they work on full-system
policies rather than app-specific policies; but still, with the
current implementation, SELinux/Smack policies would need to be
designed to allow processes to connect to the core dumping socket to
make core dumping work.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-06 14:51                   ` Jann Horn
@ 2025-05-06 15:16                     ` Christian Brauner
  2025-05-06 19:28                       ` Kuniyuki Iwashima
  2025-05-07 11:50                     ` Mickaël Salaün
  1 sibling, 1 reply; 44+ messages in thread
From: Christian Brauner @ 2025-05-06 15:16 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kuniyuki Iwashima, alexander, bluca, daan.j.demeyer, davem, david,
	edumazet, horms, jack, kuba, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

On Tue, May 06, 2025 at 04:51:25PM +0200, Jann Horn wrote:
> On Tue, May 6, 2025 at 9:39 AM Christian Brauner <brauner@kernel.org> wrote:
> > > ("a kernel socket" is not necessarily the same as "a kernel socket
> > > intended for core dumping")
> >
> > Indeed. The usermodehelper is a kernel protocol. Here it's the task with
> > its own credentials that's connecting to a userspace socket. Which makes
> > this very elegant because it's just userspace IPC. No one is running
> > around with kernel credentials anywhere.
> 
> To be clear: I think your current patch is using special kernel
> privileges in one regard, because kernel_connect() bypasses the
> security_socket_connect() security hook. I think it is a good thing
> that it bypasses security hooks in this way; I think we wouldn't want
> LSMs to get in the way of this special connect(), since the task in
> whose context the connect() call happens is not in control of this
> connection; the system administrator is the one who decided that this
> connect() should happen on core dumps. It is kind of inconsistent
> though that that separate security_unix_stream_connect() LSM hook will
> still be invoked in this case, and we might have to watch out to make
> sure that LSMs won't end up blocking such connections... which I think

Right, it is the same as for the usermode helper. It calls
kernel_execve() which invokes at least security_bprm_creds_for_exec()
and security_bprm_check(). Both of which can be used to make the
usermodehelper execve fail.

Fwiw, it's even the case for dumping directly to a file as in that case
it's subject to all kinds of lookup and open security hooks like
security_file_open() and then another round in do_truncate().

All of that happens fully in the task's context as well via
file_open()/file_open_root() or do_truncate().

So there's nothing special here.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-06  8:06           ` Christian Brauner
  2025-05-06 14:37             ` Jann Horn
@ 2025-05-06 19:18             ` Kuniyuki Iwashima
  2025-05-07 11:51               ` Mickaël Salaün
  1 sibling, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-06 19:18 UTC (permalink / raw)
  To: brauner
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, jannh, kuba, kuniyu, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

From: Christian Brauner <brauner@kernel.org>
Date: Tue, 6 May 2025 10:06:27 +0200
> On Mon, May 05, 2025 at 09:10:28PM +0200, Jann Horn wrote:
> > On Mon, May 5, 2025 at 8:41 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > From: Christian Brauner <brauner@kernel.org>
> > > Date: Mon, 5 May 2025 16:06:40 +0200
> > > > On Mon, May 05, 2025 at 03:08:07PM +0200, Jann Horn wrote:
> > > > > On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > Make sure that only tasks that actually coredumped may connect to the
> > > > > > coredump socket. This restriction may be loosened later in case
> > > > > > userspace processes would like to use it to generate their own
> > > > > > coredumps. Though it'd be wiser if userspace just exposed a separate
> > > > > > socket for that.
> > > > >
> > > > > This implementation kinda feels a bit fragile to me... I wonder if we
> > > > > could instead have a flag inside the af_unix client socket that says
> > > > > "this is a special client socket for coredumping".
> > > >
> > > > Should be easily doable with a sock_flag().
> > >
> > > This restriction should be applied by BPF LSM.
> > 
> > I think we shouldn't allow random userspace processes to connect to
> > the core dump handling service and provide bogus inputs; that
> > unnecessarily increases the risk that a crafted coredump can be used
> > to exploit a bug in the service. So I think it makes sense to enforce
> > this restriction in the kernel.
> > 
> > My understanding is that BPF LSM creates fairly tight coupling between
> > userspace and the kernel implementation, and it is kind of unwieldy
> > for userspace. (I imagine the "man 5 core" manpage would get a bit
> > longer and describe more kernel implementation detail if you tried to
> > show how to write a BPF LSM that is capable of detecting unix domain
> > socket connections to a specific address that are not initiated by
> > core dumping.) I would like to keep it possible to implement core
> > userspace functionality in a best-practice way without needing eBPF.
> > 
> > > It's hard to loosen such a default restriction as someone might
> > > argue that's unexpected and regression.
> > 
> > If userspace wants to allow other processes to connect to the core
> > dumping service, that's easy to implement - userspace can listen on a
> > separate address that is not subject to these restrictions.
> 
> I think Kuniyuki's point is defensible. And I did discuss this with
> Lennart when I wrote the patch and he didn't see a point in preventing
> other processes from connecting to the core dump socket. He actually
> would like this to be possible because there's some userspace programs
> out there that generate their own coredumps (Python?) and he wanted them
> to use the general coredump socket to send them to.
> 
> I just found it more elegant to simply guarantee that only connections
> are made to that socket come from coredumping tasks.
> 
> But I should note there are two ways to cleanly handle this in
> userspace. I had already mentioned the bpf LSM in the contect of
> rate-limiting in an earlier posting:
> 
> (1) complex:
> 
>     Use a bpf LSM to intercept the connection request via
>     security_unix_stream_connect() in unix_stream_connect().
> 
>     The bpf program can simply check:
> 
>     current->signal->core_state
> 
>     and reject any connection if it isn't set to NULL.
> 
>     The big downside is that bpf (and security) need to be enabled.
>     Neither is guaranteed and there's quite a few users out there that
>     don't enable bpf.
> 
> (2) simple (and supported in this series):
> 
>     Userspace accepts a connection. It has to get SO_PEERPIDFD anyway.
>     It then needs to verify:
> 
>     struct pidfd_info info = {
>             info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP,
>     };
> 
>     ioctl(pidfd, PIDFD_GET_INFO, &info);
>     if (!(info.mask & PIDFD_INFO_COREDUMP)) {
>             // Can't be from a coredumping task so we can close the
> 	    // connection without reading.
> 	    close(coredump_client_fd);
> 	    return;
>     }
> 
>     /* This has to be set and is only settable by do_coredump(). */
>     if (!(info.coredump_mask & PIDFD_COREDUMPED)) {
>             // Can't be from a coredumping task so we can close the
> 	    // connection without reading.
> 	    close(coredump_client_fd);
> 	    return;
>     }
> 
>     // Ok, this is a connection from a task that has coredumped, let's
>     // handle it.
> 
>     The crux is that the series guarantees that by the time the
>     connection is made the info whether the task/thread-group did
>     coredump is guaranteed to be available via the pidfd.
>  
> I think if we document that most coredump servers have to do (2) then
> this is fine. But I wouldn't mind a nod from Jann on this.

I like this approach (2) allowing users to filter the right client.
This way we can extend the application flexibly for another coredump
service.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-06 15:16                     ` Christian Brauner
@ 2025-05-06 19:28                       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-06 19:28 UTC (permalink / raw)
  To: brauner
  Cc: alexander, bluca, daan.j.demeyer, davem, david, edumazet, horms,
	jack, jannh, kuba, kuniyu, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek

From: Christian Brauner <brauner@kernel.org>
Date: Tue, 6 May 2025 17:16:13 +0200
> On Tue, May 06, 2025 at 04:51:25PM +0200, Jann Horn wrote:
> > On Tue, May 6, 2025 at 9:39 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > ("a kernel socket" is not necessarily the same as "a kernel socket
> > > > intended for core dumping")
> > >
> > > Indeed. The usermodehelper is a kernel protocol. Here it's the task with
> > > its own credentials that's connecting to a userspace socket. Which makes
> > > this very elegant because it's just userspace IPC. No one is running
> > > around with kernel credentials anywhere.
> > 
> > To be clear: I think your current patch is using special kernel
> > privileges in one regard, because kernel_connect() bypasses the
> > security_socket_connect() security hook.

Precisely, whether LSM ignores kernel sockets or not depends on LSM.

When we create a socket, kern=0/1 is passed to security_socket_create().
Currently, SELinux always ignore the kernel socket, and AppArmor depends
on another condition.  Other LSM doesn't care.  Especially, BPF LSM is
just a set of functions to attach BPF programs, so it can enfoce whatever.


> I think it is a good thing
> > that it bypasses security hooks in this way; I think we wouldn't want
> > LSMs to get in the way of this special connect(), since the task in
> > whose context the connect() call happens is not in control of this
> > connection; the system administrator is the one who decided that this
> > connect() should happen on core dumps. It is kind of inconsistent
> > though that that separate security_unix_stream_connect() LSM hook will
> > still be invoked in this case, and we might have to watch out to make
> > sure that LSMs won't end up blocking such connections... which I think
> 
> Right, it is the same as for the usermode helper. It calls
> kernel_execve() which invokes at least security_bprm_creds_for_exec()
> and security_bprm_check(). Both of which can be used to make the
> usermodehelper execve fail.
> 
> Fwiw, it's even the case for dumping directly to a file as in that case
> it's subject to all kinds of lookup and open security hooks like
> security_file_open() and then another round in do_truncate().
> 
> All of that happens fully in the task's context as well via
> file_open()/file_open_root() or do_truncate().
> 
> So there's nothing special here.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-06 14:51                   ` Jann Horn
  2025-05-06 15:16                     ` Christian Brauner
@ 2025-05-07 11:50                     ` Mickaël Salaün
  1 sibling, 0 replies; 44+ messages in thread
From: Mickaël Salaün @ 2025-05-07 11:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, Kuniyuki Iwashima, alexander, bluca,
	daan.j.demeyer, davem, david, edumazet, horms, jack, kuba,
	lennart, linux-fsdevel, linux-kernel, me, netdev, oleg, pabeni,
	viro, zbyszek

On Tue, May 06, 2025 at 04:51:25PM +0200, Jann Horn wrote:
> On Tue, May 6, 2025 at 9:39 AM Christian Brauner <brauner@kernel.org> wrote:
> > > ("a kernel socket" is not necessarily the same as "a kernel socket
> > > intended for core dumping")
> >
> > Indeed. The usermodehelper is a kernel protocol. Here it's the task with
> > its own credentials that's connecting to a userspace socket. Which makes
> > this very elegant because it's just userspace IPC. No one is running
> > around with kernel credentials anywhere.
> 
> To be clear: I think your current patch is using special kernel
> privileges in one regard, because kernel_connect() bypasses the
> security_socket_connect() security hook. I think it is a good thing
> that it bypasses security hooks in this way; I think we wouldn't want
> LSMs to get in the way of this special connect(), since the task in
> whose context the connect() call happens is not in control of this
> connection; the system administrator is the one who decided that this
> connect() should happen on core dumps. It is kind of inconsistent
> though that that separate security_unix_stream_connect() LSM hook will
> still be invoked in this case, and we might have to watch out to make
> sure that LSMs won't end up blocking such connections... which I think
> is related to what Mickael was saying on the other thread.

Right

> Landlock
> currently doesn't filter abstract connections at that hook, so for now

Landlock implements this hook since Linux 6.12 and can deny connections
from a sandboxed process to a peer outside the sandbox:
https://docs.kernel.org/userspace-api/landlock.html#ipc-scoping
I was worried that security_unix_stream_connect() would be called with
the task's credential, which would block coredumps from sandboxed tasks.
This would also apply to other LSMs.

> this would only be relevant for SELinux and Smack. I guess those are
> maybe less problematic in this regard because they work on full-system
> policies rather than app-specific policies; but still, with the
> current implementation, SELinux/Smack policies would need to be
> designed to allow processes to connect to the core dumping socket to
> make core dumping work.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-06 19:18             ` Kuniyuki Iwashima
@ 2025-05-07 11:51               ` Mickaël Salaün
  2025-05-07 14:22                 ` Lennart Poettering
  2025-05-07 22:10                 ` Paul Moore
  0 siblings, 2 replies; 44+ messages in thread
From: Mickaël Salaün @ 2025-05-07 11:51 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: brauner, alexander, bluca, daan.j.demeyer, davem, david, edumazet,
	horms, jack, jannh, kuba, lennart, linux-fsdevel, linux-kernel,
	me, netdev, oleg, pabeni, viro, zbyszek, linux-security-module

On Tue, May 06, 2025 at 12:18:12PM -0700, Kuniyuki Iwashima wrote:
> From: Christian Brauner <brauner@kernel.org>
> Date: Tue, 6 May 2025 10:06:27 +0200
> > On Mon, May 05, 2025 at 09:10:28PM +0200, Jann Horn wrote:
> > > On Mon, May 5, 2025 at 8:41 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > From: Christian Brauner <brauner@kernel.org>
> > > > Date: Mon, 5 May 2025 16:06:40 +0200
> > > > > On Mon, May 05, 2025 at 03:08:07PM +0200, Jann Horn wrote:
> > > > > > On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > > Make sure that only tasks that actually coredumped may connect to the
> > > > > > > coredump socket. This restriction may be loosened later in case
> > > > > > > userspace processes would like to use it to generate their own
> > > > > > > coredumps. Though it'd be wiser if userspace just exposed a separate
> > > > > > > socket for that.
> > > > > >
> > > > > > This implementation kinda feels a bit fragile to me... I wonder if we
> > > > > > could instead have a flag inside the af_unix client socket that says
> > > > > > "this is a special client socket for coredumping".
> > > > >
> > > > > Should be easily doable with a sock_flag().
> > > >
> > > > This restriction should be applied by BPF LSM.
> > > 
> > > I think we shouldn't allow random userspace processes to connect to
> > > the core dump handling service and provide bogus inputs; that
> > > unnecessarily increases the risk that a crafted coredump can be used
> > > to exploit a bug in the service. So I think it makes sense to enforce
> > > this restriction in the kernel.
> > > 
> > > My understanding is that BPF LSM creates fairly tight coupling between
> > > userspace and the kernel implementation, and it is kind of unwieldy
> > > for userspace. (I imagine the "man 5 core" manpage would get a bit
> > > longer and describe more kernel implementation detail if you tried to
> > > show how to write a BPF LSM that is capable of detecting unix domain
> > > socket connections to a specific address that are not initiated by
> > > core dumping.) I would like to keep it possible to implement core
> > > userspace functionality in a best-practice way without needing eBPF.
> > > 
> > > > It's hard to loosen such a default restriction as someone might
> > > > argue that's unexpected and regression.
> > > 
> > > If userspace wants to allow other processes to connect to the core
> > > dumping service, that's easy to implement - userspace can listen on a
> > > separate address that is not subject to these restrictions.
> > 
> > I think Kuniyuki's point is defensible. And I did discuss this with
> > Lennart when I wrote the patch and he didn't see a point in preventing
> > other processes from connecting to the core dump socket. He actually
> > would like this to be possible because there's some userspace programs
> > out there that generate their own coredumps (Python?) and he wanted them
> > to use the general coredump socket to send them to.
> > 
> > I just found it more elegant to simply guarantee that only connections
> > are made to that socket come from coredumping tasks.
> > 
> > But I should note there are two ways to cleanly handle this in
> > userspace. I had already mentioned the bpf LSM in the contect of
> > rate-limiting in an earlier posting:
> > 
> > (1) complex:
> > 
> >     Use a bpf LSM to intercept the connection request via
> >     security_unix_stream_connect() in unix_stream_connect().
> > 
> >     The bpf program can simply check:
> > 
> >     current->signal->core_state
> > 
> >     and reject any connection if it isn't set to NULL.
> > 
> >     The big downside is that bpf (and security) need to be enabled.
> >     Neither is guaranteed and there's quite a few users out there that
> >     don't enable bpf.

The kernel should indeed always have a minimal security policy in place,
LSM can tailored that but we should not assume that a specific LSM with
a specific policy is enabled/configured on the system.

> > 
> > (2) simple (and supported in this series):
> > 
> >     Userspace accepts a connection. It has to get SO_PEERPIDFD anyway.
> >     It then needs to verify:
> > 
> >     struct pidfd_info info = {
> >             info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP,
> >     };
> > 
> >     ioctl(pidfd, PIDFD_GET_INFO, &info);
> >     if (!(info.mask & PIDFD_INFO_COREDUMP)) {
> >             // Can't be from a coredumping task so we can close the
> > 	    // connection without reading.
> > 	    close(coredump_client_fd);
> > 	    return;
> >     }
> > 
> >     /* This has to be set and is only settable by do_coredump(). */
> >     if (!(info.coredump_mask & PIDFD_COREDUMPED)) {
> >             // Can't be from a coredumping task so we can close the
> > 	    // connection without reading.
> > 	    close(coredump_client_fd);
> > 	    return;
> >     }
> > 
> >     // Ok, this is a connection from a task that has coredumped, let's
> >     // handle it.

What if the task send a "fake" coredump and just after that really
coredump?  There could be a race condition on the server side when
checking the coredump property of this pidfd.

Could we add a trusted header to the coredump payload that is always
written by the kernel?  This would enable to read a trusted flag
indicating if the following payload is a coredumped generated by the
kernel or not.

> > 
> >     The crux is that the series guarantees that by the time the
> >     connection is made the info whether the task/thread-group did
> >     coredump is guaranteed to be available via the pidfd.
> >  
> > I think if we document that most coredump servers have to do (2) then
> > this is fine. But I wouldn't mind a nod from Jann on this.
> 
> I like this approach (2) allowing users to filter the right client.
> This way we can extend the application flexibly for another coredump
> service.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-07 11:51               ` Mickaël Salaün
@ 2025-05-07 14:22                 ` Lennart Poettering
  2025-05-07 22:10                 ` Paul Moore
  1 sibling, 0 replies; 44+ messages in thread
From: Lennart Poettering @ 2025-05-07 14:22 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Kuniyuki Iwashima, brauner, alexander, bluca, daan.j.demeyer,
	davem, david, edumazet, horms, jack, jannh, kuba, linux-fsdevel,
	linux-kernel, me, netdev, oleg, pabeni, viro, zbyszek,
	linux-security-module

On Mi, 07.05.25 13:51, Mickaël Salaün (mic@digikod.net) wrote:

> What if the task send a "fake" coredump and just after that really
> coredump?  There could be a race condition on the server side when
> checking the coredump property of this pidfd.
>
> Could we add a trusted header to the coredump payload that is always
> written by the kernel?  This would enable to read a trusted flag
> indicating if the following payload is a coredumped generated by the
> kernel or not.

With my systemd hat on I must say I don't really care if the coredump
is "authentic" or not. The coredump is untrusted data anyway, it needs
to be quarantined from systemd-coredump's PoV, there's no security
value in distinguishing if some random user's process was sent to the
handler because the user called raise(SIGSEGV) or because the user
called connect() to our socket. The process is under user control in
almost all ways anyways, they can rearrange its internals in almost
any way already, play any games they want. It's of very little value
if the receiving side can determine if the serialization of potential
complete garbage was written out by the kernel or by the process' own
code.

Moreover, in systemd-coredump we these days collect not only classic
ELF coredumps passed to us by the kernel, but also other forms of
crashes. For example if a Python program dies because of an uncaught
exception this is also passed to systemd-coredump, and treated the
same way in regards to metadata collection, logging, storage,
recycling and so on. Conceptually a python crash like that and a
process level crash are the same thing for us, we treat them the
same. And of course, Python crashes like this are *inherently* a
userspace concept, hence we explicitly want to accept them. Hence even
if we'd be able to distinguish "true" from "fake" coredumps we'd still
not care or change our behaviour much, because we are *as* *much*
interested in user-level crashes as in kernel handled crashes.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
  2025-05-07 11:51               ` Mickaël Salaün
  2025-05-07 14:22                 ` Lennart Poettering
@ 2025-05-07 22:10                 ` Paul Moore
  1 sibling, 0 replies; 44+ messages in thread
From: Paul Moore @ 2025-05-07 22:10 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Kuniyuki Iwashima, brauner, alexander, bluca, daan.j.demeyer,
	davem, david, edumazet, horms, jack, jannh, kuba, lennart,
	linux-fsdevel, linux-kernel, me, netdev, oleg, pabeni, viro,
	zbyszek, linux-security-module

On Wed, May 7, 2025 at 7:54 AM Mickaël Salaün <mic@digikod.net> wrote:
> On Tue, May 06, 2025 at 12:18:12PM -0700, Kuniyuki Iwashima wrote:
> > From: Christian Brauner <brauner@kernel.org>
> > Date: Tue, 6 May 2025 10:06:27 +0200
> > > On Mon, May 05, 2025 at 09:10:28PM +0200, Jann Horn wrote:
> > > > On Mon, May 5, 2025 at 8:41 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > From: Christian Brauner <brauner@kernel.org>
> > > > > Date: Mon, 5 May 2025 16:06:40 +0200
> > > > > > On Mon, May 05, 2025 at 03:08:07PM +0200, Jann Horn wrote:
> > > > > > > On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > > > > > Make sure that only tasks that actually coredumped may connect to the
> > > > > > > > coredump socket. This restriction may be loosened later in case
> > > > > > > > userspace processes would like to use it to generate their own
> > > > > > > > coredumps. Though it'd be wiser if userspace just exposed a separate
> > > > > > > > socket for that.
> > > > > > >
> > > > > > > This implementation kinda feels a bit fragile to me... I wonder if we
> > > > > > > could instead have a flag inside the af_unix client socket that says
> > > > > > > "this is a special client socket for coredumping".
> > > > > >
> > > > > > Should be easily doable with a sock_flag().
> > > > >
> > > > > This restriction should be applied by BPF LSM.
> > > >
> > > > I think we shouldn't allow random userspace processes to connect to
> > > > the core dump handling service and provide bogus inputs; that
> > > > unnecessarily increases the risk that a crafted coredump can be used
> > > > to exploit a bug in the service. So I think it makes sense to enforce
> > > > this restriction in the kernel.
> > > >
> > > > My understanding is that BPF LSM creates fairly tight coupling between
> > > > userspace and the kernel implementation, and it is kind of unwieldy
> > > > for userspace. (I imagine the "man 5 core" manpage would get a bit
> > > > longer and describe more kernel implementation detail if you tried to
> > > > show how to write a BPF LSM that is capable of detecting unix domain
> > > > socket connections to a specific address that are not initiated by
> > > > core dumping.) I would like to keep it possible to implement core
> > > > userspace functionality in a best-practice way without needing eBPF.
> > > >
> > > > > It's hard to loosen such a default restriction as someone might
> > > > > argue that's unexpected and regression.
> > > >
> > > > If userspace wants to allow other processes to connect to the core
> > > > dumping service, that's easy to implement - userspace can listen on a
> > > > separate address that is not subject to these restrictions.
> > >
> > > I think Kuniyuki's point is defensible. And I did discuss this with
> > > Lennart when I wrote the patch and he didn't see a point in preventing
> > > other processes from connecting to the core dump socket. He actually
> > > would like this to be possible because there's some userspace programs
> > > out there that generate their own coredumps (Python?) and he wanted them
> > > to use the general coredump socket to send them to.

From a security perspective, it seems very reasonable to me that an
LSM would want to potentially control which processes are allowed to
bind or connect to the coredump socket.  Assuming that the socket
creation, bind(), and connect() operations go through all of the
normal LSM hooks like any other socket that shouldn't be a problem.
Some LSMs may have challenges with the lack of a filesystem path
associated with the socket, but abstract sockets are far from a new
concept and those LSMs should already have a mechanism for dealing
with such sockets.

I also want to stress that when we think about LSM controls, we need
to think in generic terms and not solely on a specific LSM, e.g. BPF.
It's fine and good to have documentation about how one might use a BPF
LSM to mitigate access to a coredump socket, but it should be made
clear in that same documentation that other LSMs may also be enforcing
access controls on that socket.  Further, and I believe everyone here
already knows this, but just to be clear, the kernel code should
definitely not assume either the presence of a specific LSM, or the
LSM in general.

> > > I just found it more elegant to simply guarantee that only connections
> > > are made to that socket come from coredumping tasks.
> > >
> > > But I should note there are two ways to cleanly handle this in
> > > userspace. I had already mentioned the bpf LSM in the contect of
> > > rate-limiting in an earlier posting:
> > >
> > > (1) complex:
> > >
> > >     Use a bpf LSM to intercept the connection request via
> > >     security_unix_stream_connect() in unix_stream_connect().
> > >
> > >     The bpf program can simply check:
> > >
> > >     current->signal->core_state
> > >
> > >     and reject any connection if it isn't set to NULL.
> > >
> > >     The big downside is that bpf (and security) need to be enabled.
> > >     Neither is guaranteed and there's quite a few users out there that
> > >     don't enable bpf.
>
> The kernel should indeed always have a minimal security policy in place,
> LSM can tailored that but we should not assume that a specific LSM with
> a specific policy is enabled/configured on the system.

None of the LSM mailing lists were CC'd so I haven't seen the full
thread yet, and haven't had the chance to dig it up on lore, but at
the very least I would think there should be some basic controls
around who can bind/receive coredumps.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-05-07 22:10 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-05 11:13 [PATCH RFC v3 00/10] coredump: add coredump socket Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 01/10] coredump: massage format_corname() Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 02/10] coredump: massage do_coredump() Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 03/10] net: reserve prefix Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 04/10] coredump: add coredump socket Christian Brauner
2025-05-05 12:55   ` Jann Horn
2025-05-05 13:06     ` Luca Boccassi
2025-05-05 14:46     ` Christian Brauner
2025-05-05 18:48   ` Kuniyuki Iwashima
2025-05-06  8:24     ` Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 05/10] coredump: validate socket name as it is written Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 06/10] coredump: show supported coredump modes Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 07/10] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 08/10] net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket Christian Brauner
2025-05-05 13:08   ` Jann Horn
2025-05-05 14:06     ` Christian Brauner
2025-05-05 18:40       ` Kuniyuki Iwashima
2025-05-05 19:10         ` Jann Horn
2025-05-05 19:35           ` Kuniyuki Iwashima
2025-05-05 19:44             ` Kuniyuki Iwashima
2025-05-05 19:55               ` Jann Horn
2025-05-05 20:41                 ` Kuniyuki Iwashima
2025-05-06  7:39                 ` Christian Brauner
2025-05-06 14:51                   ` Jann Horn
2025-05-06 15:16                     ` Christian Brauner
2025-05-06 19:28                       ` Kuniyuki Iwashima
2025-05-07 11:50                     ` Mickaël Salaün
2025-05-05 19:55             ` Jann Horn
2025-05-05 20:30               ` Kuniyuki Iwashima
2025-05-06  8:06           ` Christian Brauner
2025-05-06 14:37             ` Jann Horn
2025-05-06 19:18             ` Kuniyuki Iwashima
2025-05-07 11:51               ` Mickaël Salaün
2025-05-07 14:22                 ` Lennart Poettering
2025-05-07 22:10                 ` Paul Moore
2025-05-05 11:13 ` [PATCH RFC v3 09/10] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure Christian Brauner
2025-05-05 11:13 ` [PATCH RFC v3 10/10] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
2025-05-05 14:41 ` [PATCH RFC v3 00/10] coredump: add coredump socket Mickaël Salaün
2025-05-05 14:56   ` Christian Brauner
2025-05-05 15:38     ` Mickaël Salaün
2025-05-05 14:59   ` Jann Horn
2025-05-05 15:39     ` Mickaël Salaün
2025-05-05 18:33 ` Kuniyuki Iwashima
2025-05-06  7:33   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).