linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/9] coredump: add coredump socket
@ 2025-05-14 22:03 Christian Brauner
  2025-05-14 22:03 ` [PATCH v7 1/9] coredump: massage format_corname() Christian Brauner
                   ` (11 more replies)
  0 siblings, 12 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-14 22:03 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module, Christian Brauner,
	Alexander Mikhalitsyn

Coredumping currently supports two modes:

(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
    spawned as a child of the system_unbound_wq or kthreadd.

For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.

The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:

        |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.

In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):

- systemd-coredump is spawned with file descriptor number 0 (stdin)
  connected to the read-end of the pipe. All other file descriptors are
  closed. That specifically includes 1 (stdout) and 2 (stderr). This has
  already caused bugs because userspace assumed that this cannot happen
  (Whether or not this is a sane assumption is irrelevant.).

- systemd-coredump will be spawned as a child of system_unbound_wq. So
  it is not a child of any userspace process and specifically not a
  child of PID 1. It cannot be waited upon and is in a weird hybrid
  upcall which are difficult for userspace to control correctly.

- systemd-coredump is spawned with full kernel privileges. This
  necessitates all kinds of weird privilege dropping excercises in
  userspace to make this safe.

- A new usermode helper has to be spawned for each crashing process.

This series adds a new mode:

(3) Dumping into an abstract AF_UNIX socket.

Userspace can set /proc/sys/kernel/core_pattern to:

        @/path/to/coredump.socket

The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps.

The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket.

- The coredump server should use SO_PEERPIDFD to get a stable handle on
  the connected crashing task. The retrieved pidfd will provide a stable
  reference even if the crashing task gets SIGKILLed while generating
  the coredump.

- When a coredump connection is initiated use the socket cookie as the
  coredump cookie and store it in the pidfd. The receiver can now easily
  authenticate that the connection is coming from the kernel.

  Unless the coredump server expects to handle connection from
  non-crashing task it can validate that the connection has been made from
  a crashing task:

     fd_coredump = accept4(fd_socket, NULL, NULL, SOCK_CLOEXEC);
     getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD, &fd_peer_pidfd, &fd_peer_pidfd_len);

     struct pidfd_info info = {
             info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP,
     };

     ioctl(pidfd, PIDFD_GET_INFO, &info);
     /* Refuse connections that aren't from a crashing task. */
     if (!(info.mask & PIDFD_INFO_COREDUMP) || !(info.coredump_mask & PIDFD_COREDUMPED) )
             close(fd_coredump);

     /*
      * Make sure that the coredump cookie matches the connection cookie.
      * If they don't it's not the coredump connection from the kernel.
      * We'll get another connection request in a bit.
      */
     getsocketop(fd_coredump, SOL_SOCKET, SO_COOKIE, &peer_cookie, &peer_cookie_len);
     if (!info.coredump_cookie || (info.coredump_cookie != peer_cookie))
             close(fd_coredump);

  The kernel guarantees that by the time the connection is made the
  coredump info is available.

- By setting core_pipe_limit non-zero userspace can guarantee that the
  crashing task cannot be reaped behind it's back and thus process all
  necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
  detect whether /proc/<pid> still refers to the same process.

  The core_pipe_limit isn't used to rate-limit connections to the
  socket. This can simply be done via AF_UNIX socket directly.

- The pidfd for the crashing task will contain information how the task
  coredumps. The PIDFD_GET_INFO ioctl gained a new flag
  PIDFD_INFO_COREDUMP which can be used to retreive the coredump
  information.

  If the coredump gets a new coredump client connection the kernel
  guarantees that PIDFD_INFO_COREDUMP information is available.
  Currently the following information is provided in the new
  @coredump_mask extension to struct pidfd_info:

  * PIDFD_COREDUMPED is raised if the task did actually coredump.
  * PIDFD_COREDUMP_SKIP	is raised if the task skipped coredumping (e.g.,
    undumpable).
  * PIDFD_COREDUMP_USER	is raised if this is a regular coredump and
    doesn't need special care by the coredump server.
  * PIDFD_COREDUMP_ROOT is raised if the generated coredump should be
    treated as sensitive and the coredump server should restrict access
    to the generated coredump to sufficiently privileged users.

- The coredump server should mark itself as non-dumpable.

- A container coredump server in a separate network namespace can simply
  bind to another well-know address and systemd-coredump fowards
  coredumps to the container.

- Coredumps could in the future also be handled via per-user/session
  coredump servers that run only with that users privileges.

  The coredump server listens on the coredump socket and accepts a
  new coredump connection. It then retrieves SO_PEERPIDFD for the
  client, inspects uid/gid and hands the accepted client to the users
  own coredump handler which runs with the users privileges only
  (It must of coure pay close attention to not forward crashing suid
  binaries.).

The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers.

This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can just keep a worker pool.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v7:
- Use regular AF_UNIX sockets instead of abstract AF_UNIX sockets. This
  fixes the permission problems as userspace can ensure that the socket
  path cannot be rebound by arbitrary unprivileged userspace via regular
  path permissions.

  This means:
  - We don't require privilege checks on a reserved abstract AF_UNIX
    namespace
  - We don't require a fixed address for the coredump socket.
  - We don't need to use abstract unix sockets at all.
  - We don't need  special socket cookie magic in the
    /proc/sys/kernel/core_pattern handler.
  - We are able to set /proc/sys/kernel/core_pattern statically without
    having any socket bound.

  That's all complaints addressed.

  Simply massage unix_find_bsd() to be able to handle this and always
  lookup the coredump socket in the initial mount namespace with
  appropriate credentials. The same thing we do for looking up other
  parts in the kernel like this. Only the lookup happens this way.
  Actual connection credentials are obviously from the coredumping task.
- Link to v6: https://lore.kernel.org/20250512-work-coredump-socket-v6-0-c51bc3450727@kernel.org

Changes in v6:
- Use the socket cookie to verify the coredump server.
- Link to v5: https://lore.kernel.org/20250509-work-coredump-socket-v5-0-23c5b14df1bc@kernel.org

Changes in v5:
- Don't use a prefix just the specific address.
- Link to v4: https://lore.kernel.org/20250507-work-coredump-socket-v4-0-af0ef317b2d0@kernel.org

Changes in v4:
- Expose the coredump socket cookie through the pidfd. This allows the
  coredump server to easily recognize coredump socket connections.
- Link to v3: https://lore.kernel.org/20250505-work-coredump-socket-v3-0-e1832f0e1eae@kernel.org

Changes in v3:
- Use an abstract unix socket.
- Add documentation.
- Add selftests.
- Link to v2: https://lore.kernel.org/20250502-work-coredump-socket-v2-0-43259042ffc7@kernel.org

Changes in v2:
- Expose dumpability via PIDFD_GET_INFO.
- Place COREDUMP_SOCK handling under CONFIG_UNIX.
- Link to v1: https://lore.kernel.org/20250430-work-coredump-socket-v1-0-2faf027dbb47@kernel.org

---
Christian Brauner (9):
      coredump: massage format_corname()
      coredump: massage do_coredump()
      coredump: reflow dump helpers a little
      coredump: add coredump socket
      pidfs, coredump: add PIDFD_INFO_COREDUMP
      coredump: show supported coredump modes
      coredump: validate socket name as it is written
      selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
      selftests/coredump: add tests for AF_UNIX coredumps

 fs/coredump.c                                     | 392 +++++++++++++----
 fs/pidfs.c                                        |  79 ++++
 include/linux/net.h                               |   1 +
 include/linux/pidfs.h                             |  10 +
 include/uapi/linux/pidfd.h                        |  22 +
 net/unix/af_unix.c                                |  60 ++-
 tools/testing/selftests/coredump/stackdump_test.c | 514 +++++++++++++++++++++-
 tools/testing/selftests/pidfd/pidfd.h             |  23 +
 8 files changed, 996 insertions(+), 105 deletions(-)
---
base-commit: 4dd6566b5a8ca1e8c9ff2652c2249715d6c64217
change-id: 20250429-work-coredump-socket-87cc0f17729c


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v7 1/9] coredump: massage format_corname()
  2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
@ 2025-05-14 22:03 ` Christian Brauner
  2025-05-15 13:19   ` Alexander Mikhalitsyn
                     ` (2 more replies)
  2025-05-14 22:03 ` [PATCH v7 2/9] coredump: massage do_coredump() Christian Brauner
                   ` (10 subsequent siblings)
  11 siblings, 3 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-14 22:03 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module, Christian Brauner,
	Alexander Mikhalitsyn

We're going to extend the coredump code in follow-up patches.
Clean it up so we can do this more easily.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 41 ++++++++++++++++++++++++-----------------
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index d740a0411266..368751d98781 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -76,9 +76,15 @@ static char core_pattern[CORENAME_MAX_SIZE] = "core";
 static int core_name_size = CORENAME_MAX_SIZE;
 unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
 
+enum coredump_type_t {
+	COREDUMP_FILE = 1,
+	COREDUMP_PIPE = 2,
+};
+
 struct core_name {
 	char *corename;
 	int used, size;
+	enum coredump_type_t core_type;
 };
 
 static int expand_corename(struct core_name *cn, int size)
@@ -218,18 +224,21 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 {
 	const struct cred *cred = current_cred();
 	const char *pat_ptr = core_pattern;
-	int ispipe = (*pat_ptr == '|');
 	bool was_space = false;
 	int pid_in_pattern = 0;
 	int err = 0;
 
 	cn->used = 0;
 	cn->corename = NULL;
+	if (*pat_ptr == '|')
+		cn->core_type = COREDUMP_PIPE;
+	else
+		cn->core_type = COREDUMP_FILE;
 	if (expand_corename(cn, core_name_size))
 		return -ENOMEM;
 	cn->corename[0] = '\0';
 
-	if (ispipe) {
+	if (cn->core_type == COREDUMP_PIPE) {
 		int argvs = sizeof(core_pattern) / 2;
 		(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
 		if (!(*argv))
@@ -247,7 +256,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 		 * Split on spaces before doing template expansion so that
 		 * %e and %E don't get split if they have spaces in them
 		 */
-		if (ispipe) {
+		if (cn->core_type == COREDUMP_PIPE) {
 			if (isspace(*pat_ptr)) {
 				if (cn->used != 0)
 					was_space = true;
@@ -353,7 +362,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 				 * Installing a pidfd only makes sense if
 				 * we actually spawn a usermode helper.
 				 */
-				if (!ispipe)
+				if (cn->core_type != COREDUMP_PIPE)
 					break;
 
 				/*
@@ -384,12 +393,12 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 	 * If core_pattern does not include a %p (as is the default)
 	 * and core_uses_pid is set, then .%pid will be appended to
 	 * the filename. Do not do this for piped commands. */
-	if (!ispipe && !pid_in_pattern && core_uses_pid) {
+	if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
 		err = cn_printf(cn, ".%d", task_tgid_vnr(current));
 		if (err)
 			return err;
 	}
-	return ispipe;
+	return 0;
 }
 
 static int zap_process(struct signal_struct *signal, int exit_code)
@@ -583,7 +592,6 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 	const struct cred *old_cred;
 	struct cred *cred;
 	int retval = 0;
-	int ispipe;
 	size_t *argv = NULL;
 	int argc = 0;
 	/* require nonrelative corefile path and be extra careful */
@@ -632,19 +640,18 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 
 	old_cred = override_creds(cred);
 
-	ispipe = format_corename(&cn, &cprm, &argv, &argc);
+	retval = format_corename(&cn, &cprm, &argv, &argc);
+	if (retval < 0) {
+		coredump_report_failure("format_corename failed, aborting core");
+		goto fail_unlock;
+	}
 
-	if (ispipe) {
+	if (cn.core_type == COREDUMP_PIPE) {
 		int argi;
 		int dump_count;
 		char **helper_argv;
 		struct subprocess_info *sub_info;
 
-		if (ispipe < 0) {
-			coredump_report_failure("format_corename failed, aborting core");
-			goto fail_unlock;
-		}
-
 		if (cprm.limit == 1) {
 			/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
 			 *
@@ -695,7 +702,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 			coredump_report_failure("|%s pipe failed", cn.corename);
 			goto close_fail;
 		}
-	} else {
+	} else if (cn.core_type == COREDUMP_FILE) {
 		struct mnt_idmap *idmap;
 		struct inode *inode;
 		int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
@@ -823,13 +830,13 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		file_end_write(cprm.file);
 		free_vma_snapshot(&cprm);
 	}
-	if (ispipe && core_pipe_limit)
+	if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
 		wait_for_dump_helpers(cprm.file);
 close_fail:
 	if (cprm.file)
 		filp_close(cprm.file, NULL);
 fail_dropcount:
-	if (ispipe)
+	if (cn.core_type == COREDUMP_PIPE)
 		atomic_dec(&core_dump_count);
 fail_unlock:
 	kfree(argv);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v7 2/9] coredump: massage do_coredump()
  2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
  2025-05-14 22:03 ` [PATCH v7 1/9] coredump: massage format_corname() Christian Brauner
@ 2025-05-14 22:03 ` Christian Brauner
  2025-05-15 13:21   ` Alexander Mikhalitsyn
  2025-05-15 20:52   ` Jann Horn
  2025-05-14 22:03 ` [PATCH v7 3/9] coredump: reflow dump helpers a little Christian Brauner
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-14 22:03 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module, Christian Brauner,
	Alexander Mikhalitsyn

We're going to extend the coredump code in follow-up patches.
Clean it up so we can do this more easily.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 122 +++++++++++++++++++++++++++++++---------------------------
 1 file changed, 65 insertions(+), 57 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 368751d98781..0e97c21b35e3 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -646,63 +646,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		goto fail_unlock;
 	}
 
-	if (cn.core_type == COREDUMP_PIPE) {
-		int argi;
-		int dump_count;
-		char **helper_argv;
-		struct subprocess_info *sub_info;
-
-		if (cprm.limit == 1) {
-			/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
-			 *
-			 * Normally core limits are irrelevant to pipes, since
-			 * we're not writing to the file system, but we use
-			 * cprm.limit of 1 here as a special value, this is a
-			 * consistent way to catch recursive crashes.
-			 * We can still crash if the core_pattern binary sets
-			 * RLIM_CORE = !1, but it runs as root, and can do
-			 * lots of stupid things.
-			 *
-			 * Note that we use task_tgid_vnr here to grab the pid
-			 * of the process group leader.  That way we get the
-			 * right pid if a thread in a multi-threaded
-			 * core_pattern process dies.
-			 */
-			coredump_report_failure("RLIMIT_CORE is set to 1, aborting core");
-			goto fail_unlock;
-		}
-		cprm.limit = RLIM_INFINITY;
-
-		dump_count = atomic_inc_return(&core_dump_count);
-		if (core_pipe_limit && (core_pipe_limit < dump_count)) {
-			coredump_report_failure("over core_pipe_limit, skipping core dump");
-			goto fail_dropcount;
-		}
-
-		helper_argv = kmalloc_array(argc + 1, sizeof(*helper_argv),
-					    GFP_KERNEL);
-		if (!helper_argv) {
-			coredump_report_failure("%s failed to allocate memory", __func__);
-			goto fail_dropcount;
-		}
-		for (argi = 0; argi < argc; argi++)
-			helper_argv[argi] = cn.corename + argv[argi];
-		helper_argv[argi] = NULL;
-
-		retval = -ENOMEM;
-		sub_info = call_usermodehelper_setup(helper_argv[0],
-						helper_argv, NULL, GFP_KERNEL,
-						umh_coredump_setup, NULL, &cprm);
-		if (sub_info)
-			retval = call_usermodehelper_exec(sub_info,
-							  UMH_WAIT_EXEC);
-
-		kfree(helper_argv);
-		if (retval) {
-			coredump_report_failure("|%s pipe failed", cn.corename);
-			goto close_fail;
-		}
-	} else if (cn.core_type == COREDUMP_FILE) {
+	switch (cn.core_type) {
+	case COREDUMP_FILE: {
 		struct mnt_idmap *idmap;
 		struct inode *inode;
 		int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
@@ -796,6 +741,69 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		if (do_truncate(idmap, cprm.file->f_path.dentry,
 				0, 0, cprm.file))
 			goto close_fail;
+		break;
+	}
+	case COREDUMP_PIPE: {
+		int argi;
+		int dump_count;
+		char **helper_argv;
+		struct subprocess_info *sub_info;
+
+		if (cprm.limit == 1) {
+			/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
+			 *
+			 * Normally core limits are irrelevant to pipes, since
+			 * we're not writing to the file system, but we use
+			 * cprm.limit of 1 here as a special value, this is a
+			 * consistent way to catch recursive crashes.
+			 * We can still crash if the core_pattern binary sets
+			 * RLIM_CORE = !1, but it runs as root, and can do
+			 * lots of stupid things.
+			 *
+			 * Note that we use task_tgid_vnr here to grab the pid
+			 * of the process group leader.  That way we get the
+			 * right pid if a thread in a multi-threaded
+			 * core_pattern process dies.
+			 */
+			coredump_report_failure("RLIMIT_CORE is set to 1, aborting core");
+			goto fail_unlock;
+		}
+		cprm.limit = RLIM_INFINITY;
+
+		dump_count = atomic_inc_return(&core_dump_count);
+		if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+			coredump_report_failure("over core_pipe_limit, skipping core dump");
+			goto fail_dropcount;
+		}
+
+		helper_argv = kmalloc_array(argc + 1, sizeof(*helper_argv),
+					    GFP_KERNEL);
+		if (!helper_argv) {
+			coredump_report_failure("%s failed to allocate memory", __func__);
+			goto fail_dropcount;
+		}
+		for (argi = 0; argi < argc; argi++)
+			helper_argv[argi] = cn.corename + argv[argi];
+		helper_argv[argi] = NULL;
+
+		retval = -ENOMEM;
+		sub_info = call_usermodehelper_setup(helper_argv[0],
+						helper_argv, NULL, GFP_KERNEL,
+						umh_coredump_setup, NULL, &cprm);
+		if (sub_info)
+			retval = call_usermodehelper_exec(sub_info,
+							  UMH_WAIT_EXEC);
+
+		kfree(helper_argv);
+		if (retval) {
+			coredump_report_failure("|%s pipe failed", cn.corename);
+			goto close_fail;
+		}
+		break;
+	}
+	default:
+		WARN_ON_ONCE(true);
+		goto close_fail;
 	}
 
 	/* get us an unshared descriptor table; almost always a no-op */

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v7 3/9] coredump: reflow dump helpers a little
  2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
  2025-05-14 22:03 ` [PATCH v7 1/9] coredump: massage format_corname() Christian Brauner
  2025-05-14 22:03 ` [PATCH v7 2/9] coredump: massage do_coredump() Christian Brauner
@ 2025-05-14 22:03 ` Christian Brauner
  2025-05-15 13:22   ` Alexander Mikhalitsyn
  2025-05-15 20:53   ` Jann Horn
  2025-05-14 22:03 ` [PATCH v7 4/9] coredump: add coredump socket Christian Brauner
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-14 22:03 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module, Christian Brauner,
	Alexander Mikhalitsyn

They look rather messy right now.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 0e97c21b35e3..a70929c3585b 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -867,10 +867,9 @@ static int __dump_emit(struct coredump_params *cprm, const void *addr, int nr)
 	struct file *file = cprm->file;
 	loff_t pos = file->f_pos;
 	ssize_t n;
+
 	if (cprm->written + nr > cprm->limit)
 		return 0;
-
-
 	if (dump_interrupted())
 		return 0;
 	n = __kernel_write(file, addr, nr, &pos);
@@ -887,20 +886,21 @@ static int __dump_skip(struct coredump_params *cprm, size_t nr)
 {
 	static char zeroes[PAGE_SIZE];
 	struct file *file = cprm->file;
+
 	if (file->f_mode & FMODE_LSEEK) {
-		if (dump_interrupted() ||
-		    vfs_llseek(file, nr, SEEK_CUR) < 0)
+		if (dump_interrupted() || vfs_llseek(file, nr, SEEK_CUR) < 0)
 			return 0;
 		cprm->pos += nr;
 		return 1;
-	} else {
-		while (nr > PAGE_SIZE) {
-			if (!__dump_emit(cprm, zeroes, PAGE_SIZE))
-				return 0;
-			nr -= PAGE_SIZE;
-		}
-		return __dump_emit(cprm, zeroes, nr);
 	}
+
+	while (nr > PAGE_SIZE) {
+		if (!__dump_emit(cprm, zeroes, PAGE_SIZE))
+			return 0;
+		nr -= PAGE_SIZE;
+	}
+
+	return __dump_emit(cprm, zeroes, nr);
 }
 
 int dump_emit(struct coredump_params *cprm, const void *addr, int nr)

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v7 4/9] coredump: add coredump socket
  2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
                   ` (2 preceding siblings ...)
  2025-05-14 22:03 ` [PATCH v7 3/9] coredump: reflow dump helpers a little Christian Brauner
@ 2025-05-14 22:03 ` Christian Brauner
  2025-05-15 13:47   ` Alexander Mikhalitsyn
                     ` (2 more replies)
  2025-05-14 22:03 ` [PATCH v7 5/9] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
                   ` (7 subsequent siblings)
  11 siblings, 3 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-14 22:03 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module, Christian Brauner,
	Alexander Mikhalitsyn

Coredumping currently supports two modes:

(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
    spawned as a child of the system_unbound_wq or kthreadd.

For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.

The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:

        |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.

In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):

- systemd-coredump is spawned with file descriptor number 0 (stdin)
  connected to the read-end of the pipe. All other file descriptors are
  closed. That specifically includes 1 (stdout) and 2 (stderr). This has
  already caused bugs because userspace assumed that this cannot happen
  (Whether or not this is a sane assumption is irrelevant.).

- systemd-coredump will be spawned as a child of system_unbound_wq. So
  it is not a child of any userspace process and specifically not a
  child of PID 1. It cannot be waited upon and is in a weird hybrid
  upcall which are difficult for userspace to control correctly.

- systemd-coredump is spawned with full kernel privileges. This
  necessitates all kinds of weird privilege dropping excercises in
  userspace to make this safe.

- A new usermode helper has to be spawned for each crashing process.

This series adds a new mode:

(3) Dumping into an AF_UNIX socket.

Userspace can set /proc/sys/kernel/core_pattern to:

        @/path/to/coredump.socket

The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps.

The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket.

- The coredump server uses SO_PEERPIDFD to get a stable handle on the
  connected crashing task. The retrieved pidfd will provide a stable
  reference even if the crashing task gets SIGKILLed while generating
  the coredump.

- By setting core_pipe_limit non-zero userspace can guarantee that the
  crashing task cannot be reaped behind it's back and thus process all
  necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
  detect whether /proc/<pid> still refers to the same process.

  The core_pipe_limit isn't used to rate-limit connections to the
  socket. This can simply be done via AF_UNIX sockets directly.

- The pidfd for the crashing task will grow new information how the task
  coredumps.

- The coredump server should mark itself as non-dumpable.

- A container coredump server in a separate network namespace can simply
  bind to another well-know address and systemd-coredump fowards
  coredumps to the container.

- Coredumps could in the future also be handled via per-user/session
  coredump servers that run only with that users privileges.

  The coredump server listens on the coredump socket and accepts a
  new coredump connection. It then retrieves SO_PEERPIDFD for the
  client, inspects uid/gid and hands the accepted client to the users
  own coredump handler which runs with the users privileges only
  (It must of coure pay close attention to not forward crashing suid
  binaries.).

The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers
that have and continue to cause significant CVEs.

This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can e.g., just keep a worker pool.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c       | 133 ++++++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/net.h |   1 +
 net/unix/af_unix.c  |  53 ++++++++++++++++-----
 3 files changed, 166 insertions(+), 21 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index a70929c3585b..e1256ebb89c1 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -44,7 +44,11 @@
 #include <linux/sysctl.h>
 #include <linux/elf.h>
 #include <linux/pidfs.h>
+#include <linux/net.h>
+#include <linux/socket.h>
+#include <net/net_namespace.h>
 #include <uapi/linux/pidfd.h>
+#include <uapi/linux/un.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
 enum coredump_type_t {
 	COREDUMP_FILE = 1,
 	COREDUMP_PIPE = 2,
+	COREDUMP_SOCK = 3,
 };
 
 struct core_name {
@@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 	cn->corename = NULL;
 	if (*pat_ptr == '|')
 		cn->core_type = COREDUMP_PIPE;
+	else if (*pat_ptr == '@')
+		cn->core_type = COREDUMP_SOCK;
 	else
 		cn->core_type = COREDUMP_FILE;
 	if (expand_corename(cn, core_name_size))
 		return -ENOMEM;
 	cn->corename[0] = '\0';
 
-	if (cn->core_type == COREDUMP_PIPE) {
+	switch (cn->core_type) {
+	case COREDUMP_PIPE: {
 		int argvs = sizeof(core_pattern) / 2;
 		(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
 		if (!(*argv))
@@ -247,6 +255,33 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 		++pat_ptr;
 		if (!(*pat_ptr))
 			return -ENOMEM;
+		break;
+	}
+	case COREDUMP_SOCK: {
+		/* skip the @ */
+		pat_ptr++;
+		err = cn_printf(cn, "%s", pat_ptr);
+		if (err)
+			return err;
+
+		/* Require absolute paths. */
+		if (cn->corename[0] != '/')
+			return -EINVAL;
+
+		/*
+		 * Currently no need to parse any other options.
+		 * Relevant information can be retrieved from the peer
+		 * pidfd retrievable via SO_PEERPIDFD by the receiver or
+		 * via /proc/<pid>, using the SO_PEERPIDFD to guard
+		 * against pid recycling when opening /proc/<pid>.
+		 */
+		return 0;
+	}
+	case COREDUMP_FILE:
+		break;
+	default:
+		WARN_ON_ONCE(true);
+		return -EINVAL;
 	}
 
 	/* Repeat as long as we have more pattern to process and more output
@@ -393,11 +428,20 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 	 * If core_pattern does not include a %p (as is the default)
 	 * and core_uses_pid is set, then .%pid will be appended to
 	 * the filename. Do not do this for piped commands. */
-	if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
-		err = cn_printf(cn, ".%d", task_tgid_vnr(current));
-		if (err)
-			return err;
+	if (!pid_in_pattern && core_uses_pid) {
+		switch (cn->core_type) {
+		case COREDUMP_FILE:
+			return cn_printf(cn, ".%d", task_tgid_vnr(current));
+		case COREDUMP_PIPE:
+			break;
+		case COREDUMP_SOCK:
+			break;
+		default:
+			WARN_ON_ONCE(true);
+			return -EINVAL;
+		}
 	}
+
 	return 0;
 }
 
@@ -801,6 +845,55 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		}
 		break;
 	}
+	case COREDUMP_SOCK: {
+#ifdef CONFIG_UNIX
+		struct file *file __free(fput) = NULL;
+		struct sockaddr_un addr = {
+			.sun_family = AF_UNIX,
+		};
+		ssize_t addr_len;
+		struct socket *socket;
+
+		retval = strscpy(addr.sun_path, cn.corename, sizeof(addr.sun_path));
+		if (retval < 0)
+			goto close_fail;
+		addr_len = offsetof(struct sockaddr_un, sun_path) + retval + 1;
+
+		/*
+		 * It is possible that the userspace process which is
+		 * supposed to handle the coredump and is listening on
+		 * the AF_UNIX socket coredumps. Userspace should just
+		 * mark itself non dumpable.
+		 */
+
+		retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
+		if (retval < 0)
+			goto close_fail;
+
+		file = sock_alloc_file(socket, 0, NULL);
+		if (IS_ERR(file)) {
+			sock_release(socket);
+			goto close_fail;
+		}
+
+		retval = kernel_connect(socket, (struct sockaddr *)(&addr),
+					addr_len, O_NONBLOCK | SOCK_COREDUMP);
+		if (retval) {
+			if (retval == -EAGAIN)
+				coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path);
+			else
+				coredump_report_failure("Coredump socket connection %s failed %d", addr.sun_path, retval);
+			goto close_fail;
+		}
+
+		cprm.limit = RLIM_INFINITY;
+		cprm.file = no_free_ptr(file);
+#else
+		coredump_report_failure("Core dump socket support %s disabled", cn.corename);
+		goto close_fail;
+#endif
+		break;
+	}
 	default:
 		WARN_ON_ONCE(true);
 		goto close_fail;
@@ -838,8 +931,32 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		file_end_write(cprm.file);
 		free_vma_snapshot(&cprm);
 	}
-	if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
-		wait_for_dump_helpers(cprm.file);
+
+	/*
+	 * When core_pipe_limit is set we wait for the coredump server
+	 * or usermodehelper to finish before exiting so it can e.g.,
+	 * inspect /proc/<pid>.
+	 */
+	if (core_pipe_limit) {
+		switch (cn.core_type) {
+		case COREDUMP_PIPE:
+			wait_for_dump_helpers(cprm.file);
+			break;
+		case COREDUMP_SOCK: {
+			/*
+			 * We use a simple read to wait for the coredump
+			 * processing to finish. Either the socket is
+			 * closed or we get sent unexpected data. In
+			 * both cases, we're done.
+			 */
+			__kernel_read(cprm.file, &(char){ 0 }, 1, NULL);
+			break;
+		}
+		default:
+			break;
+		}
+	}
+
 close_fail:
 	if (cprm.file)
 		filp_close(cprm.file, NULL);
@@ -1069,7 +1186,7 @@ EXPORT_SYMBOL(dump_align);
 void validate_coredump_safety(void)
 {
 	if (suid_dumpable == SUID_DUMP_ROOT &&
-	    core_pattern[0] != '/' && core_pattern[0] != '|') {
+	    core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != '@') {
 
 		coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: "
 			"pipe handler or fully qualified core dump path required. "
diff --git a/include/linux/net.h b/include/linux/net.h
index 0ff950eecc6b..139c85d0f2ea 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -81,6 +81,7 @@ enum sock_type {
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK	O_NONBLOCK
 #endif
+#define SOCK_COREDUMP	O_NOCTTY
 
 #endif /* ARCH_HAS_SOCKET_TYPES */
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 472f8aa9ea15..a9d1c9ba2961 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -85,10 +85,13 @@
 #include <linux/file.h>
 #include <linux/filter.h>
 #include <linux/fs.h>
+#include <linux/fs_struct.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/mount.h>
 #include <linux/namei.h>
+#include <linux/net.h>
+#include <linux/pidfs.h>
 #include <linux/poll.h>
 #include <linux/proc_fs.h>
 #include <linux/sched/signal.h>
@@ -100,7 +103,6 @@
 #include <linux/splice.h>
 #include <linux/string.h>
 #include <linux/uaccess.h>
-#include <linux/pidfs.h>
 #include <net/af_unix.h>
 #include <net/net_namespace.h>
 #include <net/scm.h>
@@ -1146,7 +1148,7 @@ static int unix_release(struct socket *sock)
 }
 
 static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
-				  int type)
+				  int type, unsigned int flags)
 {
 	struct inode *inode;
 	struct path path;
@@ -1154,13 +1156,38 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
 	int err;
 
 	unix_mkname_bsd(sunaddr, addr_len);
-	err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
-	if (err)
-		goto fail;
 
-	err = path_permission(&path, MAY_WRITE);
-	if (err)
-		goto path_put;
+	if (flags & SOCK_COREDUMP) {
+		struct path root;
+		struct cred *kcred;
+		const struct cred *cred;
+
+		err = -ENOMEM;
+		kcred = prepare_kernel_cred(&init_task);
+		if (!kcred)
+			goto fail;
+
+		task_lock(&init_task);
+		get_fs_root(init_task.fs, &root);
+		task_unlock(&init_task);
+
+		cred = override_creds(kcred);
+		err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
+				      LOOKUP_BENEATH | LOOKUP_NO_SYMLINKS |
+				      LOOKUP_NO_MAGICLINKS, &path);
+		put_cred(revert_creds(cred));
+		path_put(&root);
+		if (err)
+			goto fail;
+	} else {
+		err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
+		if (err)
+			goto fail;
+
+		err = path_permission(&path, MAY_WRITE);
+		if (err)
+			goto path_put;
+	}
 
 	err = -ECONNREFUSED;
 	inode = d_backing_inode(path.dentry);
@@ -1210,12 +1237,12 @@ static struct sock *unix_find_abstract(struct net *net,
 
 static struct sock *unix_find_other(struct net *net,
 				    struct sockaddr_un *sunaddr,
-				    int addr_len, int type)
+				    int addr_len, int type, int flags)
 {
 	struct sock *sk;
 
 	if (sunaddr->sun_path[0])
-		sk = unix_find_bsd(sunaddr, addr_len, type);
+		sk = unix_find_bsd(sunaddr, addr_len, type, flags);
 	else
 		sk = unix_find_abstract(net, sunaddr, addr_len, type);
 
@@ -1473,7 +1500,7 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr *addr,
 		}
 
 restart:
-		other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type);
+		other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type, 0);
 		if (IS_ERR(other)) {
 			err = PTR_ERR(other);
 			goto out;
@@ -1620,7 +1647,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 
 restart:
 	/*  Find listening sock. */
-	other = unix_find_other(net, sunaddr, addr_len, sk->sk_type);
+	other = unix_find_other(net, sunaddr, addr_len, sk->sk_type, flags);
 	if (IS_ERR(other)) {
 		err = PTR_ERR(other);
 		goto out_free_skb;
@@ -2089,7 +2116,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 	if (msg->msg_namelen) {
 lookup:
 		other = unix_find_other(sock_net(sk), msg->msg_name,
-					msg->msg_namelen, sk->sk_type);
+					msg->msg_namelen, sk->sk_type, 0);
 		if (IS_ERR(other)) {
 			err = PTR_ERR(other);
 			goto out_free;

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v7 5/9] pidfs, coredump: add PIDFD_INFO_COREDUMP
  2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
                   ` (3 preceding siblings ...)
  2025-05-14 22:03 ` [PATCH v7 4/9] coredump: add coredump socket Christian Brauner
@ 2025-05-14 22:03 ` Christian Brauner
  2025-05-15 14:08   ` Alexander Mikhalitsyn
  2025-05-15 20:56   ` Jann Horn
  2025-05-14 22:03 ` [PATCH v7 6/9] coredump: show supported coredump modes Christian Brauner
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-14 22:03 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module, Christian Brauner,
	Alexander Mikhalitsyn

Extend the PIDFD_INFO_COREDUMP ioctl() with the new PIDFD_INFO_COREDUMP
mask flag. This adds the fields @coredump_mask and @coredump_cookie to
struct pidfd_info.

When a task coredumps the kernel will provide the following information
to userspace in @coredump_mask:

* PIDFD_COREDUMPED is raised if the task did actually coredump.
* PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
  undumpable).
* PIDFD_COREDUMP_USER is raised if this is a regular coredump and
  doesn't need special care by the coredump server.
* PIDFD_COREDUMP_ROOT is raised if the generated coredump should be
  treated as sensitive and the coredump server should restrict to the
  generated coredump to sufficiently privileged users.

If userspace uses the coredump socket to process coredumps it needs to
be able to discern connection from the kernel from connects from
userspace (e.g., Python generating it's own coredumps and forwarding
them to systemd). The @coredump_cookie extension uses the SO_COOKIE of
the new connection. This allows userspace to validate that the
connection has been made from the kernel by a crashing task:

   fd_coredump = accept4(fd_socket, NULL, NULL, SOCK_CLOEXEC);
   getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD, &fd_peer_pidfd, &fd_peer_pidfd_len);

   struct pidfd_info info = {
           info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP,
   };

   ioctl(pidfd, PIDFD_GET_INFO, &info);
   /* Refuse connections that aren't from a crashing task. */
   if (!(info.mask & PIDFD_INFO_COREDUMP) || !(info.coredump_mask & PIDFD_COREDUMPED) )
           close(fd_coredump);

   /*
    * Make sure that the coredump cookie matches the connection cookie.
    * If they don't it's not the coredump connection from the kernel.
    * We'll get another connection request in a bit.
    */
   getsocketop(fd_coredump, SOL_SOCKET, SO_COOKIE, &peer_cookie, &peer_cookie_len);
   if (!info.coredump_cookie || (info.coredump_cookie != peer_cookie))
           close(fd_coredump);

The kernel guarantees that by the time the connection is made the all
PIDFD_INFO_COREDUMP info is available.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c              | 34 ++++++++++++++++++++
 fs/pidfs.c                 | 79 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pidfs.h      | 10 ++++++
 include/uapi/linux/pidfd.h | 22 +++++++++++++
 net/unix/af_unix.c         |  7 ++++
 5 files changed, 152 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index e1256ebb89c1..bfc4a32f737c 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -46,7 +46,9 @@
 #include <linux/pidfs.h>
 #include <linux/net.h>
 #include <linux/socket.h>
+#include <net/af_unix.h>
 #include <net/net_namespace.h>
+#include <net/sock.h>
 #include <uapi/linux/pidfd.h>
 #include <uapi/linux/un.h>
 
@@ -598,6 +600,8 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
 		if (IS_ERR(pidfs_file))
 			return PTR_ERR(pidfs_file);
 
+		pidfs_coredump(cp);
+
 		/*
 		 * Usermode helpers are childen of either
 		 * system_unbound_wq or of kthreadd. So we know that
@@ -876,8 +880,34 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 			goto close_fail;
 		}
 
+		/*
+		 * Set the thread-group leader pid which is used for the
+		 * peer credentials during connect() below. Then
+		 * immediately register it in pidfs...
+		 */
+		cprm.pid = task_tgid(current);
+		retval = pidfs_register_pid(cprm.pid);
+		if (retval) {
+			sock_release(socket);
+			goto close_fail;
+		}
+
+		/*
+		 * ... and set the coredump information so userspace
+		 * has it available after connect()...
+		 */
+		pidfs_coredump(&cprm);
+
+		/*
+		 * ... On connect() the peer credentials are recorded
+		 * and @cprm.pid registered in pidfs...
+		 */
 		retval = kernel_connect(socket, (struct sockaddr *)(&addr),
 					addr_len, O_NONBLOCK | SOCK_COREDUMP);
+
+		/* ... So we can safely put our pidfs reference now... */
+		pidfs_put_pid(cprm.pid);
+
 		if (retval) {
 			if (retval == -EAGAIN)
 				coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path);
@@ -886,6 +916,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 			goto close_fail;
 		}
 
+		/* ... and validate that @sk_peer_pid matches @cprm.pid. */
+		if (WARN_ON_ONCE(unix_peer(socket->sk)->sk_peer_pid != cprm.pid))
+			goto close_fail;
+
 		cprm.limit = RLIM_INFINITY;
 		cprm.file = no_free_ptr(file);
 #else
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 3b39e471840b..d7b9a0dd2db6 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -20,6 +20,7 @@
 #include <linux/time_namespace.h>
 #include <linux/utsname.h>
 #include <net/net_namespace.h>
+#include <linux/coredump.h>
 
 #include "internal.h"
 #include "mount.h"
@@ -33,6 +34,8 @@ static struct kmem_cache *pidfs_cachep __ro_after_init;
 struct pidfs_exit_info {
 	__u64 cgroupid;
 	__s32 exit_code;
+	__u32 coredump_mask;
+	__u64 coredump_cookie;
 };
 
 struct pidfs_inode {
@@ -240,6 +243,22 @@ static inline bool pid_in_current_pidns(const struct pid *pid)
 	return false;
 }
 
+static __u32 pidfs_coredump_mask(unsigned long mm_flags)
+{
+	switch (__get_dumpable(mm_flags)) {
+	case SUID_DUMP_USER:
+		return PIDFD_COREDUMP_USER;
+	case SUID_DUMP_ROOT:
+		return PIDFD_COREDUMP_ROOT;
+	case SUID_DUMP_DISABLE:
+		return PIDFD_COREDUMP_SKIP;
+	default:
+		WARN_ON_ONCE(true);
+	}
+
+	return 0;
+}
+
 static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
@@ -280,6 +299,13 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 		}
 	}
 
+	if (mask & PIDFD_INFO_COREDUMP) {
+		kinfo.mask |= PIDFD_INFO_COREDUMP;
+		smp_rmb();
+		kinfo.coredump_cookie = READ_ONCE(pidfs_i(inode)->__pei.coredump_cookie);
+		kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
+	}
+
 	task = get_pid_task(pid, PIDTYPE_PID);
 	if (!task) {
 		/*
@@ -296,6 +322,16 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
 	if (!c)
 		return -ESRCH;
 
+	if (!(kinfo.mask & PIDFD_INFO_COREDUMP)) {
+		task_lock(task);
+		if (task->mm) {
+			smp_rmb();
+			kinfo.coredump_cookie = READ_ONCE(pidfs_i(inode)->__pei.coredump_cookie);
+			kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags);
+		}
+		task_unlock(task);
+	}
+
 	/* Unconditionally return identifiers and credentials, the rest only on request */
 
 	user_ns = current_user_ns();
@@ -559,6 +595,49 @@ void pidfs_exit(struct task_struct *tsk)
 	}
 }
 
+#if defined(CONFIG_COREDUMP) && defined(CONFIG_UNIX)
+void pidfs_coredump_cookie(struct pid *pid, u64 coredump_cookie)
+{
+	struct pidfs_exit_info *exit_info;
+	struct dentry *dentry = pid->stashed;
+	struct inode *inode;
+
+	if (WARN_ON_ONCE(!dentry))
+		return;
+
+	inode = d_inode(dentry);
+	exit_info = &pidfs_i(inode)->__pei;
+	/* Can't use smp_store_release() because of 32bit. */
+	smp_wmb();
+	WRITE_ONCE(exit_info->coredump_cookie, coredump_cookie);
+}
+#endif
+
+#ifdef CONFIG_COREDUMP
+void pidfs_coredump(const struct coredump_params *cprm)
+{
+	struct pid *pid = cprm->pid;
+	struct pidfs_exit_info *exit_info;
+	struct dentry *dentry;
+	struct inode *inode;
+	__u32 coredump_mask = 0;
+
+	dentry = pid->stashed;
+	if (WARN_ON_ONCE(!dentry))
+		return;
+
+	inode = d_inode(dentry);
+	exit_info = &pidfs_i(inode)->__pei;
+	/* Note how we were coredumped. */
+	coredump_mask = pidfs_coredump_mask(cprm->mm_flags);
+	/* Note that we actually did coredump. */
+	coredump_mask |= PIDFD_COREDUMPED;
+	/* If coredumping is set to skip we should never end up here. */
+	VFS_WARN_ON_ONCE(coredump_mask & PIDFD_COREDUMP_SKIP);
+	smp_store_release(&exit_info->coredump_mask, coredump_mask);
+}
+#endif
+
 static struct vfsmount *pidfs_mnt __ro_after_init;
 
 /*
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index 2676890c4d0d..497997bc5e34 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -2,11 +2,21 @@
 #ifndef _LINUX_PID_FS_H
 #define _LINUX_PID_FS_H
 
+struct coredump_params;
+
 struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
 void __init pidfs_init(void);
 void pidfs_add_pid(struct pid *pid);
 void pidfs_remove_pid(struct pid *pid);
 void pidfs_exit(struct task_struct *tsk);
+#ifdef CONFIG_COREDUMP
+void pidfs_coredump(const struct coredump_params *cprm);
+#endif
+#if defined(CONFIG_COREDUMP) && defined(CONFIG_UNIX)
+void pidfs_coredump_cookie(struct pid *pid, u64 coredump_cookie);
+#elif defined(CONFIG_UNIX)
+static inline void pidfs_coredump_cookie(struct pid *pid, u64 coredump_cookie) { }
+#endif
 extern const struct dentry_operations pidfs_dentry_operations;
 int pidfs_register_pid(struct pid *pid);
 void pidfs_get_pid(struct pid *pid);
diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
index 8c1511edd0e9..69267c5ae6d0 100644
--- a/include/uapi/linux/pidfd.h
+++ b/include/uapi/linux/pidfd.h
@@ -25,9 +25,28 @@
 #define PIDFD_INFO_CREDS		(1UL << 1) /* Always returned, even if not requested */
 #define PIDFD_INFO_CGROUPID		(1UL << 2) /* Always returned if available, even if not requested */
 #define PIDFD_INFO_EXIT			(1UL << 3) /* Only returned if requested. */
+#define PIDFD_INFO_COREDUMP		(1UL << 4) /* Only returned if requested. */
 
 #define PIDFD_INFO_SIZE_VER0		64 /* sizeof first published struct */
 
+/*
+ * Values for @coredump_mask in pidfd_info.
+ * Only valid if PIDFD_INFO_COREDUMP is set in @mask.
+ *
+ * Note, the @PIDFD_COREDUMP_ROOT flag indicates that the generated
+ * coredump should be treated as sensitive and access should only be
+ * granted to privileged users.
+ *
+ * If the coredump AF_UNIX socket is used for processing coredumps
+ * @coredump_cookie will be set to the socket SO_COOKIE of the receivers
+ * client socket. This allows the coredump handler to detect whether an
+ * incoming coredump connection was initiated from the crashing task.
+ */
+#define PIDFD_COREDUMPED	(1U << 0) /* Did crash and... */
+#define PIDFD_COREDUMP_SKIP	(1U << 1) /* coredumping generation was skipped. */
+#define PIDFD_COREDUMP_USER	(1U << 2) /* coredump was done as the user. */
+#define PIDFD_COREDUMP_ROOT	(1U << 3) /* coredump was done as root. */
+
 /*
  * The concept of process and threads in userland and the kernel is a confusing
  * one - within the kernel every thread is a 'task' with its own individual PID,
@@ -92,6 +111,9 @@ struct pidfd_info {
 	__u32 fsuid;
 	__u32 fsgid;
 	__s32 exit_code;
+	__u32 coredump_mask;
+	__u32 __spare1;
+	__u64 coredump_cookie;
 };
 
 #define PIDFS_IOCTL_MAGIC 0xFF
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index a9d1c9ba2961..053d2e48e918 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -99,6 +99,7 @@
 #include <linux/seq_file.h>
 #include <linux/skbuff.h>
 #include <linux/slab.h>
+#include <linux/sock_diag.h>
 #include <linux/socket.h>
 #include <linux/splice.h>
 #include <linux/string.h>
@@ -742,6 +743,7 @@ static void unix_release_sock(struct sock *sk, int embrion)
 
 struct unix_peercred {
 	struct pid *peer_pid;
+	u64 cookie;
 	const struct cred *peer_cred;
 };
 
@@ -777,6 +779,8 @@ static void drop_peercred(struct unix_peercred *peercred)
 static inline void init_peercred(struct sock *sk,
 				 const struct unix_peercred *peercred)
 {
+	if (peercred->cookie)
+		pidfs_coredump_cookie(peercred->peer_pid, peercred->cookie);
 	sk->sk_peer_pid = peercred->peer_pid;
 	sk->sk_peer_cred = peercred->peer_cred;
 }
@@ -1713,6 +1717,9 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	unix_peer(newsk)	= sk;
 	newsk->sk_state		= TCP_ESTABLISHED;
 	newsk->sk_type		= sk->sk_type;
+	/* Prepare a new socket cookie for the receiver. */
+	if (flags & SOCK_COREDUMP)
+		peercred.cookie = sock_gen_cookie(newsk);
 	init_peercred(newsk, &peercred);
 	newu = unix_sk(newsk);
 	newu->listener = other;

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v7 6/9] coredump: show supported coredump modes
  2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
                   ` (4 preceding siblings ...)
  2025-05-14 22:03 ` [PATCH v7 5/9] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
@ 2025-05-14 22:03 ` Christian Brauner
  2025-05-15 13:56   ` Alexander Mikhalitsyn
  2025-05-15 20:56   ` Jann Horn
  2025-05-14 22:03 ` [PATCH v7 7/9] coredump: validate socket name as it is written Christian Brauner
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-14 22:03 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module, Christian Brauner,
	Alexander Mikhalitsyn

Allow userspace to discover what coredump modes are supported.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index bfc4a32f737c..6ee38e3da108 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1240,6 +1240,12 @@ static int proc_dostring_coredump(const struct ctl_table *table, int write,
 
 static const unsigned int core_file_note_size_min = CORE_FILE_NOTE_SIZE_DEFAULT;
 static const unsigned int core_file_note_size_max = CORE_FILE_NOTE_SIZE_MAX;
+static char core_modes[] = {
+	"file\npipe"
+#ifdef CONFIG_UNIX
+	"\nsocket"
+#endif
+};
 
 static const struct ctl_table coredump_sysctls[] = {
 	{
@@ -1283,6 +1289,13 @@ static const struct ctl_table coredump_sysctls[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "core_modes",
+		.data		= core_modes,
+		.maxlen		= sizeof(core_modes) - 1,
+		.mode		= 0444,
+		.proc_handler	= proc_dostring,
+	},
 };
 
 static int __init init_fs_coredump_sysctls(void)

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v7 7/9] coredump: validate socket name as it is written
  2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
                   ` (5 preceding siblings ...)
  2025-05-14 22:03 ` [PATCH v7 6/9] coredump: show supported coredump modes Christian Brauner
@ 2025-05-14 22:03 ` Christian Brauner
  2025-05-15 14:03   ` Alexander Mikhalitsyn
  2025-05-15 20:56   ` Jann Horn
  2025-05-14 22:03 ` [PATCH v7 8/9] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure Christian Brauner
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-14 22:03 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module, Christian Brauner,
	Alexander Mikhalitsyn

In contrast to other parameters written into
/proc/sys/kernel/core_pattern that never fail we can validate enabling
the new AF_UNIX support. This is obviously racy as hell but it's always
been that way.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 37 ++++++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 6ee38e3da108..d4ff08ef03e5 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1228,13 +1228,44 @@ void validate_coredump_safety(void)
 	}
 }
 
+static inline bool check_coredump_socket(void)
+{
+	if (core_pattern[0] != '@')
+		return true;
+
+	/*
+	 * Coredump socket must be located in the initial mount
+	 * namespace. Don't give the that impression anything else is
+	 * supported right now.
+	 */
+	if (current->nsproxy->mnt_ns != init_task.nsproxy->mnt_ns)
+		return false;
+
+	/* Must be an absolute path. */
+	if (*(core_pattern + 1) != '/')
+		return false;
+
+	return true;
+}
+
 static int proc_dostring_coredump(const struct ctl_table *table, int write,
 		  void *buffer, size_t *lenp, loff_t *ppos)
 {
-	int error = proc_dostring(table, write, buffer, lenp, ppos);
+	int error;
+	ssize_t retval;
+	char old_core_pattern[CORENAME_MAX_SIZE];
+
+	retval = strscpy(old_core_pattern, core_pattern, CORENAME_MAX_SIZE);
+
+	error = proc_dostring(table, write, buffer, lenp, ppos);
+	if (error)
+		return error;
+	if (!check_coredump_socket()) {
+		strscpy(core_pattern, old_core_pattern, retval + 1);
+		return -EINVAL;
+	}
 
-	if (!error)
-		validate_coredump_safety();
+	validate_coredump_safety();
 	return error;
 }
 

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v7 8/9] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
  2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
                   ` (6 preceding siblings ...)
  2025-05-14 22:03 ` [PATCH v7 7/9] coredump: validate socket name as it is written Christian Brauner
@ 2025-05-14 22:03 ` Christian Brauner
  2025-05-15 14:35   ` Alexander Mikhalitsyn
  2025-05-14 22:03 ` [PATCH v7 9/9] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Christian Brauner @ 2025-05-14 22:03 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module, Christian Brauner,
	Alexander Mikhalitsyn

Add PIDFD_INFO_COREDUMP infrastructure so we can use it in tests.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index 55bcf81a2b9a..887c74007086 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -131,6 +131,26 @@
 #define PIDFD_INFO_EXIT			(1UL << 3) /* Always returned if available, even if not requested */
 #endif
 
+#ifndef PIDFD_INFO_COREDUMP
+#define PIDFD_INFO_COREDUMP	(1UL << 4)
+#endif
+
+#ifndef PIDFD_COREDUMPED
+#define PIDFD_COREDUMPED	(1U << 0) /* Did crash and... */
+#endif
+
+#ifndef PIDFD_COREDUMP_SKIP
+#define PIDFD_COREDUMP_SKIP	(1U << 1) /* coredumping generation was skipped. */
+#endif
+
+#ifndef PIDFD_COREDUMP_USER
+#define PIDFD_COREDUMP_USER	(1U << 2) /* coredump was done as the user. */
+#endif
+
+#ifndef PIDFD_COREDUMP_ROOT
+#define PIDFD_COREDUMP_ROOT	(1U << 3) /* coredump was done as root. */
+#endif
+
 #ifndef PIDFD_THREAD
 #define PIDFD_THREAD O_EXCL
 #endif
@@ -150,6 +170,9 @@ struct pidfd_info {
 	__u32 fsuid;
 	__u32 fsgid;
 	__s32 exit_code;
+	__u32 coredump_mask;
+	__u32 __spare1;
+	__u64 coredump_cookie;
 };
 
 /*

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v7 9/9] selftests/coredump: add tests for AF_UNIX coredumps
  2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
                   ` (7 preceding siblings ...)
  2025-05-14 22:03 ` [PATCH v7 8/9] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure Christian Brauner
@ 2025-05-14 22:03 ` Christian Brauner
  2025-05-15 14:37   ` Alexander Mikhalitsyn
  2025-05-14 22:38 ` [PATCH v7 0/9] coredump: add coredump socket Luca Boccassi
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Christian Brauner @ 2025-05-14 22:03 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module, Christian Brauner,
	Alexander Mikhalitsyn

Add a simple test for generating coredumps via AF_UNIX sockets.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/coredump/stackdump_test.c | 514 +++++++++++++++++++++-
 1 file changed, 513 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/coredump/stackdump_test.c b/tools/testing/selftests/coredump/stackdump_test.c
index fe3c728cd6be..42ddcf0bdaf2 100644
--- a/tools/testing/selftests/coredump/stackdump_test.c
+++ b/tools/testing/selftests/coredump/stackdump_test.c
@@ -1,14 +1,20 @@
 // SPDX-License-Identifier: GPL-2.0
 
 #include <fcntl.h>
+#include <inttypes.h>
 #include <libgen.h>
 #include <linux/limits.h>
 #include <pthread.h>
 #include <string.h>
+#include <sys/mount.h>
 #include <sys/resource.h>
+#include <sys/stat.h>
+#include <sys/socket.h>
+#include <sys/un.h>
 #include <unistd.h>
 
 #include "../kselftest_harness.h"
+#include "../pidfd/pidfd.h"
 
 #define STACKDUMP_FILE "stack_values"
 #define STACKDUMP_SCRIPT "stackdump"
@@ -35,6 +41,7 @@ static void crashing_child(void)
 FIXTURE(coredump)
 {
 	char original_core_pattern[256];
+	pid_t pid_coredump_server;
 };
 
 FIXTURE_SETUP(coredump)
@@ -44,6 +51,7 @@ FIXTURE_SETUP(coredump)
 	char *dir;
 	int ret;
 
+	self->pid_coredump_server = -ESRCH;
 	file = fopen("/proc/sys/kernel/core_pattern", "r");
 	ASSERT_NE(NULL, file);
 
@@ -61,10 +69,17 @@ FIXTURE_TEARDOWN(coredump)
 {
 	const char *reason;
 	FILE *file;
-	int ret;
+	int ret, status;
 
 	unlink(STACKDUMP_FILE);
 
+	if (self->pid_coredump_server > 0) {
+		kill(self->pid_coredump_server, SIGTERM);
+		waitpid(self->pid_coredump_server, &status, 0);
+	}
+	unlink("/tmp/coredump.file");
+	unlink("/tmp/coredump.socket");
+
 	file = fopen("/proc/sys/kernel/core_pattern", "w");
 	if (!file) {
 		reason = "Unable to open core_pattern";
@@ -154,4 +169,501 @@ TEST_F_TIMEOUT(coredump, stackdump, 120)
 	fclose(file);
 }
 
+TEST_F(coredump, socket)
+{
+	int fd, pidfd, ret, status;
+	FILE *file;
+	pid_t pid, pid_coredump_server;
+	struct stat st;
+	char core_file[PATH_MAX];
+	struct pidfd_info info = {};
+	int ipc_sockets[2];
+	char c;
+	const struct sockaddr_un coredump_sk = {
+		.sun_family = AF_UNIX,
+		.sun_path = "/tmp/coredump.socket",
+	};
+	size_t coredump_sk_len = offsetof(struct sockaddr_un, sun_path) +
+				 sizeof("/tmp/coredump.socket");
+
+	ret = socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	ASSERT_EQ(ret, 0);
+
+	file = fopen("/proc/sys/kernel/core_pattern", "w");
+	ASSERT_NE(file, NULL);
+
+	ret = fprintf(file, "@/tmp/coredump.socket");
+	ASSERT_EQ(ret, strlen("@/tmp/coredump.socket"));
+	ASSERT_EQ(fclose(file), 0);
+
+	pid_coredump_server = fork();
+	ASSERT_GE(pid_coredump_server, 0);
+	if (pid_coredump_server == 0) {
+		int fd_server, fd_coredump, fd_peer_pidfd, fd_core_file;
+		__u64 peer_cookie;
+		socklen_t fd_peer_pidfd_len, peer_cookie_len;
+
+		close(ipc_sockets[0]);
+
+		fd_server = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
+		if (fd_server < 0)
+			_exit(EXIT_FAILURE);
+
+		ret = bind(fd_server, (const struct sockaddr *)&coredump_sk, coredump_sk_len);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to bind coredump socket\n");
+			close(fd_server);
+			close(ipc_sockets[1]);
+			_exit(EXIT_FAILURE);
+		}
+
+		ret = listen(fd_server, 1);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to listen on coredump socket\n");
+			close(fd_server);
+			close(ipc_sockets[1]);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (write_nointr(ipc_sockets[1], "1", 1) < 0) {
+			close(fd_server);
+			close(ipc_sockets[1]);
+			_exit(EXIT_FAILURE);
+		}
+
+		close(ipc_sockets[1]);
+
+		fd_coredump = accept4(fd_server, NULL, NULL, SOCK_CLOEXEC);
+		if (fd_coredump < 0) {
+			fprintf(stderr, "Failed to accept coredump socket connection\n");
+			close(fd_server);
+			_exit(EXIT_FAILURE);
+		}
+
+		peer_cookie_len = sizeof(peer_cookie);
+		ret = getsockopt(fd_coredump, SOL_SOCKET, SO_COOKIE,
+				 &peer_cookie, &peer_cookie_len);
+		if (ret < 0) {
+			fprintf(stderr, "%m - Failed to retrieve cookie for coredump socket connection\n");
+			close(fd_coredump);
+			close(fd_server);
+			_exit(EXIT_FAILURE);
+		}
+
+		fd_peer_pidfd_len = sizeof(fd_peer_pidfd);
+		ret = getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD,
+				 &fd_peer_pidfd, &fd_peer_pidfd_len);
+		if (ret < 0) {
+			fprintf(stderr, "%m - Failed to retrieve peer pidfd for coredump socket connection\n");
+			close(fd_coredump);
+			close(fd_server);
+			_exit(EXIT_FAILURE);
+		}
+
+		memset(&info, 0, sizeof(info));
+		info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP;
+		ret = ioctl(fd_peer_pidfd, PIDFD_GET_INFO, &info);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to retrieve pidfd info from peer pidfd for coredump socket connection\n");
+			close(fd_coredump);
+			close(fd_server);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (!(info.mask & PIDFD_INFO_COREDUMP)) {
+			fprintf(stderr, "Missing coredump information from coredumping task\n");
+			close(fd_coredump);
+			close(fd_server);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (!(info.coredump_mask & PIDFD_COREDUMPED)) {
+			fprintf(stderr, "Received connection from non-coredumping task\n");
+			close(fd_coredump);
+			close(fd_server);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (!info.coredump_cookie) {
+			fprintf(stderr, "Missing coredump cookie\n");
+			close(fd_coredump);
+			close(fd_server);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (info.coredump_cookie != peer_cookie) {
+			fprintf(stderr, "Mismatching coredump cookies\n");
+			close(fd_coredump);
+			close(fd_server);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		fd_core_file = creat("/tmp/coredump.file", 0644);
+		if (fd_core_file < 0) {
+			fprintf(stderr, "Failed to create coredump file\n");
+			close(fd_coredump);
+			close(fd_server);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		for (;;) {
+			char buffer[4096];
+			ssize_t bytes_read, bytes_write;
+
+			bytes_read = read(fd_coredump, buffer, sizeof(buffer));
+			if (bytes_read < 0) {
+				close(fd_coredump);
+				close(fd_server);
+				close(fd_peer_pidfd);
+				close(fd_core_file);
+				_exit(EXIT_FAILURE);
+			}
+
+			if (bytes_read == 0)
+				break;
+
+			bytes_write = write(fd_core_file, buffer, bytes_read);
+			if (bytes_read != bytes_write) {
+				close(fd_coredump);
+				close(fd_server);
+				close(fd_peer_pidfd);
+				close(fd_core_file);
+				_exit(EXIT_FAILURE);
+			}
+		}
+
+		close(fd_coredump);
+		close(fd_server);
+		close(fd_peer_pidfd);
+		close(fd_core_file);
+		_exit(EXIT_SUCCESS);
+	}
+	self->pid_coredump_server = pid_coredump_server;
+
+	EXPECT_EQ(close(ipc_sockets[1]), 0);
+	ASSERT_EQ(read_nointr(ipc_sockets[0], &c, 1), 1);
+	EXPECT_EQ(close(ipc_sockets[0]), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+	if (pid == 0)
+		crashing_child();
+
+	pidfd = sys_pidfd_open(pid, 0);
+	ASSERT_GE(pidfd, 0);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFSIGNALED(status));
+	ASSERT_TRUE(WCOREDUMP(status));
+
+	info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP;
+	ASSERT_EQ(ioctl(pidfd, PIDFD_GET_INFO, &info), 0);
+	ASSERT_GT((info.mask & PIDFD_INFO_COREDUMP), 0);
+	ASSERT_GT((info.coredump_mask & PIDFD_COREDUMPED), 0);
+
+	waitpid(pid_coredump_server, &status, 0);
+	self->pid_coredump_server = -ESRCH;
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+
+	ASSERT_EQ(stat("/tmp/coredump.file", &st), 0);
+	ASSERT_GT(st.st_size, 0);
+	/*
+	 * We should somehow validate the produced core file.
+	 * For now just allow for visual inspection
+	 */
+	system("file /tmp/coredump.file");
+}
+
+TEST_F(coredump, socket_detect_userspace_client)
+{
+	int fd, pidfd, ret, status;
+	FILE *file;
+	pid_t pid, pid_coredump_server;
+	struct stat st;
+	char core_file[PATH_MAX];
+	struct pidfd_info info = {};
+	int ipc_sockets[2];
+	char c;
+	const struct sockaddr_un coredump_sk = {
+		.sun_family = AF_UNIX,
+		.sun_path = "/tmp/coredump.socket",
+	};
+	size_t coredump_sk_len = offsetof(struct sockaddr_un, sun_path) +
+				 sizeof("/tmp/coredump.socket");
+
+	file = fopen("/proc/sys/kernel/core_pattern", "w");
+	ASSERT_NE(file, NULL);
+
+	ret = fprintf(file, "@/tmp/coredump.socket");
+	ASSERT_EQ(ret, strlen("@/tmp/coredump.socket"));
+	ASSERT_EQ(fclose(file), 0);
+
+	ret = socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	ASSERT_EQ(ret, 0);
+
+	pid_coredump_server = fork();
+	ASSERT_GE(pid_coredump_server, 0);
+	if (pid_coredump_server == 0) {
+		int fd_server, fd_coredump, fd_peer_pidfd, fd_core_file;
+		__u64 peer_cookie;
+		socklen_t fd_peer_pidfd_len, peer_cookie_len;
+
+		close(ipc_sockets[0]);
+
+		fd_server = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
+		if (fd_server < 0)
+			_exit(EXIT_FAILURE);
+
+		ret = bind(fd_server, (const struct sockaddr *)&coredump_sk, coredump_sk_len);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to bind coredump socket\n");
+			close(fd_server);
+			close(ipc_sockets[1]);
+			_exit(EXIT_FAILURE);
+		}
+
+		ret = listen(fd_server, 1);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to listen on coredump socket\n");
+			close(fd_server);
+			close(ipc_sockets[1]);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (write_nointr(ipc_sockets[1], "1", 1) < 0) {
+			close(fd_server);
+			close(ipc_sockets[1]);
+			_exit(EXIT_FAILURE);
+		}
+
+		close(ipc_sockets[1]);
+
+		fd_coredump = accept4(fd_server, NULL, NULL, SOCK_CLOEXEC);
+		if (fd_coredump < 0) {
+			fprintf(stderr, "Failed to accept coredump socket connection\n");
+			close(fd_server);
+			_exit(EXIT_FAILURE);
+		}
+
+		peer_cookie_len = sizeof(peer_cookie);
+		ret = getsockopt(fd_coredump, SOL_SOCKET, SO_COOKIE,
+				 &peer_cookie, &peer_cookie_len);
+		if (ret < 0) {
+			fprintf(stderr, "%m - Failed to retrieve cookie for coredump socket connection\n");
+			close(fd_coredump);
+			close(fd_server);
+			_exit(EXIT_FAILURE);
+		}
+
+		fd_peer_pidfd_len = sizeof(fd_peer_pidfd);
+		ret = getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD,
+				 &fd_peer_pidfd, &fd_peer_pidfd_len);
+		if (ret < 0) {
+			fprintf(stderr, "%m - Failed to retrieve peer pidfd for coredump socket connection\n");
+			close(fd_coredump);
+			close(fd_server);
+			_exit(EXIT_FAILURE);
+		}
+
+		memset(&info, 0, sizeof(info));
+		info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP;
+		ret = ioctl(fd_peer_pidfd, PIDFD_GET_INFO, &info);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to retrieve pidfd info from peer pidfd for coredump socket connection\n");
+			close(fd_coredump);
+			close(fd_server);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (!(info.mask & PIDFD_INFO_COREDUMP)) {
+			fprintf(stderr, "Missing coredump information from coredumping task\n");
+			close(fd_coredump);
+			close(fd_server);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (info.coredump_mask & PIDFD_COREDUMPED) {
+			fprintf(stderr, "Received unexpected connection from coredumping task\n");
+			close(fd_coredump);
+			close(fd_server);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (info.coredump_cookie) {
+			fprintf(stderr, "Received unexpected coredump cookie\n");
+			close(fd_coredump);
+			close(fd_server);
+			close(fd_peer_pidfd);
+			_exit(EXIT_FAILURE);
+		}
+
+		close(fd_coredump);
+		close(fd_server);
+		close(fd_peer_pidfd);
+		close(fd_core_file);
+		_exit(EXIT_SUCCESS);
+	}
+	self->pid_coredump_server = pid_coredump_server;
+
+	EXPECT_EQ(close(ipc_sockets[1]), 0);
+	ASSERT_EQ(read_nointr(ipc_sockets[0], &c, 1), 1);
+	EXPECT_EQ(close(ipc_sockets[0]), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+	if (pid == 0) {
+		int fd_socket;
+		ssize_t ret;
+
+		fd_socket = socket(AF_UNIX, SOCK_STREAM, 0);
+		if (fd_socket < 0)
+			_exit(EXIT_FAILURE);
+
+
+		ret = connect(fd_socket, (const struct sockaddr *)&coredump_sk, coredump_sk_len);
+		if (ret < 0)
+			_exit(EXIT_FAILURE);
+
+		(void *)write(fd_socket, &(char){ 0 }, 1);
+		close(fd_socket);
+		_exit(EXIT_SUCCESS);
+	}
+
+	pidfd = sys_pidfd_open(pid, 0);
+	ASSERT_GE(pidfd, 0);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+
+	info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP;
+	ASSERT_EQ(ioctl(pidfd, PIDFD_GET_INFO, &info), 0);
+	ASSERT_GT((info.mask & PIDFD_INFO_COREDUMP), 0);
+	ASSERT_EQ((info.coredump_mask & PIDFD_COREDUMPED), 0);
+
+	waitpid(pid_coredump_server, &status, 0);
+	self->pid_coredump_server = -ESRCH;
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+
+	ASSERT_NE(stat("/tmp/coredump.file", &st), 0);
+	ASSERT_EQ(errno, ENOENT);
+}
+
+TEST_F(coredump, socket_enoent)
+{
+	int pidfd, ret, status;
+	FILE *file;
+	pid_t pid;
+	char core_file[PATH_MAX];
+
+	file = fopen("/proc/sys/kernel/core_pattern", "w");
+	ASSERT_NE(file, NULL);
+
+	ret = fprintf(file, "@/tmp/coredump.socket");
+	ASSERT_EQ(ret, strlen("@/tmp/coredump.socket"));
+	ASSERT_EQ(fclose(file), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+	if (pid == 0)
+		crashing_child();
+
+	pidfd = sys_pidfd_open(pid, 0);
+	ASSERT_GE(pidfd, 0);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFSIGNALED(status));
+	ASSERT_FALSE(WCOREDUMP(status));
+}
+
+TEST_F(coredump, socket_no_listener)
+{
+	int pidfd, ret, status;
+	FILE *file;
+	pid_t pid, pid_coredump_server;
+	int ipc_sockets[2];
+	char c;
+	const struct sockaddr_un coredump_sk = {
+		.sun_family = AF_UNIX,
+		.sun_path = "/tmp/coredump.socket",
+	};
+	size_t coredump_sk_len = offsetof(struct sockaddr_un, sun_path) +
+				 sizeof("/tmp/coredump.socket");
+
+	ret = socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
+	ASSERT_EQ(ret, 0);
+
+	file = fopen("/proc/sys/kernel/core_pattern", "w");
+	ASSERT_NE(file, NULL);
+
+	ret = fprintf(file, "@/tmp/coredump.socket");
+	ASSERT_EQ(ret, strlen("@/tmp/coredump.socket"));
+	ASSERT_EQ(fclose(file), 0);
+
+	pid_coredump_server = fork();
+	ASSERT_GE(pid_coredump_server, 0);
+	if (pid_coredump_server == 0) {
+		int fd_server, fd_coredump, fd_peer_pidfd, fd_core_file;
+		__u64 peer_cookie;
+		socklen_t fd_peer_pidfd_len, peer_cookie_len;
+
+		close(ipc_sockets[0]);
+
+		fd_server = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
+		if (fd_server < 0)
+			_exit(EXIT_FAILURE);
+
+		ret = bind(fd_server, (const struct sockaddr *)&coredump_sk, coredump_sk_len);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to bind coredump socket\n");
+			close(fd_server);
+			close(ipc_sockets[1]);
+			_exit(EXIT_FAILURE);
+		}
+
+		if (write_nointr(ipc_sockets[1], "1", 1) < 0) {
+			close(fd_server);
+			close(ipc_sockets[1]);
+			_exit(EXIT_FAILURE);
+		}
+
+		close(fd_server);
+		close(ipc_sockets[1]);
+		_exit(EXIT_SUCCESS);
+	}
+	self->pid_coredump_server = pid_coredump_server;
+
+	EXPECT_EQ(close(ipc_sockets[1]), 0);
+	ASSERT_EQ(read_nointr(ipc_sockets[0], &c, 1), 1);
+	EXPECT_EQ(close(ipc_sockets[0]), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+	if (pid == 0)
+		crashing_child();
+
+	pidfd = sys_pidfd_open(pid, 0);
+	ASSERT_GE(pidfd, 0);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFSIGNALED(status));
+	ASSERT_FALSE(WCOREDUMP(status));
+
+	waitpid(pid_coredump_server, &status, 0);
+	self->pid_coredump_server = -ESRCH;
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 0/9] coredump: add coredump socket
  2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
                   ` (8 preceding siblings ...)
  2025-05-14 22:03 ` [PATCH v7 9/9] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
@ 2025-05-14 22:38 ` Luca Boccassi
  2025-05-15  9:17 ` Christian Brauner
  2025-05-15  9:26 ` Lennart Poettering
  11 siblings, 0 replies; 43+ messages in thread
From: Luca Boccassi @ 2025-05-14 22:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, linux-kernel, netdev, linux-security-module

On Wed, 14 May 2025 at 23:04, Christian Brauner <brauner@kernel.org> wrote:
>
> Coredumping currently supports two modes:
>
> (1) Dumping directly into a file somewhere on the filesystem.
> (2) Dumping into a pipe connected to a usermode helper process
>     spawned as a child of the system_unbound_wq or kthreadd.
>
> For simplicity I'm mostly ignoring (1). There's probably still some
> users of (1) out there but processing coredumps in this way can be
> considered adventurous especially in the face of set*id binaries.
>
> The most common option should be (2) by now. It works by allowing
> userspace to put a string into /proc/sys/kernel/core_pattern like:
>
>         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
>
> The "|" at the beginning indicates to the kernel that a pipe must be
> used. The path following the pipe indicator is a path to a binary that
> will be spawned as a usermode helper process. Any additional parameters
> pass information about the task that is generating the coredump to the
> binary that processes the coredump.
>
> In the example core_pattern shown above systemd-coredump is spawned as a
> usermode helper. There's various conceptual consequences of this
> (non-exhaustive list):
>
> - systemd-coredump is spawned with file descriptor number 0 (stdin)
>   connected to the read-end of the pipe. All other file descriptors are
>   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
>   already caused bugs because userspace assumed that this cannot happen
>   (Whether or not this is a sane assumption is irrelevant.).
>
> - systemd-coredump will be spawned as a child of system_unbound_wq. So
>   it is not a child of any userspace process and specifically not a
>   child of PID 1. It cannot be waited upon and is in a weird hybrid
>   upcall which are difficult for userspace to control correctly.
>
> - systemd-coredump is spawned with full kernel privileges. This
>   necessitates all kinds of weird privilege dropping excercises in
>   userspace to make this safe.
>
> - A new usermode helper has to be spawned for each crashing process.
>
> This series adds a new mode:
>
> (3) Dumping into an abstract AF_UNIX socket.
>
> Userspace can set /proc/sys/kernel/core_pattern to:
>
>         @/path/to/coredump.socket
>
> The "@" at the beginning indicates to the kernel that an AF_UNIX
> coredump socket will be used to process coredumps.
>
> The coredump socket must be located in the initial mount namespace.
> When a task coredumps it opens a client socket in the initial network
> namespace and connects to the coredump socket.
>
> - The coredump server should use SO_PEERPIDFD to get a stable handle on
>   the connected crashing task. The retrieved pidfd will provide a stable
>   reference even if the crashing task gets SIGKILLed while generating
>   the coredump.
>
> - When a coredump connection is initiated use the socket cookie as the
>   coredump cookie and store it in the pidfd. The receiver can now easily
>   authenticate that the connection is coming from the kernel.
>
>   Unless the coredump server expects to handle connection from
>   non-crashing task it can validate that the connection has been made from
>   a crashing task:
>
>      fd_coredump = accept4(fd_socket, NULL, NULL, SOCK_CLOEXEC);
>      getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD, &fd_peer_pidfd, &fd_peer_pidfd_len);
>
>      struct pidfd_info info = {
>              info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP,
>      };
>
>      ioctl(pidfd, PIDFD_GET_INFO, &info);
>      /* Refuse connections that aren't from a crashing task. */
>      if (!(info.mask & PIDFD_INFO_COREDUMP) || !(info.coredump_mask & PIDFD_COREDUMPED) )
>              close(fd_coredump);
>
>      /*
>       * Make sure that the coredump cookie matches the connection cookie.
>       * If they don't it's not the coredump connection from the kernel.
>       * We'll get another connection request in a bit.
>       */
>      getsocketop(fd_coredump, SOL_SOCKET, SO_COOKIE, &peer_cookie, &peer_cookie_len);
>      if (!info.coredump_cookie || (info.coredump_cookie != peer_cookie))
>              close(fd_coredump);
>
>   The kernel guarantees that by the time the connection is made the
>   coredump info is available.
>
> - By setting core_pipe_limit non-zero userspace can guarantee that the
>   crashing task cannot be reaped behind it's back and thus process all
>   necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
>   detect whether /proc/<pid> still refers to the same process.
>
>   The core_pipe_limit isn't used to rate-limit connections to the
>   socket. This can simply be done via AF_UNIX socket directly.
>
> - The pidfd for the crashing task will contain information how the task
>   coredumps. The PIDFD_GET_INFO ioctl gained a new flag
>   PIDFD_INFO_COREDUMP which can be used to retreive the coredump
>   information.
>
>   If the coredump gets a new coredump client connection the kernel
>   guarantees that PIDFD_INFO_COREDUMP information is available.
>   Currently the following information is provided in the new
>   @coredump_mask extension to struct pidfd_info:
>
>   * PIDFD_COREDUMPED is raised if the task did actually coredump.
>   * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
>     undumpable).
>   * PIDFD_COREDUMP_USER is raised if this is a regular coredump and
>     doesn't need special care by the coredump server.
>   * PIDFD_COREDUMP_ROOT is raised if the generated coredump should be
>     treated as sensitive and the coredump server should restrict access
>     to the generated coredump to sufficiently privileged users.
>
> - The coredump server should mark itself as non-dumpable.
>
> - A container coredump server in a separate network namespace can simply
>   bind to another well-know address and systemd-coredump fowards
>   coredumps to the container.
>
> - Coredumps could in the future also be handled via per-user/session
>   coredump servers that run only with that users privileges.
>
>   The coredump server listens on the coredump socket and accepts a
>   new coredump connection. It then retrieves SO_PEERPIDFD for the
>   client, inspects uid/gid and hands the accepted client to the users
>   own coredump handler which runs with the users privileges only
>   (It must of coure pay close attention to not forward crashing suid
>   binaries.).
>
> The new coredump socket will allow userspace to not have to rely on
> usermode helpers for processing coredumps and provides a safer way to
> handle them instead of relying on super privileged coredumping helpers.
>
> This will also be significantly more lightweight since no fork()+exec()
> for the usermodehelper is required for each crashing process. The
> coredump server in userspace can just keep a worker pool.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Changes in v7:
> - Use regular AF_UNIX sockets instead of abstract AF_UNIX sockets. This
>   fixes the permission problems as userspace can ensure that the socket
>   path cannot be rebound by arbitrary unprivileged userspace via regular
>   path permissions.
>
>   This means:
>   - We don't require privilege checks on a reserved abstract AF_UNIX
>     namespace
>   - We don't require a fixed address for the coredump socket.
>   - We don't need to use abstract unix sockets at all.
>   - We don't need  special socket cookie magic in the
>     /proc/sys/kernel/core_pattern handler.
>   - We are able to set /proc/sys/kernel/core_pattern statically without
>     having any socket bound.
>
>   That's all complaints addressed.
>
>   Simply massage unix_find_bsd() to be able to handle this and always
>   lookup the coredump socket in the initial mount namespace with
>   appropriate credentials. The same thing we do for looking up other
>   parts in the kernel like this. Only the lookup happens this way.
>   Actual connection credentials are obviously from the coredumping task.
> - Link to v6: https://lore.kernel.org/20250512-work-coredump-socket-v6-0-c51bc3450727@kernel.org
>
> Changes in v6:
> - Use the socket cookie to verify the coredump server.
> - Link to v5: https://lore.kernel.org/20250509-work-coredump-socket-v5-0-23c5b14df1bc@kernel.org
>
> Changes in v5:
> - Don't use a prefix just the specific address.
> - Link to v4: https://lore.kernel.org/20250507-work-coredump-socket-v4-0-af0ef317b2d0@kernel.org
>
> Changes in v4:
> - Expose the coredump socket cookie through the pidfd. This allows the
>   coredump server to easily recognize coredump socket connections.
> - Link to v3: https://lore.kernel.org/20250505-work-coredump-socket-v3-0-e1832f0e1eae@kernel.org
>
> Changes in v3:
> - Use an abstract unix socket.
> - Add documentation.
> - Add selftests.
> - Link to v2: https://lore.kernel.org/20250502-work-coredump-socket-v2-0-43259042ffc7@kernel.org
>
> Changes in v2:
> - Expose dumpability via PIDFD_GET_INFO.
> - Place COREDUMP_SOCK handling under CONFIG_UNIX.
> - Link to v1: https://lore.kernel.org/20250430-work-coredump-socket-v1-0-2faf027dbb47@kernel.org
>
> ---
> Christian Brauner (9):
>       coredump: massage format_corname()
>       coredump: massage do_coredump()
>       coredump: reflow dump helpers a little
>       coredump: add coredump socket
>       pidfs, coredump: add PIDFD_INFO_COREDUMP
>       coredump: show supported coredump modes
>       coredump: validate socket name as it is written
>       selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
>       selftests/coredump: add tests for AF_UNIX coredumps
>
>  fs/coredump.c                                     | 392 +++++++++++++----
>  fs/pidfs.c                                        |  79 ++++
>  include/linux/net.h                               |   1 +
>  include/linux/pidfs.h                             |  10 +
>  include/uapi/linux/pidfd.h                        |  22 +
>  net/unix/af_unix.c                                |  60 ++-
>  tools/testing/selftests/coredump/stackdump_test.c | 514 +++++++++++++++++++++-
>  tools/testing/selftests/pidfd/pidfd.h             |  23 +
>  8 files changed, 996 insertions(+), 105 deletions(-)
> ---
> base-commit: 4dd6566b5a8ca1e8c9ff2652c2249715d6c64217
> change-id: 20250429-work-coredump-socket-87cc0f17729c

Looks great to me and we can for sure use this in systemd-coredump,
thanks, for the series:

Acked-by: Luca Boccassi <luca.boccassi@gmail.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 0/9] coredump: add coredump socket
  2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
                   ` (9 preceding siblings ...)
  2025-05-14 22:38 ` [PATCH v7 0/9] coredump: add coredump socket Luca Boccassi
@ 2025-05-15  9:17 ` Christian Brauner
  2025-05-15  9:26 ` Lennart Poettering
  11 siblings, 0 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-15  9:17 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima
  Cc: Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 12:03:33AM +0200, Christian Brauner wrote:
> Coredumping currently supports two modes:
> 
> (1) Dumping directly into a file somewhere on the filesystem.
> (2) Dumping into a pipe connected to a usermode helper process
>     spawned as a child of the system_unbound_wq or kthreadd.
> 
> For simplicity I'm mostly ignoring (1). There's probably still some
> users of (1) out there but processing coredumps in this way can be
> considered adventurous especially in the face of set*id binaries.
> 
> The most common option should be (2) by now. It works by allowing
> userspace to put a string into /proc/sys/kernel/core_pattern like:
> 
>         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> 
> The "|" at the beginning indicates to the kernel that a pipe must be
> used. The path following the pipe indicator is a path to a binary that
> will be spawned as a usermode helper process. Any additional parameters
> pass information about the task that is generating the coredump to the
> binary that processes the coredump.
> 
> In the example core_pattern shown above systemd-coredump is spawned as a
> usermode helper. There's various conceptual consequences of this
> (non-exhaustive list):
> 
> - systemd-coredump is spawned with file descriptor number 0 (stdin)
>   connected to the read-end of the pipe. All other file descriptors are
>   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
>   already caused bugs because userspace assumed that this cannot happen
>   (Whether or not this is a sane assumption is irrelevant.).
> 
> - systemd-coredump will be spawned as a child of system_unbound_wq. So
>   it is not a child of any userspace process and specifically not a
>   child of PID 1. It cannot be waited upon and is in a weird hybrid
>   upcall which are difficult for userspace to control correctly.
> 
> - systemd-coredump is spawned with full kernel privileges. This
>   necessitates all kinds of weird privilege dropping excercises in
>   userspace to make this safe.
> 
> - A new usermode helper has to be spawned for each crashing process.
> 
> This series adds a new mode:
> 
> (3) Dumping into an abstract AF_UNIX socket.

s/abstract//
Forgot to remove that. Fixed in-tree.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 0/9] coredump: add coredump socket
  2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
                   ` (10 preceding siblings ...)
  2025-05-15  9:17 ` Christian Brauner
@ 2025-05-15  9:26 ` Lennart Poettering
  11 siblings, 0 replies; 43+ messages in thread
From: Lennart Poettering @ 2025-05-15  9:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima,
	Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Do, 15.05.25 00:03, Christian Brauner (brauner@kernel.org) wrote:

> Coredumping currently supports two modes:

[...]
> ---
> base-commit: 4dd6566b5a8ca1e8c9ff2652c2249715d6c64217
> change-id: 20250429-work-coredump-socket-87cc0f17729c

Looks lovely, thank you!

Looking forward to hooking this up with systemd-coredump!

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 1/9] coredump: massage format_corname()
  2025-05-14 22:03 ` [PATCH v7 1/9] coredump: massage format_corname() Christian Brauner
@ 2025-05-15 13:19   ` Alexander Mikhalitsyn
  2025-05-15 13:36   ` Serge E. Hallyn
  2025-05-15 20:52   ` Jann Horn
  2 siblings, 0 replies; 43+ messages in thread
From: Alexander Mikhalitsyn @ 2025-05-15 13:19 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima,
	Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module

Am Do., 15. Mai 2025 um 00:04 Uhr schrieb Christian Brauner
<brauner@kernel.org>:
>
> We're going to extend the coredump code in follow-up patches.
> Clean it up so we can do this more easily.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>

> ---
>  fs/coredump.c | 41 ++++++++++++++++++++++++-----------------
>  1 file changed, 24 insertions(+), 17 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index d740a0411266..368751d98781 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -76,9 +76,15 @@ static char core_pattern[CORENAME_MAX_SIZE] = "core";
>  static int core_name_size = CORENAME_MAX_SIZE;
>  unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
>
> +enum coredump_type_t {
> +       COREDUMP_FILE = 1,
> +       COREDUMP_PIPE = 2,
> +};
> +
>  struct core_name {
>         char *corename;
>         int used, size;
> +       enum coredump_type_t core_type;
>  };
>
>  static int expand_corename(struct core_name *cn, int size)
> @@ -218,18 +224,21 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>  {
>         const struct cred *cred = current_cred();
>         const char *pat_ptr = core_pattern;
> -       int ispipe = (*pat_ptr == '|');
>         bool was_space = false;
>         int pid_in_pattern = 0;
>         int err = 0;
>
>         cn->used = 0;
>         cn->corename = NULL;
> +       if (*pat_ptr == '|')
> +               cn->core_type = COREDUMP_PIPE;
> +       else
> +               cn->core_type = COREDUMP_FILE;
>         if (expand_corename(cn, core_name_size))
>                 return -ENOMEM;
>         cn->corename[0] = '\0';
>
> -       if (ispipe) {
> +       if (cn->core_type == COREDUMP_PIPE) {
>                 int argvs = sizeof(core_pattern) / 2;
>                 (*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
>                 if (!(*argv))
> @@ -247,7 +256,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>                  * Split on spaces before doing template expansion so that
>                  * %e and %E don't get split if they have spaces in them
>                  */
> -               if (ispipe) {
> +               if (cn->core_type == COREDUMP_PIPE) {
>                         if (isspace(*pat_ptr)) {
>                                 if (cn->used != 0)
>                                         was_space = true;
> @@ -353,7 +362,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>                                  * Installing a pidfd only makes sense if
>                                  * we actually spawn a usermode helper.
>                                  */
> -                               if (!ispipe)
> +                               if (cn->core_type != COREDUMP_PIPE)
>                                         break;
>
>                                 /*
> @@ -384,12 +393,12 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>          * If core_pattern does not include a %p (as is the default)
>          * and core_uses_pid is set, then .%pid will be appended to
>          * the filename. Do not do this for piped commands. */
> -       if (!ispipe && !pid_in_pattern && core_uses_pid) {
> +       if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
>                 err = cn_printf(cn, ".%d", task_tgid_vnr(current));
>                 if (err)
>                         return err;
>         }
> -       return ispipe;
> +       return 0;
>  }
>
>  static int zap_process(struct signal_struct *signal, int exit_code)
> @@ -583,7 +592,6 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>         const struct cred *old_cred;
>         struct cred *cred;
>         int retval = 0;
> -       int ispipe;
>         size_t *argv = NULL;
>         int argc = 0;
>         /* require nonrelative corefile path and be extra careful */
> @@ -632,19 +640,18 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>
>         old_cred = override_creds(cred);
>
> -       ispipe = format_corename(&cn, &cprm, &argv, &argc);
> +       retval = format_corename(&cn, &cprm, &argv, &argc);
> +       if (retval < 0) {
> +               coredump_report_failure("format_corename failed, aborting core");
> +               goto fail_unlock;
> +       }
>
> -       if (ispipe) {
> +       if (cn.core_type == COREDUMP_PIPE) {
>                 int argi;
>                 int dump_count;
>                 char **helper_argv;
>                 struct subprocess_info *sub_info;
>
> -               if (ispipe < 0) {
> -                       coredump_report_failure("format_corename failed, aborting core");
> -                       goto fail_unlock;
> -               }
> -
>                 if (cprm.limit == 1) {
>                         /* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
>                          *
> @@ -695,7 +702,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                         coredump_report_failure("|%s pipe failed", cn.corename);
>                         goto close_fail;
>                 }
> -       } else {
> +       } else if (cn.core_type == COREDUMP_FILE) {
>                 struct mnt_idmap *idmap;
>                 struct inode *inode;
>                 int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
> @@ -823,13 +830,13 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                 file_end_write(cprm.file);
>                 free_vma_snapshot(&cprm);
>         }
> -       if (ispipe && core_pipe_limit)
> +       if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
>                 wait_for_dump_helpers(cprm.file);
>  close_fail:
>         if (cprm.file)
>                 filp_close(cprm.file, NULL);
>  fail_dropcount:
> -       if (ispipe)
> +       if (cn.core_type == COREDUMP_PIPE)
>                 atomic_dec(&core_dump_count);
>  fail_unlock:
>         kfree(argv);
>
> --
> 2.47.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/9] coredump: massage do_coredump()
  2025-05-14 22:03 ` [PATCH v7 2/9] coredump: massage do_coredump() Christian Brauner
@ 2025-05-15 13:21   ` Alexander Mikhalitsyn
  2025-05-15 20:52   ` Jann Horn
  1 sibling, 0 replies; 43+ messages in thread
From: Alexander Mikhalitsyn @ 2025-05-15 13:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima,
	Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module

Am Do., 15. Mai 2025 um 00:04 Uhr schrieb Christian Brauner
<brauner@kernel.org>:
>
> We're going to extend the coredump code in follow-up patches.
> Clean it up so we can do this more easily.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>

> ---
>  fs/coredump.c | 122 +++++++++++++++++++++++++++++++---------------------------
>  1 file changed, 65 insertions(+), 57 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 368751d98781..0e97c21b35e3 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -646,63 +646,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                 goto fail_unlock;
>         }
>
> -       if (cn.core_type == COREDUMP_PIPE) {
> -               int argi;
> -               int dump_count;
> -               char **helper_argv;
> -               struct subprocess_info *sub_info;
> -
> -               if (cprm.limit == 1) {
> -                       /* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
> -                        *
> -                        * Normally core limits are irrelevant to pipes, since
> -                        * we're not writing to the file system, but we use
> -                        * cprm.limit of 1 here as a special value, this is a
> -                        * consistent way to catch recursive crashes.
> -                        * We can still crash if the core_pattern binary sets
> -                        * RLIM_CORE = !1, but it runs as root, and can do
> -                        * lots of stupid things.
> -                        *
> -                        * Note that we use task_tgid_vnr here to grab the pid
> -                        * of the process group leader.  That way we get the
> -                        * right pid if a thread in a multi-threaded
> -                        * core_pattern process dies.
> -                        */
> -                       coredump_report_failure("RLIMIT_CORE is set to 1, aborting core");
> -                       goto fail_unlock;
> -               }
> -               cprm.limit = RLIM_INFINITY;
> -
> -               dump_count = atomic_inc_return(&core_dump_count);
> -               if (core_pipe_limit && (core_pipe_limit < dump_count)) {
> -                       coredump_report_failure("over core_pipe_limit, skipping core dump");
> -                       goto fail_dropcount;
> -               }
> -
> -               helper_argv = kmalloc_array(argc + 1, sizeof(*helper_argv),
> -                                           GFP_KERNEL);
> -               if (!helper_argv) {
> -                       coredump_report_failure("%s failed to allocate memory", __func__);
> -                       goto fail_dropcount;
> -               }
> -               for (argi = 0; argi < argc; argi++)
> -                       helper_argv[argi] = cn.corename + argv[argi];
> -               helper_argv[argi] = NULL;
> -
> -               retval = -ENOMEM;
> -               sub_info = call_usermodehelper_setup(helper_argv[0],
> -                                               helper_argv, NULL, GFP_KERNEL,
> -                                               umh_coredump_setup, NULL, &cprm);
> -               if (sub_info)
> -                       retval = call_usermodehelper_exec(sub_info,
> -                                                         UMH_WAIT_EXEC);
> -
> -               kfree(helper_argv);
> -               if (retval) {
> -                       coredump_report_failure("|%s pipe failed", cn.corename);
> -                       goto close_fail;
> -               }
> -       } else if (cn.core_type == COREDUMP_FILE) {
> +       switch (cn.core_type) {
> +       case COREDUMP_FILE: {
>                 struct mnt_idmap *idmap;
>                 struct inode *inode;
>                 int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
> @@ -796,6 +741,69 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                 if (do_truncate(idmap, cprm.file->f_path.dentry,
>                                 0, 0, cprm.file))
>                         goto close_fail;
> +               break;
> +       }
> +       case COREDUMP_PIPE: {
> +               int argi;
> +               int dump_count;
> +               char **helper_argv;
> +               struct subprocess_info *sub_info;
> +
> +               if (cprm.limit == 1) {
> +                       /* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
> +                        *
> +                        * Normally core limits are irrelevant to pipes, since
> +                        * we're not writing to the file system, but we use
> +                        * cprm.limit of 1 here as a special value, this is a
> +                        * consistent way to catch recursive crashes.
> +                        * We can still crash if the core_pattern binary sets
> +                        * RLIM_CORE = !1, but it runs as root, and can do
> +                        * lots of stupid things.
> +                        *
> +                        * Note that we use task_tgid_vnr here to grab the pid
> +                        * of the process group leader.  That way we get the
> +                        * right pid if a thread in a multi-threaded
> +                        * core_pattern process dies.
> +                        */
> +                       coredump_report_failure("RLIMIT_CORE is set to 1, aborting core");
> +                       goto fail_unlock;
> +               }
> +               cprm.limit = RLIM_INFINITY;
> +
> +               dump_count = atomic_inc_return(&core_dump_count);
> +               if (core_pipe_limit && (core_pipe_limit < dump_count)) {
> +                       coredump_report_failure("over core_pipe_limit, skipping core dump");
> +                       goto fail_dropcount;
> +               }
> +
> +               helper_argv = kmalloc_array(argc + 1, sizeof(*helper_argv),
> +                                           GFP_KERNEL);
> +               if (!helper_argv) {
> +                       coredump_report_failure("%s failed to allocate memory", __func__);
> +                       goto fail_dropcount;
> +               }
> +               for (argi = 0; argi < argc; argi++)
> +                       helper_argv[argi] = cn.corename + argv[argi];
> +               helper_argv[argi] = NULL;
> +
> +               retval = -ENOMEM;
> +               sub_info = call_usermodehelper_setup(helper_argv[0],
> +                                               helper_argv, NULL, GFP_KERNEL,
> +                                               umh_coredump_setup, NULL, &cprm);
> +               if (sub_info)
> +                       retval = call_usermodehelper_exec(sub_info,
> +                                                         UMH_WAIT_EXEC);
> +
> +               kfree(helper_argv);
> +               if (retval) {
> +                       coredump_report_failure("|%s pipe failed", cn.corename);
> +                       goto close_fail;
> +               }
> +               break;
> +       }
> +       default:
> +               WARN_ON_ONCE(true);
> +               goto close_fail;
>         }
>
>         /* get us an unshared descriptor table; almost always a no-op */
>
> --
> 2.47.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 3/9] coredump: reflow dump helpers a little
  2025-05-14 22:03 ` [PATCH v7 3/9] coredump: reflow dump helpers a little Christian Brauner
@ 2025-05-15 13:22   ` Alexander Mikhalitsyn
  2025-05-15 20:53   ` Jann Horn
  1 sibling, 0 replies; 43+ messages in thread
From: Alexander Mikhalitsyn @ 2025-05-15 13:22 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima,
	Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module

Am Do., 15. Mai 2025 um 00:04 Uhr schrieb Christian Brauner
<brauner@kernel.org>:
>
> They look rather messy right now.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>

> ---
>  fs/coredump.c | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 0e97c21b35e3..a70929c3585b 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -867,10 +867,9 @@ static int __dump_emit(struct coredump_params *cprm, const void *addr, int nr)
>         struct file *file = cprm->file;
>         loff_t pos = file->f_pos;
>         ssize_t n;
> +
>         if (cprm->written + nr > cprm->limit)
>                 return 0;
> -
> -
>         if (dump_interrupted())
>                 return 0;
>         n = __kernel_write(file, addr, nr, &pos);
> @@ -887,20 +886,21 @@ static int __dump_skip(struct coredump_params *cprm, size_t nr)
>  {
>         static char zeroes[PAGE_SIZE];
>         struct file *file = cprm->file;
> +
>         if (file->f_mode & FMODE_LSEEK) {
> -               if (dump_interrupted() ||
> -                   vfs_llseek(file, nr, SEEK_CUR) < 0)
> +               if (dump_interrupted() || vfs_llseek(file, nr, SEEK_CUR) < 0)
>                         return 0;
>                 cprm->pos += nr;
>                 return 1;
> -       } else {
> -               while (nr > PAGE_SIZE) {
> -                       if (!__dump_emit(cprm, zeroes, PAGE_SIZE))
> -                               return 0;
> -                       nr -= PAGE_SIZE;
> -               }
> -               return __dump_emit(cprm, zeroes, nr);
>         }
> +
> +       while (nr > PAGE_SIZE) {
> +               if (!__dump_emit(cprm, zeroes, PAGE_SIZE))
> +                       return 0;
> +               nr -= PAGE_SIZE;
> +       }
> +
> +       return __dump_emit(cprm, zeroes, nr);
>  }
>
>  int dump_emit(struct coredump_params *cprm, const void *addr, int nr)
>
> --
> 2.47.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 1/9] coredump: massage format_corname()
  2025-05-14 22:03 ` [PATCH v7 1/9] coredump: massage format_corname() Christian Brauner
  2025-05-15 13:19   ` Alexander Mikhalitsyn
@ 2025-05-15 13:36   ` Serge E. Hallyn
  2025-05-15 20:52   ` Jann Horn
  2 siblings, 0 replies; 43+ messages in thread
From: Serge E. Hallyn @ 2025-05-15 13:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima,
	Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 12:03:34AM +0200, Christian Brauner wrote:
> We're going to extend the coredump code in follow-up patches.
> Clean it up so we can do this more easily.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Not my wheelhouse, but this is a nice cleanup.

Acked-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/coredump.c | 41 ++++++++++++++++++++++++-----------------
>  1 file changed, 24 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index d740a0411266..368751d98781 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -76,9 +76,15 @@ static char core_pattern[CORENAME_MAX_SIZE] = "core";
>  static int core_name_size = CORENAME_MAX_SIZE;
>  unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
>  
> +enum coredump_type_t {
> +	COREDUMP_FILE = 1,
> +	COREDUMP_PIPE = 2,
> +};
> +
>  struct core_name {
>  	char *corename;
>  	int used, size;
> +	enum coredump_type_t core_type;
>  };
>  
>  static int expand_corename(struct core_name *cn, int size)
> @@ -218,18 +224,21 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>  {
>  	const struct cred *cred = current_cred();
>  	const char *pat_ptr = core_pattern;
> -	int ispipe = (*pat_ptr == '|');
>  	bool was_space = false;
>  	int pid_in_pattern = 0;
>  	int err = 0;
>  
>  	cn->used = 0;
>  	cn->corename = NULL;
> +	if (*pat_ptr == '|')
> +		cn->core_type = COREDUMP_PIPE;
> +	else
> +		cn->core_type = COREDUMP_FILE;
>  	if (expand_corename(cn, core_name_size))
>  		return -ENOMEM;
>  	cn->corename[0] = '\0';
>  
> -	if (ispipe) {
> +	if (cn->core_type == COREDUMP_PIPE) {
>  		int argvs = sizeof(core_pattern) / 2;
>  		(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
>  		if (!(*argv))
> @@ -247,7 +256,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>  		 * Split on spaces before doing template expansion so that
>  		 * %e and %E don't get split if they have spaces in them
>  		 */
> -		if (ispipe) {
> +		if (cn->core_type == COREDUMP_PIPE) {
>  			if (isspace(*pat_ptr)) {
>  				if (cn->used != 0)
>  					was_space = true;
> @@ -353,7 +362,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>  				 * Installing a pidfd only makes sense if
>  				 * we actually spawn a usermode helper.
>  				 */
> -				if (!ispipe)
> +				if (cn->core_type != COREDUMP_PIPE)
>  					break;
>  
>  				/*
> @@ -384,12 +393,12 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>  	 * If core_pattern does not include a %p (as is the default)
>  	 * and core_uses_pid is set, then .%pid will be appended to
>  	 * the filename. Do not do this for piped commands. */
> -	if (!ispipe && !pid_in_pattern && core_uses_pid) {
> +	if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
>  		err = cn_printf(cn, ".%d", task_tgid_vnr(current));
>  		if (err)
>  			return err;
>  	}
> -	return ispipe;
> +	return 0;
>  }
>  
>  static int zap_process(struct signal_struct *signal, int exit_code)
> @@ -583,7 +592,6 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  	const struct cred *old_cred;
>  	struct cred *cred;
>  	int retval = 0;
> -	int ispipe;
>  	size_t *argv = NULL;
>  	int argc = 0;
>  	/* require nonrelative corefile path and be extra careful */
> @@ -632,19 +640,18 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  
>  	old_cred = override_creds(cred);
>  
> -	ispipe = format_corename(&cn, &cprm, &argv, &argc);
> +	retval = format_corename(&cn, &cprm, &argv, &argc);
> +	if (retval < 0) {
> +		coredump_report_failure("format_corename failed, aborting core");
> +		goto fail_unlock;
> +	}
>  
> -	if (ispipe) {
> +	if (cn.core_type == COREDUMP_PIPE) {
>  		int argi;
>  		int dump_count;
>  		char **helper_argv;
>  		struct subprocess_info *sub_info;
>  
> -		if (ispipe < 0) {
> -			coredump_report_failure("format_corename failed, aborting core");
> -			goto fail_unlock;
> -		}
> -
>  		if (cprm.limit == 1) {
>  			/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
>  			 *
> @@ -695,7 +702,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  			coredump_report_failure("|%s pipe failed", cn.corename);
>  			goto close_fail;
>  		}
> -	} else {
> +	} else if (cn.core_type == COREDUMP_FILE) {
>  		struct mnt_idmap *idmap;
>  		struct inode *inode;
>  		int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
> @@ -823,13 +830,13 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  		file_end_write(cprm.file);
>  		free_vma_snapshot(&cprm);
>  	}
> -	if (ispipe && core_pipe_limit)
> +	if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
>  		wait_for_dump_helpers(cprm.file);
>  close_fail:
>  	if (cprm.file)
>  		filp_close(cprm.file, NULL);
>  fail_dropcount:
> -	if (ispipe)
> +	if (cn.core_type == COREDUMP_PIPE)
>  		atomic_dec(&core_dump_count);
>  fail_unlock:
>  	kfree(argv);
> 
> -- 
> 2.47.2
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 4/9] coredump: add coredump socket
  2025-05-14 22:03 ` [PATCH v7 4/9] coredump: add coredump socket Christian Brauner
@ 2025-05-15 13:47   ` Alexander Mikhalitsyn
  2025-05-16  8:30     ` Christian Brauner
  2025-05-15 17:00   ` Kuniyuki Iwashima
  2025-05-15 20:54   ` Jann Horn
  2 siblings, 1 reply; 43+ messages in thread
From: Alexander Mikhalitsyn @ 2025-05-15 13:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima,
	Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module

Am Do., 15. Mai 2025 um 00:04 Uhr schrieb Christian Brauner
<brauner@kernel.org>:
>
> Coredumping currently supports two modes:
>
> (1) Dumping directly into a file somewhere on the filesystem.
> (2) Dumping into a pipe connected to a usermode helper process
>     spawned as a child of the system_unbound_wq or kthreadd.
>
> For simplicity I'm mostly ignoring (1). There's probably still some
> users of (1) out there but processing coredumps in this way can be
> considered adventurous especially in the face of set*id binaries.
>
> The most common option should be (2) by now. It works by allowing
> userspace to put a string into /proc/sys/kernel/core_pattern like:
>
>         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
>
> The "|" at the beginning indicates to the kernel that a pipe must be
> used. The path following the pipe indicator is a path to a binary that
> will be spawned as a usermode helper process. Any additional parameters
> pass information about the task that is generating the coredump to the
> binary that processes the coredump.
>
> In the example core_pattern shown above systemd-coredump is spawned as a
> usermode helper. There's various conceptual consequences of this
> (non-exhaustive list):
>
> - systemd-coredump is spawned with file descriptor number 0 (stdin)
>   connected to the read-end of the pipe. All other file descriptors are
>   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
>   already caused bugs because userspace assumed that this cannot happen
>   (Whether or not this is a sane assumption is irrelevant.).
>
> - systemd-coredump will be spawned as a child of system_unbound_wq. So
>   it is not a child of any userspace process and specifically not a
>   child of PID 1. It cannot be waited upon and is in a weird hybrid
>   upcall which are difficult for userspace to control correctly.
>
> - systemd-coredump is spawned with full kernel privileges. This
>   necessitates all kinds of weird privilege dropping excercises in
>   userspace to make this safe.
>
> - A new usermode helper has to be spawned for each crashing process.
>
> This series adds a new mode:
>
> (3) Dumping into an AF_UNIX socket.
>
> Userspace can set /proc/sys/kernel/core_pattern to:
>
>         @/path/to/coredump.socket
>
> The "@" at the beginning indicates to the kernel that an AF_UNIX
> coredump socket will be used to process coredumps.
>
> The coredump socket must be located in the initial mount namespace.
> When a task coredumps it opens a client socket in the initial network
> namespace and connects to the coredump socket.
>
> - The coredump server uses SO_PEERPIDFD to get a stable handle on the
>   connected crashing task. The retrieved pidfd will provide a stable
>   reference even if the crashing task gets SIGKILLed while generating
>   the coredump.
>
> - By setting core_pipe_limit non-zero userspace can guarantee that the
>   crashing task cannot be reaped behind it's back and thus process all
>   necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
>   detect whether /proc/<pid> still refers to the same process.
>
>   The core_pipe_limit isn't used to rate-limit connections to the
>   socket. This can simply be done via AF_UNIX sockets directly.
>
> - The pidfd for the crashing task will grow new information how the task
>   coredumps.
>
> - The coredump server should mark itself as non-dumpable.
>
> - A container coredump server in a separate network namespace can simply
>   bind to another well-know address and systemd-coredump fowards
>   coredumps to the container.
>
> - Coredumps could in the future also be handled via per-user/session
>   coredump servers that run only with that users privileges.
>
>   The coredump server listens on the coredump socket and accepts a
>   new coredump connection. It then retrieves SO_PEERPIDFD for the
>   client, inspects uid/gid and hands the accepted client to the users
>   own coredump handler which runs with the users privileges only
>   (It must of coure pay close attention to not forward crashing suid
>   binaries.).
>
> The new coredump socket will allow userspace to not have to rely on
> usermode helpers for processing coredumps and provides a safer way to
> handle them instead of relying on super privileged coredumping helpers
> that have and continue to cause significant CVEs.
>
> This will also be significantly more lightweight since no fork()+exec()
> for the usermodehelper is required for each crashing process. The
> coredump server in userspace can e.g., just keep a worker pool.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>

> ---
>  fs/coredump.c       | 133 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  include/linux/net.h |   1 +
>  net/unix/af_unix.c  |  53 ++++++++++++++++-----
>  3 files changed, 166 insertions(+), 21 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index a70929c3585b..e1256ebb89c1 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -44,7 +44,11 @@
>  #include <linux/sysctl.h>
>  #include <linux/elf.h>
>  #include <linux/pidfs.h>
> +#include <linux/net.h>
> +#include <linux/socket.h>
> +#include <net/net_namespace.h>
>  #include <uapi/linux/pidfd.h>
> +#include <uapi/linux/un.h>
>
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
>  enum coredump_type_t {
>         COREDUMP_FILE = 1,
>         COREDUMP_PIPE = 2,
> +       COREDUMP_SOCK = 3,
>  };
>
>  struct core_name {
> @@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>         cn->corename = NULL;
>         if (*pat_ptr == '|')
>                 cn->core_type = COREDUMP_PIPE;
> +       else if (*pat_ptr == '@')
> +               cn->core_type = COREDUMP_SOCK;
>         else
>                 cn->core_type = COREDUMP_FILE;
>         if (expand_corename(cn, core_name_size))
>                 return -ENOMEM;
>         cn->corename[0] = '\0';
>
> -       if (cn->core_type == COREDUMP_PIPE) {
> +       switch (cn->core_type) {
> +       case COREDUMP_PIPE: {
>                 int argvs = sizeof(core_pattern) / 2;
>                 (*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
>                 if (!(*argv))
> @@ -247,6 +255,33 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>                 ++pat_ptr;
>                 if (!(*pat_ptr))
>                         return -ENOMEM;
> +               break;
> +       }
> +       case COREDUMP_SOCK: {
> +               /* skip the @ */
> +               pat_ptr++;

nit: I would do
if (!(*pat_ptr))
   return -ENOMEM;
as we do for the COREDUMP_PIPE case above.
just in case if something will change in cn_printf() to eliminate any
chance of crashes in there.

> +               err = cn_printf(cn, "%s", pat_ptr);
> +               if (err)
> +                       return err;
> +
> +               /* Require absolute paths. */
> +               if (cn->corename[0] != '/')
> +                       return -EINVAL;
> +
> +               /*
> +                * Currently no need to parse any other options.
> +                * Relevant information can be retrieved from the peer
> +                * pidfd retrievable via SO_PEERPIDFD by the receiver or
> +                * via /proc/<pid>, using the SO_PEERPIDFD to guard
> +                * against pid recycling when opening /proc/<pid>.
> +                */
> +               return 0;
> +       }
> +       case COREDUMP_FILE:
> +               break;
> +       default:
> +               WARN_ON_ONCE(true);
> +               return -EINVAL;
>         }
>
>         /* Repeat as long as we have more pattern to process and more output
> @@ -393,11 +428,20 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>          * If core_pattern does not include a %p (as is the default)
>          * and core_uses_pid is set, then .%pid will be appended to
>          * the filename. Do not do this for piped commands. */
> -       if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
> -               err = cn_printf(cn, ".%d", task_tgid_vnr(current));
> -               if (err)
> -                       return err;
> +       if (!pid_in_pattern && core_uses_pid) {
> +               switch (cn->core_type) {
> +               case COREDUMP_FILE:
> +                       return cn_printf(cn, ".%d", task_tgid_vnr(current));
> +               case COREDUMP_PIPE:
> +                       break;
> +               case COREDUMP_SOCK:
> +                       break;
> +               default:
> +                       WARN_ON_ONCE(true);
> +                       return -EINVAL;
> +               }
>         }
> +
>         return 0;
>  }
>
> @@ -801,6 +845,55 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                 }
>                 break;
>         }
> +       case COREDUMP_SOCK: {
> +#ifdef CONFIG_UNIX
> +               struct file *file __free(fput) = NULL;
> +               struct sockaddr_un addr = {
> +                       .sun_family = AF_UNIX,
> +               };
> +               ssize_t addr_len;
> +               struct socket *socket;
> +
> +               retval = strscpy(addr.sun_path, cn.corename, sizeof(addr.sun_path));
> +               if (retval < 0)
> +                       goto close_fail;
> +               addr_len = offsetof(struct sockaddr_un, sun_path) + retval + 1;
> +
> +               /*
> +                * It is possible that the userspace process which is
> +                * supposed to handle the coredump and is listening on
> +                * the AF_UNIX socket coredumps. Userspace should just
> +                * mark itself non dumpable.
> +                */
> +
> +               retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> +               if (retval < 0)
> +                       goto close_fail;
> +
> +               file = sock_alloc_file(socket, 0, NULL);
> +               if (IS_ERR(file)) {
> +                       sock_release(socket);
> +                       goto close_fail;
> +               }
> +
> +               retval = kernel_connect(socket, (struct sockaddr *)(&addr),
> +                                       addr_len, O_NONBLOCK | SOCK_COREDUMP);
> +               if (retval) {
> +                       if (retval == -EAGAIN)
> +                               coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path);
> +                       else
> +                               coredump_report_failure("Coredump socket connection %s failed %d", addr.sun_path, retval);
> +                       goto close_fail;
> +               }
> +
> +               cprm.limit = RLIM_INFINITY;
> +               cprm.file = no_free_ptr(file);
> +#else
> +               coredump_report_failure("Core dump socket support %s disabled", cn.corename);
> +               goto close_fail;
> +#endif
> +               break;
> +       }
>         default:
>                 WARN_ON_ONCE(true);
>                 goto close_fail;
> @@ -838,8 +931,32 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                 file_end_write(cprm.file);
>                 free_vma_snapshot(&cprm);
>         }
> -       if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
> -               wait_for_dump_helpers(cprm.file);
> +
> +       /*
> +        * When core_pipe_limit is set we wait for the coredump server
> +        * or usermodehelper to finish before exiting so it can e.g.,
> +        * inspect /proc/<pid>.
> +        */
> +       if (core_pipe_limit) {
> +               switch (cn.core_type) {
> +               case COREDUMP_PIPE:
> +                       wait_for_dump_helpers(cprm.file);
> +                       break;
> +               case COREDUMP_SOCK: {
> +                       /*
> +                        * We use a simple read to wait for the coredump
> +                        * processing to finish. Either the socket is
> +                        * closed or we get sent unexpected data. In
> +                        * both cases, we're done.
> +                        */
> +                       __kernel_read(cprm.file, &(char){ 0 }, 1, NULL);
> +                       break;
> +               }
> +               default:
> +                       break;
> +               }
> +       }
> +
>  close_fail:
>         if (cprm.file)
>                 filp_close(cprm.file, NULL);
> @@ -1069,7 +1186,7 @@ EXPORT_SYMBOL(dump_align);
>  void validate_coredump_safety(void)
>  {
>         if (suid_dumpable == SUID_DUMP_ROOT &&
> -           core_pattern[0] != '/' && core_pattern[0] != '|') {
> +           core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != '@') {
>
>                 coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: "
>                         "pipe handler or fully qualified core dump path required. "
> diff --git a/include/linux/net.h b/include/linux/net.h
> index 0ff950eecc6b..139c85d0f2ea 100644
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -81,6 +81,7 @@ enum sock_type {
>  #ifndef SOCK_NONBLOCK
>  #define SOCK_NONBLOCK  O_NONBLOCK
>  #endif
> +#define SOCK_COREDUMP  O_NOCTTY
>
>  #endif /* ARCH_HAS_SOCKET_TYPES */
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 472f8aa9ea15..a9d1c9ba2961 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -85,10 +85,13 @@
>  #include <linux/file.h>
>  #include <linux/filter.h>
>  #include <linux/fs.h>
> +#include <linux/fs_struct.h>
>  #include <linux/init.h>
>  #include <linux/kernel.h>
>  #include <linux/mount.h>
>  #include <linux/namei.h>
> +#include <linux/net.h>
> +#include <linux/pidfs.h>
>  #include <linux/poll.h>
>  #include <linux/proc_fs.h>
>  #include <linux/sched/signal.h>
> @@ -100,7 +103,6 @@
>  #include <linux/splice.h>
>  #include <linux/string.h>
>  #include <linux/uaccess.h>
> -#include <linux/pidfs.h>
>  #include <net/af_unix.h>
>  #include <net/net_namespace.h>
>  #include <net/scm.h>
> @@ -1146,7 +1148,7 @@ static int unix_release(struct socket *sock)
>  }
>
>  static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
> -                                 int type)
> +                                 int type, unsigned int flags)
>  {
>         struct inode *inode;
>         struct path path;
> @@ -1154,13 +1156,38 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
>         int err;
>
>         unix_mkname_bsd(sunaddr, addr_len);
> -       err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
> -       if (err)
> -               goto fail;
>
> -       err = path_permission(&path, MAY_WRITE);
> -       if (err)
> -               goto path_put;
> +       if (flags & SOCK_COREDUMP) {
> +               struct path root;
> +               struct cred *kcred;
> +               const struct cred *cred;
> +
> +               err = -ENOMEM;
> +               kcred = prepare_kernel_cred(&init_task);
> +               if (!kcred)
> +                       goto fail;
> +
> +               task_lock(&init_task);
> +               get_fs_root(init_task.fs, &root);
> +               task_unlock(&init_task);
> +
> +               cred = override_creds(kcred);
> +               err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
> +                                     LOOKUP_BENEATH | LOOKUP_NO_SYMLINKS |
> +                                     LOOKUP_NO_MAGICLINKS, &path);
> +               put_cred(revert_creds(cred));
> +               path_put(&root);
> +               if (err)
> +                       goto fail;
> +       } else {
> +               err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
> +               if (err)
> +                       goto fail;
> +
> +               err = path_permission(&path, MAY_WRITE);
> +               if (err)
> +                       goto path_put;
> +       }
>
>         err = -ECONNREFUSED;
>         inode = d_backing_inode(path.dentry);
> @@ -1210,12 +1237,12 @@ static struct sock *unix_find_abstract(struct net *net,
>
>  static struct sock *unix_find_other(struct net *net,
>                                     struct sockaddr_un *sunaddr,
> -                                   int addr_len, int type)
> +                                   int addr_len, int type, int flags)
>  {
>         struct sock *sk;
>
>         if (sunaddr->sun_path[0])
> -               sk = unix_find_bsd(sunaddr, addr_len, type);
> +               sk = unix_find_bsd(sunaddr, addr_len, type, flags);
>         else
>                 sk = unix_find_abstract(net, sunaddr, addr_len, type);
>
> @@ -1473,7 +1500,7 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr *addr,
>                 }
>
>  restart:
> -               other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type);
> +               other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type, 0);
>                 if (IS_ERR(other)) {
>                         err = PTR_ERR(other);
>                         goto out;
> @@ -1620,7 +1647,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
>
>  restart:
>         /*  Find listening sock. */
> -       other = unix_find_other(net, sunaddr, addr_len, sk->sk_type);
> +       other = unix_find_other(net, sunaddr, addr_len, sk->sk_type, flags);
>         if (IS_ERR(other)) {
>                 err = PTR_ERR(other);
>                 goto out_free_skb;
> @@ -2089,7 +2116,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
>         if (msg->msg_namelen) {
>  lookup:
>                 other = unix_find_other(sock_net(sk), msg->msg_name,
> -                                       msg->msg_namelen, sk->sk_type);
> +                                       msg->msg_namelen, sk->sk_type, 0);
>                 if (IS_ERR(other)) {
>                         err = PTR_ERR(other);
>                         goto out_free;
>
> --
> 2.47.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 6/9] coredump: show supported coredump modes
  2025-05-14 22:03 ` [PATCH v7 6/9] coredump: show supported coredump modes Christian Brauner
@ 2025-05-15 13:56   ` Alexander Mikhalitsyn
  2025-05-15 20:56   ` Jann Horn
  1 sibling, 0 replies; 43+ messages in thread
From: Alexander Mikhalitsyn @ 2025-05-15 13:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima,
	Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module

Am Do., 15. Mai 2025 um 00:04 Uhr schrieb Christian Brauner
<brauner@kernel.org>:
>
> Allow userspace to discover what coredump modes are supported.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>

> ---
>  fs/coredump.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index bfc4a32f737c..6ee38e3da108 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -1240,6 +1240,12 @@ static int proc_dostring_coredump(const struct ctl_table *table, int write,
>
>  static const unsigned int core_file_note_size_min = CORE_FILE_NOTE_SIZE_DEFAULT;
>  static const unsigned int core_file_note_size_max = CORE_FILE_NOTE_SIZE_MAX;
> +static char core_modes[] = {
> +       "file\npipe"
> +#ifdef CONFIG_UNIX
> +       "\nsocket"
> +#endif
> +};
>
>  static const struct ctl_table coredump_sysctls[] = {
>         {
> @@ -1283,6 +1289,13 @@ static const struct ctl_table coredump_sysctls[] = {
>                 .extra1         = SYSCTL_ZERO,
>                 .extra2         = SYSCTL_ONE,
>         },
> +       {
> +               .procname       = "core_modes",
> +               .data           = core_modes,
> +               .maxlen         = sizeof(core_modes) - 1,
> +               .mode           = 0444,
> +               .proc_handler   = proc_dostring,
> +       },
>  };
>
>  static int __init init_fs_coredump_sysctls(void)
>
> --
> 2.47.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 7/9] coredump: validate socket name as it is written
  2025-05-14 22:03 ` [PATCH v7 7/9] coredump: validate socket name as it is written Christian Brauner
@ 2025-05-15 14:03   ` Alexander Mikhalitsyn
  2025-05-15 20:56   ` Jann Horn
  1 sibling, 0 replies; 43+ messages in thread
From: Alexander Mikhalitsyn @ 2025-05-15 14:03 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima,
	Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module

Am Do., 15. Mai 2025 um 00:04 Uhr schrieb Christian Brauner
<brauner@kernel.org>:
>
> In contrast to other parameters written into
> /proc/sys/kernel/core_pattern that never fail we can validate enabling
> the new AF_UNIX support. This is obviously racy as hell but it's always
> been that way.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>

> ---
>  fs/coredump.c | 37 ++++++++++++++++++++++++++++++++++---
>  1 file changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 6ee38e3da108..d4ff08ef03e5 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -1228,13 +1228,44 @@ void validate_coredump_safety(void)
>         }
>  }
>
> +static inline bool check_coredump_socket(void)
> +{
> +       if (core_pattern[0] != '@')
> +               return true;
> +
> +       /*
> +        * Coredump socket must be located in the initial mount
> +        * namespace. Don't give the that impression anything else is
> +        * supported right now.
> +        */
> +       if (current->nsproxy->mnt_ns != init_task.nsproxy->mnt_ns)
> +               return false;
> +
> +       /* Must be an absolute path. */
> +       if (*(core_pattern + 1) != '/')
> +               return false;
> +
> +       return true;
> +}
> +
>  static int proc_dostring_coredump(const struct ctl_table *table, int write,
>                   void *buffer, size_t *lenp, loff_t *ppos)
>  {
> -       int error = proc_dostring(table, write, buffer, lenp, ppos);
> +       int error;
> +       ssize_t retval;
> +       char old_core_pattern[CORENAME_MAX_SIZE];
> +
> +       retval = strscpy(old_core_pattern, core_pattern, CORENAME_MAX_SIZE);
> +
> +       error = proc_dostring(table, write, buffer, lenp, ppos);
> +       if (error)
> +               return error;
> +       if (!check_coredump_socket()) {
> +               strscpy(core_pattern, old_core_pattern, retval + 1);
> +               return -EINVAL;
> +       }
>
> -       if (!error)
> -               validate_coredump_safety();
> +       validate_coredump_safety();
>         return error;
>  }
>
>
> --
> 2.47.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 5/9] pidfs, coredump: add PIDFD_INFO_COREDUMP
  2025-05-14 22:03 ` [PATCH v7 5/9] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
@ 2025-05-15 14:08   ` Alexander Mikhalitsyn
  2025-05-15 20:56   ` Jann Horn
  1 sibling, 0 replies; 43+ messages in thread
From: Alexander Mikhalitsyn @ 2025-05-15 14:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima,
	Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module

Am Do., 15. Mai 2025 um 00:04 Uhr schrieb Christian Brauner
<brauner@kernel.org>:
>
> Extend the PIDFD_INFO_COREDUMP ioctl() with the new PIDFD_INFO_COREDUMP
> mask flag. This adds the fields @coredump_mask and @coredump_cookie to
> struct pidfd_info.
>
> When a task coredumps the kernel will provide the following information
> to userspace in @coredump_mask:
>
> * PIDFD_COREDUMPED is raised if the task did actually coredump.
> * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
>   undumpable).
> * PIDFD_COREDUMP_USER is raised if this is a regular coredump and
>   doesn't need special care by the coredump server.
> * PIDFD_COREDUMP_ROOT is raised if the generated coredump should be
>   treated as sensitive and the coredump server should restrict to the
>   generated coredump to sufficiently privileged users.
>
> If userspace uses the coredump socket to process coredumps it needs to
> be able to discern connection from the kernel from connects from
> userspace (e.g., Python generating it's own coredumps and forwarding
> them to systemd). The @coredump_cookie extension uses the SO_COOKIE of
> the new connection. This allows userspace to validate that the
> connection has been made from the kernel by a crashing task:
>
>    fd_coredump = accept4(fd_socket, NULL, NULL, SOCK_CLOEXEC);
>    getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD, &fd_peer_pidfd, &fd_peer_pidfd_len);
>
>    struct pidfd_info info = {
>            info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP,
>    };
>
>    ioctl(pidfd, PIDFD_GET_INFO, &info);
>    /* Refuse connections that aren't from a crashing task. */
>    if (!(info.mask & PIDFD_INFO_COREDUMP) || !(info.coredump_mask & PIDFD_COREDUMPED) )
>            close(fd_coredump);
>
>    /*
>     * Make sure that the coredump cookie matches the connection cookie.
>     * If they don't it's not the coredump connection from the kernel.
>     * We'll get another connection request in a bit.
>     */
>    getsocketop(fd_coredump, SOL_SOCKET, SO_COOKIE, &peer_cookie, &peer_cookie_len);
>    if (!info.coredump_cookie || (info.coredump_cookie != peer_cookie))
>            close(fd_coredump);
>
> The kernel guarantees that by the time the connection is made the all
> PIDFD_INFO_COREDUMP info is available.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>

> ---
>  fs/coredump.c              | 34 ++++++++++++++++++++
>  fs/pidfs.c                 | 79 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pidfs.h      | 10 ++++++
>  include/uapi/linux/pidfd.h | 22 +++++++++++++
>  net/unix/af_unix.c         |  7 ++++
>  5 files changed, 152 insertions(+)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index e1256ebb89c1..bfc4a32f737c 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -46,7 +46,9 @@
>  #include <linux/pidfs.h>
>  #include <linux/net.h>
>  #include <linux/socket.h>
> +#include <net/af_unix.h>
>  #include <net/net_namespace.h>
> +#include <net/sock.h>
>  #include <uapi/linux/pidfd.h>
>  #include <uapi/linux/un.h>
>
> @@ -598,6 +600,8 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
>                 if (IS_ERR(pidfs_file))
>                         return PTR_ERR(pidfs_file);
>
> +               pidfs_coredump(cp);
> +
>                 /*
>                  * Usermode helpers are childen of either
>                  * system_unbound_wq or of kthreadd. So we know that
> @@ -876,8 +880,34 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                         goto close_fail;
>                 }
>
> +               /*
> +                * Set the thread-group leader pid which is used for the
> +                * peer credentials during connect() below. Then
> +                * immediately register it in pidfs...
> +                */
> +               cprm.pid = task_tgid(current);
> +               retval = pidfs_register_pid(cprm.pid);
> +               if (retval) {
> +                       sock_release(socket);
> +                       goto close_fail;
> +               }
> +
> +               /*
> +                * ... and set the coredump information so userspace
> +                * has it available after connect()...
> +                */
> +               pidfs_coredump(&cprm);
> +
> +               /*
> +                * ... On connect() the peer credentials are recorded
> +                * and @cprm.pid registered in pidfs...
> +                */
>                 retval = kernel_connect(socket, (struct sockaddr *)(&addr),
>                                         addr_len, O_NONBLOCK | SOCK_COREDUMP);
> +
> +               /* ... So we can safely put our pidfs reference now... */
> +               pidfs_put_pid(cprm.pid);
> +
>                 if (retval) {
>                         if (retval == -EAGAIN)
>                                 coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path);
> @@ -886,6 +916,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                         goto close_fail;
>                 }
>
> +               /* ... and validate that @sk_peer_pid matches @cprm.pid. */
> +               if (WARN_ON_ONCE(unix_peer(socket->sk)->sk_peer_pid != cprm.pid))
> +                       goto close_fail;
> +
>                 cprm.limit = RLIM_INFINITY;
>                 cprm.file = no_free_ptr(file);
>  #else
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 3b39e471840b..d7b9a0dd2db6 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -20,6 +20,7 @@
>  #include <linux/time_namespace.h>
>  #include <linux/utsname.h>
>  #include <net/net_namespace.h>
> +#include <linux/coredump.h>
>
>  #include "internal.h"
>  #include "mount.h"
> @@ -33,6 +34,8 @@ static struct kmem_cache *pidfs_cachep __ro_after_init;
>  struct pidfs_exit_info {
>         __u64 cgroupid;
>         __s32 exit_code;
> +       __u32 coredump_mask;
> +       __u64 coredump_cookie;
>  };
>
>  struct pidfs_inode {
> @@ -240,6 +243,22 @@ static inline bool pid_in_current_pidns(const struct pid *pid)
>         return false;
>  }
>
> +static __u32 pidfs_coredump_mask(unsigned long mm_flags)
> +{
> +       switch (__get_dumpable(mm_flags)) {
> +       case SUID_DUMP_USER:
> +               return PIDFD_COREDUMP_USER;
> +       case SUID_DUMP_ROOT:
> +               return PIDFD_COREDUMP_ROOT;
> +       case SUID_DUMP_DISABLE:
> +               return PIDFD_COREDUMP_SKIP;
> +       default:
> +               WARN_ON_ONCE(true);
> +       }
> +
> +       return 0;
> +}
> +
>  static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
>  {
>         struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> @@ -280,6 +299,13 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
>                 }
>         }
>
> +       if (mask & PIDFD_INFO_COREDUMP) {
> +               kinfo.mask |= PIDFD_INFO_COREDUMP;
> +               smp_rmb();
> +               kinfo.coredump_cookie = READ_ONCE(pidfs_i(inode)->__pei.coredump_cookie);
> +               kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
> +       }
> +
>         task = get_pid_task(pid, PIDTYPE_PID);
>         if (!task) {
>                 /*
> @@ -296,6 +322,16 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
>         if (!c)
>                 return -ESRCH;
>
> +       if (!(kinfo.mask & PIDFD_INFO_COREDUMP)) {
> +               task_lock(task);
> +               if (task->mm) {
> +                       smp_rmb();
> +                       kinfo.coredump_cookie = READ_ONCE(pidfs_i(inode)->__pei.coredump_cookie);
> +                       kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags);
> +               }
> +               task_unlock(task);
> +       }
> +
>         /* Unconditionally return identifiers and credentials, the rest only on request */
>
>         user_ns = current_user_ns();
> @@ -559,6 +595,49 @@ void pidfs_exit(struct task_struct *tsk)
>         }
>  }
>
> +#if defined(CONFIG_COREDUMP) && defined(CONFIG_UNIX)
> +void pidfs_coredump_cookie(struct pid *pid, u64 coredump_cookie)
> +{
> +       struct pidfs_exit_info *exit_info;
> +       struct dentry *dentry = pid->stashed;
> +       struct inode *inode;
> +
> +       if (WARN_ON_ONCE(!dentry))
> +               return;
> +
> +       inode = d_inode(dentry);
> +       exit_info = &pidfs_i(inode)->__pei;
> +       /* Can't use smp_store_release() because of 32bit. */
> +       smp_wmb();
> +       WRITE_ONCE(exit_info->coredump_cookie, coredump_cookie);
> +}
> +#endif
> +
> +#ifdef CONFIG_COREDUMP
> +void pidfs_coredump(const struct coredump_params *cprm)
> +{
> +       struct pid *pid = cprm->pid;
> +       struct pidfs_exit_info *exit_info;
> +       struct dentry *dentry;
> +       struct inode *inode;
> +       __u32 coredump_mask = 0;
> +
> +       dentry = pid->stashed;
> +       if (WARN_ON_ONCE(!dentry))
> +               return;
> +
> +       inode = d_inode(dentry);
> +       exit_info = &pidfs_i(inode)->__pei;
> +       /* Note how we were coredumped. */
> +       coredump_mask = pidfs_coredump_mask(cprm->mm_flags);
> +       /* Note that we actually did coredump. */
> +       coredump_mask |= PIDFD_COREDUMPED;
> +       /* If coredumping is set to skip we should never end up here. */
> +       VFS_WARN_ON_ONCE(coredump_mask & PIDFD_COREDUMP_SKIP);
> +       smp_store_release(&exit_info->coredump_mask, coredump_mask);
> +}
> +#endif
> +
>  static struct vfsmount *pidfs_mnt __ro_after_init;
>
>  /*
> diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
> index 2676890c4d0d..497997bc5e34 100644
> --- a/include/linux/pidfs.h
> +++ b/include/linux/pidfs.h
> @@ -2,11 +2,21 @@
>  #ifndef _LINUX_PID_FS_H
>  #define _LINUX_PID_FS_H
>
> +struct coredump_params;
> +
>  struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
>  void __init pidfs_init(void);
>  void pidfs_add_pid(struct pid *pid);
>  void pidfs_remove_pid(struct pid *pid);
>  void pidfs_exit(struct task_struct *tsk);
> +#ifdef CONFIG_COREDUMP
> +void pidfs_coredump(const struct coredump_params *cprm);
> +#endif
> +#if defined(CONFIG_COREDUMP) && defined(CONFIG_UNIX)
> +void pidfs_coredump_cookie(struct pid *pid, u64 coredump_cookie);
> +#elif defined(CONFIG_UNIX)
> +static inline void pidfs_coredump_cookie(struct pid *pid, u64 coredump_cookie) { }
> +#endif
>  extern const struct dentry_operations pidfs_dentry_operations;
>  int pidfs_register_pid(struct pid *pid);
>  void pidfs_get_pid(struct pid *pid);
> diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> index 8c1511edd0e9..69267c5ae6d0 100644
> --- a/include/uapi/linux/pidfd.h
> +++ b/include/uapi/linux/pidfd.h
> @@ -25,9 +25,28 @@
>  #define PIDFD_INFO_CREDS               (1UL << 1) /* Always returned, even if not requested */
>  #define PIDFD_INFO_CGROUPID            (1UL << 2) /* Always returned if available, even if not requested */
>  #define PIDFD_INFO_EXIT                        (1UL << 3) /* Only returned if requested. */
> +#define PIDFD_INFO_COREDUMP            (1UL << 4) /* Only returned if requested. */
>
>  #define PIDFD_INFO_SIZE_VER0           64 /* sizeof first published struct */
>
> +/*
> + * Values for @coredump_mask in pidfd_info.
> + * Only valid if PIDFD_INFO_COREDUMP is set in @mask.
> + *
> + * Note, the @PIDFD_COREDUMP_ROOT flag indicates that the generated
> + * coredump should be treated as sensitive and access should only be
> + * granted to privileged users.
> + *
> + * If the coredump AF_UNIX socket is used for processing coredumps
> + * @coredump_cookie will be set to the socket SO_COOKIE of the receivers
> + * client socket. This allows the coredump handler to detect whether an
> + * incoming coredump connection was initiated from the crashing task.
> + */
> +#define PIDFD_COREDUMPED       (1U << 0) /* Did crash and... */
> +#define PIDFD_COREDUMP_SKIP    (1U << 1) /* coredumping generation was skipped. */
> +#define PIDFD_COREDUMP_USER    (1U << 2) /* coredump was done as the user. */
> +#define PIDFD_COREDUMP_ROOT    (1U << 3) /* coredump was done as root. */
> +
>  /*
>   * The concept of process and threads in userland and the kernel is a confusing
>   * one - within the kernel every thread is a 'task' with its own individual PID,
> @@ -92,6 +111,9 @@ struct pidfd_info {
>         __u32 fsuid;
>         __u32 fsgid;
>         __s32 exit_code;
> +       __u32 coredump_mask;
> +       __u32 __spare1;
> +       __u64 coredump_cookie;
>  };
>
>  #define PIDFS_IOCTL_MAGIC 0xFF
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index a9d1c9ba2961..053d2e48e918 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -99,6 +99,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/skbuff.h>
>  #include <linux/slab.h>
> +#include <linux/sock_diag.h>
>  #include <linux/socket.h>
>  #include <linux/splice.h>
>  #include <linux/string.h>
> @@ -742,6 +743,7 @@ static void unix_release_sock(struct sock *sk, int embrion)
>
>  struct unix_peercred {
>         struct pid *peer_pid;
> +       u64 cookie;
>         const struct cred *peer_cred;
>  };
>
> @@ -777,6 +779,8 @@ static void drop_peercred(struct unix_peercred *peercred)
>  static inline void init_peercred(struct sock *sk,
>                                  const struct unix_peercred *peercred)
>  {
> +       if (peercred->cookie)
> +               pidfs_coredump_cookie(peercred->peer_pid, peercred->cookie);
>         sk->sk_peer_pid = peercred->peer_pid;
>         sk->sk_peer_cred = peercred->peer_cred;
>  }
> @@ -1713,6 +1717,9 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
>         unix_peer(newsk)        = sk;
>         newsk->sk_state         = TCP_ESTABLISHED;
>         newsk->sk_type          = sk->sk_type;
> +       /* Prepare a new socket cookie for the receiver. */
> +       if (flags & SOCK_COREDUMP)
> +               peercred.cookie = sock_gen_cookie(newsk);
>         init_peercred(newsk, &peercred);
>         newu = unix_sk(newsk);
>         newu->listener = other;
>
> --
> 2.47.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 8/9] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
  2025-05-14 22:03 ` [PATCH v7 8/9] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure Christian Brauner
@ 2025-05-15 14:35   ` Alexander Mikhalitsyn
  0 siblings, 0 replies; 43+ messages in thread
From: Alexander Mikhalitsyn @ 2025-05-15 14:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima,
	Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module

Am Do., 15. Mai 2025 um 00:04 Uhr schrieb Christian Brauner
<brauner@kernel.org>:
>
> Add PIDFD_INFO_COREDUMP infrastructure so we can use it in tests.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>

> ---
>  tools/testing/selftests/pidfd/pidfd.h | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
>
> diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
> index 55bcf81a2b9a..887c74007086 100644
> --- a/tools/testing/selftests/pidfd/pidfd.h
> +++ b/tools/testing/selftests/pidfd/pidfd.h
> @@ -131,6 +131,26 @@
>  #define PIDFD_INFO_EXIT                        (1UL << 3) /* Always returned if available, even if not requested */
>  #endif
>
> +#ifndef PIDFD_INFO_COREDUMP
> +#define PIDFD_INFO_COREDUMP    (1UL << 4)
> +#endif
> +
> +#ifndef PIDFD_COREDUMPED
> +#define PIDFD_COREDUMPED       (1U << 0) /* Did crash and... */
> +#endif
> +
> +#ifndef PIDFD_COREDUMP_SKIP
> +#define PIDFD_COREDUMP_SKIP    (1U << 1) /* coredumping generation was skipped. */
> +#endif
> +
> +#ifndef PIDFD_COREDUMP_USER
> +#define PIDFD_COREDUMP_USER    (1U << 2) /* coredump was done as the user. */
> +#endif
> +
> +#ifndef PIDFD_COREDUMP_ROOT
> +#define PIDFD_COREDUMP_ROOT    (1U << 3) /* coredump was done as root. */
> +#endif
> +
>  #ifndef PIDFD_THREAD
>  #define PIDFD_THREAD O_EXCL
>  #endif
> @@ -150,6 +170,9 @@ struct pidfd_info {
>         __u32 fsuid;
>         __u32 fsgid;
>         __s32 exit_code;
> +       __u32 coredump_mask;
> +       __u32 __spare1;
> +       __u64 coredump_cookie;
>  };
>
>  /*
>
> --
> 2.47.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 9/9] selftests/coredump: add tests for AF_UNIX coredumps
  2025-05-14 22:03 ` [PATCH v7 9/9] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
@ 2025-05-15 14:37   ` Alexander Mikhalitsyn
  0 siblings, 0 replies; 43+ messages in thread
From: Alexander Mikhalitsyn @ 2025-05-15 14:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima,
	Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module

Am Do., 15. Mai 2025 um 00:04 Uhr schrieb Christian Brauner
<brauner@kernel.org>:
>
> Add a simple test for generating coredumps via AF_UNIX sockets.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>

> ---
>  tools/testing/selftests/coredump/stackdump_test.c | 514 +++++++++++++++++++++-
>  1 file changed, 513 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/coredump/stackdump_test.c b/tools/testing/selftests/coredump/stackdump_test.c
> index fe3c728cd6be..42ddcf0bdaf2 100644
> --- a/tools/testing/selftests/coredump/stackdump_test.c
> +++ b/tools/testing/selftests/coredump/stackdump_test.c
> @@ -1,14 +1,20 @@
>  // SPDX-License-Identifier: GPL-2.0
>
>  #include <fcntl.h>
> +#include <inttypes.h>
>  #include <libgen.h>
>  #include <linux/limits.h>
>  #include <pthread.h>
>  #include <string.h>
> +#include <sys/mount.h>
>  #include <sys/resource.h>
> +#include <sys/stat.h>
> +#include <sys/socket.h>
> +#include <sys/un.h>
>  #include <unistd.h>
>
>  #include "../kselftest_harness.h"
> +#include "../pidfd/pidfd.h"
>
>  #define STACKDUMP_FILE "stack_values"
>  #define STACKDUMP_SCRIPT "stackdump"
> @@ -35,6 +41,7 @@ static void crashing_child(void)
>  FIXTURE(coredump)
>  {
>         char original_core_pattern[256];
> +       pid_t pid_coredump_server;
>  };
>
>  FIXTURE_SETUP(coredump)
> @@ -44,6 +51,7 @@ FIXTURE_SETUP(coredump)
>         char *dir;
>         int ret;
>
> +       self->pid_coredump_server = -ESRCH;
>         file = fopen("/proc/sys/kernel/core_pattern", "r");
>         ASSERT_NE(NULL, file);
>
> @@ -61,10 +69,17 @@ FIXTURE_TEARDOWN(coredump)
>  {
>         const char *reason;
>         FILE *file;
> -       int ret;
> +       int ret, status;
>
>         unlink(STACKDUMP_FILE);
>
> +       if (self->pid_coredump_server > 0) {
> +               kill(self->pid_coredump_server, SIGTERM);
> +               waitpid(self->pid_coredump_server, &status, 0);
> +       }
> +       unlink("/tmp/coredump.file");
> +       unlink("/tmp/coredump.socket");
> +
>         file = fopen("/proc/sys/kernel/core_pattern", "w");
>         if (!file) {
>                 reason = "Unable to open core_pattern";
> @@ -154,4 +169,501 @@ TEST_F_TIMEOUT(coredump, stackdump, 120)
>         fclose(file);
>  }
>
> +TEST_F(coredump, socket)
> +{
> +       int fd, pidfd, ret, status;
> +       FILE *file;
> +       pid_t pid, pid_coredump_server;
> +       struct stat st;
> +       char core_file[PATH_MAX];
> +       struct pidfd_info info = {};
> +       int ipc_sockets[2];
> +       char c;
> +       const struct sockaddr_un coredump_sk = {
> +               .sun_family = AF_UNIX,
> +               .sun_path = "/tmp/coredump.socket",
> +       };
> +       size_t coredump_sk_len = offsetof(struct sockaddr_un, sun_path) +
> +                                sizeof("/tmp/coredump.socket");
> +
> +       ret = socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
> +       ASSERT_EQ(ret, 0);
> +
> +       file = fopen("/proc/sys/kernel/core_pattern", "w");
> +       ASSERT_NE(file, NULL);
> +
> +       ret = fprintf(file, "@/tmp/coredump.socket");
> +       ASSERT_EQ(ret, strlen("@/tmp/coredump.socket"));
> +       ASSERT_EQ(fclose(file), 0);
> +
> +       pid_coredump_server = fork();
> +       ASSERT_GE(pid_coredump_server, 0);
> +       if (pid_coredump_server == 0) {
> +               int fd_server, fd_coredump, fd_peer_pidfd, fd_core_file;
> +               __u64 peer_cookie;
> +               socklen_t fd_peer_pidfd_len, peer_cookie_len;
> +
> +               close(ipc_sockets[0]);
> +
> +               fd_server = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> +               if (fd_server < 0)
> +                       _exit(EXIT_FAILURE);
> +
> +               ret = bind(fd_server, (const struct sockaddr *)&coredump_sk, coredump_sk_len);
> +               if (ret < 0) {
> +                       fprintf(stderr, "Failed to bind coredump socket\n");
> +                       close(fd_server);
> +                       close(ipc_sockets[1]);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               ret = listen(fd_server, 1);
> +               if (ret < 0) {
> +                       fprintf(stderr, "Failed to listen on coredump socket\n");
> +                       close(fd_server);
> +                       close(ipc_sockets[1]);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               if (write_nointr(ipc_sockets[1], "1", 1) < 0) {
> +                       close(fd_server);
> +                       close(ipc_sockets[1]);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               close(ipc_sockets[1]);
> +
> +               fd_coredump = accept4(fd_server, NULL, NULL, SOCK_CLOEXEC);
> +               if (fd_coredump < 0) {
> +                       fprintf(stderr, "Failed to accept coredump socket connection\n");
> +                       close(fd_server);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               peer_cookie_len = sizeof(peer_cookie);
> +               ret = getsockopt(fd_coredump, SOL_SOCKET, SO_COOKIE,
> +                                &peer_cookie, &peer_cookie_len);
> +               if (ret < 0) {
> +                       fprintf(stderr, "%m - Failed to retrieve cookie for coredump socket connection\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               fd_peer_pidfd_len = sizeof(fd_peer_pidfd);
> +               ret = getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD,
> +                                &fd_peer_pidfd, &fd_peer_pidfd_len);
> +               if (ret < 0) {
> +                       fprintf(stderr, "%m - Failed to retrieve peer pidfd for coredump socket connection\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               memset(&info, 0, sizeof(info));
> +               info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP;
> +               ret = ioctl(fd_peer_pidfd, PIDFD_GET_INFO, &info);
> +               if (ret < 0) {
> +                       fprintf(stderr, "Failed to retrieve pidfd info from peer pidfd for coredump socket connection\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       close(fd_peer_pidfd);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               if (!(info.mask & PIDFD_INFO_COREDUMP)) {
> +                       fprintf(stderr, "Missing coredump information from coredumping task\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       close(fd_peer_pidfd);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               if (!(info.coredump_mask & PIDFD_COREDUMPED)) {
> +                       fprintf(stderr, "Received connection from non-coredumping task\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       close(fd_peer_pidfd);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               if (!info.coredump_cookie) {
> +                       fprintf(stderr, "Missing coredump cookie\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       close(fd_peer_pidfd);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               if (info.coredump_cookie != peer_cookie) {
> +                       fprintf(stderr, "Mismatching coredump cookies\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       close(fd_peer_pidfd);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               fd_core_file = creat("/tmp/coredump.file", 0644);
> +               if (fd_core_file < 0) {
> +                       fprintf(stderr, "Failed to create coredump file\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       close(fd_peer_pidfd);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               for (;;) {
> +                       char buffer[4096];
> +                       ssize_t bytes_read, bytes_write;
> +
> +                       bytes_read = read(fd_coredump, buffer, sizeof(buffer));
> +                       if (bytes_read < 0) {
> +                               close(fd_coredump);
> +                               close(fd_server);
> +                               close(fd_peer_pidfd);
> +                               close(fd_core_file);
> +                               _exit(EXIT_FAILURE);
> +                       }
> +
> +                       if (bytes_read == 0)
> +                               break;
> +
> +                       bytes_write = write(fd_core_file, buffer, bytes_read);
> +                       if (bytes_read != bytes_write) {
> +                               close(fd_coredump);
> +                               close(fd_server);
> +                               close(fd_peer_pidfd);
> +                               close(fd_core_file);
> +                               _exit(EXIT_FAILURE);
> +                       }
> +               }
> +
> +               close(fd_coredump);
> +               close(fd_server);
> +               close(fd_peer_pidfd);
> +               close(fd_core_file);
> +               _exit(EXIT_SUCCESS);
> +       }
> +       self->pid_coredump_server = pid_coredump_server;
> +
> +       EXPECT_EQ(close(ipc_sockets[1]), 0);
> +       ASSERT_EQ(read_nointr(ipc_sockets[0], &c, 1), 1);
> +       EXPECT_EQ(close(ipc_sockets[0]), 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +       if (pid == 0)
> +               crashing_child();
> +
> +       pidfd = sys_pidfd_open(pid, 0);
> +       ASSERT_GE(pidfd, 0);
> +
> +       waitpid(pid, &status, 0);
> +       ASSERT_TRUE(WIFSIGNALED(status));
> +       ASSERT_TRUE(WCOREDUMP(status));
> +
> +       info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP;
> +       ASSERT_EQ(ioctl(pidfd, PIDFD_GET_INFO, &info), 0);
> +       ASSERT_GT((info.mask & PIDFD_INFO_COREDUMP), 0);
> +       ASSERT_GT((info.coredump_mask & PIDFD_COREDUMPED), 0);
> +
> +       waitpid(pid_coredump_server, &status, 0);
> +       self->pid_coredump_server = -ESRCH;
> +       ASSERT_TRUE(WIFEXITED(status));
> +       ASSERT_EQ(WEXITSTATUS(status), 0);
> +
> +       ASSERT_EQ(stat("/tmp/coredump.file", &st), 0);
> +       ASSERT_GT(st.st_size, 0);
> +       /*
> +        * We should somehow validate the produced core file.
> +        * For now just allow for visual inspection
> +        */
> +       system("file /tmp/coredump.file");
> +}
> +
> +TEST_F(coredump, socket_detect_userspace_client)
> +{
> +       int fd, pidfd, ret, status;
> +       FILE *file;
> +       pid_t pid, pid_coredump_server;
> +       struct stat st;
> +       char core_file[PATH_MAX];
> +       struct pidfd_info info = {};
> +       int ipc_sockets[2];
> +       char c;
> +       const struct sockaddr_un coredump_sk = {
> +               .sun_family = AF_UNIX,
> +               .sun_path = "/tmp/coredump.socket",
> +       };
> +       size_t coredump_sk_len = offsetof(struct sockaddr_un, sun_path) +
> +                                sizeof("/tmp/coredump.socket");
> +
> +       file = fopen("/proc/sys/kernel/core_pattern", "w");
> +       ASSERT_NE(file, NULL);
> +
> +       ret = fprintf(file, "@/tmp/coredump.socket");
> +       ASSERT_EQ(ret, strlen("@/tmp/coredump.socket"));
> +       ASSERT_EQ(fclose(file), 0);
> +
> +       ret = socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
> +       ASSERT_EQ(ret, 0);
> +
> +       pid_coredump_server = fork();
> +       ASSERT_GE(pid_coredump_server, 0);
> +       if (pid_coredump_server == 0) {
> +               int fd_server, fd_coredump, fd_peer_pidfd, fd_core_file;
> +               __u64 peer_cookie;
> +               socklen_t fd_peer_pidfd_len, peer_cookie_len;
> +
> +               close(ipc_sockets[0]);
> +
> +               fd_server = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> +               if (fd_server < 0)
> +                       _exit(EXIT_FAILURE);
> +
> +               ret = bind(fd_server, (const struct sockaddr *)&coredump_sk, coredump_sk_len);
> +               if (ret < 0) {
> +                       fprintf(stderr, "Failed to bind coredump socket\n");
> +                       close(fd_server);
> +                       close(ipc_sockets[1]);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               ret = listen(fd_server, 1);
> +               if (ret < 0) {
> +                       fprintf(stderr, "Failed to listen on coredump socket\n");
> +                       close(fd_server);
> +                       close(ipc_sockets[1]);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               if (write_nointr(ipc_sockets[1], "1", 1) < 0) {
> +                       close(fd_server);
> +                       close(ipc_sockets[1]);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               close(ipc_sockets[1]);
> +
> +               fd_coredump = accept4(fd_server, NULL, NULL, SOCK_CLOEXEC);
> +               if (fd_coredump < 0) {
> +                       fprintf(stderr, "Failed to accept coredump socket connection\n");
> +                       close(fd_server);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               peer_cookie_len = sizeof(peer_cookie);
> +               ret = getsockopt(fd_coredump, SOL_SOCKET, SO_COOKIE,
> +                                &peer_cookie, &peer_cookie_len);
> +               if (ret < 0) {
> +                       fprintf(stderr, "%m - Failed to retrieve cookie for coredump socket connection\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               fd_peer_pidfd_len = sizeof(fd_peer_pidfd);
> +               ret = getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD,
> +                                &fd_peer_pidfd, &fd_peer_pidfd_len);
> +               if (ret < 0) {
> +                       fprintf(stderr, "%m - Failed to retrieve peer pidfd for coredump socket connection\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               memset(&info, 0, sizeof(info));
> +               info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP;
> +               ret = ioctl(fd_peer_pidfd, PIDFD_GET_INFO, &info);
> +               if (ret < 0) {
> +                       fprintf(stderr, "Failed to retrieve pidfd info from peer pidfd for coredump socket connection\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       close(fd_peer_pidfd);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               if (!(info.mask & PIDFD_INFO_COREDUMP)) {
> +                       fprintf(stderr, "Missing coredump information from coredumping task\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       close(fd_peer_pidfd);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               if (info.coredump_mask & PIDFD_COREDUMPED) {
> +                       fprintf(stderr, "Received unexpected connection from coredumping task\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       close(fd_peer_pidfd);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               if (info.coredump_cookie) {
> +                       fprintf(stderr, "Received unexpected coredump cookie\n");
> +                       close(fd_coredump);
> +                       close(fd_server);
> +                       close(fd_peer_pidfd);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               close(fd_coredump);
> +               close(fd_server);
> +               close(fd_peer_pidfd);
> +               close(fd_core_file);
> +               _exit(EXIT_SUCCESS);
> +       }
> +       self->pid_coredump_server = pid_coredump_server;
> +
> +       EXPECT_EQ(close(ipc_sockets[1]), 0);
> +       ASSERT_EQ(read_nointr(ipc_sockets[0], &c, 1), 1);
> +       EXPECT_EQ(close(ipc_sockets[0]), 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +       if (pid == 0) {
> +               int fd_socket;
> +               ssize_t ret;
> +
> +               fd_socket = socket(AF_UNIX, SOCK_STREAM, 0);
> +               if (fd_socket < 0)
> +                       _exit(EXIT_FAILURE);
> +
> +
> +               ret = connect(fd_socket, (const struct sockaddr *)&coredump_sk, coredump_sk_len);
> +               if (ret < 0)
> +                       _exit(EXIT_FAILURE);
> +
> +               (void *)write(fd_socket, &(char){ 0 }, 1);
> +               close(fd_socket);
> +               _exit(EXIT_SUCCESS);
> +       }
> +
> +       pidfd = sys_pidfd_open(pid, 0);
> +       ASSERT_GE(pidfd, 0);
> +
> +       waitpid(pid, &status, 0);
> +       ASSERT_TRUE(WIFEXITED(status));
> +       ASSERT_EQ(WEXITSTATUS(status), 0);
> +
> +       info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP;
> +       ASSERT_EQ(ioctl(pidfd, PIDFD_GET_INFO, &info), 0);
> +       ASSERT_GT((info.mask & PIDFD_INFO_COREDUMP), 0);
> +       ASSERT_EQ((info.coredump_mask & PIDFD_COREDUMPED), 0);
> +
> +       waitpid(pid_coredump_server, &status, 0);
> +       self->pid_coredump_server = -ESRCH;
> +       ASSERT_TRUE(WIFEXITED(status));
> +       ASSERT_EQ(WEXITSTATUS(status), 0);
> +
> +       ASSERT_NE(stat("/tmp/coredump.file", &st), 0);
> +       ASSERT_EQ(errno, ENOENT);
> +}
> +
> +TEST_F(coredump, socket_enoent)
> +{
> +       int pidfd, ret, status;
> +       FILE *file;
> +       pid_t pid;
> +       char core_file[PATH_MAX];
> +
> +       file = fopen("/proc/sys/kernel/core_pattern", "w");
> +       ASSERT_NE(file, NULL);
> +
> +       ret = fprintf(file, "@/tmp/coredump.socket");
> +       ASSERT_EQ(ret, strlen("@/tmp/coredump.socket"));
> +       ASSERT_EQ(fclose(file), 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +       if (pid == 0)
> +               crashing_child();
> +
> +       pidfd = sys_pidfd_open(pid, 0);
> +       ASSERT_GE(pidfd, 0);
> +
> +       waitpid(pid, &status, 0);
> +       ASSERT_TRUE(WIFSIGNALED(status));
> +       ASSERT_FALSE(WCOREDUMP(status));
> +}
> +
> +TEST_F(coredump, socket_no_listener)
> +{
> +       int pidfd, ret, status;
> +       FILE *file;
> +       pid_t pid, pid_coredump_server;
> +       int ipc_sockets[2];
> +       char c;
> +       const struct sockaddr_un coredump_sk = {
> +               .sun_family = AF_UNIX,
> +               .sun_path = "/tmp/coredump.socket",
> +       };
> +       size_t coredump_sk_len = offsetof(struct sockaddr_un, sun_path) +
> +                                sizeof("/tmp/coredump.socket");
> +
> +       ret = socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc_sockets);
> +       ASSERT_EQ(ret, 0);
> +
> +       file = fopen("/proc/sys/kernel/core_pattern", "w");
> +       ASSERT_NE(file, NULL);
> +
> +       ret = fprintf(file, "@/tmp/coredump.socket");
> +       ASSERT_EQ(ret, strlen("@/tmp/coredump.socket"));
> +       ASSERT_EQ(fclose(file), 0);
> +
> +       pid_coredump_server = fork();
> +       ASSERT_GE(pid_coredump_server, 0);
> +       if (pid_coredump_server == 0) {
> +               int fd_server, fd_coredump, fd_peer_pidfd, fd_core_file;
> +               __u64 peer_cookie;
> +               socklen_t fd_peer_pidfd_len, peer_cookie_len;
> +
> +               close(ipc_sockets[0]);
> +
> +               fd_server = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> +               if (fd_server < 0)
> +                       _exit(EXIT_FAILURE);
> +
> +               ret = bind(fd_server, (const struct sockaddr *)&coredump_sk, coredump_sk_len);
> +               if (ret < 0) {
> +                       fprintf(stderr, "Failed to bind coredump socket\n");
> +                       close(fd_server);
> +                       close(ipc_sockets[1]);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               if (write_nointr(ipc_sockets[1], "1", 1) < 0) {
> +                       close(fd_server);
> +                       close(ipc_sockets[1]);
> +                       _exit(EXIT_FAILURE);
> +               }
> +
> +               close(fd_server);
> +               close(ipc_sockets[1]);
> +               _exit(EXIT_SUCCESS);
> +       }
> +       self->pid_coredump_server = pid_coredump_server;
> +
> +       EXPECT_EQ(close(ipc_sockets[1]), 0);
> +       ASSERT_EQ(read_nointr(ipc_sockets[0], &c, 1), 1);
> +       EXPECT_EQ(close(ipc_sockets[0]), 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +       if (pid == 0)
> +               crashing_child();
> +
> +       pidfd = sys_pidfd_open(pid, 0);
> +       ASSERT_GE(pidfd, 0);
> +
> +       waitpid(pid, &status, 0);
> +       ASSERT_TRUE(WIFSIGNALED(status));
> +       ASSERT_FALSE(WCOREDUMP(status));
> +
> +       waitpid(pid_coredump_server, &status, 0);
> +       self->pid_coredump_server = -ESRCH;
> +       ASSERT_TRUE(WIFEXITED(status));
> +       ASSERT_EQ(WEXITSTATUS(status), 0);
> +}
> +
>  TEST_HARNESS_MAIN
>
> --
> 2.47.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 4/9] coredump: add coredump socket
  2025-05-14 22:03 ` [PATCH v7 4/9] coredump: add coredump socket Christian Brauner
  2025-05-15 13:47   ` Alexander Mikhalitsyn
@ 2025-05-15 17:00   ` Kuniyuki Iwashima
  2025-05-15 20:52     ` Jann Horn
  2025-05-16 10:14     ` Christian Brauner
  2025-05-15 20:54   ` Jann Horn
  2 siblings, 2 replies; 43+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-15 17:00 UTC (permalink / raw)
  To: brauner
  Cc: alexander, bluca, daan.j.demeyer, daniel, davem, david, edumazet,
	horms, jack, jannh, kuba, kuniyu, lennart, linux-fsdevel,
	linux-kernel, linux-security-module, me, netdev, oleg, pabeni,
	viro, zbyszek

From: Christian Brauner <brauner@kernel.org>
Date: Thu, 15 May 2025 00:03:37 +0200
> Coredumping currently supports two modes:
> 
> (1) Dumping directly into a file somewhere on the filesystem.
> (2) Dumping into a pipe connected to a usermode helper process
>     spawned as a child of the system_unbound_wq or kthreadd.
> 
> For simplicity I'm mostly ignoring (1). There's probably still some
> users of (1) out there but processing coredumps in this way can be
> considered adventurous especially in the face of set*id binaries.
> 
> The most common option should be (2) by now. It works by allowing
> userspace to put a string into /proc/sys/kernel/core_pattern like:
> 
>         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> 
> The "|" at the beginning indicates to the kernel that a pipe must be
> used. The path following the pipe indicator is a path to a binary that
> will be spawned as a usermode helper process. Any additional parameters
> pass information about the task that is generating the coredump to the
> binary that processes the coredump.
> 
> In the example core_pattern shown above systemd-coredump is spawned as a
> usermode helper. There's various conceptual consequences of this
> (non-exhaustive list):
> 
> - systemd-coredump is spawned with file descriptor number 0 (stdin)
>   connected to the read-end of the pipe. All other file descriptors are
>   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
>   already caused bugs because userspace assumed that this cannot happen
>   (Whether or not this is a sane assumption is irrelevant.).
> 
> - systemd-coredump will be spawned as a child of system_unbound_wq. So
>   it is not a child of any userspace process and specifically not a
>   child of PID 1. It cannot be waited upon and is in a weird hybrid
>   upcall which are difficult for userspace to control correctly.
> 
> - systemd-coredump is spawned with full kernel privileges. This
>   necessitates all kinds of weird privilege dropping excercises in
>   userspace to make this safe.
> 
> - A new usermode helper has to be spawned for each crashing process.
> 
> This series adds a new mode:
> 
> (3) Dumping into an AF_UNIX socket.
> 
> Userspace can set /proc/sys/kernel/core_pattern to:
> 
>         @/path/to/coredump.socket
> 
> The "@" at the beginning indicates to the kernel that an AF_UNIX
> coredump socket will be used to process coredumps.
> 
> The coredump socket must be located in the initial mount namespace.
> When a task coredumps it opens a client socket in the initial network
> namespace and connects to the coredump socket.
> 
> - The coredump server uses SO_PEERPIDFD to get a stable handle on the
>   connected crashing task. The retrieved pidfd will provide a stable
>   reference even if the crashing task gets SIGKILLed while generating
>   the coredump.
> 
> - By setting core_pipe_limit non-zero userspace can guarantee that the
>   crashing task cannot be reaped behind it's back and thus process all
>   necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
>   detect whether /proc/<pid> still refers to the same process.
> 
>   The core_pipe_limit isn't used to rate-limit connections to the
>   socket. This can simply be done via AF_UNIX sockets directly.
> 
> - The pidfd for the crashing task will grow new information how the task
>   coredumps.
> 
> - The coredump server should mark itself as non-dumpable.
> 
> - A container coredump server in a separate network namespace can simply
>   bind to another well-know address and systemd-coredump fowards
>   coredumps to the container.
> 
> - Coredumps could in the future also be handled via per-user/session
>   coredump servers that run only with that users privileges.
> 
>   The coredump server listens on the coredump socket and accepts a
>   new coredump connection. It then retrieves SO_PEERPIDFD for the
>   client, inspects uid/gid and hands the accepted client to the users
>   own coredump handler which runs with the users privileges only
>   (It must of coure pay close attention to not forward crashing suid
>   binaries.).
> 
> The new coredump socket will allow userspace to not have to rely on
> usermode helpers for processing coredumps and provides a safer way to
> handle them instead of relying on super privileged coredumping helpers
> that have and continue to cause significant CVEs.
> 
> This will also be significantly more lightweight since no fork()+exec()
> for the usermodehelper is required for each crashing process. The
> coredump server in userspace can e.g., just keep a worker pool.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/coredump.c       | 133 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  include/linux/net.h |   1 +
>  net/unix/af_unix.c  |  53 ++++++++++++++++-----
>  3 files changed, 166 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index a70929c3585b..e1256ebb89c1 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -44,7 +44,11 @@
>  #include <linux/sysctl.h>
>  #include <linux/elf.h>
>  #include <linux/pidfs.h>
> +#include <linux/net.h>
> +#include <linux/socket.h>
> +#include <net/net_namespace.h>
>  #include <uapi/linux/pidfd.h>
> +#include <uapi/linux/un.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
>  enum coredump_type_t {
>  	COREDUMP_FILE = 1,
>  	COREDUMP_PIPE = 2,
> +	COREDUMP_SOCK = 3,
>  };
>  
>  struct core_name {
> @@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>  	cn->corename = NULL;
>  	if (*pat_ptr == '|')
>  		cn->core_type = COREDUMP_PIPE;
> +	else if (*pat_ptr == '@')
> +		cn->core_type = COREDUMP_SOCK;
>  	else
>  		cn->core_type = COREDUMP_FILE;
>  	if (expand_corename(cn, core_name_size))
>  		return -ENOMEM;
>  	cn->corename[0] = '\0';
>  
> -	if (cn->core_type == COREDUMP_PIPE) {
> +	switch (cn->core_type) {
> +	case COREDUMP_PIPE: {
>  		int argvs = sizeof(core_pattern) / 2;
>  		(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
>  		if (!(*argv))
> @@ -247,6 +255,33 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>  		++pat_ptr;
>  		if (!(*pat_ptr))
>  			return -ENOMEM;
> +		break;
> +	}
> +	case COREDUMP_SOCK: {
> +		/* skip the @ */
> +		pat_ptr++;
> +		err = cn_printf(cn, "%s", pat_ptr);
> +		if (err)
> +			return err;
> +
> +		/* Require absolute paths. */
> +		if (cn->corename[0] != '/')
> +			return -EINVAL;
> +
> +		/*
> +		 * Currently no need to parse any other options.
> +		 * Relevant information can be retrieved from the peer
> +		 * pidfd retrievable via SO_PEERPIDFD by the receiver or
> +		 * via /proc/<pid>, using the SO_PEERPIDFD to guard
> +		 * against pid recycling when opening /proc/<pid>.
> +		 */
> +		return 0;
> +	}
> +	case COREDUMP_FILE:
> +		break;
> +	default:
> +		WARN_ON_ONCE(true);
> +		return -EINVAL;
>  	}
>  
>  	/* Repeat as long as we have more pattern to process and more output
> @@ -393,11 +428,20 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>  	 * If core_pattern does not include a %p (as is the default)
>  	 * and core_uses_pid is set, then .%pid will be appended to
>  	 * the filename. Do not do this for piped commands. */
> -	if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
> -		err = cn_printf(cn, ".%d", task_tgid_vnr(current));
> -		if (err)
> -			return err;
> +	if (!pid_in_pattern && core_uses_pid) {
> +		switch (cn->core_type) {
> +		case COREDUMP_FILE:
> +			return cn_printf(cn, ".%d", task_tgid_vnr(current));
> +		case COREDUMP_PIPE:
> +			break;
> +		case COREDUMP_SOCK:
> +			break;
> +		default:
> +			WARN_ON_ONCE(true);
> +			return -EINVAL;
> +		}
>  	}
> +
>  	return 0;
>  }
>  
> @@ -801,6 +845,55 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  		}
>  		break;
>  	}
> +	case COREDUMP_SOCK: {
> +#ifdef CONFIG_UNIX
> +		struct file *file __free(fput) = NULL;
> +		struct sockaddr_un addr = {
> +			.sun_family = AF_UNIX,
> +		};
> +		ssize_t addr_len;
> +		struct socket *socket;
> +
> +		retval = strscpy(addr.sun_path, cn.corename, sizeof(addr.sun_path));
> +		if (retval < 0)
> +			goto close_fail;
> +		addr_len = offsetof(struct sockaddr_un, sun_path) + retval + 1;
> +
> +		/*
> +		 * It is possible that the userspace process which is
> +		 * supposed to handle the coredump and is listening on
> +		 * the AF_UNIX socket coredumps. Userspace should just
> +		 * mark itself non dumpable.
> +		 */
> +
> +		retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> +		if (retval < 0)
> +			goto close_fail;
> +
> +		file = sock_alloc_file(socket, 0, NULL);
> +		if (IS_ERR(file)) {
> +			sock_release(socket);
> +			goto close_fail;
> +		}
> +
> +		retval = kernel_connect(socket, (struct sockaddr *)(&addr),
> +					addr_len, O_NONBLOCK | SOCK_COREDUMP);
> +		if (retval) {
> +			if (retval == -EAGAIN)
> +				coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path);
> +			else
> +				coredump_report_failure("Coredump socket connection %s failed %d", addr.sun_path, retval);
> +			goto close_fail;
> +		}
> +
> +		cprm.limit = RLIM_INFINITY;
> +		cprm.file = no_free_ptr(file);
> +#else
> +		coredump_report_failure("Core dump socket support %s disabled", cn.corename);
> +		goto close_fail;
> +#endif
> +		break;
> +	}
>  	default:
>  		WARN_ON_ONCE(true);
>  		goto close_fail;
> @@ -838,8 +931,32 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  		file_end_write(cprm.file);
>  		free_vma_snapshot(&cprm);
>  	}
> -	if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
> -		wait_for_dump_helpers(cprm.file);
> +
> +	/*
> +	 * When core_pipe_limit is set we wait for the coredump server
> +	 * or usermodehelper to finish before exiting so it can e.g.,
> +	 * inspect /proc/<pid>.
> +	 */
> +	if (core_pipe_limit) {
> +		switch (cn.core_type) {
> +		case COREDUMP_PIPE:
> +			wait_for_dump_helpers(cprm.file);
> +			break;
> +		case COREDUMP_SOCK: {
> +			/*
> +			 * We use a simple read to wait for the coredump
> +			 * processing to finish. Either the socket is
> +			 * closed or we get sent unexpected data. In
> +			 * both cases, we're done.
> +			 */
> +			__kernel_read(cprm.file, &(char){ 0 }, 1, NULL);
> +			break;
> +		}
> +		default:
> +			break;
> +		}
> +	}
> +
>  close_fail:
>  	if (cprm.file)
>  		filp_close(cprm.file, NULL);
> @@ -1069,7 +1186,7 @@ EXPORT_SYMBOL(dump_align);
>  void validate_coredump_safety(void)
>  {
>  	if (suid_dumpable == SUID_DUMP_ROOT &&
> -	    core_pattern[0] != '/' && core_pattern[0] != '|') {
> +	    core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != '@') {
>  
>  		coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: "
>  			"pipe handler or fully qualified core dump path required. "
> diff --git a/include/linux/net.h b/include/linux/net.h
> index 0ff950eecc6b..139c85d0f2ea 100644
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -81,6 +81,7 @@ enum sock_type {
>  #ifndef SOCK_NONBLOCK
>  #define SOCK_NONBLOCK	O_NONBLOCK
>  #endif
> +#define SOCK_COREDUMP	O_NOCTTY
>  
>  #endif /* ARCH_HAS_SOCKET_TYPES */
>  
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 472f8aa9ea15..a9d1c9ba2961 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -85,10 +85,13 @@
>  #include <linux/file.h>
>  #include <linux/filter.h>
>  #include <linux/fs.h>
> +#include <linux/fs_struct.h>
>  #include <linux/init.h>
>  #include <linux/kernel.h>
>  #include <linux/mount.h>
>  #include <linux/namei.h>
> +#include <linux/net.h>
> +#include <linux/pidfs.h>
>  #include <linux/poll.h>
>  #include <linux/proc_fs.h>
>  #include <linux/sched/signal.h>
> @@ -100,7 +103,6 @@
>  #include <linux/splice.h>
>  #include <linux/string.h>
>  #include <linux/uaccess.h>
> -#include <linux/pidfs.h>
>  #include <net/af_unix.h>
>  #include <net/net_namespace.h>
>  #include <net/scm.h>
> @@ -1146,7 +1148,7 @@ static int unix_release(struct socket *sock)
>  }
>  
>  static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
> -				  int type)
> +				  int type, unsigned int flags)
  				      	    ^^^
nit: int flags


>  {
>  	struct inode *inode;
>  	struct path path;
> @@ -1154,13 +1156,38 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
>  	int err;
>  
>  	unix_mkname_bsd(sunaddr, addr_len);
> -	err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
> -	if (err)
> -		goto fail;
>  
> -	err = path_permission(&path, MAY_WRITE);
> -	if (err)
> -		goto path_put;
> +	if (flags & SOCK_COREDUMP) {
> +		struct path root;
> +		struct cred *kcred;
> +		const struct cred *cred;

nit: please keep these in the reverse xmas tree order.
https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs


> +
> +		err = -ENOMEM;

While at it, please move this in the "if (!kcred)" as it's only
used for this.

Otherwise looks good to me.  I think you can just fix up nits
before pushing to the vfs tree unless there is any other feedback.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Thanks!


> +		kcred = prepare_kernel_cred(&init_task);
> +		if (!kcred)
> +			goto fail;
> +
> +		task_lock(&init_task);
> +		get_fs_root(init_task.fs, &root);
> +		task_unlock(&init_task);
> +
> +		cred = override_creds(kcred);
> +		err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
> +				      LOOKUP_BENEATH | LOOKUP_NO_SYMLINKS |
> +				      LOOKUP_NO_MAGICLINKS, &path);
> +		put_cred(revert_creds(cred));
> +		path_put(&root);
> +		if (err)
> +			goto fail;
> +	} else {
> +		err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
> +		if (err)
> +			goto fail;
> +
> +		err = path_permission(&path, MAY_WRITE);
> +		if (err)
> +			goto path_put;
> +	}
>  
>  	err = -ECONNREFUSED;
>  	inode = d_backing_inode(path.dentry);
> @@ -1210,12 +1237,12 @@ static struct sock *unix_find_abstract(struct net *net,
>  
>  static struct sock *unix_find_other(struct net *net,
>  				    struct sockaddr_un *sunaddr,
> -				    int addr_len, int type)
> +				    int addr_len, int type, int flags)
>  {
>  	struct sock *sk;
>  
>  	if (sunaddr->sun_path[0])
> -		sk = unix_find_bsd(sunaddr, addr_len, type);
> +		sk = unix_find_bsd(sunaddr, addr_len, type, flags);
>  	else
>  		sk = unix_find_abstract(net, sunaddr, addr_len, type);
>  
> @@ -1473,7 +1500,7 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr *addr,
>  		}
>  
>  restart:
> -		other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type);
> +		other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type, 0);
>  		if (IS_ERR(other)) {
>  			err = PTR_ERR(other);
>  			goto out;
> @@ -1620,7 +1647,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
>  
>  restart:
>  	/*  Find listening sock. */
> -	other = unix_find_other(net, sunaddr, addr_len, sk->sk_type);
> +	other = unix_find_other(net, sunaddr, addr_len, sk->sk_type, flags);
>  	if (IS_ERR(other)) {
>  		err = PTR_ERR(other);
>  		goto out_free_skb;
> @@ -2089,7 +2116,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
>  	if (msg->msg_namelen) {
>  lookup:
>  		other = unix_find_other(sock_net(sk), msg->msg_name,
> -					msg->msg_namelen, sk->sk_type);
> +					msg->msg_namelen, sk->sk_type, 0);
>  		if (IS_ERR(other)) {
>  			err = PTR_ERR(other);
>  			goto out_free;
> 
> -- 
> 2.47.2
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 4/9] coredump: add coredump socket
  2025-05-15 17:00   ` Kuniyuki Iwashima
@ 2025-05-15 20:52     ` Jann Horn
  2025-05-15 21:04       ` Kuniyuki Iwashima
  2025-05-16 10:14     ` Christian Brauner
  1 sibling, 1 reply; 43+ messages in thread
From: Jann Horn @ 2025-05-15 20:52 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: brauner, alexander, bluca, daan.j.demeyer, daniel, davem, david,
	edumazet, horms, jack, kuba, lennart, linux-fsdevel, linux-kernel,
	linux-security-module, me, netdev, oleg, pabeni, viro, zbyszek

On Thu, May 15, 2025 at 7:01 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> nit: please keep these in the reverse xmas tree order.
> https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs

Isn't that rule specific to things that go through the net tree?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 1/9] coredump: massage format_corname()
  2025-05-14 22:03 ` [PATCH v7 1/9] coredump: massage format_corname() Christian Brauner
  2025-05-15 13:19   ` Alexander Mikhalitsyn
  2025-05-15 13:36   ` Serge E. Hallyn
@ 2025-05-15 20:52   ` Jann Horn
  2 siblings, 0 replies; 43+ messages in thread
From: Jann Horn @ 2025-05-15 20:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 12:04 AM Christian Brauner <brauner@kernel.org> wrote:
> We're going to extend the coredump code in follow-up patches.
> Clean it up so we can do this more easily.

typo nit: format_corename() written wrong in patch title

> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Jann Horn <jannh@google.com>

> @@ -384,12 +393,12 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>          * If core_pattern does not include a %p (as is the default)
>          * and core_uses_pid is set, then .%pid will be appended to
>          * the filename. Do not do this for piped commands. */
> -       if (!ispipe && !pid_in_pattern && core_uses_pid) {
> +       if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {

non-actionable note: "!(cn->core_type == COREDUMP_PIPE)" can be
simplified to "cn->core_type != COREDUMP_PIPE"; but patch 4 rewrites
this anyway, so no need to change this

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 2/9] coredump: massage do_coredump()
  2025-05-14 22:03 ` [PATCH v7 2/9] coredump: massage do_coredump() Christian Brauner
  2025-05-15 13:21   ` Alexander Mikhalitsyn
@ 2025-05-15 20:52   ` Jann Horn
  1 sibling, 0 replies; 43+ messages in thread
From: Jann Horn @ 2025-05-15 20:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 12:04 AM Christian Brauner <brauner@kernel.org> wrote:
> We're going to extend the coredump code in follow-up patches.
> Clean it up so we can do this more easily.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Jann Horn <jannh@google.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 3/9] coredump: reflow dump helpers a little
  2025-05-14 22:03 ` [PATCH v7 3/9] coredump: reflow dump helpers a little Christian Brauner
  2025-05-15 13:22   ` Alexander Mikhalitsyn
@ 2025-05-15 20:53   ` Jann Horn
  1 sibling, 0 replies; 43+ messages in thread
From: Jann Horn @ 2025-05-15 20:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 12:04 AM Christian Brauner <brauner@kernel.org> wrote:
> They look rather messy right now.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Jann Horn <jannh@google.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 4/9] coredump: add coredump socket
  2025-05-14 22:03 ` [PATCH v7 4/9] coredump: add coredump socket Christian Brauner
  2025-05-15 13:47   ` Alexander Mikhalitsyn
  2025-05-15 17:00   ` Kuniyuki Iwashima
@ 2025-05-15 20:54   ` Jann Horn
  2025-05-15 21:15     ` Kuniyuki Iwashima
  2025-05-16 10:09     ` Christian Brauner
  2 siblings, 2 replies; 43+ messages in thread
From: Jann Horn @ 2025-05-15 20:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 12:04 AM Christian Brauner <brauner@kernel.org> wrote:
> diff --git a/fs/coredump.c b/fs/coredump.c
> index a70929c3585b..e1256ebb89c1 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
[...]
> @@ -393,11 +428,20 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>          * If core_pattern does not include a %p (as is the default)
>          * and core_uses_pid is set, then .%pid will be appended to
>          * the filename. Do not do this for piped commands. */
> -       if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
> -               err = cn_printf(cn, ".%d", task_tgid_vnr(current));
> -               if (err)
> -                       return err;
> +       if (!pid_in_pattern && core_uses_pid) {
> +               switch (cn->core_type) {
> +               case COREDUMP_FILE:
> +                       return cn_printf(cn, ".%d", task_tgid_vnr(current));
> +               case COREDUMP_PIPE:
> +                       break;
> +               case COREDUMP_SOCK:
> +                       break;

This branch is dead code, we can't get this far down with
COREDUMP_SOCK. Maybe you could remove the "break;" and fall through to
the default WARN_ON_ONCE() here. Or better, revert this hunk and
instead just change the check to check for "cn->core_type ==
COREDUMP_FILE" (in patch 1), since this whole block is legacy logic
specific to dumping into files (COREDUMP_FILE).

> +               default:
> +                       WARN_ON_ONCE(true);
> +                       return -EINVAL;
> +               }
>         }
> +
>         return 0;
>  }
>
> @@ -801,6 +845,55 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                 }
>                 break;
>         }
> +       case COREDUMP_SOCK: {
> +#ifdef CONFIG_UNIX
> +               struct file *file __free(fput) = NULL;
> +               struct sockaddr_un addr = {
> +                       .sun_family = AF_UNIX,
> +               };
> +               ssize_t addr_len;
> +               struct socket *socket;
> +
> +               retval = strscpy(addr.sun_path, cn.corename, sizeof(addr.sun_path));

nit: strscpy() explicitly supports eliding the last argument in this
case, thanks to macro magic:

 * The size argument @... is only required when @dst is not an array, or
 * when the copy needs to be smaller than sizeof(@dst).

> +               if (retval < 0)
> +                       goto close_fail;
> +               addr_len = offsetof(struct sockaddr_un, sun_path) + retval + 1;

nit: On a 64-bit system, strscpy() returns a 64-bit value, and
addr_len is also 64-bit, but retval is 32-bit. Implicitly moving
length values back and forth between 64-bit and 32-bit is slightly
dodgy and might generate suboptimal code (it could force the compiler
to emit instructions to explicitly truncate the value if it can't
prove that the value fits in 32 bits). It would be nice to keep the
value 64-bit throughout by storing the return value in a ssize_t.

And actually, you don't have to compute addr_len here at all; that's
needed for abstract unix domain sockets, but for path-based unix
domain socket, you should be able to just use sizeof(struct
sockaddr_un) as addrlen. (This is documented in "man 7 unix".)

> +
> +               /*
> +                * It is possible that the userspace process which is
> +                * supposed to handle the coredump and is listening on
> +                * the AF_UNIX socket coredumps. Userspace should just
> +                * mark itself non dumpable.
> +                */
> +
> +               retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> +               if (retval < 0)
> +                       goto close_fail;
> +
> +               file = sock_alloc_file(socket, 0, NULL);
> +               if (IS_ERR(file)) {
> +                       sock_release(socket);

I think you missed an API gotcha here. See the sock_alloc_file() documentation:

 * On failure @sock is released, and an ERR pointer is returned.

So I think basically sock_alloc_file() always consumes the socket
reference provided by the caller, and the sock_release() in this
branch is a double-free?

> +                       goto close_fail;
> +               }
[...]
> diff --git a/include/linux/net.h b/include/linux/net.h
> index 0ff950eecc6b..139c85d0f2ea 100644
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -81,6 +81,7 @@ enum sock_type {
>  #ifndef SOCK_NONBLOCK
>  #define SOCK_NONBLOCK  O_NONBLOCK
>  #endif
> +#define SOCK_COREDUMP  O_NOCTTY

Hrrrm. I looked through all the paths from which the ->connect() call
can come, and I think this is currently safe; but I wonder if it would
make sense to either give this highly privileged bit a separate value
that can never come from userspace, or explicitly strip it away in
__sys_connect_file() just to be safe.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 5/9] pidfs, coredump: add PIDFD_INFO_COREDUMP
  2025-05-14 22:03 ` [PATCH v7 5/9] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
  2025-05-15 14:08   ` Alexander Mikhalitsyn
@ 2025-05-15 20:56   ` Jann Horn
  2025-05-15 21:37     ` Jann Horn
  2025-05-16 10:34     ` Christian Brauner
  1 sibling, 2 replies; 43+ messages in thread
From: Jann Horn @ 2025-05-15 20:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 12:04 AM Christian Brauner <brauner@kernel.org> wrote:
> Extend the PIDFD_INFO_COREDUMP ioctl() with the new PIDFD_INFO_COREDUMP
> mask flag. This adds the fields @coredump_mask and @coredump_cookie to
> struct pidfd_info.

FWIW, now that you're using path-based sockets and override_creds(),
one option may be to drop this patch and say "if you don't want
untrusted processes to directly connect to the coredumping socket,
just set the listening socket to mode 0000 or mode 0600"...

> Signed-off-by: Christian Brauner <brauner@kernel.org>
[...]
> diff --git a/fs/coredump.c b/fs/coredump.c
> index e1256ebb89c1..bfc4a32f737c 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
[...]
> @@ -876,8 +880,34 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                         goto close_fail;
>                 }
>
> +               /*
> +                * Set the thread-group leader pid which is used for the
> +                * peer credentials during connect() below. Then
> +                * immediately register it in pidfs...
> +                */
> +               cprm.pid = task_tgid(current);
> +               retval = pidfs_register_pid(cprm.pid);
> +               if (retval) {
> +                       sock_release(socket);
> +                       goto close_fail;
> +               }
> +
> +               /*
> +                * ... and set the coredump information so userspace
> +                * has it available after connect()...
> +                */
> +               pidfs_coredump(&cprm);
> +
> +               /*
> +                * ... On connect() the peer credentials are recorded
> +                * and @cprm.pid registered in pidfs...

I don't understand this comment. Wasn't "@cprm.pid registered in
pidfs" above with the explicit `pidfs_register_pid(cprm.pid)`?

> +                */
>                 retval = kernel_connect(socket, (struct sockaddr *)(&addr),
>                                         addr_len, O_NONBLOCK | SOCK_COREDUMP);
> +
> +               /* ... So we can safely put our pidfs reference now... */
> +               pidfs_put_pid(cprm.pid);

Why can we safely put the pidfs reference now but couldn't do it
before the kernel_connect()? Does the kernel_connect() look up this
pidfs entry by calling something like pidfs_alloc_file()? Or does that
only happen later on, when the peer does getsockopt(SO_PEERPIDFD)?

>                 if (retval) {
>                         if (retval == -EAGAIN)
>                                 coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path);
[...]
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 3b39e471840b..d7b9a0dd2db6 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
[...]
> @@ -280,6 +299,13 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
>                 }
>         }
>
> +       if (mask & PIDFD_INFO_COREDUMP) {
> +               kinfo.mask |= PIDFD_INFO_COREDUMP;
> +               smp_rmb();

I assume I would regret it if I asked what these barriers are for,
because the answer is something terrifying about how we otherwise
don't have a guarantee that memory accesses can't be reordered between
multiple subsequent syscalls or something like that?

checkpatch complains about the lack of comments on these memory barriers.

> +               kinfo.coredump_cookie = READ_ONCE(pidfs_i(inode)->__pei.coredump_cookie);
> +               kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
> +       }
> +
>         task = get_pid_task(pid, PIDTYPE_PID);
>         if (!task) {
>                 /*
[...]
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index a9d1c9ba2961..053d2e48e918 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
[...]
> @@ -742,6 +743,7 @@ static void unix_release_sock(struct sock *sk, int embrion)
>
>  struct unix_peercred {
>         struct pid *peer_pid;
> +       u64 cookie;

Maybe add a comment here documenting that for now, this is assumed to
be used exclusively for coredump sockets.


>         const struct cred *peer_cred;
>  };
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 6/9] coredump: show supported coredump modes
  2025-05-14 22:03 ` [PATCH v7 6/9] coredump: show supported coredump modes Christian Brauner
  2025-05-15 13:56   ` Alexander Mikhalitsyn
@ 2025-05-15 20:56   ` Jann Horn
  1 sibling, 0 replies; 43+ messages in thread
From: Jann Horn @ 2025-05-15 20:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 12:04 AM Christian Brauner <brauner@kernel.org> wrote:
> Allow userspace to discover what coredump modes are supported.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Jann Horn <jannh@google.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 7/9] coredump: validate socket name as it is written
  2025-05-14 22:03 ` [PATCH v7 7/9] coredump: validate socket name as it is written Christian Brauner
  2025-05-15 14:03   ` Alexander Mikhalitsyn
@ 2025-05-15 20:56   ` Jann Horn
  2025-05-16  9:54     ` Christian Brauner
  1 sibling, 1 reply; 43+ messages in thread
From: Jann Horn @ 2025-05-15 20:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 12:04 AM Christian Brauner <brauner@kernel.org> wrote:
> In contrast to other parameters written into
> /proc/sys/kernel/core_pattern that never fail we can validate enabling
> the new AF_UNIX support. This is obviously racy as hell but it's always
> been that way.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Jann Horn <jannh@google.com>

> ---
>  fs/coredump.c | 37 ++++++++++++++++++++++++++++++++++---
>  1 file changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 6ee38e3da108..d4ff08ef03e5 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -1228,13 +1228,44 @@ void validate_coredump_safety(void)
>         }
>  }
>
> +static inline bool check_coredump_socket(void)
> +{
> +       if (core_pattern[0] != '@')
> +               return true;
> +
> +       /*
> +        * Coredump socket must be located in the initial mount
> +        * namespace. Don't give the that impression anything else is
> +        * supported right now.
> +        */
> +       if (current->nsproxy->mnt_ns != init_task.nsproxy->mnt_ns)
> +               return false;

(Ah, dereferencing init_task.nsproxy without locks is safe because
init_task is actually the boot cpu's swapper/idle task, which never
switches namespaces, right?)

> +       /* Must be an absolute path. */
> +       if (*(core_pattern + 1) != '/')
> +               return false;
> +
> +       return true;
> +}
> +
>  static int proc_dostring_coredump(const struct ctl_table *table, int write,
>                   void *buffer, size_t *lenp, loff_t *ppos)
>  {
> -       int error = proc_dostring(table, write, buffer, lenp, ppos);
> +       int error;
> +       ssize_t retval;
> +       char old_core_pattern[CORENAME_MAX_SIZE];
> +
> +       retval = strscpy(old_core_pattern, core_pattern, CORENAME_MAX_SIZE);
> +
> +       error = proc_dostring(table, write, buffer, lenp, ppos);
> +       if (error)
> +               return error;
> +       if (!check_coredump_socket()) {

(non-actionable note: This is kiiinda dodgy under
SYSCTL_WRITES_LEGACY, but I guess we can assume that new users of the
new coredump socket feature aren't actually going to write the
coredump path one byte at a time, so I guess it's fine.)

> +               strscpy(core_pattern, old_core_pattern, retval + 1);

The third strscpy() argument is semantically supposed to be the
destination buffer size, not the amount of data to copy. For trivial
invocations like here, strscpy() actually allows you to leave out the
third argument.


> +               return -EINVAL;
> +       }
>
> -       if (!error)
> -               validate_coredump_safety();
> +       validate_coredump_safety();
>         return error;
>  }
>
>
> --
> 2.47.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 4/9] coredump: add coredump socket
  2025-05-15 20:52     ` Jann Horn
@ 2025-05-15 21:04       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 43+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-15 21:04 UTC (permalink / raw)
  To: jannh
  Cc: alexander, bluca, brauner, daan.j.demeyer, daniel, davem, david,
	edumazet, horms, jack, kuba, kuniyu, lennart, linux-fsdevel,
	linux-kernel, linux-security-module, me, netdev, oleg, pabeni,
	viro, zbyszek

From: Jann Horn <jannh@google.com>
Date: Thu, 15 May 2025 22:52:22 +0200
> On Thu, May 15, 2025 at 7:01 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > nit: please keep these in the reverse xmas tree order.
> > https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
> 
> Isn't that rule specific to things that go through the net tree?

Which tree to go through doesn't matter, rather it's applied
to code maintained by netdev maintainers, especially net/ and
drivers/net/.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 4/9] coredump: add coredump socket
  2025-05-15 20:54   ` Jann Horn
@ 2025-05-15 21:15     ` Kuniyuki Iwashima
  2025-05-16 10:09     ` Christian Brauner
  1 sibling, 0 replies; 43+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-15 21:15 UTC (permalink / raw)
  To: jannh
  Cc: alexander, bluca, brauner, daan.j.demeyer, daniel, davem, david,
	edumazet, horms, jack, kuba, kuniyu, lennart, linux-fsdevel,
	linux-kernel, linux-security-module, me, netdev, oleg, pabeni,
	viro, zbyszek

From: Jann Horn <jannh@google.com>
Date: Thu, 15 May 2025 22:54:14 +0200
> > +               /*
> > +                * It is possible that the userspace process which is
> > +                * supposed to handle the coredump and is listening on
> > +                * the AF_UNIX socket coredumps. Userspace should just
> > +                * mark itself non dumpable.
> > +                */
> > +
> > +               retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> > +               if (retval < 0)
> > +                       goto close_fail;
> > +
> > +               file = sock_alloc_file(socket, 0, NULL);
> > +               if (IS_ERR(file)) {
> > +                       sock_release(socket);
> 
> I think you missed an API gotcha here. See the sock_alloc_file() documentation:
> 
>  * On failure @sock is released, and an ERR pointer is returned.
> 
> So I think basically sock_alloc_file() always consumes the socket
> reference provided by the caller, and the sock_release() in this
> branch is a double-free?

Good catch, yes, sock_release() is not needed here.


> 
> > +                       goto close_fail;
> > +               }
> [...]
> > diff --git a/include/linux/net.h b/include/linux/net.h
> > index 0ff950eecc6b..139c85d0f2ea 100644
> > --- a/include/linux/net.h
> > +++ b/include/linux/net.h
> > @@ -81,6 +81,7 @@ enum sock_type {
> >  #ifndef SOCK_NONBLOCK
> >  #define SOCK_NONBLOCK  O_NONBLOCK
> >  #endif
> > +#define SOCK_COREDUMP  O_NOCTTY
> 
> Hrrrm. I looked through all the paths from which the ->connect() call
> can come, and I think this is currently safe; but I wonder if it would
> make sense to either give this highly privileged bit a separate value
> that can never come from userspace, or explicitly strip it away in
> __sys_connect_file() just to be safe.

I had the same thought, but I think it's fine to leave the code as
is for now.  We can revisit it later once someone reports a strange
regression, which will be most unlikely.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 5/9] pidfs, coredump: add PIDFD_INFO_COREDUMP
  2025-05-15 20:56   ` Jann Horn
@ 2025-05-15 21:37     ` Jann Horn
  2025-05-16 10:34     ` Christian Brauner
  1 sibling, 0 replies; 43+ messages in thread
From: Jann Horn @ 2025-05-15 21:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 10:56 PM Jann Horn <jannh@google.com> wrote:
> On Thu, May 15, 2025 at 12:04 AM Christian Brauner <brauner@kernel.org> wrote:
> > Extend the PIDFD_INFO_COREDUMP ioctl() with the new PIDFD_INFO_COREDUMP
> > mask flag. This adds the fields @coredump_mask and @coredump_cookie to
> > struct pidfd_info.
>
> FWIW, now that you're using path-based sockets and override_creds(),
> one option may be to drop this patch and say "if you don't want
> untrusted processes to directly connect to the coredumping socket,
> just set the listening socket to mode 0000 or mode 0600"...

Er, forget I said that, of course we'd still want to have at least the
@coredump_mask.

> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> [...]
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index e1256ebb89c1..bfc4a32f737c 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> [...]
> > @@ -876,8 +880,34 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> >                         goto close_fail;
> >                 }
> >
> > +               /*
> > +                * Set the thread-group leader pid which is used for the
> > +                * peer credentials during connect() below. Then
> > +                * immediately register it in pidfs...
> > +                */
> > +               cprm.pid = task_tgid(current);
> > +               retval = pidfs_register_pid(cprm.pid);
> > +               if (retval) {
> > +                       sock_release(socket);
> > +                       goto close_fail;
> > +               }
> > +
> > +               /*
> > +                * ... and set the coredump information so userspace
> > +                * has it available after connect()...
> > +                */
> > +               pidfs_coredump(&cprm);
> > +
> > +               /*
> > +                * ... On connect() the peer credentials are recorded
> > +                * and @cprm.pid registered in pidfs...
>
> I don't understand this comment. Wasn't "@cprm.pid registered in
> pidfs" above with the explicit `pidfs_register_pid(cprm.pid)`?
>
> > +                */
> >                 retval = kernel_connect(socket, (struct sockaddr *)(&addr),
> >                                         addr_len, O_NONBLOCK | SOCK_COREDUMP);
> > +
> > +               /* ... So we can safely put our pidfs reference now... */
> > +               pidfs_put_pid(cprm.pid);
>
> Why can we safely put the pidfs reference now but couldn't do it
> before the kernel_connect()? Does the kernel_connect() look up this
> pidfs entry by calling something like pidfs_alloc_file()? Or does that
> only happen later on, when the peer does getsockopt(SO_PEERPIDFD)?
>
> >                 if (retval) {
> >                         if (retval == -EAGAIN)
> >                                 coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path);
> [...]
> > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > index 3b39e471840b..d7b9a0dd2db6 100644
> > --- a/fs/pidfs.c
> > +++ b/fs/pidfs.c
> [...]
> > @@ -280,6 +299,13 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
> >                 }
> >         }
> >
> > +       if (mask & PIDFD_INFO_COREDUMP) {
> > +               kinfo.mask |= PIDFD_INFO_COREDUMP;
> > +               smp_rmb();
>
> I assume I would regret it if I asked what these barriers are for,
> because the answer is something terrifying about how we otherwise
> don't have a guarantee that memory accesses can't be reordered between
> multiple subsequent syscalls or something like that?
>
> checkpatch complains about the lack of comments on these memory barriers.
>
> > +               kinfo.coredump_cookie = READ_ONCE(pidfs_i(inode)->__pei.coredump_cookie);
> > +               kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
> > +       }
> > +
> >         task = get_pid_task(pid, PIDTYPE_PID);
> >         if (!task) {
> >                 /*
> [...]
> > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > index a9d1c9ba2961..053d2e48e918 100644
> > --- a/net/unix/af_unix.c
> > +++ b/net/unix/af_unix.c
> [...]
> > @@ -742,6 +743,7 @@ static void unix_release_sock(struct sock *sk, int embrion)
> >
> >  struct unix_peercred {
> >         struct pid *peer_pid;
> > +       u64 cookie;
>
> Maybe add a comment here documenting that for now, this is assumed to
> be used exclusively for coredump sockets.
>
>
> >         const struct cred *peer_cred;
> >  };
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 4/9] coredump: add coredump socket
  2025-05-15 13:47   ` Alexander Mikhalitsyn
@ 2025-05-16  8:30     ` Christian Brauner
  0 siblings, 0 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-16  8:30 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: linux-fsdevel, Jann Horn, Daniel Borkmann, Kuniyuki Iwashima,
	Eric Dumazet, Oleg Nesterov, David S. Miller, Alexander Viro,
	Daan De Meyer, David Rheinsberg, Jakub Kicinski, Jan Kara,
	Lennart Poettering, Luca Boccassi, Mike Yuan, Paolo Abeni,
	Simon Horman, Zbigniew Jędrzejewski-Szmek, linux-kernel,
	netdev, linux-security-module

On Thu, May 15, 2025 at 03:47:33PM +0200, Alexander Mikhalitsyn wrote:
> Am Do., 15. Mai 2025 um 00:04 Uhr schrieb Christian Brauner
> <brauner@kernel.org>:
> >
> > Coredumping currently supports two modes:
> >
> > (1) Dumping directly into a file somewhere on the filesystem.
> > (2) Dumping into a pipe connected to a usermode helper process
> >     spawned as a child of the system_unbound_wq or kthreadd.
> >
> > For simplicity I'm mostly ignoring (1). There's probably still some
> > users of (1) out there but processing coredumps in this way can be
> > considered adventurous especially in the face of set*id binaries.
> >
> > The most common option should be (2) by now. It works by allowing
> > userspace to put a string into /proc/sys/kernel/core_pattern like:
> >
> >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> >
> > The "|" at the beginning indicates to the kernel that a pipe must be
> > used. The path following the pipe indicator is a path to a binary that
> > will be spawned as a usermode helper process. Any additional parameters
> > pass information about the task that is generating the coredump to the
> > binary that processes the coredump.
> >
> > In the example core_pattern shown above systemd-coredump is spawned as a
> > usermode helper. There's various conceptual consequences of this
> > (non-exhaustive list):
> >
> > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> >   connected to the read-end of the pipe. All other file descriptors are
> >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> >   already caused bugs because userspace assumed that this cannot happen
> >   (Whether or not this is a sane assumption is irrelevant.).
> >
> > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> >   it is not a child of any userspace process and specifically not a
> >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> >   upcall which are difficult for userspace to control correctly.
> >
> > - systemd-coredump is spawned with full kernel privileges. This
> >   necessitates all kinds of weird privilege dropping excercises in
> >   userspace to make this safe.
> >
> > - A new usermode helper has to be spawned for each crashing process.
> >
> > This series adds a new mode:
> >
> > (3) Dumping into an AF_UNIX socket.
> >
> > Userspace can set /proc/sys/kernel/core_pattern to:
> >
> >         @/path/to/coredump.socket
> >
> > The "@" at the beginning indicates to the kernel that an AF_UNIX
> > coredump socket will be used to process coredumps.
> >
> > The coredump socket must be located in the initial mount namespace.
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket.
> >
> > - The coredump server uses SO_PEERPIDFD to get a stable handle on the
> >   connected crashing task. The retrieved pidfd will provide a stable
> >   reference even if the crashing task gets SIGKILLed while generating
> >   the coredump.
> >
> > - By setting core_pipe_limit non-zero userspace can guarantee that the
> >   crashing task cannot be reaped behind it's back and thus process all
> >   necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
> >   detect whether /proc/<pid> still refers to the same process.
> >
> >   The core_pipe_limit isn't used to rate-limit connections to the
> >   socket. This can simply be done via AF_UNIX sockets directly.
> >
> > - The pidfd for the crashing task will grow new information how the task
> >   coredumps.
> >
> > - The coredump server should mark itself as non-dumpable.
> >
> > - A container coredump server in a separate network namespace can simply
> >   bind to another well-know address and systemd-coredump fowards
> >   coredumps to the container.
> >
> > - Coredumps could in the future also be handled via per-user/session
> >   coredump servers that run only with that users privileges.
> >
> >   The coredump server listens on the coredump socket and accepts a
> >   new coredump connection. It then retrieves SO_PEERPIDFD for the
> >   client, inspects uid/gid and hands the accepted client to the users
> >   own coredump handler which runs with the users privileges only
> >   (It must of coure pay close attention to not forward crashing suid
> >   binaries.).
> >
> > The new coredump socket will allow userspace to not have to rely on
> > usermode helpers for processing coredumps and provides a safer way to
> > handle them instead of relying on super privileged coredumping helpers
> > that have and continue to cause significant CVEs.
> >
> > This will also be significantly more lightweight since no fork()+exec()
> > for the usermodehelper is required for each crashing process. The
> > coredump server in userspace can e.g., just keep a worker pool.
> >
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> 
> > ---
> >  fs/coredump.c       | 133 ++++++++++++++++++++++++++++++++++++++++++++++++----
> >  include/linux/net.h |   1 +
> >  net/unix/af_unix.c  |  53 ++++++++++++++++-----
> >  3 files changed, 166 insertions(+), 21 deletions(-)
> >
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index a70929c3585b..e1256ebb89c1 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> > @@ -44,7 +44,11 @@
> >  #include <linux/sysctl.h>
> >  #include <linux/elf.h>
> >  #include <linux/pidfs.h>
> > +#include <linux/net.h>
> > +#include <linux/socket.h>
> > +#include <net/net_namespace.h>
> >  #include <uapi/linux/pidfd.h>
> > +#include <uapi/linux/un.h>
> >
> >  #include <linux/uaccess.h>
> >  #include <asm/mmu_context.h>
> > @@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
> >  enum coredump_type_t {
> >         COREDUMP_FILE = 1,
> >         COREDUMP_PIPE = 2,
> > +       COREDUMP_SOCK = 3,
> >  };
> >
> >  struct core_name {
> > @@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> >         cn->corename = NULL;
> >         if (*pat_ptr == '|')
> >                 cn->core_type = COREDUMP_PIPE;
> > +       else if (*pat_ptr == '@')
> > +               cn->core_type = COREDUMP_SOCK;
> >         else
> >                 cn->core_type = COREDUMP_FILE;
> >         if (expand_corename(cn, core_name_size))
> >                 return -ENOMEM;
> >         cn->corename[0] = '\0';
> >
> > -       if (cn->core_type == COREDUMP_PIPE) {
> > +       switch (cn->core_type) {
> > +       case COREDUMP_PIPE: {
> >                 int argvs = sizeof(core_pattern) / 2;
> >                 (*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
> >                 if (!(*argv))
> > @@ -247,6 +255,33 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> >                 ++pat_ptr;
> >                 if (!(*pat_ptr))
> >                         return -ENOMEM;
> > +               break;
> > +       }
> > +       case COREDUMP_SOCK: {
> > +               /* skip the @ */
> > +               pat_ptr++;
> 
> nit: I would do
> if (!(*pat_ptr))
>    return -ENOMEM;
> as we do for the COREDUMP_PIPE case above.
> just in case if something will change in cn_printf() to eliminate any
> chance of crashes in there.

Ok.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 7/9] coredump: validate socket name as it is written
  2025-05-15 20:56   ` Jann Horn
@ 2025-05-16  9:54     ` Christian Brauner
  2025-05-16 13:29       ` Christian Brauner
  0 siblings, 1 reply; 43+ messages in thread
From: Christian Brauner @ 2025-05-16  9:54 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 10:56:51PM +0200, Jann Horn wrote:
> On Thu, May 15, 2025 at 12:04 AM Christian Brauner <brauner@kernel.org> wrote:
> > In contrast to other parameters written into
> > /proc/sys/kernel/core_pattern that never fail we can validate enabling
> > the new AF_UNIX support. This is obviously racy as hell but it's always
> > been that way.
> >
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> Reviewed-by: Jann Horn <jannh@google.com>
> 
> > ---
> >  fs/coredump.c | 37 ++++++++++++++++++++++++++++++++++---
> >  1 file changed, 34 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index 6ee38e3da108..d4ff08ef03e5 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> > @@ -1228,13 +1228,44 @@ void validate_coredump_safety(void)
> >         }
> >  }
> >
> > +static inline bool check_coredump_socket(void)
> > +{
> > +       if (core_pattern[0] != '@')
> > +               return true;
> > +
> > +       /*
> > +        * Coredump socket must be located in the initial mount
> > +        * namespace. Don't give the that impression anything else is
> > +        * supported right now.
> > +        */
> > +       if (current->nsproxy->mnt_ns != init_task.nsproxy->mnt_ns)
> > +               return false;
> 
> (Ah, dereferencing init_task.nsproxy without locks is safe because
> init_task is actually the boot cpu's swapper/idle task, which never
> switches namespaces, right?)

I would be very worried if it did. It would fsck everyone over that
relies on copying its credentials and assumes that the set of namespaces
is stable.

> 
> > +       /* Must be an absolute path. */
> > +       if (*(core_pattern + 1) != '/')
> > +               return false;
> > +
> > +       return true;
> > +}
> > +
> >  static int proc_dostring_coredump(const struct ctl_table *table, int write,
> >                   void *buffer, size_t *lenp, loff_t *ppos)
> >  {
> > -       int error = proc_dostring(table, write, buffer, lenp, ppos);
> > +       int error;
> > +       ssize_t retval;
> > +       char old_core_pattern[CORENAME_MAX_SIZE];
> > +
> > +       retval = strscpy(old_core_pattern, core_pattern, CORENAME_MAX_SIZE);
> > +
> > +       error = proc_dostring(table, write, buffer, lenp, ppos);
> > +       if (error)
> > +               return error;
> > +       if (!check_coredump_socket()) {
> 
> (non-actionable note: This is kiiinda dodgy under
> SYSCTL_WRITES_LEGACY, but I guess we can assume that new users of the
> new coredump socket feature aren't actually going to write the
> coredump path one byte at a time, so I guess it's fine.)

So this is all kinds of broken already imho. Because there's not really
mutual exclusion between multiple writers to such sysctls from what I
remember. Which means that this buffer can be trampled in all kinds of
ways if multiple tasks decide to update it at the same time. That's
super unlikely of course but whatever.

> 
> > +               strscpy(core_pattern, old_core_pattern, retval + 1);
> 
> The third strscpy() argument is semantically supposed to be the
> destination buffer size, not the amount of data to copy. For trivial
> invocations like here, strscpy() actually allows you to leave out the
> third argument.

Eeeeewww, that's really implicit behavior. I can use the destination
buffer size but given that retval will always be smaller than that I
didn't bother but ok. I'll fix that in-tree.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 4/9] coredump: add coredump socket
  2025-05-15 20:54   ` Jann Horn
  2025-05-15 21:15     ` Kuniyuki Iwashima
@ 2025-05-16 10:09     ` Christian Brauner
  2025-05-16 10:20       ` Christian Brauner
  1 sibling, 1 reply; 43+ messages in thread
From: Christian Brauner @ 2025-05-16 10:09 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 10:54:14PM +0200, Jann Horn wrote:
> On Thu, May 15, 2025 at 12:04 AM Christian Brauner <brauner@kernel.org> wrote:
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index a70929c3585b..e1256ebb89c1 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> [...]
> > @@ -393,11 +428,20 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> >          * If core_pattern does not include a %p (as is the default)
> >          * and core_uses_pid is set, then .%pid will be appended to
> >          * the filename. Do not do this for piped commands. */
> > -       if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
> > -               err = cn_printf(cn, ".%d", task_tgid_vnr(current));
> > -               if (err)
> > -                       return err;
> > +       if (!pid_in_pattern && core_uses_pid) {
> > +               switch (cn->core_type) {
> > +               case COREDUMP_FILE:
> > +                       return cn_printf(cn, ".%d", task_tgid_vnr(current));
> > +               case COREDUMP_PIPE:
> > +                       break;
> > +               case COREDUMP_SOCK:
> > +                       break;
> 
> This branch is dead code, we can't get this far down with
> COREDUMP_SOCK. Maybe you could remove the "break;" and fall through to
> the default WARN_ON_ONCE() here. Or better, revert this hunk and
> instead just change the check to check for "cn->core_type ==
> COREDUMP_FILE" (in patch 1), since this whole block is legacy logic
> specific to dumping into files (COREDUMP_FILE).

Ok, folded:

diff --git a/fs/coredump.c b/fs/coredump.c
index 368751d98781..45725465c299 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -393,11 +393,8 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
         * If core_pattern does not include a %p (as is the default)
         * and core_uses_pid is set, then .%pid will be appended to
         * the filename. Do not do this for piped commands. */
-       if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
-               err = cn_printf(cn, ".%d", task_tgid_vnr(current));
-               if (err)
-                       return err;
-       }
+       if (cn->core_type == COREDUMP_FILE && !pid_in_pattern && core_uses_pid)
+               return cn_printf(cn, ".%d", task_tgid_vnr(current));
        return 0;
 }

into the first patch.

> 
> > +               default:
> > +                       WARN_ON_ONCE(true);
> > +                       return -EINVAL;
> > +               }
> >         }
> > +
> >         return 0;
> >  }
> >
> > @@ -801,6 +845,55 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> >                 }
> >                 break;
> >         }
> > +       case COREDUMP_SOCK: {
> > +#ifdef CONFIG_UNIX
> > +               struct file *file __free(fput) = NULL;
> > +               struct sockaddr_un addr = {
> > +                       .sun_family = AF_UNIX,
> > +               };
> > +               ssize_t addr_len;
> > +               struct socket *socket;
> > +
> > +               retval = strscpy(addr.sun_path, cn.corename, sizeof(addr.sun_path));
> 
> nit: strscpy() explicitly supports eliding the last argument in this
> case, thanks to macro magic:
> 
>  * The size argument @... is only required when @dst is not an array, or
>  * when the copy needs to be smaller than sizeof(@dst).

Ok.

> 
> > +               if (retval < 0)
> > +                       goto close_fail;
> > +               addr_len = offsetof(struct sockaddr_un, sun_path) + retval + 1;
> 
> nit: On a 64-bit system, strscpy() returns a 64-bit value, and
> addr_len is also 64-bit, but retval is 32-bit. Implicitly moving
> length values back and forth between 64-bit and 32-bit is slightly
> dodgy and might generate suboptimal code (it could force the compiler
> to emit instructions to explicitly truncate the value if it can't
> prove that the value fits in 32 bits). It would be nice to keep the
> value 64-bit throughout by storing the return value in a ssize_t.
> 
> And actually, you don't have to compute addr_len here at all; that's
> needed for abstract unix domain sockets, but for path-based unix
> domain socket, you should be able to just use sizeof(struct
> sockaddr_un) as addrlen. (This is documented in "man 7 unix".)

Ok, folded:

@@ -845,10 +845,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
                ssize_t addr_len;
                struct socket *socket;

-               retval = strscpy(addr.sun_path, cn.corename);
-               if (retval < 0)
+               addr_len = strscpy(addr.sun_path, cn.corename);
+               if (addr_len < 0)
                        goto close_fail;
-               addr_len = offsetof(struct sockaddr_un, sun_path) + retval + 1;
+               addr_len += offsetof(struct sockaddr_un, sun_path) + 1;

> 
> > +
> > +               /*
> > +                * It is possible that the userspace process which is
> > +                * supposed to handle the coredump and is listening on
> > +                * the AF_UNIX socket coredumps. Userspace should just
> > +                * mark itself non dumpable.
> > +                */
> > +
> > +               retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> > +               if (retval < 0)
> > +                       goto close_fail;
> > +
> > +               file = sock_alloc_file(socket, 0, NULL);
> > +               if (IS_ERR(file)) {
> > +                       sock_release(socket);
> 
> I think you missed an API gotcha here. See the sock_alloc_file() documentation:
> 
>  * On failure @sock is released, and an ERR pointer is returned.

Thanks, fixed.

> 
> So I think basically sock_alloc_file() always consumes the socket
> reference provided by the caller, and the sock_release() in this
> branch is a double-free?

> 
> > +                       goto close_fail;
> > +               }
> [...]
> > diff --git a/include/linux/net.h b/include/linux/net.h
> > index 0ff950eecc6b..139c85d0f2ea 100644
> > --- a/include/linux/net.h
> > +++ b/include/linux/net.h
> > @@ -81,6 +81,7 @@ enum sock_type {
> >  #ifndef SOCK_NONBLOCK
> >  #define SOCK_NONBLOCK  O_NONBLOCK
> >  #endif
> > +#define SOCK_COREDUMP  O_NOCTTY
> 
> Hrrrm. I looked through all the paths from which the ->connect() call
> can come, and I think this is currently safe; but I wonder if it would

Yes, I made sure that unknown bits are excluded.

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 4/9] coredump: add coredump socket
  2025-05-15 17:00   ` Kuniyuki Iwashima
  2025-05-15 20:52     ` Jann Horn
@ 2025-05-16 10:14     ` Christian Brauner
  1 sibling, 0 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-16 10:14 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: alexander, bluca, daan.j.demeyer, daniel, davem, david, edumazet,
	horms, jack, jannh, kuba, lennart, linux-fsdevel, linux-kernel,
	linux-security-module, me, netdev, oleg, pabeni, viro, zbyszek

On Thu, May 15, 2025 at 10:00:43AM -0700, Kuniyuki Iwashima wrote:
> From: Christian Brauner <brauner@kernel.org>
> Date: Thu, 15 May 2025 00:03:37 +0200
> > Coredumping currently supports two modes:
> > 
> > (1) Dumping directly into a file somewhere on the filesystem.
> > (2) Dumping into a pipe connected to a usermode helper process
> >     spawned as a child of the system_unbound_wq or kthreadd.
> > 
> > For simplicity I'm mostly ignoring (1). There's probably still some
> > users of (1) out there but processing coredumps in this way can be
> > considered adventurous especially in the face of set*id binaries.
> > 
> > The most common option should be (2) by now. It works by allowing
> > userspace to put a string into /proc/sys/kernel/core_pattern like:
> > 
> >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> > 
> > The "|" at the beginning indicates to the kernel that a pipe must be
> > used. The path following the pipe indicator is a path to a binary that
> > will be spawned as a usermode helper process. Any additional parameters
> > pass information about the task that is generating the coredump to the
> > binary that processes the coredump.
> > 
> > In the example core_pattern shown above systemd-coredump is spawned as a
> > usermode helper. There's various conceptual consequences of this
> > (non-exhaustive list):
> > 
> > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> >   connected to the read-end of the pipe. All other file descriptors are
> >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> >   already caused bugs because userspace assumed that this cannot happen
> >   (Whether or not this is a sane assumption is irrelevant.).
> > 
> > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> >   it is not a child of any userspace process and specifically not a
> >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> >   upcall which are difficult for userspace to control correctly.
> > 
> > - systemd-coredump is spawned with full kernel privileges. This
> >   necessitates all kinds of weird privilege dropping excercises in
> >   userspace to make this safe.
> > 
> > - A new usermode helper has to be spawned for each crashing process.
> > 
> > This series adds a new mode:
> > 
> > (3) Dumping into an AF_UNIX socket.
> > 
> > Userspace can set /proc/sys/kernel/core_pattern to:
> > 
> >         @/path/to/coredump.socket
> > 
> > The "@" at the beginning indicates to the kernel that an AF_UNIX
> > coredump socket will be used to process coredumps.
> > 
> > The coredump socket must be located in the initial mount namespace.
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket.
> > 
> > - The coredump server uses SO_PEERPIDFD to get a stable handle on the
> >   connected crashing task. The retrieved pidfd will provide a stable
> >   reference even if the crashing task gets SIGKILLed while generating
> >   the coredump.
> > 
> > - By setting core_pipe_limit non-zero userspace can guarantee that the
> >   crashing task cannot be reaped behind it's back and thus process all
> >   necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
> >   detect whether /proc/<pid> still refers to the same process.
> > 
> >   The core_pipe_limit isn't used to rate-limit connections to the
> >   socket. This can simply be done via AF_UNIX sockets directly.
> > 
> > - The pidfd for the crashing task will grow new information how the task
> >   coredumps.
> > 
> > - The coredump server should mark itself as non-dumpable.
> > 
> > - A container coredump server in a separate network namespace can simply
> >   bind to another well-know address and systemd-coredump fowards
> >   coredumps to the container.
> > 
> > - Coredumps could in the future also be handled via per-user/session
> >   coredump servers that run only with that users privileges.
> > 
> >   The coredump server listens on the coredump socket and accepts a
> >   new coredump connection. It then retrieves SO_PEERPIDFD for the
> >   client, inspects uid/gid and hands the accepted client to the users
> >   own coredump handler which runs with the users privileges only
> >   (It must of coure pay close attention to not forward crashing suid
> >   binaries.).
> > 
> > The new coredump socket will allow userspace to not have to rely on
> > usermode helpers for processing coredumps and provides a safer way to
> > handle them instead of relying on super privileged coredumping helpers
> > that have and continue to cause significant CVEs.
> > 
> > This will also be significantly more lightweight since no fork()+exec()
> > for the usermodehelper is required for each crashing process. The
> > coredump server in userspace can e.g., just keep a worker pool.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/coredump.c       | 133 ++++++++++++++++++++++++++++++++++++++++++++++++----
> >  include/linux/net.h |   1 +
> >  net/unix/af_unix.c  |  53 ++++++++++++++++-----
> >  3 files changed, 166 insertions(+), 21 deletions(-)
> > 
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index a70929c3585b..e1256ebb89c1 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> > @@ -44,7 +44,11 @@
> >  #include <linux/sysctl.h>
> >  #include <linux/elf.h>
> >  #include <linux/pidfs.h>
> > +#include <linux/net.h>
> > +#include <linux/socket.h>
> > +#include <net/net_namespace.h>
> >  #include <uapi/linux/pidfd.h>
> > +#include <uapi/linux/un.h>
> >  
> >  #include <linux/uaccess.h>
> >  #include <asm/mmu_context.h>
> > @@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
> >  enum coredump_type_t {
> >  	COREDUMP_FILE = 1,
> >  	COREDUMP_PIPE = 2,
> > +	COREDUMP_SOCK = 3,
> >  };
> >  
> >  struct core_name {
> > @@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> >  	cn->corename = NULL;
> >  	if (*pat_ptr == '|')
> >  		cn->core_type = COREDUMP_PIPE;
> > +	else if (*pat_ptr == '@')
> > +		cn->core_type = COREDUMP_SOCK;
> >  	else
> >  		cn->core_type = COREDUMP_FILE;
> >  	if (expand_corename(cn, core_name_size))
> >  		return -ENOMEM;
> >  	cn->corename[0] = '\0';
> >  
> > -	if (cn->core_type == COREDUMP_PIPE) {
> > +	switch (cn->core_type) {
> > +	case COREDUMP_PIPE: {
> >  		int argvs = sizeof(core_pattern) / 2;
> >  		(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
> >  		if (!(*argv))
> > @@ -247,6 +255,33 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> >  		++pat_ptr;
> >  		if (!(*pat_ptr))
> >  			return -ENOMEM;
> > +		break;
> > +	}
> > +	case COREDUMP_SOCK: {
> > +		/* skip the @ */
> > +		pat_ptr++;
> > +		err = cn_printf(cn, "%s", pat_ptr);
> > +		if (err)
> > +			return err;
> > +
> > +		/* Require absolute paths. */
> > +		if (cn->corename[0] != '/')
> > +			return -EINVAL;
> > +
> > +		/*
> > +		 * Currently no need to parse any other options.
> > +		 * Relevant information can be retrieved from the peer
> > +		 * pidfd retrievable via SO_PEERPIDFD by the receiver or
> > +		 * via /proc/<pid>, using the SO_PEERPIDFD to guard
> > +		 * against pid recycling when opening /proc/<pid>.
> > +		 */
> > +		return 0;
> > +	}
> > +	case COREDUMP_FILE:
> > +		break;
> > +	default:
> > +		WARN_ON_ONCE(true);
> > +		return -EINVAL;
> >  	}
> >  
> >  	/* Repeat as long as we have more pattern to process and more output
> > @@ -393,11 +428,20 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> >  	 * If core_pattern does not include a %p (as is the default)
> >  	 * and core_uses_pid is set, then .%pid will be appended to
> >  	 * the filename. Do not do this for piped commands. */
> > -	if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
> > -		err = cn_printf(cn, ".%d", task_tgid_vnr(current));
> > -		if (err)
> > -			return err;
> > +	if (!pid_in_pattern && core_uses_pid) {
> > +		switch (cn->core_type) {
> > +		case COREDUMP_FILE:
> > +			return cn_printf(cn, ".%d", task_tgid_vnr(current));
> > +		case COREDUMP_PIPE:
> > +			break;
> > +		case COREDUMP_SOCK:
> > +			break;
> > +		default:
> > +			WARN_ON_ONCE(true);
> > +			return -EINVAL;
> > +		}
> >  	}
> > +
> >  	return 0;
> >  }
> >  
> > @@ -801,6 +845,55 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> >  		}
> >  		break;
> >  	}
> > +	case COREDUMP_SOCK: {
> > +#ifdef CONFIG_UNIX
> > +		struct file *file __free(fput) = NULL;
> > +		struct sockaddr_un addr = {
> > +			.sun_family = AF_UNIX,
> > +		};
> > +		ssize_t addr_len;
> > +		struct socket *socket;
> > +
> > +		retval = strscpy(addr.sun_path, cn.corename, sizeof(addr.sun_path));
> > +		if (retval < 0)
> > +			goto close_fail;
> > +		addr_len = offsetof(struct sockaddr_un, sun_path) + retval + 1;
> > +
> > +		/*
> > +		 * It is possible that the userspace process which is
> > +		 * supposed to handle the coredump and is listening on
> > +		 * the AF_UNIX socket coredumps. Userspace should just
> > +		 * mark itself non dumpable.
> > +		 */
> > +
> > +		retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> > +		if (retval < 0)
> > +			goto close_fail;
> > +
> > +		file = sock_alloc_file(socket, 0, NULL);
> > +		if (IS_ERR(file)) {
> > +			sock_release(socket);
> > +			goto close_fail;
> > +		}
> > +
> > +		retval = kernel_connect(socket, (struct sockaddr *)(&addr),
> > +					addr_len, O_NONBLOCK | SOCK_COREDUMP);
> > +		if (retval) {
> > +			if (retval == -EAGAIN)
> > +				coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path);
> > +			else
> > +				coredump_report_failure("Coredump socket connection %s failed %d", addr.sun_path, retval);
> > +			goto close_fail;
> > +		}
> > +
> > +		cprm.limit = RLIM_INFINITY;
> > +		cprm.file = no_free_ptr(file);
> > +#else
> > +		coredump_report_failure("Core dump socket support %s disabled", cn.corename);
> > +		goto close_fail;
> > +#endif
> > +		break;
> > +	}
> >  	default:
> >  		WARN_ON_ONCE(true);
> >  		goto close_fail;
> > @@ -838,8 +931,32 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> >  		file_end_write(cprm.file);
> >  		free_vma_snapshot(&cprm);
> >  	}
> > -	if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
> > -		wait_for_dump_helpers(cprm.file);
> > +
> > +	/*
> > +	 * When core_pipe_limit is set we wait for the coredump server
> > +	 * or usermodehelper to finish before exiting so it can e.g.,
> > +	 * inspect /proc/<pid>.
> > +	 */
> > +	if (core_pipe_limit) {
> > +		switch (cn.core_type) {
> > +		case COREDUMP_PIPE:
> > +			wait_for_dump_helpers(cprm.file);
> > +			break;
> > +		case COREDUMP_SOCK: {
> > +			/*
> > +			 * We use a simple read to wait for the coredump
> > +			 * processing to finish. Either the socket is
> > +			 * closed or we get sent unexpected data. In
> > +			 * both cases, we're done.
> > +			 */
> > +			__kernel_read(cprm.file, &(char){ 0 }, 1, NULL);
> > +			break;
> > +		}
> > +		default:
> > +			break;
> > +		}
> > +	}
> > +
> >  close_fail:
> >  	if (cprm.file)
> >  		filp_close(cprm.file, NULL);
> > @@ -1069,7 +1186,7 @@ EXPORT_SYMBOL(dump_align);
> >  void validate_coredump_safety(void)
> >  {
> >  	if (suid_dumpable == SUID_DUMP_ROOT &&
> > -	    core_pattern[0] != '/' && core_pattern[0] != '|') {
> > +	    core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != '@') {
> >  
> >  		coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: "
> >  			"pipe handler or fully qualified core dump path required. "
> > diff --git a/include/linux/net.h b/include/linux/net.h
> > index 0ff950eecc6b..139c85d0f2ea 100644
> > --- a/include/linux/net.h
> > +++ b/include/linux/net.h
> > @@ -81,6 +81,7 @@ enum sock_type {
> >  #ifndef SOCK_NONBLOCK
> >  #define SOCK_NONBLOCK	O_NONBLOCK
> >  #endif
> > +#define SOCK_COREDUMP	O_NOCTTY
> >  
> >  #endif /* ARCH_HAS_SOCKET_TYPES */
> >  
> > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > index 472f8aa9ea15..a9d1c9ba2961 100644
> > --- a/net/unix/af_unix.c
> > +++ b/net/unix/af_unix.c
> > @@ -85,10 +85,13 @@
> >  #include <linux/file.h>
> >  #include <linux/filter.h>
> >  #include <linux/fs.h>
> > +#include <linux/fs_struct.h>
> >  #include <linux/init.h>
> >  #include <linux/kernel.h>
> >  #include <linux/mount.h>
> >  #include <linux/namei.h>
> > +#include <linux/net.h>
> > +#include <linux/pidfs.h>
> >  #include <linux/poll.h>
> >  #include <linux/proc_fs.h>
> >  #include <linux/sched/signal.h>
> > @@ -100,7 +103,6 @@
> >  #include <linux/splice.h>
> >  #include <linux/string.h>
> >  #include <linux/uaccess.h>
> > -#include <linux/pidfs.h>
> >  #include <net/af_unix.h>
> >  #include <net/net_namespace.h>
> >  #include <net/scm.h>
> > @@ -1146,7 +1148,7 @@ static int unix_release(struct socket *sock)
> >  }
> >  
> >  static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
> > -				  int type)
> > +				  int type, unsigned int flags)
>   				      	    ^^^
> nit: int flags

done

> 
> 
> >  {
> >  	struct inode *inode;
> >  	struct path path;
> > @@ -1154,13 +1156,38 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
> >  	int err;
> >  
> >  	unix_mkname_bsd(sunaddr, addr_len);
> > -	err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
> > -	if (err)
> > -		goto fail;
> >  
> > -	err = path_permission(&path, MAY_WRITE);
> > -	if (err)
> > -		goto path_put;
> > +	if (flags & SOCK_COREDUMP) {
> > +		struct path root;
> > +		struct cred *kcred;
> > +		const struct cred *cred;
> 
> nit: please keep these in the reverse xmas tree order.
> https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs

Done. I keep forgetting this. Another decade and maybe I'll remember.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 4/9] coredump: add coredump socket
  2025-05-16 10:09     ` Christian Brauner
@ 2025-05-16 10:20       ` Christian Brauner
  0 siblings, 0 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-16 10:20 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

[-- Attachment #1: Type: text/plain, Size: 6956 bytes --]

On Fri, May 16, 2025 at 12:09:21PM +0200, Christian Brauner wrote:
> On Thu, May 15, 2025 at 10:54:14PM +0200, Jann Horn wrote:
> > On Thu, May 15, 2025 at 12:04 AM Christian Brauner <brauner@kernel.org> wrote:
> > > diff --git a/fs/coredump.c b/fs/coredump.c
> > > index a70929c3585b..e1256ebb89c1 100644
> > > --- a/fs/coredump.c
> > > +++ b/fs/coredump.c
> > [...]
> > > @@ -393,11 +428,20 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> > >          * If core_pattern does not include a %p (as is the default)
> > >          * and core_uses_pid is set, then .%pid will be appended to
> > >          * the filename. Do not do this for piped commands. */
> > > -       if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
> > > -               err = cn_printf(cn, ".%d", task_tgid_vnr(current));
> > > -               if (err)
> > > -                       return err;
> > > +       if (!pid_in_pattern && core_uses_pid) {
> > > +               switch (cn->core_type) {
> > > +               case COREDUMP_FILE:
> > > +                       return cn_printf(cn, ".%d", task_tgid_vnr(current));
> > > +               case COREDUMP_PIPE:
> > > +                       break;
> > > +               case COREDUMP_SOCK:
> > > +                       break;
> > 
> > This branch is dead code, we can't get this far down with
> > COREDUMP_SOCK. Maybe you could remove the "break;" and fall through to
> > the default WARN_ON_ONCE() here. Or better, revert this hunk and
> > instead just change the check to check for "cn->core_type ==
> > COREDUMP_FILE" (in patch 1), since this whole block is legacy logic
> > specific to dumping into files (COREDUMP_FILE).
> 
> Ok, folded:
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 368751d98781..45725465c299 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -393,11 +393,8 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
>          * If core_pattern does not include a %p (as is the default)
>          * and core_uses_pid is set, then .%pid will be appended to
>          * the filename. Do not do this for piped commands. */
> -       if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
> -               err = cn_printf(cn, ".%d", task_tgid_vnr(current));
> -               if (err)
> -                       return err;
> -       }
> +       if (cn->core_type == COREDUMP_FILE && !pid_in_pattern && core_uses_pid)
> +               return cn_printf(cn, ".%d", task_tgid_vnr(current));
>         return 0;
>  }
> 
> into the first patch.
> 
> > 
> > > +               default:
> > > +                       WARN_ON_ONCE(true);
> > > +                       return -EINVAL;
> > > +               }
> > >         }
> > > +
> > >         return 0;
> > >  }
> > >
> > > @@ -801,6 +845,55 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > >                 }
> > >                 break;
> > >         }
> > > +       case COREDUMP_SOCK: {
> > > +#ifdef CONFIG_UNIX
> > > +               struct file *file __free(fput) = NULL;
> > > +               struct sockaddr_un addr = {
> > > +                       .sun_family = AF_UNIX,
> > > +               };
> > > +               ssize_t addr_len;
> > > +               struct socket *socket;
> > > +
> > > +               retval = strscpy(addr.sun_path, cn.corename, sizeof(addr.sun_path));
> > 
> > nit: strscpy() explicitly supports eliding the last argument in this
> > case, thanks to macro magic:
> > 
> >  * The size argument @... is only required when @dst is not an array, or
> >  * when the copy needs to be smaller than sizeof(@dst).
> 
> Ok.
> 
> > 
> > > +               if (retval < 0)
> > > +                       goto close_fail;
> > > +               addr_len = offsetof(struct sockaddr_un, sun_path) + retval + 1;
> > 
> > nit: On a 64-bit system, strscpy() returns a 64-bit value, and
> > addr_len is also 64-bit, but retval is 32-bit. Implicitly moving
> > length values back and forth between 64-bit and 32-bit is slightly
> > dodgy and might generate suboptimal code (it could force the compiler
> > to emit instructions to explicitly truncate the value if it can't
> > prove that the value fits in 32 bits). It would be nice to keep the
> > value 64-bit throughout by storing the return value in a ssize_t.
> > 
> > And actually, you don't have to compute addr_len here at all; that's
> > needed for abstract unix domain sockets, but for path-based unix
> > domain socket, you should be able to just use sizeof(struct
> > sockaddr_un) as addrlen. (This is documented in "man 7 unix".)
> 
> Ok, folded:
> 
> @@ -845,10 +845,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                 ssize_t addr_len;
>                 struct socket *socket;
> 
> -               retval = strscpy(addr.sun_path, cn.corename);
> -               if (retval < 0)
> +               addr_len = strscpy(addr.sun_path, cn.corename);
> +               if (addr_len < 0)
>                         goto close_fail;
> -               addr_len = offsetof(struct sockaddr_un, sun_path) + retval + 1;
> +               addr_len += offsetof(struct sockaddr_un, sun_path) + 1;
> 
> > 
> > > +
> > > +               /*
> > > +                * It is possible that the userspace process which is
> > > +                * supposed to handle the coredump and is listening on
> > > +                * the AF_UNIX socket coredumps. Userspace should just
> > > +                * mark itself non dumpable.
> > > +                */
> > > +
> > > +               retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> > > +               if (retval < 0)
> > > +                       goto close_fail;
> > > +
> > > +               file = sock_alloc_file(socket, 0, NULL);
> > > +               if (IS_ERR(file)) {
> > > +                       sock_release(socket);
> > 
> > I think you missed an API gotcha here. See the sock_alloc_file() documentation:
> > 
> >  * On failure @sock is released, and an ERR pointer is returned.
> 
> Thanks, fixed.
> 
> > 
> > So I think basically sock_alloc_file() always consumes the socket
> > reference provided by the caller, and the sock_release() in this
> > branch is a double-free?
> 
> > 
> > > +                       goto close_fail;
> > > +               }
> > [...]
> > > diff --git a/include/linux/net.h b/include/linux/net.h
> > > index 0ff950eecc6b..139c85d0f2ea 100644
> > > --- a/include/linux/net.h
> > > +++ b/include/linux/net.h
> > > @@ -81,6 +81,7 @@ enum sock_type {
> > >  #ifndef SOCK_NONBLOCK
> > >  #define SOCK_NONBLOCK  O_NONBLOCK
> > >  #endif
> > > +#define SOCK_COREDUMP  O_NOCTTY
> > 
> > Hrrrm. I looked through all the paths from which the ->connect() call
> > can come, and I think this is currently safe; but I wonder if it would
> 
> Yes, I made sure that unknown bits are excluded.

See the appended updated version for completeness sake.

[-- Attachment #2: 0001-coredump-add-coredump-socket.patch --]
[-- Type: text/x-diff, Size: 14148 bytes --]

From f365092f4cb84af265b3f8134802f625e68d6da0 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Thu, 15 May 2025 00:03:37 +0200
Subject: [PATCH] coredump: add coredump socket

Coredumping currently supports two modes:

(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
    spawned as a child of the system_unbound_wq or kthreadd.

For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.

The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:

        |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.

In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):

- systemd-coredump is spawned with file descriptor number 0 (stdin)
  connected to the read-end of the pipe. All other file descriptors are
  closed. That specifically includes 1 (stdout) and 2 (stderr). This has
  already caused bugs because userspace assumed that this cannot happen
  (Whether or not this is a sane assumption is irrelevant.).

- systemd-coredump will be spawned as a child of system_unbound_wq. So
  it is not a child of any userspace process and specifically not a
  child of PID 1. It cannot be waited upon and is in a weird hybrid
  upcall which are difficult for userspace to control correctly.

- systemd-coredump is spawned with full kernel privileges. This
  necessitates all kinds of weird privilege dropping excercises in
  userspace to make this safe.

- A new usermode helper has to be spawned for each crashing process.

This series adds a new mode:

(3) Dumping into an AF_UNIX socket.

Userspace can set /proc/sys/kernel/core_pattern to:

        @/path/to/coredump.socket

The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps.

The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket.

- The coredump server uses SO_PEERPIDFD to get a stable handle on the
  connected crashing task. The retrieved pidfd will provide a stable
  reference even if the crashing task gets SIGKILLed while generating
  the coredump.

- By setting core_pipe_limit non-zero userspace can guarantee that the
  crashing task cannot be reaped behind it's back and thus process all
  necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
  detect whether /proc/<pid> still refers to the same process.

  The core_pipe_limit isn't used to rate-limit connections to the
  socket. This can simply be done via AF_UNIX sockets directly.

- The pidfd for the crashing task will grow new information how the task
  coredumps.

- The coredump server should mark itself as non-dumpable.

- A container coredump server in a separate network namespace can simply
  bind to another well-know address and systemd-coredump fowards
  coredumps to the container.

- Coredumps could in the future also be handled via per-user/session
  coredump servers that run only with that users privileges.

  The coredump server listens on the coredump socket and accepts a
  new coredump connection. It then retrieves SO_PEERPIDFD for the
  client, inspects uid/gid and hands the accepted client to the users
  own coredump handler which runs with the users privileges only
  (It must of coure pay close attention to not forward crashing suid
  binaries.).

The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers
that have and continue to cause significant CVEs.

This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can e.g., just keep a worker pool.

Link: https://lore.kernel.org/20250515-work-coredump-socket-v7-4-0a1329496c31@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c       | 117 ++++++++++++++++++++++++++++++++++++++++++--
 include/linux/net.h |   1 +
 net/unix/af_unix.c  |  54 +++++++++++++++-----
 3 files changed, 155 insertions(+), 17 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 4b9ea455a59c..22c1730e8eaf 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -44,7 +44,11 @@
 #include <linux/sysctl.h>
 #include <linux/elf.h>
 #include <linux/pidfs.h>
+#include <linux/net.h>
+#include <linux/socket.h>
+#include <net/net_namespace.h>
 #include <uapi/linux/pidfd.h>
+#include <uapi/linux/un.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
 enum coredump_type_t {
 	COREDUMP_FILE = 1,
 	COREDUMP_PIPE = 2,
+	COREDUMP_SOCK = 3,
 };
 
 struct core_name {
@@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 	cn->corename = NULL;
 	if (*pat_ptr == '|')
 		cn->core_type = COREDUMP_PIPE;
+	else if (*pat_ptr == '@')
+		cn->core_type = COREDUMP_SOCK;
 	else
 		cn->core_type = COREDUMP_FILE;
 	if (expand_corename(cn, core_name_size))
 		return -ENOMEM;
 	cn->corename[0] = '\0';
 
-	if (cn->core_type == COREDUMP_PIPE) {
+	switch (cn->core_type) {
+	case COREDUMP_PIPE: {
 		int argvs = sizeof(core_pattern) / 2;
 		(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
 		if (!(*argv))
@@ -247,6 +255,35 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 		++pat_ptr;
 		if (!(*pat_ptr))
 			return -ENOMEM;
+		break;
+	}
+	case COREDUMP_SOCK: {
+		/* skip the @ */
+		pat_ptr++;
+		if (!(*pat_ptr))
+			return -ENOMEM;
+		err = cn_printf(cn, "%s", pat_ptr);
+		if (err)
+			return err;
+
+		/* Require absolute paths. */
+		if (cn->corename[0] != '/')
+			return -EINVAL;
+
+		/*
+		 * Currently no need to parse any other options.
+		 * Relevant information can be retrieved from the peer
+		 * pidfd retrievable via SO_PEERPIDFD by the receiver or
+		 * via /proc/<pid>, using the SO_PEERPIDFD to guard
+		 * against pid recycling when opening /proc/<pid>.
+		 */
+		return 0;
+	}
+	case COREDUMP_FILE:
+		break;
+	default:
+		WARN_ON_ONCE(true);
+		return -EINVAL;
 	}
 
 	/* Repeat as long as we have more pattern to process and more output
@@ -395,6 +432,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 	 * the filename. Do not do this for piped commands. */
 	if (cn->core_type == COREDUMP_FILE && !pid_in_pattern && core_uses_pid)
 		return cn_printf(cn, ".%d", task_tgid_vnr(current));
+
 	return 0;
 }
 
@@ -798,6 +836,53 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		}
 		break;
 	}
+	case COREDUMP_SOCK: {
+#ifdef CONFIG_UNIX
+		struct file *file __free(fput) = NULL;
+		struct sockaddr_un addr = {
+			.sun_family = AF_UNIX,
+		};
+		ssize_t addr_len;
+		struct socket *socket;
+
+		addr_len = strscpy(addr.sun_path, cn.corename);
+		if (addr_len < 0)
+			goto close_fail;
+		addr_len += offsetof(struct sockaddr_un, sun_path) + 1;
+
+		/*
+		 * It is possible that the userspace process which is
+		 * supposed to handle the coredump and is listening on
+		 * the AF_UNIX socket coredumps. Userspace should just
+		 * mark itself non dumpable.
+		 */
+
+		retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
+		if (retval < 0)
+			goto close_fail;
+
+		file = sock_alloc_file(socket, 0, NULL);
+		if (IS_ERR(file))
+			goto close_fail;
+
+		retval = kernel_connect(socket, (struct sockaddr *)(&addr),
+					addr_len, O_NONBLOCK | SOCK_COREDUMP);
+		if (retval) {
+			if (retval == -EAGAIN)
+				coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path);
+			else
+				coredump_report_failure("Coredump socket connection %s failed %d", addr.sun_path, retval);
+			goto close_fail;
+		}
+
+		cprm.limit = RLIM_INFINITY;
+		cprm.file = no_free_ptr(file);
+#else
+		coredump_report_failure("Core dump socket support %s disabled", cn.corename);
+		goto close_fail;
+#endif
+		break;
+	}
 	default:
 		WARN_ON_ONCE(true);
 		goto close_fail;
@@ -835,8 +920,32 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 		file_end_write(cprm.file);
 		free_vma_snapshot(&cprm);
 	}
-	if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
-		wait_for_dump_helpers(cprm.file);
+
+	/*
+	 * When core_pipe_limit is set we wait for the coredump server
+	 * or usermodehelper to finish before exiting so it can e.g.,
+	 * inspect /proc/<pid>.
+	 */
+	if (core_pipe_limit) {
+		switch (cn.core_type) {
+		case COREDUMP_PIPE:
+			wait_for_dump_helpers(cprm.file);
+			break;
+		case COREDUMP_SOCK: {
+			/*
+			 * We use a simple read to wait for the coredump
+			 * processing to finish. Either the socket is
+			 * closed or we get sent unexpected data. In
+			 * both cases, we're done.
+			 */
+			__kernel_read(cprm.file, &(char){ 0 }, 1, NULL);
+			break;
+		}
+		default:
+			break;
+		}
+	}
+
 close_fail:
 	if (cprm.file)
 		filp_close(cprm.file, NULL);
@@ -1066,7 +1175,7 @@ EXPORT_SYMBOL(dump_align);
 void validate_coredump_safety(void)
 {
 	if (suid_dumpable == SUID_DUMP_ROOT &&
-	    core_pattern[0] != '/' && core_pattern[0] != '|') {
+	    core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != '@') {
 
 		coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: "
 			"pipe handler or fully qualified core dump path required. "
diff --git a/include/linux/net.h b/include/linux/net.h
index 0ff950eecc6b..139c85d0f2ea 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -81,6 +81,7 @@ enum sock_type {
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK	O_NONBLOCK
 #endif
+#define SOCK_COREDUMP	O_NOCTTY
 
 #endif /* ARCH_HAS_SOCKET_TYPES */
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 472f8aa9ea15..59a64b2ced6e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -85,10 +85,13 @@
 #include <linux/file.h>
 #include <linux/filter.h>
 #include <linux/fs.h>
+#include <linux/fs_struct.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/mount.h>
 #include <linux/namei.h>
+#include <linux/net.h>
+#include <linux/pidfs.h>
 #include <linux/poll.h>
 #include <linux/proc_fs.h>
 #include <linux/sched/signal.h>
@@ -100,7 +103,6 @@
 #include <linux/splice.h>
 #include <linux/string.h>
 #include <linux/uaccess.h>
-#include <linux/pidfs.h>
 #include <net/af_unix.h>
 #include <net/net_namespace.h>
 #include <net/scm.h>
@@ -1146,7 +1148,7 @@ static int unix_release(struct socket *sock)
 }
 
 static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
-				  int type)
+				  int type, int flags)
 {
 	struct inode *inode;
 	struct path path;
@@ -1154,13 +1156,39 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
 	int err;
 
 	unix_mkname_bsd(sunaddr, addr_len);
-	err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
-	if (err)
-		goto fail;
 
-	err = path_permission(&path, MAY_WRITE);
-	if (err)
-		goto path_put;
+	if (flags & SOCK_COREDUMP) {
+		const struct cred *cred;
+		struct cred *kcred;
+		struct path root;
+
+		kcred = prepare_kernel_cred(&init_task);
+		if (!kcred) {
+			err = -ENOMEM;
+			goto fail;
+		}
+
+		task_lock(&init_task);
+		get_fs_root(init_task.fs, &root);
+		task_unlock(&init_task);
+
+		cred = override_creds(kcred);
+		err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
+				      LOOKUP_BENEATH | LOOKUP_NO_SYMLINKS |
+				      LOOKUP_NO_MAGICLINKS, &path);
+		put_cred(revert_creds(cred));
+		path_put(&root);
+		if (err)
+			goto fail;
+	} else {
+		err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
+		if (err)
+			goto fail;
+
+		err = path_permission(&path, MAY_WRITE);
+		if (err)
+			goto path_put;
+	}
 
 	err = -ECONNREFUSED;
 	inode = d_backing_inode(path.dentry);
@@ -1210,12 +1238,12 @@ static struct sock *unix_find_abstract(struct net *net,
 
 static struct sock *unix_find_other(struct net *net,
 				    struct sockaddr_un *sunaddr,
-				    int addr_len, int type)
+				    int addr_len, int type, int flags)
 {
 	struct sock *sk;
 
 	if (sunaddr->sun_path[0])
-		sk = unix_find_bsd(sunaddr, addr_len, type);
+		sk = unix_find_bsd(sunaddr, addr_len, type, flags);
 	else
 		sk = unix_find_abstract(net, sunaddr, addr_len, type);
 
@@ -1473,7 +1501,7 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr *addr,
 		}
 
 restart:
-		other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type);
+		other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type, 0);
 		if (IS_ERR(other)) {
 			err = PTR_ERR(other);
 			goto out;
@@ -1620,7 +1648,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 
 restart:
 	/*  Find listening sock. */
-	other = unix_find_other(net, sunaddr, addr_len, sk->sk_type);
+	other = unix_find_other(net, sunaddr, addr_len, sk->sk_type, flags);
 	if (IS_ERR(other)) {
 		err = PTR_ERR(other);
 		goto out_free_skb;
@@ -2089,7 +2117,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 	if (msg->msg_namelen) {
 lookup:
 		other = unix_find_other(sock_net(sk), msg->msg_name,
-					msg->msg_namelen, sk->sk_type);
+					msg->msg_namelen, sk->sk_type, 0);
 		if (IS_ERR(other)) {
 			err = PTR_ERR(other);
 			goto out_free;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 5/9] pidfs, coredump: add PIDFD_INFO_COREDUMP
  2025-05-15 20:56   ` Jann Horn
  2025-05-15 21:37     ` Jann Horn
@ 2025-05-16 10:34     ` Christian Brauner
  2025-05-16 14:26       ` Jann Horn
  1 sibling, 1 reply; 43+ messages in thread
From: Christian Brauner @ 2025-05-16 10:34 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Thu, May 15, 2025 at 10:56:26PM +0200, Jann Horn wrote:
> On Thu, May 15, 2025 at 12:04 AM Christian Brauner <brauner@kernel.org> wrote:
> > Extend the PIDFD_INFO_COREDUMP ioctl() with the new PIDFD_INFO_COREDUMP
> > mask flag. This adds the fields @coredump_mask and @coredump_cookie to
> > struct pidfd_info.
> 
> FWIW, now that you're using path-based sockets and override_creds(),
> one option may be to drop this patch and say "if you don't want
> untrusted processes to directly connect to the coredumping socket,
> just set the listening socket to mode 0000 or mode 0600"...
> 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> [...]
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index e1256ebb89c1..bfc4a32f737c 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> [...]
> > @@ -876,8 +880,34 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> >                         goto close_fail;
> >                 }
> >
> > +               /*
> > +                * Set the thread-group leader pid which is used for the
> > +                * peer credentials during connect() below. Then
> > +                * immediately register it in pidfs...
> > +                */
> > +               cprm.pid = task_tgid(current);
> > +               retval = pidfs_register_pid(cprm.pid);
> > +               if (retval) {
> > +                       sock_release(socket);
> > +                       goto close_fail;
> > +               }
> > +
> > +               /*
> > +                * ... and set the coredump information so userspace
> > +                * has it available after connect()...
> > +                */
> > +               pidfs_coredump(&cprm);
> > +
> > +               /*
> > +                * ... On connect() the peer credentials are recorded
> > +                * and @cprm.pid registered in pidfs...
> 
> I don't understand this comment. Wasn't "@cprm.pid registered in
> pidfs" above with the explicit `pidfs_register_pid(cprm.pid)`?

I'll answer both questions in one go below...

> 
> > +                */
> >                 retval = kernel_connect(socket, (struct sockaddr *)(&addr),
> >                                         addr_len, O_NONBLOCK | SOCK_COREDUMP);
> > +
> > +               /* ... So we can safely put our pidfs reference now... */
> > +               pidfs_put_pid(cprm.pid);
> 
> Why can we safely put the pidfs reference now but couldn't do it
> before the kernel_connect()? Does the kernel_connect() look up this
> pidfs entry by calling something like pidfs_alloc_file()? Or does that
> only happen later on, when the peer does getsockopt(SO_PEERPIDFD)?

AF_UNIX sockets support SO_PEERPIDFD as you know. Users such as dbus or
systemd want to be able to retrieve a pidfd for the peer even if the
peer has already been reaped. To support this AF_UNIX ensures that when
the peer credentials are set up (connect(), listen()) the corresponding
@pid will also be registered in pidfs. This ensures that exit
information is stored in the inode if we hand out a pidfd for a reaped
task. IOW, we only hand out pidfds for reaped task if at the time of
reaping a pidfs entry existed for it.

Since we're setting coredump information on the pidfd here we're calling
pidfs_register_pid() even before connect() sets up the peer credentials
so we're sure that the coredump information is stored in the inode.

Then we delay our pidfs_put_pid() call until the connect() took it's own
reference and thus continues pinning the inode. IOW, connect() will also
call pidfs_register_pid() but it will ofc just increment the reference
count ensuring that our pidfs_put_pid() doesn't drop the inode.

If we immediately did a pidfs_put_pid() before connect() we'd loose the
coredump information.

> 
> >                 if (retval) {
> >                         if (retval == -EAGAIN)
> >                                 coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path);
> [...]
> > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > index 3b39e471840b..d7b9a0dd2db6 100644
> > --- a/fs/pidfs.c
> > +++ b/fs/pidfs.c
> [...]
> > @@ -280,6 +299,13 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
> >                 }
> >         }
> >
> > +       if (mask & PIDFD_INFO_COREDUMP) {
> > +               kinfo.mask |= PIDFD_INFO_COREDUMP;
> > +               smp_rmb();
> 
> I assume I would regret it if I asked what these barriers are for,
> because the answer is something terrifying about how we otherwise
> don't have a guarantee that memory accesses can't be reordered between
> multiple subsequent syscalls or something like that?

No, not really. It's just so that when someone calls PIDFD_GET_INFO with
PIDFD_INFO_COREDUMP but one gotten from the coredump socket that they
don't see half-initialized information. I can just use WRITE_ONCE() for
that.

> 
> checkpatch complains about the lack of comments on these memory barriers.

I'll just use WRITE_ONCE().

> 
> > +               kinfo.coredump_cookie = READ_ONCE(pidfs_i(inode)->__pei.coredump_cookie);
> > +               kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
> > +       }
> > +
> >         task = get_pid_task(pid, PIDTYPE_PID);
> >         if (!task) {
> >                 /*
> [...]
> > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > index a9d1c9ba2961..053d2e48e918 100644
> > --- a/net/unix/af_unix.c
> > +++ b/net/unix/af_unix.c
> [...]
> > @@ -742,6 +743,7 @@ static void unix_release_sock(struct sock *sk, int embrion)
> >
> >  struct unix_peercred {
> >         struct pid *peer_pid;
> > +       u64 cookie;
> 
> Maybe add a comment here documenting that for now, this is assumed to
> be used exclusively for coredump sockets.

I think we should just drop it.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 7/9] coredump: validate socket name as it is written
  2025-05-16  9:54     ` Christian Brauner
@ 2025-05-16 13:29       ` Christian Brauner
  0 siblings, 0 replies; 43+ messages in thread
From: Christian Brauner @ 2025-05-16 13:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

> > The third strscpy() argument is semantically supposed to be the
> > destination buffer size, not the amount of data to copy. For trivial
> > invocations like here, strscpy() actually allows you to leave out the
> > third argument.
> 
> Eeeeewww, that's really implicit behavior. I can use the destination

Ah, I see the argument is optional. I thought you could pass 0 or
something weird.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v7 5/9] pidfs, coredump: add PIDFD_INFO_COREDUMP
  2025-05-16 10:34     ` Christian Brauner
@ 2025-05-16 14:26       ` Jann Horn
  0 siblings, 0 replies; 43+ messages in thread
From: Jann Horn @ 2025-05-16 14:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Daniel Borkmann, Kuniyuki Iwashima, Eric Dumazet,
	Oleg Nesterov, David S. Miller, Alexander Viro, Daan De Meyer,
	David Rheinsberg, Jakub Kicinski, Jan Kara, Lennart Poettering,
	Luca Boccassi, Mike Yuan, Paolo Abeni, Simon Horman,
	Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
	linux-security-module, Alexander Mikhalitsyn

On Fri, May 16, 2025 at 12:34 PM Christian Brauner <brauner@kernel.org> wrote:
> On Thu, May 15, 2025 at 10:56:26PM +0200, Jann Horn wrote:
> > Why can we safely put the pidfs reference now but couldn't do it
> > before the kernel_connect()? Does the kernel_connect() look up this
> > pidfs entry by calling something like pidfs_alloc_file()? Or does that
> > only happen later on, when the peer does getsockopt(SO_PEERPIDFD)?
>
> AF_UNIX sockets support SO_PEERPIDFD as you know. Users such as dbus or
> systemd want to be able to retrieve a pidfd for the peer even if the
> peer has already been reaped. To support this AF_UNIX ensures that when
> the peer credentials are set up (connect(), listen()) the corresponding
> @pid will also be registered in pidfs. This ensures that exit
> information is stored in the inode if we hand out a pidfd for a reaped
> task. IOW, we only hand out pidfds for reaped task if at the time of
> reaping a pidfs entry existed for it.
>
> Since we're setting coredump information on the pidfd here we're calling
> pidfs_register_pid() even before connect() sets up the peer credentials
> so we're sure that the coredump information is stored in the inode.
>
> Then we delay our pidfs_put_pid() call until the connect() took it's own
> reference and thus continues pinning the inode. IOW, connect() will also
> call pidfs_register_pid() but it will ofc just increment the reference
> count ensuring that our pidfs_put_pid() doesn't drop the inode.

Aah, so the call graph looks like this:

unix_stream_connect
  prepare_peercred
    pidfs_register_pid
      [pidfs reference taken]
  [point of no return]
  init_peercred
    [copies creds to socket, moving ref ownership]
  copy_peercred
    [copies creds from socket to peer socket, taking refs]

Thanks for explaining!

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2025-05-16 14:26 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-14 22:03 [PATCH v7 0/9] coredump: add coredump socket Christian Brauner
2025-05-14 22:03 ` [PATCH v7 1/9] coredump: massage format_corname() Christian Brauner
2025-05-15 13:19   ` Alexander Mikhalitsyn
2025-05-15 13:36   ` Serge E. Hallyn
2025-05-15 20:52   ` Jann Horn
2025-05-14 22:03 ` [PATCH v7 2/9] coredump: massage do_coredump() Christian Brauner
2025-05-15 13:21   ` Alexander Mikhalitsyn
2025-05-15 20:52   ` Jann Horn
2025-05-14 22:03 ` [PATCH v7 3/9] coredump: reflow dump helpers a little Christian Brauner
2025-05-15 13:22   ` Alexander Mikhalitsyn
2025-05-15 20:53   ` Jann Horn
2025-05-14 22:03 ` [PATCH v7 4/9] coredump: add coredump socket Christian Brauner
2025-05-15 13:47   ` Alexander Mikhalitsyn
2025-05-16  8:30     ` Christian Brauner
2025-05-15 17:00   ` Kuniyuki Iwashima
2025-05-15 20:52     ` Jann Horn
2025-05-15 21:04       ` Kuniyuki Iwashima
2025-05-16 10:14     ` Christian Brauner
2025-05-15 20:54   ` Jann Horn
2025-05-15 21:15     ` Kuniyuki Iwashima
2025-05-16 10:09     ` Christian Brauner
2025-05-16 10:20       ` Christian Brauner
2025-05-14 22:03 ` [PATCH v7 5/9] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
2025-05-15 14:08   ` Alexander Mikhalitsyn
2025-05-15 20:56   ` Jann Horn
2025-05-15 21:37     ` Jann Horn
2025-05-16 10:34     ` Christian Brauner
2025-05-16 14:26       ` Jann Horn
2025-05-14 22:03 ` [PATCH v7 6/9] coredump: show supported coredump modes Christian Brauner
2025-05-15 13:56   ` Alexander Mikhalitsyn
2025-05-15 20:56   ` Jann Horn
2025-05-14 22:03 ` [PATCH v7 7/9] coredump: validate socket name as it is written Christian Brauner
2025-05-15 14:03   ` Alexander Mikhalitsyn
2025-05-15 20:56   ` Jann Horn
2025-05-16  9:54     ` Christian Brauner
2025-05-16 13:29       ` Christian Brauner
2025-05-14 22:03 ` [PATCH v7 8/9] selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure Christian Brauner
2025-05-15 14:35   ` Alexander Mikhalitsyn
2025-05-14 22:03 ` [PATCH v7 9/9] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
2025-05-15 14:37   ` Alexander Mikhalitsyn
2025-05-14 22:38 ` [PATCH v7 0/9] coredump: add coredump socket Luca Boccassi
2025-05-15  9:17 ` Christian Brauner
2025-05-15  9:26 ` Lennart Poettering

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).