* [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets
@ 2025-05-02 12:42 Christian Brauner
2025-05-02 12:42 ` [PATCH RFC v2 1/6] coredump: massage format_corname() Christian Brauner
` (6 more replies)
0 siblings, 7 replies; 17+ messages in thread
From: Christian Brauner @ 2025-05-02 12:42 UTC (permalink / raw)
To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
Jann Horn
Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
Christian Brauner
I need some help with the following questions:
(i) The core_pipe_limit setting is of vital importance to userspace
because it allows it to a) limit the number of concurrent coredumps
and b) causes the kernel to wait until userspace closes the pipe and
thus prevents the process from being reaped, allowing userspace to
parse information out of /proc/<pid>/.
Pipes already support this. I need to know from the networking
people (or Oleg :)) how to wait for the userspace side to shutdown
the socket/terminate the connection.
I don't want to just read() because then userspace can send us
SCM_RIGHTS messages and it's really ugly anyway.
(ii) The dumpability setting is of importance for userspace in order to
know how a given binary is dumped: as regular user or as root user.
This helps guard against exploits abusing set*id binaries. The
setting needs to be the same as used at the time of the coredump.
I'm exposing this as part of PIDFD_GET_INFO. I would like some
input whether it's fine to simply expose the dumpability this way.
I'm pretty sure it is. But it'd be good to have @Jann give his
thoughts here.
Now the actual blurb:
Coredumping currently supports two modes:
(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
spawned as a child of the system_unbound_wq or kthreadd.
For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.
The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.
In this case systemd-coredump is spawned as a usermode helper. There's
various conceptual consequences of this (non-exhaustive list):
- systemd-coredump is spawned with file descriptor number 0 (stdin)
to the read-end of the pipe. All other file descriptors are closed.
That specifically includes 1 (stdout) and 2 (stderr). This has already
caused bugs because userspace assumed that this cannot happen (Whether
or not this is a sane assumption is irrelevant.).
- systemd-coredump will be spawned as a child of system_unbound_wq. So
it is not a child of any userspace process and specifically not a
child of PID 1 so it cannot be waited upon and is in general a weird
hybrid upcall.
- systemd-coredump is spawned highly privileged as it is spawned with
full kernel credentials requiring all kinds of weird privilege
dropping excercises in userspaces.
This adds another mode:
(3) Dumping into a AF_UNIX socket.
Userspace can set /proc/sys/kernel/core_pattern to:
:/run/coredump.socket
The ":" at the beginning indicates to the kernel that an AF_UNIX socket
is used to process coredumps. The task generating the coredump simply
connects to the socket and writes the coredump into the socket.
Userspace can get a stable handle on the task generating the coredump by
using the SO_PEERPIDFD socket option. SO_PEERPIDFD uses the thread-group
leader pid stashed during connect(). Even if the task generating the
coredump is a subthread in the thread-group the pidfd of the
thread-group leader is a reliable stable handle. Userspace that's
interested in the credentials of the specific thread that crashed can
use SCM_PIDFD to retrieve them.
The pidfd can be used to safely open and parse /proc/<pid> of the task
and it can also be used to retrieve additional meta information via the
PIDFD_GET_INFO ioctl().
This will allow userspace to not have to rely on usermode helpers for
processing coredumps and thus to stop having to handle super privileged
coredumping helpers.
This is easy to test:
(a) coredump processing (we're using socat):
> cat coredump_socket.sh
#!/bin/bash
set -x
sudo bash -c "echo ':/tmp/stream.sock' > /proc/sys/kernel/core_pattern"
socat --statistics unix-listen:/tmp/stream.sock,fork FILE:core_file,create,append,truncate
(b) trigger a coredump:
user1@localhost:~/data/scripts$ cat crash.c
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
fprintf(stderr, "%u\n", (1 / 0));
_exit(0);
}
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Expose dumpability via PIDFD_GET_INFO.
- Place COREDUMP_SOCK handling under CONFIG_UNIX.
- Link to v1: https://lore.kernel.org/20250430-work-coredump-socket-v1-0-2faf027dbb47@kernel.org
---
Christian Brauner (6):
coredump: massage format_corname()
coredump: massage do_coredump()
coredump: support AF_UNIX sockets
coredump: show supported coredump modes
pidfs, coredump: add PIDFD_INFO_COREDUMP
selftests/coredump: add tests for AF_UNIX coredumps
fs/coredump.c | 312 ++++++++++++++++------
fs/pidfs.c | 58 ++++
include/linux/pidfs.h | 3 +
include/uapi/linux/pidfd.h | 11 +
tools/testing/selftests/coredump/stackdump_test.c | 50 ++++
5 files changed, 359 insertions(+), 75 deletions(-)
---
base-commit: 4dd6566b5a8ca1e8c9ff2652c2249715d6c64217
change-id: 20250429-work-coredump-socket-87cc0f17729c
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH RFC v2 1/6] coredump: massage format_corname()
2025-05-02 12:42 [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets Christian Brauner
@ 2025-05-02 12:42 ` Christian Brauner
2025-05-02 12:42 ` [PATCH RFC v2 2/6] coredump: massage do_coredump() Christian Brauner
` (5 subsequent siblings)
6 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2025-05-02 12:42 UTC (permalink / raw)
To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
Jann Horn
Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
Christian Brauner
We're going to extend the coredump code in follow-up patches.
Clean it up so we can do this more easily.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/coredump.c | 41 ++++++++++++++++++++++++-----------------
1 file changed, 24 insertions(+), 17 deletions(-)
diff --git a/fs/coredump.c b/fs/coredump.c
index d740a0411266..281320ea351f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -76,9 +76,15 @@ static char core_pattern[CORENAME_MAX_SIZE] = "core";
static int core_name_size = CORENAME_MAX_SIZE;
unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
+enum coredump_type_t {
+ COREDUMP_FILE = 1,
+ COREDUMP_PIPE = 2,
+};
+
struct core_name {
char *corename;
int used, size;
+ enum coredump_type_t core_type;
};
static int expand_corename(struct core_name *cn, int size)
@@ -218,18 +224,21 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
{
const struct cred *cred = current_cred();
const char *pat_ptr = core_pattern;
- int ispipe = (*pat_ptr == '|');
bool was_space = false;
int pid_in_pattern = 0;
int err = 0;
cn->used = 0;
cn->corename = NULL;
+ if (*pat_ptr == '|')
+ cn->core_type = COREDUMP_PIPE;
+ else
+ cn->core_type = COREDUMP_FILE;
if (expand_corename(cn, core_name_size))
return -ENOMEM;
cn->corename[0] = '\0';
- if (ispipe) {
+ if (cn->core_type == COREDUMP_PIPE) {
int argvs = sizeof(core_pattern) / 2;
(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
if (!(*argv))
@@ -247,7 +256,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
* Split on spaces before doing template expansion so that
* %e and %E don't get split if they have spaces in them
*/
- if (ispipe) {
+ if (cn->core_type == COREDUMP_PIPE) {
if (isspace(*pat_ptr)) {
if (cn->used != 0)
was_space = true;
@@ -353,7 +362,7 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
* Installing a pidfd only makes sense if
* we actually spawn a usermode helper.
*/
- if (!ispipe)
+ if (!(cn->core_type != COREDUMP_PIPE))
break;
/*
@@ -384,12 +393,12 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
* If core_pattern does not include a %p (as is the default)
* and core_uses_pid is set, then .%pid will be appended to
* the filename. Do not do this for piped commands. */
- if (!ispipe && !pid_in_pattern && core_uses_pid) {
+ if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
err = cn_printf(cn, ".%d", task_tgid_vnr(current));
if (err)
return err;
}
- return ispipe;
+ return 0;
}
static int zap_process(struct signal_struct *signal, int exit_code)
@@ -583,7 +592,6 @@ void do_coredump(const kernel_siginfo_t *siginfo)
const struct cred *old_cred;
struct cred *cred;
int retval = 0;
- int ispipe;
size_t *argv = NULL;
int argc = 0;
/* require nonrelative corefile path and be extra careful */
@@ -632,19 +640,18 @@ void do_coredump(const kernel_siginfo_t *siginfo)
old_cred = override_creds(cred);
- ispipe = format_corename(&cn, &cprm, &argv, &argc);
+ retval = format_corename(&cn, &cprm, &argv, &argc);
+ if (retval < 0) {
+ coredump_report_failure("format_corename failed, aborting core");
+ goto fail_unlock;
+ }
- if (ispipe) {
+ if (cn.core_type == COREDUMP_PIPE) {
int argi;
int dump_count;
char **helper_argv;
struct subprocess_info *sub_info;
- if (ispipe < 0) {
- coredump_report_failure("format_corename failed, aborting core");
- goto fail_unlock;
- }
-
if (cprm.limit == 1) {
/* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
*
@@ -695,7 +702,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
coredump_report_failure("|%s pipe failed", cn.corename);
goto close_fail;
}
- } else {
+ } else if (cn.core_type == COREDUMP_FILE) {
struct mnt_idmap *idmap;
struct inode *inode;
int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
@@ -823,13 +830,13 @@ void do_coredump(const kernel_siginfo_t *siginfo)
file_end_write(cprm.file);
free_vma_snapshot(&cprm);
}
- if (ispipe && core_pipe_limit)
+ if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
wait_for_dump_helpers(cprm.file);
close_fail:
if (cprm.file)
filp_close(cprm.file, NULL);
fail_dropcount:
- if (ispipe)
+ if (cn.core_type == COREDUMP_PIPE)
atomic_dec(&core_dump_count);
fail_unlock:
kfree(argv);
--
2.47.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH RFC v2 2/6] coredump: massage do_coredump()
2025-05-02 12:42 [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets Christian Brauner
2025-05-02 12:42 ` [PATCH RFC v2 1/6] coredump: massage format_corname() Christian Brauner
@ 2025-05-02 12:42 ` Christian Brauner
2025-05-02 12:42 ` [PATCH RFC v2 3/6] coredump: support AF_UNIX sockets Christian Brauner
` (4 subsequent siblings)
6 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2025-05-02 12:42 UTC (permalink / raw)
To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
Jann Horn
Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
Christian Brauner
We're going to extend the coredump code in follow-up patches.
Clean it up so we can do this more easily.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/coredump.c | 123 +++++++++++++++++++++++++++++++---------------------------
1 file changed, 66 insertions(+), 57 deletions(-)
diff --git a/fs/coredump.c b/fs/coredump.c
index 281320ea351f..1779299b8c61 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -646,63 +646,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
goto fail_unlock;
}
- if (cn.core_type == COREDUMP_PIPE) {
- int argi;
- int dump_count;
- char **helper_argv;
- struct subprocess_info *sub_info;
-
- if (cprm.limit == 1) {
- /* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
- *
- * Normally core limits are irrelevant to pipes, since
- * we're not writing to the file system, but we use
- * cprm.limit of 1 here as a special value, this is a
- * consistent way to catch recursive crashes.
- * We can still crash if the core_pattern binary sets
- * RLIM_CORE = !1, but it runs as root, and can do
- * lots of stupid things.
- *
- * Note that we use task_tgid_vnr here to grab the pid
- * of the process group leader. That way we get the
- * right pid if a thread in a multi-threaded
- * core_pattern process dies.
- */
- coredump_report_failure("RLIMIT_CORE is set to 1, aborting core");
- goto fail_unlock;
- }
- cprm.limit = RLIM_INFINITY;
-
- dump_count = atomic_inc_return(&core_dump_count);
- if (core_pipe_limit && (core_pipe_limit < dump_count)) {
- coredump_report_failure("over core_pipe_limit, skipping core dump");
- goto fail_dropcount;
- }
-
- helper_argv = kmalloc_array(argc + 1, sizeof(*helper_argv),
- GFP_KERNEL);
- if (!helper_argv) {
- coredump_report_failure("%s failed to allocate memory", __func__);
- goto fail_dropcount;
- }
- for (argi = 0; argi < argc; argi++)
- helper_argv[argi] = cn.corename + argv[argi];
- helper_argv[argi] = NULL;
-
- retval = -ENOMEM;
- sub_info = call_usermodehelper_setup(helper_argv[0],
- helper_argv, NULL, GFP_KERNEL,
- umh_coredump_setup, NULL, &cprm);
- if (sub_info)
- retval = call_usermodehelper_exec(sub_info,
- UMH_WAIT_EXEC);
-
- kfree(helper_argv);
- if (retval) {
- coredump_report_failure("|%s pipe failed", cn.corename);
- goto close_fail;
- }
- } else if (cn.core_type == COREDUMP_FILE) {
+ switch (cn.core_type) {
+ case COREDUMP_FILE: {
struct mnt_idmap *idmap;
struct inode *inode;
int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
@@ -796,6 +741,70 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (do_truncate(idmap, cprm.file->f_path.dentry,
0, 0, cprm.file))
goto close_fail;
+ break;
+ }
+ case COREDUMP_PIPE: {
+ int argi;
+ int dump_count;
+ char **helper_argv;
+ struct subprocess_info *sub_info;
+
+ if (cprm.limit == 1) {
+ /* See umh_coredump_setup() which sets RLIMIT_CORE = 1.
+ *
+ * Normally core limits are irrelevant to pipes, since
+ * we're not writing to the file system, but we use
+ * cprm.limit of 1 here as a special value, this is a
+ * consistent way to catch recursive crashes.
+ * We can still crash if the core_pattern binary sets
+ * RLIM_CORE = !1, but it runs as root, and can do
+ * lots of stupid things.
+ *
+ * Note that we use task_tgid_vnr here to grab the pid
+ * of the process group leader. That way we get the
+ * right pid if a thread in a multi-threaded
+ * core_pattern process dies.
+ */
+ coredump_report_failure("RLIMIT_CORE is set to 1, aborting core");
+ goto fail_unlock;
+ }
+ cprm.limit = RLIM_INFINITY;
+
+ dump_count = atomic_inc_return(&core_dump_count);
+ if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+ coredump_report_failure("over core_pipe_limit, skipping core dump");
+ goto fail_dropcount;
+ }
+
+ helper_argv = kmalloc_array(argc + 1, sizeof(*helper_argv),
+ GFP_KERNEL);
+ if (!helper_argv) {
+ coredump_report_failure("%s failed to allocate memory", __func__);
+ goto fail_dropcount;
+ }
+ for (argi = 0; argi < argc; argi++)
+ helper_argv[argi] = cn.corename + argv[argi];
+ helper_argv[argi] = NULL;
+
+ retval = -ENOMEM;
+ sub_info = call_usermodehelper_setup(helper_argv[0],
+ helper_argv, NULL, GFP_KERNEL,
+ umh_coredump_setup, NULL, &cprm);
+ if (sub_info)
+ retval = call_usermodehelper_exec(sub_info,
+ UMH_WAIT_EXEC);
+
+ kfree(helper_argv);
+ if (retval) {
+ coredump_report_failure("|%s pipe failed", cn.corename);
+ goto close_fail;
+ }
+ break;
+ }
+ default:
+ WARN_ON_ONCE(true);
+ retval = -EINVAL;
+ goto close_fail;
}
/* get us an unshared descriptor table; almost always a no-op */
--
2.47.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH RFC v2 3/6] coredump: support AF_UNIX sockets
2025-05-02 12:42 [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets Christian Brauner
2025-05-02 12:42 ` [PATCH RFC v2 1/6] coredump: massage format_corname() Christian Brauner
2025-05-02 12:42 ` [PATCH RFC v2 2/6] coredump: massage do_coredump() Christian Brauner
@ 2025-05-02 12:42 ` Christian Brauner
2025-05-02 14:04 ` Jann Horn
2025-05-02 12:42 ` [PATCH RFC v2 4/6] coredump: show supported coredump modes Christian Brauner
` (3 subsequent siblings)
6 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2025-05-02 12:42 UTC (permalink / raw)
To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
Jann Horn
Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
Christian Brauner
Coredumping currently supports two modes:
(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
spawned as a child of the system_unbound_wq or kthreadd.
For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.
The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.
In this case systemd-coredump is spawned as a usermode helper. There's
various conceptual consequences of this (non-exhaustive list):
- systemd-coredump is spawned with file descriptor number 0 (stdin)
to the read-end of the pipe. All other file descriptors are closed.
That specifically includes 1 (stdout) and 2 (stderr). This has already
caused bugs because userspace assumed that this cannot happen (Whether
or not this is a sane assumption is irrelevant.).
- systemd-coredump will be spawned as a child of system_unbound_wq. So
it is not a child of any userspace process and specifically not a
child of PID 1 so it cannot be waited upon and is in general a weird
hybrid upcall.
- systemd-coredump is spawned highly privileged as it is spawned with
full kernel credentials requiring all kinds of weird privilege
dropping excercises in userspaces.
This adds another mode:
(3) Dumping into a AF_UNIX socket.
Userspace can set /proc/sys/kernel/core_pattern to:
:/run/coredump.socket
The ":" at the beginning indicates to the kernel that an AF_UNIX socket
is used to process coredumps. The task generating the coredump simply
connects to the socket and writes the coredump into the socket.
Userspace can get a stable handle on the task generating the coredump by
using the SO_PEERPIDFD socket option. SO_PEERPIDFD uses the thread-group
leader pid stashed during connect(). Even if the task generating the
coredump is a subthread in the thread-group the pidfd of the
thread-group leader is a reliable stable handle. Userspace that's
interested in the credentials of the specific thread that crashed can
use SCM_PIDFD to retrieve them.
The pidfd can be used to safely open and parse /proc/<pid> of the task
and it can also be used to retrieve additional meta information via the
PIDFD_GET_INFO ioctl().
This will allow userspace to not have to rely on usermode helpers for
processing coredumps and thus to stop having to handle super privileged
coredumping helpers.
This is easy to test:
(a) coredump processing (we're using socat):
> cat coredump_socket.sh
#!/bin/bash
set -x
sudo bash -c "echo ':/tmp/stream.sock' > /proc/sys/kernel/core_pattern"
socat --statistics unix-listen:/tmp/stream.sock,fork FILE:core_file,create,append,truncate
(b) trigger a coredump:
user1@localhost:~/data/scripts$ cat crash.c
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
fprintf(stderr, "%u\n", (1 / 0));
_exit(0);
}
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/coredump.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 132 insertions(+), 5 deletions(-)
diff --git a/fs/coredump.c b/fs/coredump.c
index 1779299b8c61..9a6cba233db9 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -45,6 +45,9 @@
#include <linux/elf.h>
#include <linux/pidfs.h>
#include <uapi/linux/pidfd.h>
+#include <linux/net.h>
+#include <uapi/linux/un.h>
+#include <linux/socket.h>
#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -79,6 +82,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
enum coredump_type_t {
COREDUMP_FILE = 1,
COREDUMP_PIPE = 2,
+ COREDUMP_SOCK = 3,
};
struct core_name {
@@ -232,13 +236,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
cn->corename = NULL;
if (*pat_ptr == '|')
cn->core_type = COREDUMP_PIPE;
+ else if (*pat_ptr == ':')
+ cn->core_type = COREDUMP_SOCK;
else
cn->core_type = COREDUMP_FILE;
if (expand_corename(cn, core_name_size))
return -ENOMEM;
cn->corename[0] = '\0';
- if (cn->core_type == COREDUMP_PIPE) {
+ switch (cn->core_type) {
+ case COREDUMP_PIPE: {
int argvs = sizeof(core_pattern) / 2;
(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
if (!(*argv))
@@ -247,6 +254,39 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
++pat_ptr;
if (!(*pat_ptr))
return -ENOMEM;
+ break;
+ }
+ case COREDUMP_SOCK: {
+ /* skip ':' */
+ ++pat_ptr;
+ /* no spaces */
+ if (!(*pat_ptr))
+ return -EINVAL;
+ /* must be an absolute path */
+ if (!(*pat_ptr == '/'))
+ return -EINVAL;
+ err = cn_printf(cn, "%s", pat_ptr);
+ if (err)
+ return err;
+ /*
+ * For simplicitly we simply refuse spaces in the socket
+ * path. It's in line with what we do for pipes.
+ */
+ if (strchr(cn->corename, ' '))
+ return -EINVAL;
+
+ /*
+ * Currently no need to parse any other options.
+ * Relevant information can be retrieved from the peer
+ * pidfd retrievable via SO_PEERPIDFD by the receiver or
+ * via /proc/<pid>, using the SO_PEERPIDFD to guard
+ * against pid recycling when opening /proc/<pid>.
+ */
+ return 0;
+ }
+ default:
+ WARN_ON_ONCE(cn->core_type != COREDUMP_FILE);
+ break;
}
/* Repeat as long as we have more pattern to process and more output
@@ -801,6 +841,73 @@ void do_coredump(const kernel_siginfo_t *siginfo)
}
break;
}
+ case COREDUMP_SOCK: {
+ struct file *file __free(fput) = NULL;
+#ifdef CONFIG_UNIX
+ ssize_t addr_size;
+ struct sockaddr_un unix_addr = {
+ .sun_family = AF_UNIX,
+ };
+ struct sockaddr_storage *addr;
+
+ /*
+ * TODO: We need to really support core_pipe_limit to
+ * prevent the task from being reaped before userspace
+ * had a chance to look at /proc/<pid>.
+ *
+ * I need help from the networking people (or maybe Oleg
+ * also knows?) how to do this.
+ *
+ * IOW, we need to wait for the other side to shutdown
+ * the socket/terminate the connection.
+ *
+ * We could just read but then userspace could sent us
+ * SCM_RIGHTS and we just shouldn't need to deal with
+ * any of that.
+ */
+ if (WARN_ON_ONCE(core_pipe_limit)) {
+ retval = -EINVAL;
+ goto close_fail;
+ }
+
+ retval = strscpy(unix_addr.sun_path, cn.corename, sizeof(unix_addr.sun_path));
+ if (retval < 0)
+ goto close_fail;
+ addr_size = offsetof(struct sockaddr_un, sun_path) + retval + 1,
+
+ file = __sys_socket_file(AF_UNIX, SOCK_STREAM, 0);
+ if (IS_ERR(file))
+ goto close_fail;
+
+ /*
+ * It is possible that the userspace process which is
+ * supposed to handle the coredump and is listening on
+ * the AF_UNIX socket coredumps. This should be fine
+ * though. If this was the only process which was
+ * listen()ing on the AF_UNIX socket for coredumps it
+ * obviously won't be listen()ing anymore by the time it
+ * gets here. So the __sys_connect_file() call will
+ * often fail with ECONNREFUSED and the coredump.
+ *
+ * In general though, userspace should just mark itself
+ * non dumpable and not do any of this nonsense. We
+ * shouldn't work around this.
+ */
+ addr = (struct sockaddr_storage *)(&unix_addr);
+ retval = __sys_connect_file(file, addr, addr_size, O_CLOEXEC);
+ if (retval)
+ goto close_fail;
+
+ /* The peer isn't supposed to write and we for sure won't read. */
+ retval = __sys_shutdown_sock(sock_from_file(file), SHUT_RD);
+ if (retval)
+ goto close_fail;
+
+ cprm.limit = RLIM_INFINITY;
+#endif
+ cprm.file = no_free_ptr(file);
+ break;
+ }
default:
WARN_ON_ONCE(true);
retval = -EINVAL;
@@ -818,7 +925,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
* have this set to NULL.
*/
if (!cprm.file) {
- coredump_report_failure("Core dump to |%s disabled", cn.corename);
+ if (cn.core_type == COREDUMP_PIPE)
+ coredump_report_failure("Core dump to |%s disabled", cn.corename);
+ else
+ coredump_report_failure("Core dump to :%s disabled", cn.corename);
goto close_fail;
}
if (!dump_vma_snapshot(&cprm))
@@ -839,8 +949,25 @@ void do_coredump(const kernel_siginfo_t *siginfo)
file_end_write(cprm.file);
free_vma_snapshot(&cprm);
}
- if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
- wait_for_dump_helpers(cprm.file);
+
+ if (core_pipe_limit) {
+ switch (cn.core_type) {
+ case COREDUMP_PIPE:
+ wait_for_dump_helpers(cprm.file);
+ break;
+ case COREDUMP_SOCK: {
+ /*
+ * TODO: Wait for the coredump handler to shut
+ * down the socket so we prevent the task from
+ * being reaped.
+ */
+ break;
+ }
+ default:
+ break;
+ }
+ }
+
close_fail:
if (cprm.file)
filp_close(cprm.file, NULL);
@@ -1070,7 +1197,7 @@ EXPORT_SYMBOL(dump_align);
void validate_coredump_safety(void)
{
if (suid_dumpable == SUID_DUMP_ROOT &&
- core_pattern[0] != '/' && core_pattern[0] != '|') {
+ core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != ':') {
coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: "
"pipe handler or fully qualified core dump path required. "
--
2.47.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH RFC v2 4/6] coredump: show supported coredump modes
2025-05-02 12:42 [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets Christian Brauner
` (2 preceding siblings ...)
2025-05-02 12:42 ` [PATCH RFC v2 3/6] coredump: support AF_UNIX sockets Christian Brauner
@ 2025-05-02 12:42 ` Christian Brauner
2025-05-02 14:07 ` Jann Horn
2025-05-02 12:42 ` [PATCH RFC v2 5/6] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
` (2 subsequent siblings)
6 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2025-05-02 12:42 UTC (permalink / raw)
To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
Jann Horn
Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
Christian Brauner
Allow userspace to discover what coredump modes are supported.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/coredump.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/fs/coredump.c b/fs/coredump.c
index 9a6cba233db9..1c7428c23878 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1217,6 +1217,13 @@ static int proc_dostring_coredump(const struct ctl_table *table, int write,
static const unsigned int core_file_note_size_min = CORE_FILE_NOTE_SIZE_DEFAULT;
static const unsigned int core_file_note_size_max = CORE_FILE_NOTE_SIZE_MAX;
+static char core_modes[] = {
+#ifdef CONFIG_UNIX
+ "file\npipe\nunix"
+#else
+ "file\npipe"
+#endif
+};
static const struct ctl_table coredump_sysctls[] = {
{
@@ -1260,6 +1267,13 @@ static const struct ctl_table coredump_sysctls[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
},
+ {
+ .procname = "core_modes",
+ .data = core_modes,
+ .maxlen = sizeof(core_modes) - 1,
+ .mode = 0444,
+ .proc_handler = proc_dostring,
+ },
};
static int __init init_fs_coredump_sysctls(void)
--
2.47.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH RFC v2 5/6] pidfs, coredump: add PIDFD_INFO_COREDUMP
2025-05-02 12:42 [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets Christian Brauner
` (3 preceding siblings ...)
2025-05-02 12:42 ` [PATCH RFC v2 4/6] coredump: show supported coredump modes Christian Brauner
@ 2025-05-02 12:42 ` Christian Brauner
2025-05-02 14:10 ` Jann Horn
2025-05-02 12:42 ` [PATCH RFC v2 6/6] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
2025-05-02 14:04 ` [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets Jann Horn
6 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2025-05-02 12:42 UTC (permalink / raw)
To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
Jann Horn
Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
Christian Brauner
Let userspace know that the task coredumped and whether it was dumped as
root or as regular user. The latter is needed so that access permissions
to the executable are correctly handled.
I don't think this requires any additional privileges checks. The
missing exposure of the dumpability attribute of a given task is an
issue we should fix given that we already expose whether a task is
coredumping or not.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/coredump.c | 5 ++++
fs/pidfs.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/pidfs.h | 3 +++
include/uapi/linux/pidfd.h | 11 +++++++++
4 files changed, 77 insertions(+)
diff --git a/fs/coredump.c b/fs/coredump.c
index 1c7428c23878..735b5b94c518 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -48,6 +48,7 @@
#include <linux/net.h>
#include <uapi/linux/un.h>
#include <linux/socket.h>
+#include <net/sock.h>
#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -594,6 +595,8 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
if (IS_ERR(pidfs_file))
return PTR_ERR(pidfs_file);
+ pidfs_coredump(cp);
+
/*
* Usermode helpers are childen of either
* system_unbound_wq or of kthreadd. So we know that
@@ -904,6 +907,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
goto close_fail;
cprm.limit = RLIM_INFINITY;
+ cprm.pid = task_tgid(current);
+ pidfs_coredump(&cprm);
#endif
cprm.file = no_free_ptr(file);
break;
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 3b39e471840b..c891f4ef9c38 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -20,6 +20,7 @@
#include <linux/time_namespace.h>
#include <linux/utsname.h>
#include <net/net_namespace.h>
+#include <linux/coredump.h>
#include "internal.h"
#include "mount.h"
@@ -33,6 +34,7 @@ static struct kmem_cache *pidfs_cachep __ro_after_init;
struct pidfs_exit_info {
__u64 cgroupid;
__s32 exit_code;
+ __s32 coredump_mask;
};
struct pidfs_inode {
@@ -240,6 +242,22 @@ static inline bool pid_in_current_pidns(const struct pid *pid)
return false;
}
+static __u32 pidfs_coredump_mask(unsigned long mm_flags)
+{
+ switch (__get_dumpable(mm_flags)) {
+ case SUID_DUMP_USER:
+ return PIDFD_COREDUMP_USER;
+ case SUID_DUMP_ROOT:
+ return PIDFD_COREDUMP_ROOT;
+ case SUID_DUMP_DISABLE:
+ return PIDFD_COREDUMP_SKIP;
+ default:
+ WARN_ON_ONCE(true);
+ }
+
+ return 0;
+}
+
static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
{
struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
@@ -292,6 +310,21 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg)
goto copy_out;
}
+ /*
+ * Should we require privilege to access that information? Seems
+ * pointless because we already expose whether a task is
+ * coredumping in /proc/<pid>/status.
+ */
+ if (mask & PIDFD_INFO_COREDUMP) {
+ kinfo.mask |= PIDFD_INFO_COREDUMP;
+ kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
+ if (!kinfo.coredump_mask) {
+ task_lock(task);
+ kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags);
+ task_unlock(task);
+ }
+ }
+
c = get_task_cred(task);
if (!c)
return -ESRCH;
@@ -559,6 +592,31 @@ void pidfs_exit(struct task_struct *tsk)
}
}
+void pidfs_coredump(const struct coredump_params *cprm)
+{
+ struct pid *pid = cprm->pid;
+ struct pidfs_exit_info *exit_info;
+ struct dentry *dentry;
+ struct inode *inode;
+ __u32 coredump_mask = 0;
+
+ dentry = stashed_dentry_get(&pid->stashed);
+ if (WARN_ON_ONCE(!dentry))
+ return;
+
+ inode = d_inode(dentry);
+ exit_info = &pidfs_i(inode)->__pei;
+ /* Note how we were coredumped. */
+ coredump_mask = pidfs_coredump_mask(cprm->mm_flags);
+ /* Note that we actually did coredump. */
+ coredump_mask |= PIDFD_COREDUMPED;
+ /* If coredumping is set to skip we should never end up here. */
+ VFS_WARN_ON_ONCE(coredump_mask & PIDFD_COREDUMP_SKIP);
+ smp_store_release(&exit_info->coredump_mask, coredump_mask);
+ /* Fwiw, this cannot be the last reference. */
+ dput(dentry);
+}
+
static struct vfsmount *pidfs_mnt __ro_after_init;
/*
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index 2676890c4d0d..f7729b9371bc 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -2,11 +2,14 @@
#ifndef _LINUX_PID_FS_H
#define _LINUX_PID_FS_H
+struct coredump_params;
+
struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
void __init pidfs_init(void);
void pidfs_add_pid(struct pid *pid);
void pidfs_remove_pid(struct pid *pid);
void pidfs_exit(struct task_struct *tsk);
+void pidfs_coredump(const struct coredump_params *cprm);
extern const struct dentry_operations pidfs_dentry_operations;
int pidfs_register_pid(struct pid *pid);
void pidfs_get_pid(struct pid *pid);
diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
index 8c1511edd0e9..b7c831f4cf4b 100644
--- a/include/uapi/linux/pidfd.h
+++ b/include/uapi/linux/pidfd.h
@@ -25,9 +25,19 @@
#define PIDFD_INFO_CREDS (1UL << 1) /* Always returned, even if not requested */
#define PIDFD_INFO_CGROUPID (1UL << 2) /* Always returned if available, even if not requested */
#define PIDFD_INFO_EXIT (1UL << 3) /* Only returned if requested. */
+#define PIDFD_INFO_COREDUMP (1UL << 4) /* Only returned if requested. */
#define PIDFD_INFO_SIZE_VER0 64 /* sizeof first published struct */
+/*
+ * Values for @coredump_mask in pidfd_info.
+ * Only valid if PIDFD_INFO_SUID_COREDUMP is set in @mask.
+ */
+#define PIDFD_COREDUMPED (1U << 0) /* Did crash and... */
+#define PIDFD_COREDUMP_SKIP (1U << 1) /* coredumping generation was skipped. */
+#define PIDFD_COREDUMP_USER (1U << 2) /* coredump was done as the user. */
+#define PIDFD_COREDUMP_ROOT (1U << 3) /* coredump was done as root. */
+
/*
* The concept of process and threads in userland and the kernel is a confusing
* one - within the kernel every thread is a 'task' with its own individual PID,
@@ -92,6 +102,7 @@ struct pidfd_info {
__u32 fsuid;
__u32 fsgid;
__s32 exit_code;
+ __u32 coredump_mask;
};
#define PIDFS_IOCTL_MAGIC 0xFF
--
2.47.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH RFC v2 6/6] selftests/coredump: add tests for AF_UNIX coredumps
2025-05-02 12:42 [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets Christian Brauner
` (4 preceding siblings ...)
2025-05-02 12:42 ` [PATCH RFC v2 5/6] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
@ 2025-05-02 12:42 ` Christian Brauner
2025-05-02 14:04 ` [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets Jann Horn
6 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2025-05-02 12:42 UTC (permalink / raw)
To: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
Jann Horn
Cc: David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev,
Christian Brauner
Add a simple test for generating coredumps via AF_UNIX sockets.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
tools/testing/selftests/coredump/stackdump_test.c | 50 +++++++++++++++++++++++
1 file changed, 50 insertions(+)
diff --git a/tools/testing/selftests/coredump/stackdump_test.c b/tools/testing/selftests/coredump/stackdump_test.c
index fe3c728cd6be..ccaee98dbee3 100644
--- a/tools/testing/selftests/coredump/stackdump_test.c
+++ b/tools/testing/selftests/coredump/stackdump_test.c
@@ -5,7 +5,9 @@
#include <linux/limits.h>
#include <pthread.h>
#include <string.h>
+#include <sys/mount.h>
#include <sys/resource.h>
+#include <sys/stat.h>
#include <unistd.h>
#include "../kselftest_harness.h"
@@ -154,4 +156,52 @@ TEST_F_TIMEOUT(coredump, stackdump, 120)
fclose(file);
}
+TEST_F(coredump, socket)
+{
+ int fd, ret, status;
+ FILE *file;
+ pid_t pid, pid_socat;
+ struct stat st;
+ char core_file[PATH_MAX];
+
+ ASSERT_EQ(unshare(CLONE_NEWNS), 0);
+ ASSERT_EQ(mount(NULL, "/", NULL, MS_PRIVATE | MS_REC, NULL), 0);
+ ASSERT_EQ(mount(NULL, "/tmp", "tmpfs", 0, NULL), 0);
+
+ pid_socat = fork();
+ ASSERT_GE(pid_socat, 0);
+ if (pid_socat == 0) {
+ execlp("socat", "socat",
+ "unix-listen:/tmp/coredump.socket,fork",
+ "FILE:/tmp/coredump_file,create,append,trunc",
+ (char *)NULL);
+ _exit(EXIT_FAILURE);
+ }
+
+ file = fopen("/proc/sys/kernel/core_pattern", "w");
+ ASSERT_NE(NULL, file);
+
+ ret = fprintf(file, ":/tmp/coredump.socket");
+ ASSERT_EQ(ret, strlen(":/tmp/coredump.socket"));
+ ASSERT_EQ(fclose(file), 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+ if (pid == 0)
+ crashing_child();
+
+ waitpid(pid, &status, 0);
+ ASSERT_TRUE(WIFSIGNALED(status));
+ ASSERT_TRUE(WCOREDUMP(status));
+
+ ASSERT_EQ(kill(pid_socat, SIGTERM), 0);
+ waitpid(pid_socat, &status, 0);
+ ASSERT_TRUE(WIFEXITED(status));
+ ASSERT_EQ(WEXITSTATUS(status), 143);
+
+ /* We should somehow validate the produced core file. */
+ ASSERT_EQ(stat("/tmp/coredump_file", &st), 0);
+ ASSERT_GT(st.st_size, 0);
+}
+
TEST_HARNESS_MAIN
--
2.47.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets
2025-05-02 12:42 [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets Christian Brauner
` (5 preceding siblings ...)
2025-05-02 12:42 ` [PATCH RFC v2 6/6] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
@ 2025-05-02 14:04 ` Jann Horn
2025-05-02 19:25 ` Christian Brauner
6 siblings, 1 reply; 17+ messages in thread
From: Jann Horn @ 2025-05-02 14:04 UTC (permalink / raw)
To: Christian Brauner
Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev
On Fri, May 2, 2025 at 2:42 PM Christian Brauner <brauner@kernel.org> wrote:
> I need some help with the following questions:
>
> (i) The core_pipe_limit setting is of vital importance to userspace
> because it allows it to a) limit the number of concurrent coredumps
> and b) causes the kernel to wait until userspace closes the pipe and
> thus prevents the process from being reaped, allowing userspace to
> parse information out of /proc/<pid>/.
>
> Pipes already support this. I need to know from the networking
> people (or Oleg :)) how to wait for the userspace side to shutdown
> the socket/terminate the connection.
>
> I don't want to just read() because then userspace can send us
> SCM_RIGHTS messages and it's really ugly anyway.
>
> (ii) The dumpability setting is of importance for userspace in order to
> know how a given binary is dumped: as regular user or as root user.
> This helps guard against exploits abusing set*id binaries. The
> setting needs to be the same as used at the time of the coredump.
>
> I'm exposing this as part of PIDFD_GET_INFO. I would like some
> input whether it's fine to simply expose the dumpability this way.
> I'm pretty sure it is. But it'd be good to have @Jann give his
> thoughts here.
My only concern here is that if we expect the userspace daemon to look
at the dumpability field and treat nondumpable tasks as "this may
contain secret data and resources owned by various UIDs mixed
together, only root should see the dump", we should have at least very
clear documentation around this.
[...]
> Userspace can get a stable handle on the task generating the coredump by
> using the SO_PEERPIDFD socket option. SO_PEERPIDFD uses the thread-group
> leader pid stashed during connect(). Even if the task generating the
Unrelated to this series: Huh, I think I haven't seen SO_PEERPIDFD
before. I guess one interesting consequence of that feature is that if
you get a unix domain socket whose peer is in another PID namespace,
you can call pidfd_getfd() on that peer, which wouldn't normally be
possible? Though of course it'll still be subject to the normal ptrace
checks.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC v2 3/6] coredump: support AF_UNIX sockets
2025-05-02 12:42 ` [PATCH RFC v2 3/6] coredump: support AF_UNIX sockets Christian Brauner
@ 2025-05-02 14:04 ` Jann Horn
2025-05-02 20:10 ` Christian Brauner
0 siblings, 1 reply; 17+ messages in thread
From: Jann Horn @ 2025-05-02 14:04 UTC (permalink / raw)
To: Christian Brauner
Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev
On Fri, May 2, 2025 at 2:42 PM Christian Brauner <brauner@kernel.org> wrote:
> diff --git a/fs/coredump.c b/fs/coredump.c
[...]
> @@ -801,6 +841,73 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> }
> break;
> }
> + case COREDUMP_SOCK: {
> + struct file *file __free(fput) = NULL;
> +#ifdef CONFIG_UNIX
> + ssize_t addr_size;
> + struct sockaddr_un unix_addr = {
> + .sun_family = AF_UNIX,
> + };
> + struct sockaddr_storage *addr;
> +
> + /*
> + * TODO: We need to really support core_pipe_limit to
> + * prevent the task from being reaped before userspace
> + * had a chance to look at /proc/<pid>.
> + *
> + * I need help from the networking people (or maybe Oleg
> + * also knows?) how to do this.
> + *
> + * IOW, we need to wait for the other side to shutdown
> + * the socket/terminate the connection.
> + *
> + * We could just read but then userspace could sent us
> + * SCM_RIGHTS and we just shouldn't need to deal with
> + * any of that.
> + */
I don't think userspace can send you SCM_RIGHTS if you don't do a
recvmsg() with a control data buffer?
> + if (WARN_ON_ONCE(core_pipe_limit)) {
> + retval = -EINVAL;
> + goto close_fail;
> + }
> +
> + retval = strscpy(unix_addr.sun_path, cn.corename, sizeof(unix_addr.sun_path));
> + if (retval < 0)
> + goto close_fail;
> + addr_size = offsetof(struct sockaddr_un, sun_path) + retval + 1,
> +
> + file = __sys_socket_file(AF_UNIX, SOCK_STREAM, 0);
> + if (IS_ERR(file))
> + goto close_fail;
> +
> + /*
> + * It is possible that the userspace process which is
> + * supposed to handle the coredump and is listening on
> + * the AF_UNIX socket coredumps. This should be fine
> + * though. If this was the only process which was
> + * listen()ing on the AF_UNIX socket for coredumps it
> + * obviously won't be listen()ing anymore by the time it
> + * gets here. So the __sys_connect_file() call will
> + * often fail with ECONNREFUSED and the coredump.
Why will the server not be listening anymore? Have the task's file
descriptors already been closed by the time we get here?
(Maybe just get rid of this comment, I agree with the following
comment saying we should let userspace deal with this.)
> + * In general though, userspace should just mark itself
> + * non dumpable and not do any of this nonsense. We
> + * shouldn't work around this.
> + */
> + addr = (struct sockaddr_storage *)(&unix_addr);
> + retval = __sys_connect_file(file, addr, addr_size, O_CLOEXEC);
Have you made an intentional decision on whether you want to connect
to a unix domain socket with a path relative to current->fs->root (so
that containers can do their own core dump handling) or relative to
the root namespace root (so that core dumps always reach the init
namespace's core dumping even if a process sandboxes itself with
namespaces or such)? Also, I think this connection attempt will be
subject to restrictions imposed by (for example) Landlock or AppArmor,
I'm not sure if that is desired here (since this is not actually a
connection that the process in whose context the call happens decided
to make, it's something the system administrator decided to do, and
especially with Landlock, policies are controlled by individual
applications that may not know how core dumps work on the system).
I guess if we keep the current behavior where the socket path is
namespaced, then we also need to keep the security checks, since an
unprivileged user could probably set up a namespace and chroot() to a
place where the socket path (indirectly, through a symlink) refers to
an arbitrary socket...
An alternative design might be to directly register the server socket
on the userns/mountns/netns or such in some magic way, and then have
the core dumping walk up the namespace hierarchy until it finds a
namespace that has opted in to using its own core dumping socket, and
connect to that socket bypassing security checks. (A bit like how
namespaced binfmt_misc works.) Like, maybe userspace with namespaced
CAP_SYS_ADMIN could bind() to some magic UNIX socket address, or use
some new setsockopt() on the socket or such, to become the handler of
core dumps? This would also have the advantage that malicious
userspace wouldn't be able to send fake bogus core dumps to the
server, and the server would provide clear consent to being connected
to without security checks at connection time.
> + if (retval)
> + goto close_fail;
> +
> + /* The peer isn't supposed to write and we for sure won't read. */
> + retval = __sys_shutdown_sock(sock_from_file(file), SHUT_RD);
> + if (retval)
> + goto close_fail;
> +
> + cprm.limit = RLIM_INFINITY;
> +#endif
> + cprm.file = no_free_ptr(file);
> + break;
> + }
> default:
> WARN_ON_ONCE(true);
> retval = -EINVAL;
> @@ -818,7 +925,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> * have this set to NULL.
> */
> if (!cprm.file) {
> - coredump_report_failure("Core dump to |%s disabled", cn.corename);
> + if (cn.core_type == COREDUMP_PIPE)
> + coredump_report_failure("Core dump to |%s disabled", cn.corename);
> + else
> + coredump_report_failure("Core dump to :%s disabled", cn.corename);
> goto close_fail;
> }
> if (!dump_vma_snapshot(&cprm))
> @@ -839,8 +949,25 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> file_end_write(cprm.file);
> free_vma_snapshot(&cprm);
> }
> - if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
> - wait_for_dump_helpers(cprm.file);
> +
> + if (core_pipe_limit) {
> + switch (cn.core_type) {
> + case COREDUMP_PIPE:
> + wait_for_dump_helpers(cprm.file);
> + break;
> + case COREDUMP_SOCK: {
> + /*
> + * TODO: Wait for the coredump handler to shut
> + * down the socket so we prevent the task from
> + * being reaped.
> + */
Hmm, I'm no expert but maybe you could poll for the POLLRDHUP event...
though that might require writing your own helper with a loop that
does vfs_poll() and waits for a poll wakeup, since I don't think there
is a kernel helper analogous to a synchronous poll() syscall yet.
> + break;
> + }
> + default:
> + break;
> + }
> + }
> +
> close_fail:
> if (cprm.file)
> filp_close(cprm.file, NULL);
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC v2 4/6] coredump: show supported coredump modes
2025-05-02 12:42 ` [PATCH RFC v2 4/6] coredump: show supported coredump modes Christian Brauner
@ 2025-05-02 14:07 ` Jann Horn
2025-05-02 20:11 ` Christian Brauner
0 siblings, 1 reply; 17+ messages in thread
From: Jann Horn @ 2025-05-02 14:07 UTC (permalink / raw)
To: Christian Brauner
Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev
On Fri, May 2, 2025 at 2:43 PM Christian Brauner <brauner@kernel.org> wrote:
> Allow userspace to discover what coredump modes are supported.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> fs/coredump.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 9a6cba233db9..1c7428c23878 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -1217,6 +1217,13 @@ static int proc_dostring_coredump(const struct ctl_table *table, int write,
>
> static const unsigned int core_file_note_size_min = CORE_FILE_NOTE_SIZE_DEFAULT;
> static const unsigned int core_file_note_size_max = CORE_FILE_NOTE_SIZE_MAX;
> +static char core_modes[] = {
> +#ifdef CONFIG_UNIX
> + "file\npipe\nunix"
> +#else
> + "file\npipe"
> +#endif
> +};
Nit: You could do something like
static char core_modes[] =
#ifdef CONFIG_UNIX
"unix\n"
#endif
"file\npipe"
;
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC v2 5/6] pidfs, coredump: add PIDFD_INFO_COREDUMP
2025-05-02 12:42 ` [PATCH RFC v2 5/6] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
@ 2025-05-02 14:10 ` Jann Horn
0 siblings, 0 replies; 17+ messages in thread
From: Jann Horn @ 2025-05-02 14:10 UTC (permalink / raw)
To: Christian Brauner
Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev
On Fri, May 2, 2025 at 2:43 PM Christian Brauner <brauner@kernel.org> wrote:
> Let userspace know that the task coredumped and whether it was dumped as
> root or as regular user. The latter is needed so that access permissions
> to the executable are correctly handled.
>
> I don't think this requires any additional privileges checks. The
> missing exposure of the dumpability attribute of a given task is an
> issue we should fix given that we already expose whether a task is
> coredumping or not.
Yeah, it certainly isn't more sensitive than things like the exit code and UIDs.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets
2025-05-02 14:04 ` [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets Jann Horn
@ 2025-05-02 19:25 ` Christian Brauner
0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2025-05-02 19:25 UTC (permalink / raw)
To: Jann Horn
Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev
On Fri, May 02, 2025 at 04:04:28PM +0200, Jann Horn wrote:
> On Fri, May 2, 2025 at 2:42 PM Christian Brauner <brauner@kernel.org> wrote:
> > I need some help with the following questions:
> >
> > (i) The core_pipe_limit setting is of vital importance to userspace
> > because it allows it to a) limit the number of concurrent coredumps
> > and b) causes the kernel to wait until userspace closes the pipe and
> > thus prevents the process from being reaped, allowing userspace to
> > parse information out of /proc/<pid>/.
> >
> > Pipes already support this. I need to know from the networking
> > people (or Oleg :)) how to wait for the userspace side to shutdown
> > the socket/terminate the connection.
> >
> > I don't want to just read() because then userspace can send us
> > SCM_RIGHTS messages and it's really ugly anyway.
> >
> > (ii) The dumpability setting is of importance for userspace in order to
> > know how a given binary is dumped: as regular user or as root user.
> > This helps guard against exploits abusing set*id binaries. The
> > setting needs to be the same as used at the time of the coredump.
> >
> > I'm exposing this as part of PIDFD_GET_INFO. I would like some
> > input whether it's fine to simply expose the dumpability this way.
> > I'm pretty sure it is. But it'd be good to have @Jann give his
> > thoughts here.
>
> My only concern here is that if we expect the userspace daemon to look
> at the dumpability field and treat nondumpable tasks as "this may
> contain secret data and resources owned by various UIDs mixed
> together, only root should see the dump", we should have at least very
> clear documentation around this.
>
> [...]
> > Userspace can get a stable handle on the task generating the coredump by
> > using the SO_PEERPIDFD socket option. SO_PEERPIDFD uses the thread-group
> > leader pid stashed during connect(). Even if the task generating the
>
> Unrelated to this series: Huh, I think I haven't seen SO_PEERPIDFD
> before. I guess one interesting consequence of that feature is that if
It's very heavily used by dbus-broker, polkit and systemd to safely
authenticate clients instead of by PIDs. (Fyi, it's even supported for
bluetooth sockets so they could benefit from this as well I'm sure.)
> you get a unix domain socket whose peer is in another PID namespace,
> you can call pidfd_getfd() on that peer, which wouldn't normally be
> possible? Though of course it'll still be subject to the normal ptrace
> checks.
I think that was already possible because you could send pidfds via
SCM_RIGHTS. That's a lot more cooperative than SO_PEERPIDFD of course
but still.
But if that's an issue we could of course enforce that pidfd_getfd() may
only work if the target is within your pidns hierarchy just as we do for
the PIDFD_GET_INFO ioctl() already. But I'm not sure it's an issue.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC v2 3/6] coredump: support AF_UNIX sockets
2025-05-02 14:04 ` Jann Horn
@ 2025-05-02 20:10 ` Christian Brauner
2025-05-02 20:23 ` Jann Horn
0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2025-05-02 20:10 UTC (permalink / raw)
To: Jann Horn
Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev
On Fri, May 02, 2025 at 04:04:32PM +0200, Jann Horn wrote:
> On Fri, May 2, 2025 at 2:42 PM Christian Brauner <brauner@kernel.org> wrote:
> > diff --git a/fs/coredump.c b/fs/coredump.c
> [...]
> > @@ -801,6 +841,73 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > }
> > break;
> > }
> > + case COREDUMP_SOCK: {
> > + struct file *file __free(fput) = NULL;
> > +#ifdef CONFIG_UNIX
> > + ssize_t addr_size;
> > + struct sockaddr_un unix_addr = {
> > + .sun_family = AF_UNIX,
> > + };
> > + struct sockaddr_storage *addr;
> > +
> > + /*
> > + * TODO: We need to really support core_pipe_limit to
> > + * prevent the task from being reaped before userspace
> > + * had a chance to look at /proc/<pid>.
> > + *
> > + * I need help from the networking people (or maybe Oleg
> > + * also knows?) how to do this.
> > + *
> > + * IOW, we need to wait for the other side to shutdown
> > + * the socket/terminate the connection.
> > + *
> > + * We could just read but then userspace could sent us
> > + * SCM_RIGHTS and we just shouldn't need to deal with
> > + * any of that.
> > + */
>
> I don't think userspace can send you SCM_RIGHTS if you don't do a
> recvmsg() with a control data buffer?
Oh hm, then maybe just a regular read at the end would work. As soon as
userspace send us anything or we get a close event we just disconnect.
But btw, I think we really need a recvmsg() flag that allows a receiver
to refuse SCM_RIGHTS/file descriptors from being sent to it. IIRC, right
now this is a real issue that systemd works around by always calling its
cmsg_close_all() helper after each recvmsg() to ensure that no one sent
it file descriptors it didn't want. The problem there is that someone
could have sent it an fd to a hanging NFS server or something and then
it would hang in close() even though it never even wanted any file
descriptors in the first place.
>
> > + if (WARN_ON_ONCE(core_pipe_limit)) {
> > + retval = -EINVAL;
> > + goto close_fail;
> > + }
> > +
> > + retval = strscpy(unix_addr.sun_path, cn.corename, sizeof(unix_addr.sun_path));
> > + if (retval < 0)
> > + goto close_fail;
> > + addr_size = offsetof(struct sockaddr_un, sun_path) + retval + 1,
> > +
> > + file = __sys_socket_file(AF_UNIX, SOCK_STREAM, 0);
> > + if (IS_ERR(file))
> > + goto close_fail;
> > +
> > + /*
> > + * It is possible that the userspace process which is
> > + * supposed to handle the coredump and is listening on
> > + * the AF_UNIX socket coredumps. This should be fine
> > + * though. If this was the only process which was
> > + * listen()ing on the AF_UNIX socket for coredumps it
> > + * obviously won't be listen()ing anymore by the time it
> > + * gets here. So the __sys_connect_file() call will
> > + * often fail with ECONNREFUSED and the coredump.
>
> Why will the server not be listening anymore? Have the task's file
> descriptors already been closed by the time we get here?
No, the file descriptors are still open.
>
> (Maybe just get rid of this comment, I agree with the following
> comment saying we should let userspace deal with this.)
Good idea.
>
> > + * In general though, userspace should just mark itself
> > + * non dumpable and not do any of this nonsense. We
> > + * shouldn't work around this.
> > + */
> > + addr = (struct sockaddr_storage *)(&unix_addr);
> > + retval = __sys_connect_file(file, addr, addr_size, O_CLOEXEC);
>
> Have you made an intentional decision on whether you want to connect
> to a unix domain socket with a path relative to current->fs->root (so
> that containers can do their own core dump handling) or relative to
> the root namespace root (so that core dumps always reach the init
> namespace's core dumping even if a process sandboxes itself with
> namespaces or such)? Also, I think this connection attempt will be
Fsck no. :) I just jotted this down as an RFC. Details below.
> subject to restrictions imposed by (for example) Landlock or AppArmor,
> I'm not sure if that is desired here (since this is not actually a
> connection that the process in whose context the call happens decided
> to make, it's something the system administrator decided to do, and
> especially with Landlock, policies are controlled by individual
> applications that may not know how core dumps work on the system).
>
> I guess if we keep the current behavior where the socket path is
> namespaced, then we also need to keep the security checks, since an
> unprivileged user could probably set up a namespace and chroot() to a
> place where the socket path (indirectly, through a symlink) refers to
> an arbitrary socket...
>
> An alternative design might be to directly register the server socket
> on the userns/mountns/netns or such in some magic way, and then have
> the core dumping walk up the namespace hierarchy until it finds a
> namespace that has opted in to using its own core dumping socket, and
> connect to that socket bypassing security checks. (A bit like how
> namespaced binfmt_misc works.) Like, maybe userspace with namespaced
Yeah, I namespaced that thing. :)
> CAP_SYS_ADMIN could bind() to some magic UNIX socket address, or use
> some new setsockopt() on the socket or such, to become the handler of
> core dumps? This would also have the advantage that malicious
> userspace wouldn't be able to send fake bogus core dumps to the
> server, and the server would provide clear consent to being connected
> to without security checks at connection time.
I think that's policy that I absolute don't want the kernel to get
involved in unless absolutely necessary. A few days ago I just discussed
this at length with Lennart and the issue is that systemd would want to
see all coredumps on the system independent of the namespace they're
created in. To have a per-namespace (userns/mountns/netns) coredump
socket would invalidate that one way or the other and end up hiding
coredumps from the administrator unless there's some elaborate scheme
where it doesn't.
systemd-coredump (and Apport fwiw) has infrastructure to forward
coredumps to individual services and containers and it's already based
on AF_UNIX afaict. And I really like that it's the job of userspace to
deal with this instead of the kernel having to get involved in that
mess.
So all of this should be relative to the initial namespace. I want a
separate security hook though so an LSMs can be used to prevent
processes from connecting to the coredump socket.
My idea has been that systemd-coredump could use a bpf lsm program that
would allow to abort a coredump before the crashing process connects to
the socket and again make this a userspace policy issue.
>
> > + if (retval)
> > + goto close_fail;
> > +
> > + /* The peer isn't supposed to write and we for sure won't read. */
> > + retval = __sys_shutdown_sock(sock_from_file(file), SHUT_RD);
> > + if (retval)
> > + goto close_fail;
> > +
> > + cprm.limit = RLIM_INFINITY;
> > +#endif
> > + cprm.file = no_free_ptr(file);
> > + break;
> > + }
> > default:
> > WARN_ON_ONCE(true);
> > retval = -EINVAL;
> > @@ -818,7 +925,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > * have this set to NULL.
> > */
> > if (!cprm.file) {
> > - coredump_report_failure("Core dump to |%s disabled", cn.corename);
> > + if (cn.core_type == COREDUMP_PIPE)
> > + coredump_report_failure("Core dump to |%s disabled", cn.corename);
> > + else
> > + coredump_report_failure("Core dump to :%s disabled", cn.corename);
> > goto close_fail;
> > }
> > if (!dump_vma_snapshot(&cprm))
> > @@ -839,8 +949,25 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > file_end_write(cprm.file);
> > free_vma_snapshot(&cprm);
> > }
> > - if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
> > - wait_for_dump_helpers(cprm.file);
> > +
> > + if (core_pipe_limit) {
> > + switch (cn.core_type) {
> > + case COREDUMP_PIPE:
> > + wait_for_dump_helpers(cprm.file);
> > + break;
> > + case COREDUMP_SOCK: {
> > + /*
> > + * TODO: Wait for the coredump handler to shut
> > + * down the socket so we prevent the task from
> > + * being reaped.
> > + */
>
> Hmm, I'm no expert but maybe you could poll for the POLLRDHUP event...
> though that might require writing your own helper with a loop that
> does vfs_poll() and waits for a poll wakeup, since I don't think there
> is a kernel helper analogous to a synchronous poll() syscall yet.
>
> > + break;
> > + }
> > + default:
> > + break;
> > + }
> > + }
> > +
> > close_fail:
> > if (cprm.file)
> > filp_close(cprm.file, NULL);
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC v2 4/6] coredump: show supported coredump modes
2025-05-02 14:07 ` Jann Horn
@ 2025-05-02 20:11 ` Christian Brauner
0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2025-05-02 20:11 UTC (permalink / raw)
To: Jann Horn
Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev
On Fri, May 02, 2025 at 04:07:23PM +0200, Jann Horn wrote:
> On Fri, May 2, 2025 at 2:43 PM Christian Brauner <brauner@kernel.org> wrote:
> > Allow userspace to discover what coredump modes are supported.
> >
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > fs/coredump.c | 14 ++++++++++++++
> > 1 file changed, 14 insertions(+)
> >
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index 9a6cba233db9..1c7428c23878 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> > @@ -1217,6 +1217,13 @@ static int proc_dostring_coredump(const struct ctl_table *table, int write,
> >
> > static const unsigned int core_file_note_size_min = CORE_FILE_NOTE_SIZE_DEFAULT;
> > static const unsigned int core_file_note_size_max = CORE_FILE_NOTE_SIZE_MAX;
> > +static char core_modes[] = {
> > +#ifdef CONFIG_UNIX
> > + "file\npipe\nunix"
> > +#else
> > + "file\npipe"
> > +#endif
> > +};
>
> Nit: You could do something like
>
> static char core_modes[] =
> #ifdef CONFIG_UNIX
> "unix\n"
> #endif
> "file\npipe"
> ;
Thanks!
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC v2 3/6] coredump: support AF_UNIX sockets
2025-05-02 20:10 ` Christian Brauner
@ 2025-05-02 20:23 ` Jann Horn
2025-05-03 5:17 ` Christian Brauner
0 siblings, 1 reply; 17+ messages in thread
From: Jann Horn @ 2025-05-02 20:23 UTC (permalink / raw)
To: Christian Brauner
Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev
On Fri, May 2, 2025 at 10:11 PM Christian Brauner <brauner@kernel.org> wrote:
> On Fri, May 02, 2025 at 04:04:32PM +0200, Jann Horn wrote:
> > On Fri, May 2, 2025 at 2:42 PM Christian Brauner <brauner@kernel.org> wrote:
> > > diff --git a/fs/coredump.c b/fs/coredump.c
> > [...]
> > > @@ -801,6 +841,73 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > > }
> > > break;
> > > }
> > > + case COREDUMP_SOCK: {
> > > + struct file *file __free(fput) = NULL;
> > > +#ifdef CONFIG_UNIX
> > > + ssize_t addr_size;
> > > + struct sockaddr_un unix_addr = {
> > > + .sun_family = AF_UNIX,
> > > + };
> > > + struct sockaddr_storage *addr;
> > > +
> > > + /*
> > > + * TODO: We need to really support core_pipe_limit to
> > > + * prevent the task from being reaped before userspace
> > > + * had a chance to look at /proc/<pid>.
> > > + *
> > > + * I need help from the networking people (or maybe Oleg
> > > + * also knows?) how to do this.
> > > + *
> > > + * IOW, we need to wait for the other side to shutdown
> > > + * the socket/terminate the connection.
> > > + *
> > > + * We could just read but then userspace could sent us
> > > + * SCM_RIGHTS and we just shouldn't need to deal with
> > > + * any of that.
> > > + */
> >
> > I don't think userspace can send you SCM_RIGHTS if you don't do a
> > recvmsg() with a control data buffer?
>
> Oh hm, then maybe just a regular read at the end would work. As soon as
> userspace send us anything or we get a close event we just disconnect.
>
> But btw, I think we really need a recvmsg() flag that allows a receiver
> to refuse SCM_RIGHTS/file descriptors from being sent to it. IIRC, right
> now this is a real issue that systemd works around by always calling its
> cmsg_close_all() helper after each recvmsg() to ensure that no one sent
> it file descriptors it didn't want. The problem there is that someone
> could have sent it an fd to a hanging NFS server or something and then
> it would hang in close() even though it never even wanted any file
> descriptors in the first place.
Would a recvmsg() flag really solve that aspect of NFS hangs? By the
time you read from the socket, the file is already attached to an SKB
queued up on the socket, and cleaning up the file is your task's
responsibility either way (which will either be done by the kernel for
you if you don't read it into a control message, or by userspace if it
was handed off through a control message). The process that sent the
file to you might already be gone, it can't be on the hook for
cleaning up the file anymore.
I think the thorough fix would probably be to introduce a socket
option (controlled via setsockopt()) that already blocks the peer's
sendmsg().
> > > + * In general though, userspace should just mark itself
> > > + * non dumpable and not do any of this nonsense. We
> > > + * shouldn't work around this.
> > > + */
> > > + addr = (struct sockaddr_storage *)(&unix_addr);
> > > + retval = __sys_connect_file(file, addr, addr_size, O_CLOEXEC);
> >
> > Have you made an intentional decision on whether you want to connect
> > to a unix domain socket with a path relative to current->fs->root (so
> > that containers can do their own core dump handling) or relative to
> > the root namespace root (so that core dumps always reach the init
> > namespace's core dumping even if a process sandboxes itself with
> > namespaces or such)? Also, I think this connection attempt will be
>
> Fsck no. :) I just jotted this down as an RFC. Details below.
>
> > subject to restrictions imposed by (for example) Landlock or AppArmor,
> > I'm not sure if that is desired here (since this is not actually a
> > connection that the process in whose context the call happens decided
> > to make, it's something the system administrator decided to do, and
> > especially with Landlock, policies are controlled by individual
> > applications that may not know how core dumps work on the system).
> >
> > I guess if we keep the current behavior where the socket path is
> > namespaced, then we also need to keep the security checks, since an
> > unprivileged user could probably set up a namespace and chroot() to a
> > place where the socket path (indirectly, through a symlink) refers to
> > an arbitrary socket...
> >
> > An alternative design might be to directly register the server socket
> > on the userns/mountns/netns or such in some magic way, and then have
> > the core dumping walk up the namespace hierarchy until it finds a
> > namespace that has opted in to using its own core dumping socket, and
> > connect to that socket bypassing security checks. (A bit like how
> > namespaced binfmt_misc works.) Like, maybe userspace with namespaced
>
> Yeah, I namespaced that thing. :)
Oh, hah, sorry, I forgot that was you.
> > CAP_SYS_ADMIN could bind() to some magic UNIX socket address, or use
> > some new setsockopt() on the socket or such, to become the handler of
> > core dumps? This would also have the advantage that malicious
> > userspace wouldn't be able to send fake bogus core dumps to the
> > server, and the server would provide clear consent to being connected
> > to without security checks at connection time.
>
> I think that's policy that I absolute don't want the kernel to get
> involved in unless absolutely necessary. A few days ago I just discussed
> this at length with Lennart and the issue is that systemd would want to
> see all coredumps on the system independent of the namespace they're
> created in. To have a per-namespace (userns/mountns/netns) coredump
> socket would invalidate that one way or the other and end up hiding
> coredumps from the administrator unless there's some elaborate scheme
> where it doesn't.
>
> systemd-coredump (and Apport fwiw) has infrastructure to forward
> coredumps to individual services and containers and it's already based
> on AF_UNIX afaict. And I really like that it's the job of userspace to
> deal with this instead of the kernel having to get involved in that
> mess.
>
> So all of this should be relative to the initial namespace. I want a
Ah, sounds good.
> separate security hook though so an LSMs can be used to prevent
> processes from connecting to the coredump socket.
>
> My idea has been that systemd-coredump could use a bpf lsm program that
> would allow to abort a coredump before the crashing process connects to
> the socket and again make this a userspace policy issue.
I don't understand this part. Why would you need an LSM to prevent a
crashing process from connecting, can't the coredumping server process
apply whatever filtering it wants in userspace?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC v2 3/6] coredump: support AF_UNIX sockets
2025-05-02 20:23 ` Jann Horn
@ 2025-05-03 5:17 ` Christian Brauner
2025-05-05 19:03 ` Kuniyuki Iwashima
0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2025-05-03 5:17 UTC (permalink / raw)
To: Jann Horn
Cc: Eric Dumazet, Kuniyuki Iwashima, Oleg Nesterov, linux-fsdevel,
David S. Miller, Alexander Viro, Daan De Meyer, David Rheinsberg,
Jakub Kicinski, Jan Kara, Lennart Poettering, Luca Boccassi,
Mike Yuan, Paolo Abeni, Simon Horman,
Zbigniew Jędrzejewski-Szmek, linux-kernel, netdev
On Fri, May 02, 2025 at 10:23:44PM +0200, Jann Horn wrote:
> On Fri, May 2, 2025 at 10:11 PM Christian Brauner <brauner@kernel.org> wrote:
> > On Fri, May 02, 2025 at 04:04:32PM +0200, Jann Horn wrote:
> > > On Fri, May 2, 2025 at 2:42 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > diff --git a/fs/coredump.c b/fs/coredump.c
> > > [...]
> > > > @@ -801,6 +841,73 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > > > }
> > > > break;
> > > > }
> > > > + case COREDUMP_SOCK: {
> > > > + struct file *file __free(fput) = NULL;
> > > > +#ifdef CONFIG_UNIX
> > > > + ssize_t addr_size;
> > > > + struct sockaddr_un unix_addr = {
> > > > + .sun_family = AF_UNIX,
> > > > + };
> > > > + struct sockaddr_storage *addr;
> > > > +
> > > > + /*
> > > > + * TODO: We need to really support core_pipe_limit to
> > > > + * prevent the task from being reaped before userspace
> > > > + * had a chance to look at /proc/<pid>.
> > > > + *
> > > > + * I need help from the networking people (or maybe Oleg
> > > > + * also knows?) how to do this.
> > > > + *
> > > > + * IOW, we need to wait for the other side to shutdown
> > > > + * the socket/terminate the connection.
> > > > + *
> > > > + * We could just read but then userspace could sent us
> > > > + * SCM_RIGHTS and we just shouldn't need to deal with
> > > > + * any of that.
> > > > + */
> > >
> > > I don't think userspace can send you SCM_RIGHTS if you don't do a
> > > recvmsg() with a control data buffer?
> >
> > Oh hm, then maybe just a regular read at the end would work. As soon as
> > userspace send us anything or we get a close event we just disconnect.
> >
> > But btw, I think we really need a recvmsg() flag that allows a receiver
> > to refuse SCM_RIGHTS/file descriptors from being sent to it. IIRC, right
> > now this is a real issue that systemd works around by always calling its
> > cmsg_close_all() helper after each recvmsg() to ensure that no one sent
> > it file descriptors it didn't want. The problem there is that someone
> > could have sent it an fd to a hanging NFS server or something and then
> > it would hang in close() even though it never even wanted any file
> > descriptors in the first place.
>
> Would a recvmsg() flag really solve that aspect of NFS hangs? By the
> time you read from the socket, the file is already attached to an SKB
> queued up on the socket, and cleaning up the file is your task's
> responsibility either way (which will either be done by the kernel for
> you if you don't read it into a control message, or by userspace if it
> was handed off through a control message). The process that sent the
> file to you might already be gone, it can't be on the hook for
> cleaning up the file anymore.
Hm, I guess the unix_gc() runs in task context? I had thought that it
might take care of that.
>
> I think the thorough fix would probably be to introduce a socket
> option (controlled via setsockopt()) that already blocks the peer's
> sendmsg().
Yes, I had considered that as well.
>
> > > > + * In general though, userspace should just mark itself
> > > > + * non dumpable and not do any of this nonsense. We
> > > > + * shouldn't work around this.
> > > > + */
> > > > + addr = (struct sockaddr_storage *)(&unix_addr);
> > > > + retval = __sys_connect_file(file, addr, addr_size, O_CLOEXEC);
> > >
> > > Have you made an intentional decision on whether you want to connect
> > > to a unix domain socket with a path relative to current->fs->root (so
> > > that containers can do their own core dump handling) or relative to
> > > the root namespace root (so that core dumps always reach the init
> > > namespace's core dumping even if a process sandboxes itself with
> > > namespaces or such)? Also, I think this connection attempt will be
> >
> > Fsck no. :) I just jotted this down as an RFC. Details below.
> >
> > > subject to restrictions imposed by (for example) Landlock or AppArmor,
> > > I'm not sure if that is desired here (since this is not actually a
> > > connection that the process in whose context the call happens decided
> > > to make, it's something the system administrator decided to do, and
> > > especially with Landlock, policies are controlled by individual
> > > applications that may not know how core dumps work on the system).
> > >
> > > I guess if we keep the current behavior where the socket path is
> > > namespaced, then we also need to keep the security checks, since an
> > > unprivileged user could probably set up a namespace and chroot() to a
> > > place where the socket path (indirectly, through a symlink) refers to
> > > an arbitrary socket...
> > >
> > > An alternative design might be to directly register the server socket
> > > on the userns/mountns/netns or such in some magic way, and then have
> > > the core dumping walk up the namespace hierarchy until it finds a
> > > namespace that has opted in to using its own core dumping socket, and
> > > connect to that socket bypassing security checks. (A bit like how
> > > namespaced binfmt_misc works.) Like, maybe userspace with namespaced
> >
> > Yeah, I namespaced that thing. :)
>
> Oh, hah, sorry, I forgot that was you.
>
> > > CAP_SYS_ADMIN could bind() to some magic UNIX socket address, or use
> > > some new setsockopt() on the socket or such, to become the handler of
> > > core dumps? This would also have the advantage that malicious
> > > userspace wouldn't be able to send fake bogus core dumps to the
> > > server, and the server would provide clear consent to being connected
> > > to without security checks at connection time.
> >
> > I think that's policy that I absolute don't want the kernel to get
> > involved in unless absolutely necessary. A few days ago I just discussed
> > this at length with Lennart and the issue is that systemd would want to
> > see all coredumps on the system independent of the namespace they're
> > created in. To have a per-namespace (userns/mountns/netns) coredump
> > socket would invalidate that one way or the other and end up hiding
> > coredumps from the administrator unless there's some elaborate scheme
> > where it doesn't.
> >
> > systemd-coredump (and Apport fwiw) has infrastructure to forward
> > coredumps to individual services and containers and it's already based
> > on AF_UNIX afaict. And I really like that it's the job of userspace to
> > deal with this instead of the kernel having to get involved in that
> > mess.
> >
> > So all of this should be relative to the initial namespace. I want a
>
> Ah, sounds good.
>
> > separate security hook though so an LSMs can be used to prevent
> > processes from connecting to the coredump socket.
> >
> > My idea has been that systemd-coredump could use a bpf lsm program that
> > would allow to abort a coredump before the crashing process connects to
> > the socket and again make this a userspace policy issue.
>
> I don't understand this part. Why would you need an LSM to prevent a
> crashing process from connecting, can't the coredumping server process
> apply whatever filtering it wants in userspace?
Coredumping is somewhat asynchronous in that the crash-dumping process
already starts writing by the time userspace could've made a decision
whether it should bother in the first place. Then userspace would need
to terminate the connection so that the kernel stops writing.
With a bpf LSM you could make a decision right when the connect happens
whether the task is even allowed to connect to the coredump socket in
the first place. This would also allow-rate limiting a reapeatedly
coredumping service/container.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC v2 3/6] coredump: support AF_UNIX sockets
2025-05-03 5:17 ` Christian Brauner
@ 2025-05-05 19:03 ` Kuniyuki Iwashima
0 siblings, 0 replies; 17+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-05 19:03 UTC (permalink / raw)
To: brauner
Cc: bluca, daan.j.demeyer, davem, david, edumazet, horms, jack, jannh,
kuba, kuniyu, lennart, linux-fsdevel, linux-kernel, me, netdev,
oleg, pabeni, viro, zbyszek
From: Christian Brauner <brauner@kernel.org>
Date: Sat, 3 May 2025 07:17:10 +0200
> On Fri, May 02, 2025 at 10:23:44PM +0200, Jann Horn wrote:
> > On Fri, May 2, 2025 at 10:11 PM Christian Brauner <brauner@kernel.org> wrote:
> > > On Fri, May 02, 2025 at 04:04:32PM +0200, Jann Horn wrote:
> > > > On Fri, May 2, 2025 at 2:42 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > > diff --git a/fs/coredump.c b/fs/coredump.c
> > > > [...]
> > > > > @@ -801,6 +841,73 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > > > > }
> > > > > break;
> > > > > }
> > > > > + case COREDUMP_SOCK: {
> > > > > + struct file *file __free(fput) = NULL;
> > > > > +#ifdef CONFIG_UNIX
> > > > > + ssize_t addr_size;
> > > > > + struct sockaddr_un unix_addr = {
> > > > > + .sun_family = AF_UNIX,
> > > > > + };
> > > > > + struct sockaddr_storage *addr;
> > > > > +
> > > > > + /*
> > > > > + * TODO: We need to really support core_pipe_limit to
> > > > > + * prevent the task from being reaped before userspace
> > > > > + * had a chance to look at /proc/<pid>.
> > > > > + *
> > > > > + * I need help from the networking people (or maybe Oleg
> > > > > + * also knows?) how to do this.
> > > > > + *
> > > > > + * IOW, we need to wait for the other side to shutdown
> > > > > + * the socket/terminate the connection.
> > > > > + *
> > > > > + * We could just read but then userspace could sent us
> > > > > + * SCM_RIGHTS and we just shouldn't need to deal with
> > > > > + * any of that.
> > > > > + */
> > > >
> > > > I don't think userspace can send you SCM_RIGHTS if you don't do a
> > > > recvmsg() with a control data buffer?
> > >
> > > Oh hm, then maybe just a regular read at the end would work. As soon as
> > > userspace send us anything or we get a close event we just disconnect.
> > >
> > > But btw, I think we really need a recvmsg() flag that allows a receiver
> > > to refuse SCM_RIGHTS/file descriptors from being sent to it. IIRC, right
> > > now this is a real issue that systemd works around by always calling its
> > > cmsg_close_all() helper after each recvmsg() to ensure that no one sent
> > > it file descriptors it didn't want. The problem there is that someone
> > > could have sent it an fd to a hanging NFS server or something and then
> > > it would hang in close() even though it never even wanted any file
> > > descriptors in the first place.
> >
> > Would a recvmsg() flag really solve that aspect of NFS hangs? By the
> > time you read from the socket, the file is already attached to an SKB
> > queued up on the socket, and cleaning up the file is your task's
> > responsibility either way (which will either be done by the kernel for
> > you if you don't read it into a control message, or by userspace if it
> > was handed off through a control message).
Right. recvmsg() is too late. Once sendmsg() is done, the last
fput() responsibility could fall on the receiver.
Btw, I was able to implement the cmsg_close_all() equivalent at
sendmsg() with BPF LSM to completely remove the issue.
I will send a series shortly and hope you like it :)
> > The process that sent the
> > file to you might already be gone, it can't be on the hook for
> > cleaning up the file anymore.
>
> Hm, I guess the unix_gc() runs in task context? I had thought that it
> might take care of that.
Note that unix_gc() is a garbage collector only for AF_UNIX fds
that have circular dependency:
1) AF_UNIX sk1 sends its fd to itself
2) AF_UNIX sk1 sends its fd to AF_UNIX sk2 and
AF_UNIX sk2 sends its fd to AF_UNIX sk1
In these examples, file refcnts remain even after close() by all
users of fds.
So, the GC is not a mechanism to deligate fput() for fds sent
by SCM_RIGHTS.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-05-05 19:04 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-02 12:42 [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets Christian Brauner
2025-05-02 12:42 ` [PATCH RFC v2 1/6] coredump: massage format_corname() Christian Brauner
2025-05-02 12:42 ` [PATCH RFC v2 2/6] coredump: massage do_coredump() Christian Brauner
2025-05-02 12:42 ` [PATCH RFC v2 3/6] coredump: support AF_UNIX sockets Christian Brauner
2025-05-02 14:04 ` Jann Horn
2025-05-02 20:10 ` Christian Brauner
2025-05-02 20:23 ` Jann Horn
2025-05-03 5:17 ` Christian Brauner
2025-05-05 19:03 ` Kuniyuki Iwashima
2025-05-02 12:42 ` [PATCH RFC v2 4/6] coredump: show supported coredump modes Christian Brauner
2025-05-02 14:07 ` Jann Horn
2025-05-02 20:11 ` Christian Brauner
2025-05-02 12:42 ` [PATCH RFC v2 5/6] pidfs, coredump: add PIDFD_INFO_COREDUMP Christian Brauner
2025-05-02 14:10 ` Jann Horn
2025-05-02 12:42 ` [PATCH RFC v2 6/6] selftests/coredump: add tests for AF_UNIX coredumps Christian Brauner
2025-05-02 14:04 ` [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets Jann Horn
2025-05-02 19:25 ` Christian Brauner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).