Coredumping currently supports two modes:
(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
spawned as a child of the system_unbound_wq or kthreadd.
For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.
The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.
In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):
- systemd-coredump is spawned with file descriptor number 0 (stdin)
connected to the read-end of the pipe. All other file descriptors are
closed. That specifically includes 1 (stdout) and 2 (stderr). This has
already caused bugs because userspace assumed that this cannot happen
(Whether or not this is a sane assumption is irrelevant.).
- systemd-coredump will be spawned as a child of system_unbound_wq. So
it is not a child of any userspace process and specifically not a
child of PID 1. It cannot be waited upon and is in a weird hybrid
upcall which are difficult for userspace to control correctly.
- systemd-coredump is spawned with full kernel privileges. This
necessitates all kinds of weird privilege dropping excercises in
userspace to make this safe.
- A new usermode helper has to be spawned for each crashing process.
This series adds a new mode:
(3) Dumping into an abstract AF_UNIX socket.
Userspace can set /proc/sys/kernel/core_pattern to:
@linuxafsk/coredump_socket
The "@" at the beginning indicates to the kernel that the abstract
AF_UNIX coredump socket will be used to process coredumps.
The coredump socket uses the fixed address "linuxafsk/coredump.socket"
for now.
The coredump socket is located in the initial network namespace. To bind
the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
user namespace. Listening and reading can happen from whatever
unprivileged context is necessary to safely process coredumps.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket. For now only tasks that
are acctually coredumping are allowed to connect to the initial coredump
socket.
- The coredump server should use SO_PEERPIDFD to get a stable handle on
the connected crashing task. The retrieved pidfd will provide a stable
reference even if the crashing task gets SIGKILLed while generating
the coredump.
- By setting core_pipe_limit non-zero userspace can guarantee that the
crashing task cannot be reaped behind it's back and thus process all
necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
detect whether /proc/<pid> still refers to the same process.
The core_pipe_limit isn't used to rate-limit connections to the
socket. This can simply be done via AF_UNIX socket directly.
- The pidfd for the crashing task will contain information how the task
coredumps. The PIDFD_GET_INFO ioctl gained a new flag
PIDFD_INFO_COREDUMP which can be used to retreive the coredump
information.
If the coredump gets a new coredump client connection the kernel
guarantees that PIDFD_INFO_COREDUMP information is available.
Currently the following information is provided in the new
@coredump_mask extension to struct pidfd_info:
* PIDFD_COREDUMPED is raised if the task did actually coredump.
* PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
undumpable).
* PIDFD_COREDUMP_USER is raised if this is a regular coredump and
doesn't need special care by the coredump server.
* IDFD_COREDUMP_ROOT is raised if the generated coredump should be
treated as sensitive and the coredump server should restrict to the
generated coredump to sufficiently privileged users.
- Since unix_stream_connect() runs bpf programs during connect it's
possible to even redirect or multiplex coredumps to other sockets.
- The coredump server should mark itself as non-dumpable.
To capture coredumps for the coredump server itself a bpf program
should be run at connect to redirect it to another socket in
userspace. This can be useful for debugging crashing coredump servers.
- A container coredump server in a separate network namespace can simply
bind to linuxafsk/coredump.socket and systemd-coredump fowards
coredumps to the container.
- Fwiw, one idea is to handle coredumps via per-user/session coredump
servers that run with that users privileges.
The coredump server listens on the coredump socket and accepts a
new coredump connection. It then retrieves SO_PEERPIDFD for the
client, inspects uid/gid and hands the accepted client to the users
own coredump handler which runs with the users privileges only.
The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers.
This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can just keep a worker pool.
This is easy to test:
(a) coredump processing (we're using socat):
> cat coredump_socket.sh
#!/bin/bash
set -x
sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc
(b) trigger a coredump:
user1@localhost:~/data/scripts$ cat crash.c
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
fprintf(stderr, "%u\n", (1 / 0));
_exit(0);
}
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/coredump.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 107 insertions(+), 5 deletions(-)
diff --git a/fs/coredump.c b/fs/coredump.c
index 1779299b8c61..c60f86c473ad 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -44,7 +44,11 @@
#include <linux/sysctl.h>
#include <linux/elf.h>
#include <linux/pidfs.h>
+#include <linux/net.h>
+#include <linux/socket.h>
+#include <net/net_namespace.h>
#include <uapi/linux/pidfd.h>
+#include <uapi/linux/un.h>
#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
enum coredump_type_t {
COREDUMP_FILE = 1,
COREDUMP_PIPE = 2,
+ COREDUMP_SOCK = 3,
};
struct core_name {
@@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
cn->corename = NULL;
if (*pat_ptr == '|')
cn->core_type = COREDUMP_PIPE;
+ else if (*pat_ptr == '@')
+ cn->core_type = COREDUMP_SOCK;
else
cn->core_type = COREDUMP_FILE;
if (expand_corename(cn, core_name_size))
return -ENOMEM;
cn->corename[0] = '\0';
- if (cn->core_type == COREDUMP_PIPE) {
+ switch (cn->core_type) {
+ case COREDUMP_PIPE: {
int argvs = sizeof(core_pattern) / 2;
(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
if (!(*argv))
@@ -247,6 +255,32 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
++pat_ptr;
if (!(*pat_ptr))
return -ENOMEM;
+ break;
+ }
+ case COREDUMP_SOCK: {
+ err = cn_printf(cn, "%s", pat_ptr);
+ if (err)
+ return err;
+
+ /*
+ * We can potentially allow this to be changed later but
+ * I currently see no reason to.
+ */
+ if (strcmp(cn->corename, "@linuxafsk/coredump.socket"))
+ return -EINVAL;
+
+ /*
+ * Currently no need to parse any other options.
+ * Relevant information can be retrieved from the peer
+ * pidfd retrievable via SO_PEERPIDFD by the receiver or
+ * via /proc/<pid>, using the SO_PEERPIDFD to guard
+ * against pid recycling when opening /proc/<pid>.
+ */
+ return 0;
+ }
+ default:
+ WARN_ON_ONCE(cn->core_type != COREDUMP_FILE);
+ break;
}
/* Repeat as long as we have more pattern to process and more output
@@ -583,6 +617,17 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
return 0;
}
+#ifdef CONFIG_UNIX
+struct sockaddr_un coredump_unix_socket = {
+ .sun_family = AF_UNIX,
+ .sun_path = "\0linuxafsk/coredump.socket",
+};
+/* Without trailing NUL byte. */
+#define COREDUMP_UNIX_SOCKET_ADDR_SIZE \
+ (offsetof(struct sockaddr_un, sun_path) + \
+ sizeof("\0linuxafsk/coredump.socket") - 1)
+#endif
+
void do_coredump(const kernel_siginfo_t *siginfo)
{
struct core_state core_state;
@@ -801,6 +846,40 @@ void do_coredump(const kernel_siginfo_t *siginfo)
}
break;
}
+ case COREDUMP_SOCK: {
+ struct file *file __free(fput) = NULL;
+#ifdef CONFIG_UNIX
+ struct socket *socket;
+
+ /*
+ * It is possible that the userspace process which is
+ * supposed to handle the coredump and is listening on
+ * the AF_UNIX socket coredumps. Userspace should just
+ * mark itself non dumpable.
+ */
+
+ retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
+ if (retval < 0)
+ goto close_fail;
+
+ file = sock_alloc_file(socket, 0, NULL);
+ if (IS_ERR(file)) {
+ sock_release(socket);
+ retval = PTR_ERR(file);
+ goto close_fail;
+ }
+
+ retval = kernel_connect(socket,
+ (struct sockaddr *)(&coredump_unix_socket),
+ COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0);
+ if (retval)
+ goto close_fail;
+
+ cprm.limit = RLIM_INFINITY;
+#endif
+ cprm.file = no_free_ptr(file);
+ break;
+ }
default:
WARN_ON_ONCE(true);
retval = -EINVAL;
@@ -818,7 +897,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
* have this set to NULL.
*/
if (!cprm.file) {
- coredump_report_failure("Core dump to |%s disabled", cn.corename);
+ if (cn.core_type == COREDUMP_PIPE)
+ coredump_report_failure("Core dump to |%s disabled", cn.corename);
+ else
+ coredump_report_failure("Core dump to @%s disabled", cn.corename);
goto close_fail;
}
if (!dump_vma_snapshot(&cprm))
@@ -839,8 +921,28 @@ void do_coredump(const kernel_siginfo_t *siginfo)
file_end_write(cprm.file);
free_vma_snapshot(&cprm);
}
- if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
- wait_for_dump_helpers(cprm.file);
+
+ if (core_pipe_limit) {
+ switch (cn.core_type) {
+ case COREDUMP_PIPE:
+ wait_for_dump_helpers(cprm.file);
+ break;
+ case COREDUMP_SOCK: {
+ char buf[1];
+ /*
+ * We use a simple read to wait for the coredump
+ * processing to finish. Either the socket is
+ * closed or we get sent unexpected data. In
+ * both cases, we're done.
+ */
+ __kernel_read(cprm.file, buf, 1, NULL);
+ break;
+ }
+ default:
+ break;
+ }
+ }
+
close_fail:
if (cprm.file)
filp_close(cprm.file, NULL);
@@ -1070,7 +1172,7 @@ EXPORT_SYMBOL(dump_align);
void validate_coredump_safety(void)
{
if (suid_dumpable == SUID_DUMP_ROOT &&
- core_pattern[0] != '/' && core_pattern[0] != '|') {
+ core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != '@') {
coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: "
"pipe handler or fully qualified core dump path required. "
--
2.47.2
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 05 May 2025 13:13:42 +0200
> @@ -801,6 +846,40 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> }
> break;
> }
> + case COREDUMP_SOCK: {
> + struct file *file __free(fput) = NULL;
> +#ifdef CONFIG_UNIX
> + struct socket *socket;
> +
> + /*
> + * It is possible that the userspace process which is
> + * supposed to handle the coredump and is listening on
> + * the AF_UNIX socket coredumps. Userspace should just
> + * mark itself non dumpable.
> + */
> +
> + retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> + if (retval < 0)
> + goto close_fail;
> +
> + file = sock_alloc_file(socket, 0, NULL);
> + if (IS_ERR(file)) {
> + sock_release(socket);
> + retval = PTR_ERR(file);
> + goto close_fail;
> + }
> +
> + retval = kernel_connect(socket,
> + (struct sockaddr *)(&coredump_unix_socket),
> + COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0);
This blocks forever if the listener's accept() queue is full.
I think we don't want that and should pass O_NONBLOCK.
To keep the queue clean is userspace responsibility, and we don't
need to care about a weird user.
> + if (retval)
> + goto close_fail;
> +
> + cprm.limit = RLIM_INFINITY;
> +#endif
> + cprm.file = no_free_ptr(file);
> + break;
> + }
> default:
> WARN_ON_ONCE(true);
> retval = -EINVAL;
On Mon, May 05, 2025 at 11:48:43AM -0700, Kuniyuki Iwashima wrote:
> From: Christian Brauner <brauner@kernel.org>
> Date: Mon, 05 May 2025 13:13:42 +0200
> > @@ -801,6 +846,40 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > }
> > break;
> > }
> > + case COREDUMP_SOCK: {
> > + struct file *file __free(fput) = NULL;
> > +#ifdef CONFIG_UNIX
> > + struct socket *socket;
> > +
> > + /*
> > + * It is possible that the userspace process which is
> > + * supposed to handle the coredump and is listening on
> > + * the AF_UNIX socket coredumps. Userspace should just
> > + * mark itself non dumpable.
> > + */
> > +
> > + retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> > + if (retval < 0)
> > + goto close_fail;
> > +
> > + file = sock_alloc_file(socket, 0, NULL);
> > + if (IS_ERR(file)) {
> > + sock_release(socket);
> > + retval = PTR_ERR(file);
> > + goto close_fail;
> > + }
> > +
> > + retval = kernel_connect(socket,
> > + (struct sockaddr *)(&coredump_unix_socket),
> > + COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0);
>
> This blocks forever if the listener's accept() queue is full.
>
> I think we don't want that and should pass O_NONBLOCK.
>
> To keep the queue clean is userspace responsibility, and we don't
> need to care about a weird user.
That seems fine to me. I've changed that.
On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> Coredumping currently supports two modes:
>
> (1) Dumping directly into a file somewhere on the filesystem.
> (2) Dumping into a pipe connected to a usermode helper process
> spawned as a child of the system_unbound_wq or kthreadd.
>
> For simplicity I'm mostly ignoring (1). There's probably still some
> users of (1) out there but processing coredumps in this way can be
> considered adventurous especially in the face of set*id binaries.
>
> The most common option should be (2) by now. It works by allowing
> userspace to put a string into /proc/sys/kernel/core_pattern like:
>
> |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
>
> The "|" at the beginning indicates to the kernel that a pipe must be
> used. The path following the pipe indicator is a path to a binary that
> will be spawned as a usermode helper process. Any additional parameters
> pass information about the task that is generating the coredump to the
> binary that processes the coredump.
>
> In the example core_pattern shown above systemd-coredump is spawned as a
> usermode helper. There's various conceptual consequences of this
> (non-exhaustive list):
>
> - systemd-coredump is spawned with file descriptor number 0 (stdin)
> connected to the read-end of the pipe. All other file descriptors are
> closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> already caused bugs because userspace assumed that this cannot happen
> (Whether or not this is a sane assumption is irrelevant.).
>
> - systemd-coredump will be spawned as a child of system_unbound_wq. So
> it is not a child of any userspace process and specifically not a
> child of PID 1. It cannot be waited upon and is in a weird hybrid
> upcall which are difficult for userspace to control correctly.
>
> - systemd-coredump is spawned with full kernel privileges. This
> necessitates all kinds of weird privilege dropping excercises in
> userspace to make this safe.
>
> - A new usermode helper has to be spawned for each crashing process.
>
> This series adds a new mode:
>
> (3) Dumping into an abstract AF_UNIX socket.
>
> Userspace can set /proc/sys/kernel/core_pattern to:
>
> @linuxafsk/coredump_socket
>
> The "@" at the beginning indicates to the kernel that the abstract
> AF_UNIX coredump socket will be used to process coredumps.
>
> The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> for now.
>
> The coredump socket is located in the initial network namespace. To bind
> the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> user namespace. Listening and reading can happen from whatever
> unprivileged context is necessary to safely process coredumps.
>
> When a task coredumps it opens a client socket in the initial network
> namespace and connects to the coredump socket. For now only tasks that
> are acctually coredumping are allowed to connect to the initial coredump
> socket.
>
> - The coredump server should use SO_PEERPIDFD to get a stable handle on
> the connected crashing task. The retrieved pidfd will provide a stable
> reference even if the crashing task gets SIGKILLed while generating
> the coredump.
>
> - By setting core_pipe_limit non-zero userspace can guarantee that the
> crashing task cannot be reaped behind it's back and thus process all
> necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
> detect whether /proc/<pid> still refers to the same process.
>
> The core_pipe_limit isn't used to rate-limit connections to the
> socket. This can simply be done via AF_UNIX socket directly.
>
> - The pidfd for the crashing task will contain information how the task
> coredumps. The PIDFD_GET_INFO ioctl gained a new flag
> PIDFD_INFO_COREDUMP which can be used to retreive the coredump
> information.
>
> If the coredump gets a new coredump client connection the kernel
> guarantees that PIDFD_INFO_COREDUMP information is available.
> Currently the following information is provided in the new
> @coredump_mask extension to struct pidfd_info:
>
> * PIDFD_COREDUMPED is raised if the task did actually coredump.
> * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
> undumpable).
> * PIDFD_COREDUMP_USER is raised if this is a regular coredump and
> doesn't need special care by the coredump server.
> * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
> treated as sensitive and the coredump server should restrict to the
> generated coredump to sufficiently privileged users.
>
> - Since unix_stream_connect() runs bpf programs during connect it's
> possible to even redirect or multiplex coredumps to other sockets.
Or change the userspace protocol used for containers such that the
init-namespace coredumping helper forwards the FD it accept()ed into a
container via SCM_RIGHTS...
> - The coredump server should mark itself as non-dumpable.
> To capture coredumps for the coredump server itself a bpf program
> should be run at connect to redirect it to another socket in
> userspace. This can be useful for debugging crashing coredump servers.
>
> - A container coredump server in a separate network namespace can simply
> bind to linuxafsk/coredump.socket and systemd-coredump fowards
> coredumps to the container.
>
> - Fwiw, one idea is to handle coredumps via per-user/session coredump
> servers that run with that users privileges.
>
> The coredump server listens on the coredump socket and accepts a
> new coredump connection. It then retrieves SO_PEERPIDFD for the
> client, inspects uid/gid and hands the accepted client to the users
> own coredump handler which runs with the users privileges only.
(Though that would only be okay if it's not done for suid dumping cases.)
> The new coredump socket will allow userspace to not have to rely on
> usermode helpers for processing coredumps and provides a safer way to
> handle them instead of relying on super privileged coredumping helpers.
>
> This will also be significantly more lightweight since no fork()+exec()
> for the usermodehelper is required for each crashing process. The
> coredump server in userspace can just keep a worker pool.
I mean, if coredumping is a performance bottleneck, something is
probably seriously wrong with the system... I don't think we need to
optimize for execution speed in this area.
> This is easy to test:
>
> (a) coredump processing (we're using socat):
>
> > cat coredump_socket.sh
> #!/bin/bash
>
> set -x
>
> sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
> sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc
>
> (b) trigger a coredump:
>
> user1@localhost:~/data/scripts$ cat crash.c
> #include <stdio.h>
> #include <unistd.h>
>
> int main(int argc, char *argv[])
> {
> fprintf(stderr, "%u\n", (1 / 0));
> _exit(0);
> }
This looks pretty neat overall!
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> fs/coredump.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 107 insertions(+), 5 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 1779299b8c61..c60f86c473ad 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -44,7 +44,11 @@
> #include <linux/sysctl.h>
> #include <linux/elf.h>
> #include <linux/pidfs.h>
> +#include <linux/net.h>
> +#include <linux/socket.h>
> +#include <net/net_namespace.h>
> #include <uapi/linux/pidfd.h>
> +#include <uapi/linux/un.h>
>
> #include <linux/uaccess.h>
> #include <asm/mmu_context.h>
> @@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
> enum coredump_type_t {
> COREDUMP_FILE = 1,
> COREDUMP_PIPE = 2,
> + COREDUMP_SOCK = 3,
> };
>
> struct core_name {
> @@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> cn->corename = NULL;
> if (*pat_ptr == '|')
> cn->core_type = COREDUMP_PIPE;
> + else if (*pat_ptr == '@')
> + cn->core_type = COREDUMP_SOCK;
> else
> cn->core_type = COREDUMP_FILE;
> if (expand_corename(cn, core_name_size))
> return -ENOMEM;
> cn->corename[0] = '\0';
>
> - if (cn->core_type == COREDUMP_PIPE) {
> + switch (cn->core_type) {
> + case COREDUMP_PIPE: {
> int argvs = sizeof(core_pattern) / 2;
> (*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
> if (!(*argv))
> @@ -247,6 +255,32 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> ++pat_ptr;
> if (!(*pat_ptr))
> return -ENOMEM;
> + break;
> + }
> + case COREDUMP_SOCK: {
> + err = cn_printf(cn, "%s", pat_ptr);
> + if (err)
> + return err;
> +
> + /*
> + * We can potentially allow this to be changed later but
> + * I currently see no reason to.
> + */
> + if (strcmp(cn->corename, "@linuxafsk/coredump.socket"))
> + return -EINVAL;
> +
> + /*
> + * Currently no need to parse any other options.
> + * Relevant information can be retrieved from the peer
> + * pidfd retrievable via SO_PEERPIDFD by the receiver or
> + * via /proc/<pid>, using the SO_PEERPIDFD to guard
> + * against pid recycling when opening /proc/<pid>.
> + */
> + return 0;
> + }
> + default:
> + WARN_ON_ONCE(cn->core_type != COREDUMP_FILE);
> + break;
> }
>
> /* Repeat as long as we have more pattern to process and more output
I think the core_uses_pid logic at the end of this function needs to
be adjusted to also exclude COREDUMP_SOCK?
> @@ -583,6 +617,17 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
> return 0;
> }
>
> +#ifdef CONFIG_UNIX
> +struct sockaddr_un coredump_unix_socket = {
> + .sun_family = AF_UNIX,
> + .sun_path = "\0linuxafsk/coredump.socket",
> +};
Nit: Please make that static and const.
> +/* Without trailing NUL byte. */
> +#define COREDUMP_UNIX_SOCKET_ADDR_SIZE \
> + (offsetof(struct sockaddr_un, sun_path) + \
> + sizeof("\0linuxafsk/coredump.socket") - 1)
> +#endif
> +
> void do_coredump(const kernel_siginfo_t *siginfo)
> {
> struct core_state core_state;
> @@ -801,6 +846,40 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> }
> break;
> }
> + case COREDUMP_SOCK: {
> + struct file *file __free(fput) = NULL;
> +#ifdef CONFIG_UNIX
> + struct socket *socket;
> +
> + /*
> + * It is possible that the userspace process which is
> + * supposed to handle the coredump and is listening on
> + * the AF_UNIX socket coredumps. Userspace should just
> + * mark itself non dumpable.
> + */
> +
> + retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> + if (retval < 0)
> + goto close_fail;
> +
> + file = sock_alloc_file(socket, 0, NULL);
> + if (IS_ERR(file)) {
> + sock_release(socket);
> + retval = PTR_ERR(file);
> + goto close_fail;
> + }
> +
> + retval = kernel_connect(socket,
> + (struct sockaddr *)(&coredump_unix_socket),
> + COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0);
> + if (retval)
> + goto close_fail;
> +
> + cprm.limit = RLIM_INFINITY;
> +#endif
The non-CONFIG_UNIX case here should probably bail out?
> + cprm.file = no_free_ptr(file);
> + break;
> + }
> default:
> WARN_ON_ONCE(true);
> retval = -EINVAL;
> @@ -818,7 +897,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> * have this set to NULL.
> */
> if (!cprm.file) {
> - coredump_report_failure("Core dump to |%s disabled", cn.corename);
> + if (cn.core_type == COREDUMP_PIPE)
> + coredump_report_failure("Core dump to |%s disabled", cn.corename);
> + else
> + coredump_report_failure("Core dump to @%s disabled", cn.corename);
Are you actually truncating the initial "@" off of cn.corename, or is
this going to print two "@" characters?
> goto close_fail;
> }
> if (!dump_vma_snapshot(&cprm))
> @@ -839,8 +921,28 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> file_end_write(cprm.file);
> free_vma_snapshot(&cprm);
> }
> - if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
> - wait_for_dump_helpers(cprm.file);
> +
> + if (core_pipe_limit) {
> + switch (cn.core_type) {
> + case COREDUMP_PIPE:
> + wait_for_dump_helpers(cprm.file);
> + break;
> + case COREDUMP_SOCK: {
> + char buf[1];
> + /*
> + * We use a simple read to wait for the coredump
> + * processing to finish. Either the socket is
> + * closed or we get sent unexpected data. In
> + * both cases, we're done.
> + */
> + __kernel_read(cprm.file, buf, 1, NULL);
> + break;
> + }
> + default:
> + break;
> + }
> + }
> +
> close_fail:
> if (cprm.file)
> filp_close(cprm.file, NULL);
On Mon, May 05, 2025 at 02:55:18PM +0200, Jann Horn wrote:
> On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote:
> > Coredumping currently supports two modes:
> >
> > (1) Dumping directly into a file somewhere on the filesystem.
> > (2) Dumping into a pipe connected to a usermode helper process
> > spawned as a child of the system_unbound_wq or kthreadd.
> >
> > For simplicity I'm mostly ignoring (1). There's probably still some
> > users of (1) out there but processing coredumps in this way can be
> > considered adventurous especially in the face of set*id binaries.
> >
> > The most common option should be (2) by now. It works by allowing
> > userspace to put a string into /proc/sys/kernel/core_pattern like:
> >
> > |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> >
> > The "|" at the beginning indicates to the kernel that a pipe must be
> > used. The path following the pipe indicator is a path to a binary that
> > will be spawned as a usermode helper process. Any additional parameters
> > pass information about the task that is generating the coredump to the
> > binary that processes the coredump.
> >
> > In the example core_pattern shown above systemd-coredump is spawned as a
> > usermode helper. There's various conceptual consequences of this
> > (non-exhaustive list):
> >
> > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> > connected to the read-end of the pipe. All other file descriptors are
> > closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> > already caused bugs because userspace assumed that this cannot happen
> > (Whether or not this is a sane assumption is irrelevant.).
> >
> > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> > it is not a child of any userspace process and specifically not a
> > child of PID 1. It cannot be waited upon and is in a weird hybrid
> > upcall which are difficult for userspace to control correctly.
> >
> > - systemd-coredump is spawned with full kernel privileges. This
> > necessitates all kinds of weird privilege dropping excercises in
> > userspace to make this safe.
> >
> > - A new usermode helper has to be spawned for each crashing process.
> >
> > This series adds a new mode:
> >
> > (3) Dumping into an abstract AF_UNIX socket.
> >
> > Userspace can set /proc/sys/kernel/core_pattern to:
> >
> > @linuxafsk/coredump_socket
> >
> > The "@" at the beginning indicates to the kernel that the abstract
> > AF_UNIX coredump socket will be used to process coredumps.
> >
> > The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> > for now.
> >
> > The coredump socket is located in the initial network namespace. To bind
> > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > user namespace. Listening and reading can happen from whatever
> > unprivileged context is necessary to safely process coredumps.
> >
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket. For now only tasks that
> > are acctually coredumping are allowed to connect to the initial coredump
> > socket.
> >
> > - The coredump server should use SO_PEERPIDFD to get a stable handle on
> > the connected crashing task. The retrieved pidfd will provide a stable
> > reference even if the crashing task gets SIGKILLed while generating
> > the coredump.
> >
> > - By setting core_pipe_limit non-zero userspace can guarantee that the
> > crashing task cannot be reaped behind it's back and thus process all
> > necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
> > detect whether /proc/<pid> still refers to the same process.
> >
> > The core_pipe_limit isn't used to rate-limit connections to the
> > socket. This can simply be done via AF_UNIX socket directly.
> >
> > - The pidfd for the crashing task will contain information how the task
> > coredumps. The PIDFD_GET_INFO ioctl gained a new flag
> > PIDFD_INFO_COREDUMP which can be used to retreive the coredump
> > information.
> >
> > If the coredump gets a new coredump client connection the kernel
> > guarantees that PIDFD_INFO_COREDUMP information is available.
> > Currently the following information is provided in the new
> > @coredump_mask extension to struct pidfd_info:
> >
> > * PIDFD_COREDUMPED is raised if the task did actually coredump.
> > * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
> > undumpable).
> > * PIDFD_COREDUMP_USER is raised if this is a regular coredump and
> > doesn't need special care by the coredump server.
> > * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
> > treated as sensitive and the coredump server should restrict to the
> > generated coredump to sufficiently privileged users.
> >
> > - Since unix_stream_connect() runs bpf programs during connect it's
> > possible to even redirect or multiplex coredumps to other sockets.
>
> Or change the userspace protocol used for containers such that the
> init-namespace coredumping helper forwards the FD it accept()ed into a
> container via SCM_RIGHTS...
Yeah, that would also work.
>
> > - The coredump server should mark itself as non-dumpable.
> > To capture coredumps for the coredump server itself a bpf program
> > should be run at connect to redirect it to another socket in
> > userspace. This can be useful for debugging crashing coredump servers.
> >
> > - A container coredump server in a separate network namespace can simply
> > bind to linuxafsk/coredump.socket and systemd-coredump fowards
> > coredumps to the container.
> >
> > - Fwiw, one idea is to handle coredumps via per-user/session coredump
> > servers that run with that users privileges.
> >
> > The coredump server listens on the coredump socket and accepts a
> > new coredump connection. It then retrieves SO_PEERPIDFD for the
> > client, inspects uid/gid and hands the accepted client to the users
> > own coredump handler which runs with the users privileges only.
>
> (Though that would only be okay if it's not done for suid dumping cases.)
Yes, I had considered adding a comment about only doing that when
PIDFD_COREDUMP_ROOT isn't set and wondered if anyone would comment on
it. :)
>
> > The new coredump socket will allow userspace to not have to rely on
> > usermode helpers for processing coredumps and provides a safer way to
> > handle them instead of relying on super privileged coredumping helpers.
> >
> > This will also be significantly more lightweight since no fork()+exec()
> > for the usermodehelper is required for each crashing process. The
> > coredump server in userspace can just keep a worker pool.
>
> I mean, if coredumping is a performance bottleneck, something is
> probably seriously wrong with the system... I don't think we need to
> optimize for execution speed in this area.
>
> > This is easy to test:
> >
> > (a) coredump processing (we're using socat):
> >
> > > cat coredump_socket.sh
> > #!/bin/bash
> >
> > set -x
> >
> > sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
> > sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc
> >
> > (b) trigger a coredump:
> >
> > user1@localhost:~/data/scripts$ cat crash.c
> > #include <stdio.h>
> > #include <unistd.h>
> >
> > int main(int argc, char *argv[])
> > {
> > fprintf(stderr, "%u\n", (1 / 0));
> > _exit(0);
> > }
>
> This looks pretty neat overall!
>
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > fs/coredump.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 107 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index 1779299b8c61..c60f86c473ad 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> > @@ -44,7 +44,11 @@
> > #include <linux/sysctl.h>
> > #include <linux/elf.h>
> > #include <linux/pidfs.h>
> > +#include <linux/net.h>
> > +#include <linux/socket.h>
> > +#include <net/net_namespace.h>
> > #include <uapi/linux/pidfd.h>
> > +#include <uapi/linux/un.h>
> >
> > #include <linux/uaccess.h>
> > #include <asm/mmu_context.h>
> > @@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
> > enum coredump_type_t {
> > COREDUMP_FILE = 1,
> > COREDUMP_PIPE = 2,
> > + COREDUMP_SOCK = 3,
> > };
> >
> > struct core_name {
> > @@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> > cn->corename = NULL;
> > if (*pat_ptr == '|')
> > cn->core_type = COREDUMP_PIPE;
> > + else if (*pat_ptr == '@')
> > + cn->core_type = COREDUMP_SOCK;
> > else
> > cn->core_type = COREDUMP_FILE;
> > if (expand_corename(cn, core_name_size))
> > return -ENOMEM;
> > cn->corename[0] = '\0';
> >
> > - if (cn->core_type == COREDUMP_PIPE) {
> > + switch (cn->core_type) {
> > + case COREDUMP_PIPE: {
> > int argvs = sizeof(core_pattern) / 2;
> > (*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
> > if (!(*argv))
> > @@ -247,6 +255,32 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
> > ++pat_ptr;
> > if (!(*pat_ptr))
> > return -ENOMEM;
> > + break;
> > + }
> > + case COREDUMP_SOCK: {
> > + err = cn_printf(cn, "%s", pat_ptr);
> > + if (err)
> > + return err;
> > +
> > + /*
> > + * We can potentially allow this to be changed later but
> > + * I currently see no reason to.
> > + */
> > + if (strcmp(cn->corename, "@linuxafsk/coredump.socket"))
> > + return -EINVAL;
> > +
> > + /*
> > + * Currently no need to parse any other options.
> > + * Relevant information can be retrieved from the peer
> > + * pidfd retrievable via SO_PEERPIDFD by the receiver or
> > + * via /proc/<pid>, using the SO_PEERPIDFD to guard
> > + * against pid recycling when opening /proc/<pid>.
> > + */
> > + return 0;
> > + }
> > + default:
> > + WARN_ON_ONCE(cn->core_type != COREDUMP_FILE);
> > + break;
> > }
> >
> > /* Repeat as long as we have more pattern to process and more output
>
> I think the core_uses_pid logic at the end of this function needs to
> be adjusted to also exclude COREDUMP_SOCK?
Thanks! Fixed.
>
> > @@ -583,6 +617,17 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new)
> > return 0;
> > }
> >
> > +#ifdef CONFIG_UNIX
> > +struct sockaddr_un coredump_unix_socket = {
> > + .sun_family = AF_UNIX,
> > + .sun_path = "\0linuxafsk/coredump.socket",
> > +};
>
> Nit: Please make that static and const.
Done.
>
> > +/* Without trailing NUL byte. */
> > +#define COREDUMP_UNIX_SOCKET_ADDR_SIZE \
> > + (offsetof(struct sockaddr_un, sun_path) + \
> > + sizeof("\0linuxafsk/coredump.socket") - 1)
> > +#endif
> > +
> > void do_coredump(const kernel_siginfo_t *siginfo)
> > {
> > struct core_state core_state;
> > @@ -801,6 +846,40 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > }
> > break;
> > }
> > + case COREDUMP_SOCK: {
> > + struct file *file __free(fput) = NULL;
> > +#ifdef CONFIG_UNIX
> > + struct socket *socket;
> > +
> > + /*
> > + * It is possible that the userspace process which is
> > + * supposed to handle the coredump and is listening on
> > + * the AF_UNIX socket coredumps. Userspace should just
> > + * mark itself non dumpable.
> > + */
> > +
> > + retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
> > + if (retval < 0)
> > + goto close_fail;
> > +
> > + file = sock_alloc_file(socket, 0, NULL);
> > + if (IS_ERR(file)) {
> > + sock_release(socket);
> > + retval = PTR_ERR(file);
> > + goto close_fail;
> > + }
> > +
> > + retval = kernel_connect(socket,
> > + (struct sockaddr *)(&coredump_unix_socket),
> > + COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0);
> > + if (retval)
> > + goto close_fail;
> > +
> > + cprm.limit = RLIM_INFINITY;
> > +#endif
>
> The non-CONFIG_UNIX case here should probably bail out?
It will bail-out later on !bprm->file where it'll report that @ support
is disabled but I think...
>
> > + cprm.file = no_free_ptr(file);
> > + break;
> > + }
> > default:
> > WARN_ON_ONCE(true);
> > retval = -EINVAL;
> > @@ -818,7 +897,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > * have this set to NULL.
> > */
> > if (!cprm.file) {
> > - coredump_report_failure("Core dump to |%s disabled", cn.corename);
> > + if (cn.core_type == COREDUMP_PIPE)
> > + coredump_report_failure("Core dump to |%s disabled", cn.corename);
> > + else
> > + coredump_report_failure("Core dump to @%s disabled", cn.corename);
>
> Are you actually truncating the initial "@" off of cn.corename, or is
> this going to print two "@" characters?
... that bailing out earlier is nicer than stripping the @off
pointlessly.
>
> > goto close_fail;
> > }
> > if (!dump_vma_snapshot(&cprm))
> > @@ -839,8 +921,28 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > file_end_write(cprm.file);
> > free_vma_snapshot(&cprm);
> > }
> > - if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
> > - wait_for_dump_helpers(cprm.file);
> > +
> > + if (core_pipe_limit) {
> > + switch (cn.core_type) {
> > + case COREDUMP_PIPE:
> > + wait_for_dump_helpers(cprm.file);
> > + break;
> > + case COREDUMP_SOCK: {
> > + char buf[1];
> > + /*
> > + * We use a simple read to wait for the coredump
> > + * processing to finish. Either the socket is
> > + * closed or we get sent unexpected data. In
> > + * both cases, we're done.
> > + */
> > + __kernel_read(cprm.file, buf, 1, NULL);
> > + break;
> > + }
> > + default:
> > + break;
> > + }
> > + }
> > +
> > close_fail:
> > if (cprm.file)
> > filp_close(cprm.file, NULL);
On Mon, 5 May 2025 at 13:55, Jann Horn <jannh@google.com> wrote: > > On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@kernel.org> wrote: > > The new coredump socket will allow userspace to not have to rely on > > usermode helpers for processing coredumps and provides a safer way to > > handle them instead of relying on super privileged coredumping helpers. > > > > This will also be significantly more lightweight since no fork()+exec() > > for the usermodehelper is required for each crashing process. The > > coredump server in userspace can just keep a worker pool. > > I mean, if coredumping is a performance bottleneck, something is > probably seriously wrong with the system... I don't think we need to > optimize for execution speed in this area. It can be a bottleneck because it blocks restarting processes until it completes, which directly impacts SLAs if the crashed process is relevant for those purposes, and it often is, we have several use cases where that's the case. The usermode helper costs hundreds of ms in latency, so any optimization in that area is very welcome and worth mentioning IMHO.
© 2016 - 2026 Red Hat, Inc.