Coredumping currently supports two modes:
(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
spawned as a child of the system_unbound_wq or kthreadd.
For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.
The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.
In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):
- systemd-coredump is spawned with file descriptor number 0 (stdin)
connected to the read-end of the pipe. All other file descriptors are
closed. That specifically includes 1 (stdout) and 2 (stderr). This has
already caused bugs because userspace assumed that this cannot happen
(Whether or not this is a sane assumption is irrelevant.).
- systemd-coredump will be spawned as a child of system_unbound_wq. So
it is not a child of any userspace process and specifically not a
child of PID 1. It cannot be waited upon and is in a weird hybrid
upcall which are difficult for userspace to control correctly.
- systemd-coredump is spawned with full kernel privileges. This
necessitates all kinds of weird privilege dropping excercises in
userspace to make this safe.
- A new usermode helper has to be spawned for each crashing process.
This series adds a new mode:
(3) Dumping into an abstract AF_UNIX socket.
Userspace can set /proc/sys/kernel/core_pattern to:
@address SO_COOKIE
The "@" at the beginning indicates to the kernel that the abstract
AF_UNIX coredump socket will be used to process coredumps. The address
is given by @address and must be followed by the socket cookie of the
coredump listening socket.
The socket cookie is used to verify the socket connection. If the
coredump server restarts or crashes and someone recycles the socket
address the kernel will detect that the address has been recycled as the
socket cookie will have necessarily changed and refuse to connect.
The coredump socket is located in the initial network namespace. When a
task coredumps it opens a client socket in the initial network namespace
and connects to the coredump socket.
- The coredump server uses SO_PEERPIDFD to get a stable handle on the
connected crashing task. The retrieved pidfd will provide a stable
reference even if the crashing task gets SIGKILLed while generating
the coredump.
- By setting core_pipe_limit non-zero userspace can guarantee that the
crashing task cannot be reaped behind it's back and thus process all
necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
detect whether /proc/<pid> still refers to the same process.
The core_pipe_limit isn't used to rate-limit connections to the
socket. This can simply be done via AF_UNIX sockets directly.
- The pidfd for the crashing task will grow new information how the task
coredumps.
- The coredump server should mark itself as non-dumpable.
- A container coredump server in a separate network namespace can simply
bind to another well-know address and systemd-coredump fowards
coredumps to the container.
- Coredumps could in the future also be handled via per-user/session
coredump servers that run only with that users privileges.
The coredump server listens on the coredump socket and accepts a
new coredump connection. It then retrieves SO_PEERPIDFD for the
client, inspects uid/gid and hands the accepted client to the users
own coredump handler which runs with the users privileges only
(It must of coure pay close attention to not forward crashing suid
binaries.).
The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers
that have and continue to cause significant CVEs.
This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can e.g., just keep a worker pool.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/coredump.c | 147 +++++++++++++++++++++++++++++++++++++++++++++++++---
include/linux/net.h | 1 +
net/socket.c | 5 +-
net/unix/af_unix.c | 24 ++++++---
4 files changed, 161 insertions(+), 16 deletions(-)
diff --git a/fs/coredump.c b/fs/coredump.c
index a70929c3585b..15e9d9a252cd 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -44,7 +44,11 @@
#include <linux/sysctl.h>
#include <linux/elf.h>
#include <linux/pidfs.h>
+#include <linux/net.h>
+#include <linux/socket.h>
+#include <net/net_namespace.h>
#include <uapi/linux/pidfd.h>
+#include <uapi/linux/un.h>
#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
enum coredump_type_t {
COREDUMP_FILE = 1,
COREDUMP_PIPE = 2,
+ COREDUMP_SOCK = 3,
};
struct core_name {
@@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
cn->corename = NULL;
if (*pat_ptr == '|')
cn->core_type = COREDUMP_PIPE;
+ else if (*pat_ptr == '@')
+ cn->core_type = COREDUMP_SOCK;
else
cn->core_type = COREDUMP_FILE;
if (expand_corename(cn, core_name_size))
return -ENOMEM;
cn->corename[0] = '\0';
- if (cn->core_type == COREDUMP_PIPE) {
+ switch (cn->core_type) {
+ case COREDUMP_PIPE: {
int argvs = sizeof(core_pattern) / 2;
(*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL);
if (!(*argv))
@@ -247,6 +255,29 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
++pat_ptr;
if (!(*pat_ptr))
return -ENOMEM;
+ break;
+ }
+ case COREDUMP_SOCK: {
+ /* skip the @ */
+ pat_ptr++;
+ err = cn_printf(cn, "%s", pat_ptr);
+ if (err)
+ return err;
+
+ /*
+ * Currently no need to parse any other options.
+ * Relevant information can be retrieved from the peer
+ * pidfd retrievable via SO_PEERPIDFD by the receiver or
+ * via /proc/<pid>, using the SO_PEERPIDFD to guard
+ * against pid recycling when opening /proc/<pid>.
+ */
+ return 0;
+ }
+ case COREDUMP_FILE:
+ break;
+ default:
+ WARN_ON_ONCE(true);
+ return -EINVAL;
}
/* Repeat as long as we have more pattern to process and more output
@@ -393,11 +424,20 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
* If core_pattern does not include a %p (as is the default)
* and core_uses_pid is set, then .%pid will be appended to
* the filename. Do not do this for piped commands. */
- if (!(cn->core_type == COREDUMP_PIPE) && !pid_in_pattern && core_uses_pid) {
- err = cn_printf(cn, ".%d", task_tgid_vnr(current));
- if (err)
- return err;
+ if (!pid_in_pattern && core_uses_pid) {
+ switch (cn->core_type) {
+ case COREDUMP_FILE:
+ return cn_printf(cn, ".%d", task_tgid_vnr(current));
+ case COREDUMP_PIPE:
+ break;
+ case COREDUMP_SOCK:
+ break;
+ default:
+ WARN_ON_ONCE(true);
+ return -EINVAL;
+ }
}
+
return 0;
}
@@ -801,6 +841,73 @@ void do_coredump(const kernel_siginfo_t *siginfo)
}
break;
}
+ case COREDUMP_SOCK: {
+#ifdef CONFIG_UNIX
+ struct file *file __free(fput) = NULL;
+ struct sockaddr_un addr = {
+ .sun_family = AF_UNIX,
+ };
+ unsigned int addr_len;
+ struct socket *socket;
+ char *p;
+ u64 sock_cookie;
+
+ p = strchr(cn.corename, ' ');
+ if (!p) {
+ coredump_report_failure("Missing socket cookie");
+ goto close_fail;
+ }
+ *p++ = '\0';
+
+ /* Leave room for the socket cookie. */
+ retval = strscpy(addr.sun_path + 1, cn.corename,
+ sizeof(addr.sun_path) - sizeof(sock_cookie) - 1);
+ if (retval < 0)
+ goto close_fail;
+ addr_len = offsetof(struct sockaddr_un, sun_path) + retval + 1;
+
+ retval = kstrtou64(p, 0, &sock_cookie);
+ if (retval || !sock_cookie) {
+ coredump_report_failure("Invalid socket cookie");
+ goto close_fail;
+ }
+
+ /* append socket cookie */
+ memcpy((char *)&addr + addr_len, &sock_cookie, sizeof(sock_cookie));
+
+ /*
+ * It is possible that the userspace process which is
+ * supposed to handle the coredump and is listening on
+ * the AF_UNIX socket coredumps. Userspace should just
+ * mark itself non dumpable.
+ */
+
+ retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket);
+ if (retval < 0)
+ goto close_fail;
+
+ file = sock_alloc_file(socket, 0, NULL);
+ if (IS_ERR(file)) {
+ sock_release(socket);
+ goto close_fail;
+ }
+
+ retval = kernel_connect(socket, (struct sockaddr *)(&addr),
+ addr_len, O_NONBLOCK | SOCK_COREDUMP);
+ if (retval) {
+ if (retval == -EAGAIN)
+ coredump_report_failure("Skipping as coredump socket connection %s couldn't complete immediately", cn.corename);
+ goto close_fail;
+ }
+
+ cprm.limit = RLIM_INFINITY;
+ cprm.file = no_free_ptr(file);
+#else
+ coredump_report_failure("Core dump socket support %s disabled", cn.corename);
+ goto close_fail;
+#endif
+ break;
+ }
default:
WARN_ON_ONCE(true);
goto close_fail;
@@ -838,8 +945,32 @@ void do_coredump(const kernel_siginfo_t *siginfo)
file_end_write(cprm.file);
free_vma_snapshot(&cprm);
}
- if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
- wait_for_dump_helpers(cprm.file);
+
+ /*
+ * When core_pipe_limit is set we wait for the coredump server
+ * or usermodehelper to finish before exiting so it can e.g.,
+ * inspect /proc/<pid>.
+ */
+ if (core_pipe_limit) {
+ switch (cn.core_type) {
+ case COREDUMP_PIPE:
+ wait_for_dump_helpers(cprm.file);
+ break;
+ case COREDUMP_SOCK: {
+ /*
+ * We use a simple read to wait for the coredump
+ * processing to finish. Either the socket is
+ * closed or we get sent unexpected data. In
+ * both cases, we're done.
+ */
+ __kernel_read(cprm.file, &(char){ 0 }, 1, NULL);
+ break;
+ }
+ default:
+ break;
+ }
+ }
+
close_fail:
if (cprm.file)
filp_close(cprm.file, NULL);
@@ -1069,7 +1200,7 @@ EXPORT_SYMBOL(dump_align);
void validate_coredump_safety(void)
{
if (suid_dumpable == SUID_DUMP_ROOT &&
- core_pattern[0] != '/' && core_pattern[0] != '|') {
+ core_pattern[0] != '/' && core_pattern[0] != '|' && core_pattern[0] != '@') {
coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: "
"pipe handler or fully qualified core dump path required. "
diff --git a/include/linux/net.h b/include/linux/net.h
index 0ff950eecc6b..139c85d0f2ea 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -81,6 +81,7 @@ enum sock_type {
#ifndef SOCK_NONBLOCK
#define SOCK_NONBLOCK O_NONBLOCK
#endif
+#define SOCK_COREDUMP O_NOCTTY
#endif /* ARCH_HAS_SOCKET_TYPES */
diff --git a/net/socket.c b/net/socket.c
index 9a0e720f0859..d62a57a57a28 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -3603,7 +3603,10 @@ int kernel_connect(struct socket *sock, struct sockaddr *addr, int addrlen,
{
struct sockaddr_storage address;
- memcpy(&address, addr, addrlen);
+ if (flags & SOCK_COREDUMP)
+ memcpy(&address, addr, addrlen + sizeof(sock->sk->sk_cookie));
+ else
+ memcpy(&address, addr, addrlen);
return READ_ONCE(sock->ops)->connect(sock, (struct sockaddr *)&address,
addrlen, flags);
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 472f8aa9ea15..6b8a7863b41c 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -89,6 +89,8 @@
#include <linux/kernel.h>
#include <linux/mount.h>
#include <linux/namei.h>
+#include <linux/net.h>
+#include <linux/pidfs.h>
#include <linux/poll.h>
#include <linux/proc_fs.h>
#include <linux/sched/signal.h>
@@ -100,7 +102,6 @@
#include <linux/splice.h>
#include <linux/string.h>
#include <linux/uaccess.h>
-#include <linux/pidfs.h>
#include <net/af_unix.h>
#include <net/net_namespace.h>
#include <net/scm.h>
@@ -1191,7 +1192,7 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
static struct sock *unix_find_abstract(struct net *net,
struct sockaddr_un *sunaddr,
- int addr_len, int type)
+ int addr_len, int type, int flags)
{
unsigned int hash = unix_abstract_hash(sunaddr, addr_len, type);
struct dentry *dentry;
@@ -1201,6 +1202,15 @@ static struct sock *unix_find_abstract(struct net *net,
if (!sk)
return ERR_PTR(-ECONNREFUSED);
+ if (flags & SOCK_COREDUMP) {
+ u64 sock_cookie;
+
+ memcpy(&sock_cookie, (char *)sunaddr + addr_len, sizeof(sock_cookie));
+ DEBUG_NET_WARN_ON_ONCE(!sock_cookie);
+ if (sock_cookie != atomic64_read(&sk->sk_cookie))
+ return ERR_PTR(-ECONNREFUSED);
+ }
+
dentry = unix_sk(sk)->path.dentry;
if (dentry)
touch_atime(&unix_sk(sk)->path);
@@ -1210,14 +1220,14 @@ static struct sock *unix_find_abstract(struct net *net,
static struct sock *unix_find_other(struct net *net,
struct sockaddr_un *sunaddr,
- int addr_len, int type)
+ int addr_len, int type, int flags)
{
struct sock *sk;
if (sunaddr->sun_path[0])
sk = unix_find_bsd(sunaddr, addr_len, type);
else
- sk = unix_find_abstract(net, sunaddr, addr_len, type);
+ sk = unix_find_abstract(net, sunaddr, addr_len, type, flags);
return sk;
}
@@ -1473,7 +1483,7 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr *addr,
}
restart:
- other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type);
+ other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type, 0);
if (IS_ERR(other)) {
err = PTR_ERR(other);
goto out;
@@ -1620,7 +1630,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
restart:
/* Find listening sock. */
- other = unix_find_other(net, sunaddr, addr_len, sk->sk_type);
+ other = unix_find_other(net, sunaddr, addr_len, sk->sk_type, flags);
if (IS_ERR(other)) {
err = PTR_ERR(other);
goto out_free_skb;
@@ -2089,7 +2099,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
if (msg->msg_namelen) {
lookup:
other = unix_find_other(sock_net(sk), msg->msg_name,
- msg->msg_namelen, sk->sk_type);
+ msg->msg_namelen, sk->sk_type, 0);
if (IS_ERR(other)) {
err = PTR_ERR(other);
goto out_free;
--
2.47.2
From: Christian Brauner <brauner@kernel.org> Date: Mon, 12 May 2025 10:55:23 +0200 > Coredumping currently supports two modes: > > (1) Dumping directly into a file somewhere on the filesystem. > (2) Dumping into a pipe connected to a usermode helper process > spawned as a child of the system_unbound_wq or kthreadd. > > For simplicity I'm mostly ignoring (1). There's probably still some > users of (1) out there but processing coredumps in this way can be > considered adventurous especially in the face of set*id binaries. > > The most common option should be (2) by now. It works by allowing > userspace to put a string into /proc/sys/kernel/core_pattern like: > > |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h > > The "|" at the beginning indicates to the kernel that a pipe must be > used. The path following the pipe indicator is a path to a binary that > will be spawned as a usermode helper process. Any additional parameters > pass information about the task that is generating the coredump to the > binary that processes the coredump. > > In the example core_pattern shown above systemd-coredump is spawned as a > usermode helper. There's various conceptual consequences of this > (non-exhaustive list): > > - systemd-coredump is spawned with file descriptor number 0 (stdin) > connected to the read-end of the pipe. All other file descriptors are > closed. That specifically includes 1 (stdout) and 2 (stderr). This has > already caused bugs because userspace assumed that this cannot happen > (Whether or not this is a sane assumption is irrelevant.). > > - systemd-coredump will be spawned as a child of system_unbound_wq. So > it is not a child of any userspace process and specifically not a > child of PID 1. It cannot be waited upon and is in a weird hybrid > upcall which are difficult for userspace to control correctly. > > - systemd-coredump is spawned with full kernel privileges. This > necessitates all kinds of weird privilege dropping excercises in > userspace to make this safe. > > - A new usermode helper has to be spawned for each crashing process. > > This series adds a new mode: > > (3) Dumping into an abstract AF_UNIX socket. > > Userspace can set /proc/sys/kernel/core_pattern to: > > @address SO_COOKIE > > The "@" at the beginning indicates to the kernel that the abstract > AF_UNIX coredump socket will be used to process coredumps. The address > is given by @address and must be followed by the socket cookie of the > coredump listening socket. > > The socket cookie is used to verify the socket connection. If the > coredump server restarts or crashes and someone recycles the socket > address the kernel will detect that the address has been recycled as the > socket cookie will have necessarily changed and refuse to connect. > > The coredump socket is located in the initial network namespace. When a > task coredumps it opens a client socket in the initial network namespace > and connects to the coredump socket. > > - The coredump server uses SO_PEERPIDFD to get a stable handle on the > connected crashing task. The retrieved pidfd will provide a stable > reference even if the crashing task gets SIGKILLed while generating > the coredump. > > - By setting core_pipe_limit non-zero userspace can guarantee that the > crashing task cannot be reaped behind it's back and thus process all > necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to > detect whether /proc/<pid> still refers to the same process. > > The core_pipe_limit isn't used to rate-limit connections to the > socket. This can simply be done via AF_UNIX sockets directly. > > - The pidfd for the crashing task will grow new information how the task > coredumps. > > - The coredump server should mark itself as non-dumpable. > > - A container coredump server in a separate network namespace can simply > bind to another well-know address and systemd-coredump fowards > coredumps to the container. > > - Coredumps could in the future also be handled via per-user/session > coredump servers that run only with that users privileges. > > The coredump server listens on the coredump socket and accepts a > new coredump connection. It then retrieves SO_PEERPIDFD for the > client, inspects uid/gid and hands the accepted client to the users > own coredump handler which runs with the users privileges only > (It must of coure pay close attention to not forward crashing suid > binaries.). > > The new coredump socket will allow userspace to not have to rely on > usermode helpers for processing coredumps and provides a safer way to > handle them instead of relying on super privileged coredumping helpers > that have and continue to cause significant CVEs. > > This will also be significantly more lightweight since no fork()+exec() > for the usermodehelper is required for each crashing process. The > coredump server in userspace can e.g., just keep a worker pool. > > Signed-off-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Thanks!
On Mon, 12 May 2025 at 09:56, Christian Brauner <brauner@kernel.org> wrote: > > Coredumping currently supports two modes: > > (1) Dumping directly into a file somewhere on the filesystem. > (2) Dumping into a pipe connected to a usermode helper process > spawned as a child of the system_unbound_wq or kthreadd. > > For simplicity I'm mostly ignoring (1). There's probably still some > users of (1) out there but processing coredumps in this way can be > considered adventurous especially in the face of set*id binaries. > > The most common option should be (2) by now. It works by allowing > userspace to put a string into /proc/sys/kernel/core_pattern like: > > |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h > > The "|" at the beginning indicates to the kernel that a pipe must be > used. The path following the pipe indicator is a path to a binary that > will be spawned as a usermode helper process. Any additional parameters > pass information about the task that is generating the coredump to the > binary that processes the coredump. > > In the example core_pattern shown above systemd-coredump is spawned as a > usermode helper. There's various conceptual consequences of this > (non-exhaustive list): > > - systemd-coredump is spawned with file descriptor number 0 (stdin) > connected to the read-end of the pipe. All other file descriptors are > closed. That specifically includes 1 (stdout) and 2 (stderr). This has > already caused bugs because userspace assumed that this cannot happen > (Whether or not this is a sane assumption is irrelevant.). > > - systemd-coredump will be spawned as a child of system_unbound_wq. So > it is not a child of any userspace process and specifically not a > child of PID 1. It cannot be waited upon and is in a weird hybrid > upcall which are difficult for userspace to control correctly. > > - systemd-coredump is spawned with full kernel privileges. This > necessitates all kinds of weird privilege dropping excercises in > userspace to make this safe. > > - A new usermode helper has to be spawned for each crashing process. > > This series adds a new mode: > > (3) Dumping into an abstract AF_UNIX socket. > > Userspace can set /proc/sys/kernel/core_pattern to: > > @address SO_COOKIE > > The "@" at the beginning indicates to the kernel that the abstract > AF_UNIX coredump socket will be used to process coredumps. The address > is given by @address and must be followed by the socket cookie of the > coredump listening socket. > > The socket cookie is used to verify the socket connection. If the > coredump server restarts or crashes and someone recycles the socket > address the kernel will detect that the address has been recycled as the > socket cookie will have necessarily changed and refuse to connect. This dynamic/cookie prefix makes it impossible to use this with socket activation units. The way systemd-coredump works is that every instance is an independent templated unit, spawned when there's a connection to the private socket. If the path was fixed, we could just reuse the same mechanism, it would fit very nicely with minimal changes. But because you need a "server" to be permanently running, this means socket-based activation can no longer work, and systemd-coredump must switch to a persistently-running mode. This is a severe degradation of functionality, will continuously waste CPU/memory resources for no good reasons, and makes the whole thing more fragile and complex, as if there are any issues with this server, you start losing core files. And honestly I don't really see the point? Setting the pattern is a privileged operation anyway. systemd manages the socket with a socket unit and again that's privileged already. Could we drop this cookie prefix and go back to the previous version (v5), please? Or if there is some specific non-systemd use case in mind that I am not aware of, have both options, so that we can use the simpler and more straightforward one with systemd-coredump. Thanks!
From: Luca Boccassi <bluca@debian.org> Date: Mon, 12 May 2025 11:58:54 +0100 > On Mon, 12 May 2025 at 09:56, Christian Brauner <brauner@kernel.org> wrote: > > > > Coredumping currently supports two modes: > > > > (1) Dumping directly into a file somewhere on the filesystem. > > (2) Dumping into a pipe connected to a usermode helper process > > spawned as a child of the system_unbound_wq or kthreadd. > > > > For simplicity I'm mostly ignoring (1). There's probably still some > > users of (1) out there but processing coredumps in this way can be > > considered adventurous especially in the face of set*id binaries. > > > > The most common option should be (2) by now. It works by allowing > > userspace to put a string into /proc/sys/kernel/core_pattern like: > > > > |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h > > > > The "|" at the beginning indicates to the kernel that a pipe must be > > used. The path following the pipe indicator is a path to a binary that > > will be spawned as a usermode helper process. Any additional parameters > > pass information about the task that is generating the coredump to the > > binary that processes the coredump. > > > > In the example core_pattern shown above systemd-coredump is spawned as a > > usermode helper. There's various conceptual consequences of this > > (non-exhaustive list): > > > > - systemd-coredump is spawned with file descriptor number 0 (stdin) > > connected to the read-end of the pipe. All other file descriptors are > > closed. That specifically includes 1 (stdout) and 2 (stderr). This has > > already caused bugs because userspace assumed that this cannot happen > > (Whether or not this is a sane assumption is irrelevant.). > > > > - systemd-coredump will be spawned as a child of system_unbound_wq. So > > it is not a child of any userspace process and specifically not a > > child of PID 1. It cannot be waited upon and is in a weird hybrid > > upcall which are difficult for userspace to control correctly. > > > > - systemd-coredump is spawned with full kernel privileges. This > > necessitates all kinds of weird privilege dropping excercises in > > userspace to make this safe. > > > > - A new usermode helper has to be spawned for each crashing process. > > > > This series adds a new mode: > > > > (3) Dumping into an abstract AF_UNIX socket. > > > > Userspace can set /proc/sys/kernel/core_pattern to: > > > > @address SO_COOKIE > > > > The "@" at the beginning indicates to the kernel that the abstract > > AF_UNIX coredump socket will be used to process coredumps. The address > > is given by @address and must be followed by the socket cookie of the > > coredump listening socket. > > > > The socket cookie is used to verify the socket connection. If the > > coredump server restarts or crashes and someone recycles the socket > > address the kernel will detect that the address has been recycled as the > > socket cookie will have necessarily changed and refuse to connect. > > This dynamic/cookie prefix makes it impossible to use this with socket > activation units. The way systemd-coredump works is that every > instance is an independent templated unit, spawned when there's a > connection to the private socket. If the path was fixed, we could just > reuse the same mechanism, it would fit very nicely with minimal > changes. Note this version does not use prefix. Now it requires users to just pass the socket cookie via core_pattern so that the kernel can verify the peer. > > But because you need a "server" to be permanently running, this means > socket-based activation can no longer work, and systemd-coredump must > switch to a persistently-running mode. The only thing for systemd to do is assign a cookie after socket creation. As long as systemd hold the file descriptor of the socket, you don't need a dedicated "server" running permanently, and the fd can be passed around to a spawned/activated process. > This is a severe degradation of > functionality, will continuously waste CPU/memory resources for no > good reasons, and makes the whole thing more fragile and complex, as > if there are any issues with this server, you start losing core files. > And honestly I don't really see the point? Setting the pattern is a > privileged operation anyway. systemd manages the socket with a socket > unit and again that's privileged already. > > Could we drop this cookie prefix and go back to the previous version > (v5), please? Or if there is some specific non-systemd use case in > mind that I am not aware of, have both options, so that we can use the > simpler and more straightforward one with systemd-coredump.
On Tue, 13 May 2025 at 01:18, Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Luca Boccassi <bluca@debian.org> > Date: Mon, 12 May 2025 11:58:54 +0100 > > On Mon, 12 May 2025 at 09:56, Christian Brauner <brauner@kernel.org> wrote: > > > > > > Coredumping currently supports two modes: > > > > > > (1) Dumping directly into a file somewhere on the filesystem. > > > (2) Dumping into a pipe connected to a usermode helper process > > > spawned as a child of the system_unbound_wq or kthreadd. > > > > > > For simplicity I'm mostly ignoring (1). There's probably still some > > > users of (1) out there but processing coredumps in this way can be > > > considered adventurous especially in the face of set*id binaries. > > > > > > The most common option should be (2) by now. It works by allowing > > > userspace to put a string into /proc/sys/kernel/core_pattern like: > > > > > > |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h > > > > > > The "|" at the beginning indicates to the kernel that a pipe must be > > > used. The path following the pipe indicator is a path to a binary that > > > will be spawned as a usermode helper process. Any additional parameters > > > pass information about the task that is generating the coredump to the > > > binary that processes the coredump. > > > > > > In the example core_pattern shown above systemd-coredump is spawned as a > > > usermode helper. There's various conceptual consequences of this > > > (non-exhaustive list): > > > > > > - systemd-coredump is spawned with file descriptor number 0 (stdin) > > > connected to the read-end of the pipe. All other file descriptors are > > > closed. That specifically includes 1 (stdout) and 2 (stderr). This has > > > already caused bugs because userspace assumed that this cannot happen > > > (Whether or not this is a sane assumption is irrelevant.). > > > > > > - systemd-coredump will be spawned as a child of system_unbound_wq. So > > > it is not a child of any userspace process and specifically not a > > > child of PID 1. It cannot be waited upon and is in a weird hybrid > > > upcall which are difficult for userspace to control correctly. > > > > > > - systemd-coredump is spawned with full kernel privileges. This > > > necessitates all kinds of weird privilege dropping excercises in > > > userspace to make this safe. > > > > > > - A new usermode helper has to be spawned for each crashing process. > > > > > > This series adds a new mode: > > > > > > (3) Dumping into an abstract AF_UNIX socket. > > > > > > Userspace can set /proc/sys/kernel/core_pattern to: > > > > > > @address SO_COOKIE > > > > > > The "@" at the beginning indicates to the kernel that the abstract > > > AF_UNIX coredump socket will be used to process coredumps. The address > > > is given by @address and must be followed by the socket cookie of the > > > coredump listening socket. > > > > > > The socket cookie is used to verify the socket connection. If the > > > coredump server restarts or crashes and someone recycles the socket > > > address the kernel will detect that the address has been recycled as the > > > socket cookie will have necessarily changed and refuse to connect. > > > > This dynamic/cookie prefix makes it impossible to use this with socket > > activation units. The way systemd-coredump works is that every > > instance is an independent templated unit, spawned when there's a > > connection to the private socket. If the path was fixed, we could just > > reuse the same mechanism, it would fit very nicely with minimal > > changes. > > Note this version does not use prefix. Now it requires users to > just pass the socket cookie via core_pattern so that the kernel > can verify the peer. Exactly - this means the pattern cannot be static in a sysctl.d early on boot anymore, and has to be set dynamically by <something>. This is a severe degradation over the status quo. > > But because you need a "server" to be permanently running, this means > > socket-based activation can no longer work, and systemd-coredump must > > switch to a persistently-running mode. > > The only thing for systemd to do is assign a cookie after socket creation. > > As long as systemd hold the file descriptor of the socket, you don't need > a dedicated "server" running permanently, and the fd can be passed around > to a spawned/activated process. There is no such facility, a socket is just a socket and there's no infrastructure to randomly extract random information from one and write it to some other random file in procfs, and I don't see why we should add some super-special-case just for this, it sounds really messy. Also sockets can be and in fact are routinely restarted (eg: on package upgrades), which would invalidate this whole scheme, and result in a very racy setup. When packages are upgraded it's one of the most complex workflows in modern distros, and it's very likely that things start crashing exactly at that point, and with this workflow it would mean we'll lose core files due to the race between restarting the socket unit and <something> updating the pattern accordingly. Also we very much want to be able to spawn as many core handlers at the same time as needed, which I don't see how can work with a cookie that has to be unique per socket. Sorry, but this particular approach seems completely unnecessary and over-complicated, and doesn't seem to fit very well with how modern userspace is set up today, and I don't see what actual problem it would solve? If you need it for some particular use case that's absolutely fine, but please add it later as another optional mode, so that we don't have to degrade our use cases for it. That way everyone gets what they want, and everyone's happy. v5 was super nice and had everything we needed to massively improve the status quo, with easy and straightforward changes and no real drawbacks, so it would be really great if we could just go back to that version, please. Thanks.
From: Luca Boccassi <bluca@debian.org> Date: Tue, 13 May 2025 02:09:24 +0100 > On Tue, 13 May 2025 at 01:18, Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > From: Luca Boccassi <bluca@debian.org> > > Date: Mon, 12 May 2025 11:58:54 +0100 > > > On Mon, 12 May 2025 at 09:56, Christian Brauner <brauner@kernel.org> wrote: > > > > > > > > Coredumping currently supports two modes: > > > > > > > > (1) Dumping directly into a file somewhere on the filesystem. > > > > (2) Dumping into a pipe connected to a usermode helper process > > > > spawned as a child of the system_unbound_wq or kthreadd. > > > > > > > > For simplicity I'm mostly ignoring (1). There's probably still some > > > > users of (1) out there but processing coredumps in this way can be > > > > considered adventurous especially in the face of set*id binaries. > > > > > > > > The most common option should be (2) by now. It works by allowing > > > > userspace to put a string into /proc/sys/kernel/core_pattern like: > > > > > > > > |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h > > > > > > > > The "|" at the beginning indicates to the kernel that a pipe must be > > > > used. The path following the pipe indicator is a path to a binary that > > > > will be spawned as a usermode helper process. Any additional parameters > > > > pass information about the task that is generating the coredump to the > > > > binary that processes the coredump. > > > > > > > > In the example core_pattern shown above systemd-coredump is spawned as a > > > > usermode helper. There's various conceptual consequences of this > > > > (non-exhaustive list): > > > > > > > > - systemd-coredump is spawned with file descriptor number 0 (stdin) > > > > connected to the read-end of the pipe. All other file descriptors are > > > > closed. That specifically includes 1 (stdout) and 2 (stderr). This has > > > > already caused bugs because userspace assumed that this cannot happen > > > > (Whether or not this is a sane assumption is irrelevant.). > > > > > > > > - systemd-coredump will be spawned as a child of system_unbound_wq. So > > > > it is not a child of any userspace process and specifically not a > > > > child of PID 1. It cannot be waited upon and is in a weird hybrid > > > > upcall which are difficult for userspace to control correctly. > > > > > > > > - systemd-coredump is spawned with full kernel privileges. This > > > > necessitates all kinds of weird privilege dropping excercises in > > > > userspace to make this safe. > > > > > > > > - A new usermode helper has to be spawned for each crashing process. > > > > > > > > This series adds a new mode: > > > > > > > > (3) Dumping into an abstract AF_UNIX socket. > > > > > > > > Userspace can set /proc/sys/kernel/core_pattern to: > > > > > > > > @address SO_COOKIE > > > > > > > > The "@" at the beginning indicates to the kernel that the abstract > > > > AF_UNIX coredump socket will be used to process coredumps. The address > > > > is given by @address and must be followed by the socket cookie of the > > > > coredump listening socket. > > > > > > > > The socket cookie is used to verify the socket connection. If the > > > > coredump server restarts or crashes and someone recycles the socket > > > > address the kernel will detect that the address has been recycled as the > > > > socket cookie will have necessarily changed and refuse to connect. > > > > > > This dynamic/cookie prefix makes it impossible to use this with socket > > > activation units. The way systemd-coredump works is that every > > > instance is an independent templated unit, spawned when there's a > > > connection to the private socket. If the path was fixed, we could just > > > reuse the same mechanism, it would fit very nicely with minimal > > > changes. > > > > Note this version does not use prefix. Now it requires users to > > just pass the socket cookie via core_pattern so that the kernel > > can verify the peer. > > Exactly - this means the pattern cannot be static in a sysctl.d early > on boot anymore, and has to be set dynamically by <something>. You missed the socket has to be created dynamically by <something>. > This is > a severe degradation over the status quo. > > > > But because you need a "server" to be permanently running, this means > > > socket-based activation can no longer work, and systemd-coredump must > > > switch to a persistently-running mode. > > > > The only thing for systemd to do is assign a cookie after socket creation. > > > > As long as systemd hold the file descriptor of the socket, you don't need > > a dedicated "server" running permanently, and the fd can be passed around > > to a spawned/activated process. > > There is no such facility, a socket is just a socket and there's no > infrastructure to randomly extract random information from one and > write it to some other random file in procfs, As only one socket can be registered to core_pattern, the socket must not be a random. > and I don't see why we > should add some super-special-case just for this, Because this is a new special use case. > it sounds really > messy. > Also sockets can be and in fact are routinely restarted (eg: on > package upgrades), which would invalidate this whole scheme, and > result in a very racy setup. When packages are upgraded it's one of > the most complex workflows in modern distros, and it's very likely > that things start crashing exactly at that point, and with this > workflow it would mean we'll lose core files due to the race between > restarting the socket unit and <something> updating the pattern > accordingly. Looks like you misunderstood the series. As you need to specify the socket in core_pattern, there must be only one socket that can receive core data, so the problem statement is always true throughout the series. kernel_connect() does not connect() to a random one out of sockets that have the common prefix. That's why the BPF was mentioned in the previous cover letter: - Since unix_stream_connect() runs bpf programs during connect it's possible to even redirect or multiplex coredumps to other sockets. > Also we very much want to be able to spawn as many core handlers at > the same time as needed, which I don't see how can work with a cookie > that has to be unique per socket. As said, you can just pass the fd of the coredump listener or a fd accept()ed from the listener, depending on how you want to handle this in userspace.
On Mo, 12.05.25 19:14, Kuniyuki Iwashima (kuniyu@amazon.com) wrote: > > > Note this version does not use prefix. Now it requires users to > > > just pass the socket cookie via core_pattern so that the kernel > > > can verify the peer. > > > > Exactly - this means the pattern cannot be static in a sysctl.d early > > on boot anymore, and has to be set dynamically by <something>. > > You missed the socket has to be created dynamically by <something>. systemd implements socket activation: the generic code in PID 1 can bind a socket, and then generically forks off a process (or instances of processes for connection-based sockets) once traffic is seen on that socket. On a typical, current systemd system, PID 1 does this for ~40 sockets by default. The code to bind AF_UNIX or AF_INET/AF_INET6 sockets is entirely generic. Currently, in the existing systemd codebase coredumping is implemented via socket activation: the core_pattern handler binary quickly hands off the coredump fds to an AF_UNIX socket bound that way, and the service behind that does the heavy lifting. Our hope is that with Christian's work we can make the kernel deliver the coredumps directly to the socket PID1 generically binds, getting rid of one middle man. By requiring userspace to echo the SO_COOKIE value into the core_pattern sysctl in a special formatting, you define a bespoke protocol: it's not just enough to bind a socket (for which the generic code in PID1 is good enough), and to write a fixed string into a sysctl (for which the generic code in the current /etc/sysctl.d/ manager, i.e. systemd-sysctl, works fine). But you suddenly are asking from userspace, that some specific tool runs at early boot, extracts the socket cookie from PID1 somehow, and writes that into sysctl. We'd have to come up with a new tool for that, we can no longer use generic tools. And that's the part that Luca doesn't like. To a large degree I agree with Luca about this. I would much prefer Christian's earlier proposal (i.e. to simply define some prefix of AF_UNIX abstract namespace addresses as requiring privs to bind), because that would enable us to do generic handling in userspace: the existing socket binding logic in PID 1, and the existing sysctl.d handling in the systemd suite would be good enough to set up everything for the coredump handling. That said, I'd take what we can get. If enforcing privs on some abstract namespace socket address prefix is not acceptable, then we can probably make the SO_COOKIE proposal work (Luca: we'd just hook some small tool into ExecStartPost= of the .socket unit, and make PID1 pass the cookie in some env var or so to it; the tool would then just echo that env var into the sysctl with the fixed prefix). In my eyes, it's not ideal though: it would mean the sysctl data on every instance of the system system image would necessarily deviate (because the socket cookie is going to be different), which mgmt tools won't like (as you cannot compare sysctl state anymore), and we'd have a weak conflict of ownership: right now most sysctl settings are managed by /etc/sysctl.d/, but the core_pattern suddenly wouldn't be anymore. This will create conflicts because suddenly two components write to the thing, and will start fighting. Hence: I'd *much* prefer Christian's original approach as it does not have these issues. But I'll take what I can get, we can make the cookie thing work, but it's much uglier. I am not sure I understand why enforcing privs on some abstract namespace socke address prefix is such an unacceptable idea though. Lennart -- Lennart Poettering, Berlin
On Tue, May 13, 2025 at 10:56:03AM +0200, Lennart Poettering wrote: > On Mo, 12.05.25 19:14, Kuniyuki Iwashima (kuniyu@amazon.com) wrote: > > > > > Note this version does not use prefix. Now it requires users to > > > > just pass the socket cookie via core_pattern so that the kernel > > > > can verify the peer. > > > > > > Exactly - this means the pattern cannot be static in a sysctl.d early > > > on boot anymore, and has to be set dynamically by <something>. > > > > You missed the socket has to be created dynamically by <something>. > > systemd implements socket activation: the generic code in PID 1 can > bind a socket, and then generically forks off a process (or instances > of processes for connection-based sockets) once traffic is seen on > that socket. On a typical, current systemd system, PID 1 does this for > ~40 sockets by default. The code to bind AF_UNIX or AF_INET/AF_INET6 > sockets is entirely generic. > > Currently, in the existing systemd codebase coredumping is implemented > via socket activation: the core_pattern handler binary quickly hands > off the coredump fds to an AF_UNIX socket bound that way, and the > service behind that does the heavy lifting. Our hope is that with > Christian's work we can make the kernel deliver the coredumps directly > to the socket PID1 generically binds, getting rid of one middle man. > > By requiring userspace to echo the SO_COOKIE value into the > core_pattern sysctl in a special formatting, you define a bespoke > protocol: it's not just enough to bind a socket (for which the generic > code in PID1 is good enough), and to write a fixed > string into a sysctl (for which the generic code in the current > /etc/sysctl.d/ manager, i.e. systemd-sysctl, works fine). But you > suddenly are asking from userspace, that some specific tool runs at > early boot, extracts the socket cookie from PID1 somehow, and writes > that into sysctl. We'd have to come up with a new tool for that, we > can no longer use generic tools. And that's the part that Luca doesn't > like. > > To a large degree I agree with Luca about this. I would much prefer > Christian's earlier proposal (i.e. to simply define some prefix of > AF_UNIX abstract namespace addresses as requiring privs to bind), > because that would enable us to do generic handling in userspace: the > existing socket binding logic in PID 1, and the existing sysctl.d > handling in the systemd suite would be good enough to set up > everything for the coredump handling. > > That said, I'd take what we can get. If enforcing privs on some > abstract namespace socket address prefix is not acceptable, then we > can probably make the SO_COOKIE proposal work (Luca: we'd just hook > some small tool into ExecStartPost= of the .socket unit, and make PID1 > pass the cookie in some env var or so to it; the tool would then just > echo that env var into the sysctl with the fixed prefix). In my eyes, > it's not ideal though: it would mean the sysctl data on every instance > of the system system image would necessarily deviate (because the > socket cookie is going to be different), which mgmt tools won't like > (as you cannot compare sysctl state anymore), and we'd have a weak > conflict of ownership: right now most sysctl settings are managed by > /etc/sysctl.d/, but the core_pattern suddenly wouldn't be > anymore. This will create conflicts because suddenly two components > write to the thing, and will start fighting. > > Hence: I'd *much* prefer Christian's original approach as it does not > have these issues. But I'll take what I can get, we can make the > cookie thing work, but it's much uglier. > > I am not sure I understand why enforcing privs on some abstract > namespace socke address prefix is such an unacceptable idea though. I prefer the prefix approach as well. It's clean, simple and is safe by itself and elegant. And it fits into the generic socket activation and system administration models. I mainly show-cased the cookie model as an elaborate workaround. It can be done but it's ugly and more difficult to use. I do have one more idea how to solve this problem cleanly using regular socket paths that hopefully pleases everyone.
© 2016 - 2025 Red Hat, Inc.