[PATCH RFC v3 00/10] coredump: add coredump socket

Christian Brauner posted 10 patches 9 months, 1 week ago
There is a newer version of this series
fs/coredump.c                                     | 358 +++++++++++++++++-----
fs/pidfs.c                                        |  68 ++++
include/linux/coredump.h                          |  12 +
include/linux/pidfs.h                             |   4 +
include/uapi/linux/pidfd.h                        |  16 +
include/uapi/linux/un.h                           |   2 +
net/unix/af_unix.c                                |  64 +++-
tools/testing/selftests/coredump/stackdump_test.c |  71 ++++-
tools/testing/selftests/pidfd/pidfd.h             |  22 ++
9 files changed, 528 insertions(+), 89 deletions(-)
[PATCH RFC v3 00/10] coredump: add coredump socket
Posted by Christian Brauner 9 months, 1 week ago
Coredumping currently supports two modes:

(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
    spawned as a child of the system_unbound_wq or kthreadd.

For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.

The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:

        |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.

In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):

- systemd-coredump is spawned with file descriptor number 0 (stdin)
  connected to the read-end of the pipe. All other file descriptors are
  closed. That specifically includes 1 (stdout) and 2 (stderr). This has
  already caused bugs because userspace assumed that this cannot happen
  (Whether or not this is a sane assumption is irrelevant.).

- systemd-coredump will be spawned as a child of system_unbound_wq. So
  it is not a child of any userspace process and specifically not a
  child of PID 1. It cannot be waited upon and is in a weird hybrid
  upcall which are difficult for userspace to control correctly.

- systemd-coredump is spawned with full kernel privileges. This
  necessitates all kinds of weird privilege dropping excercises in
  userspace to make this safe.

- A new usermode helper has to be spawned for each crashing process.

This series adds a new mode:

(3) Dumping into an abstract AF_UNIX socket.

Userspace can set /proc/sys/kernel/core_pattern to:

        @linuxafsk/coredump_socket

The "@" at the beginning indicates to the kernel that the abstract
AF_UNIX coredump socket will be used to process coredumps.

The coredump socket uses the fixed address "linuxafsk/coredump.socket"
for now.

The coredump socket is located in the initial network namespace. To bind
the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
user namespace. Listening and reading can happen from whatever
unprivileged context is necessary to safely process coredumps.

When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket. For now only tasks that
are acctually coredumping are allowed to connect to the initial coredump
socket.

- The coredump server should use SO_PEERPIDFD to get a stable handle on
  the connected crashing task. The retrieved pidfd will provide a stable
  reference even if the crashing task gets SIGKILLed while generating
  the coredump.

- By setting core_pipe_limit non-zero userspace can guarantee that the
  crashing task cannot be reaped behind it's back and thus process all
  necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
  detect whether /proc/<pid> still refers to the same process.

  The core_pipe_limit isn't used to rate-limit connections to the
  socket. This can simply be done via AF_UNIX socket directly.

- The pidfd for the crashing task will contain information how the task
  coredumps. The PIDFD_GET_INFO ioctl gained a new flag
  PIDFD_INFO_COREDUMP which can be used to retreive the coredump
  information.

  If the coredump gets a new coredump client connection the kernel
  guarantees that PIDFD_INFO_COREDUMP information is available.
  Currently the following information is provided in the new
  @coredump_mask extension to struct pidfd_info:

  * PIDFD_COREDUMPED is raised if the task did actually coredump.
  * PIDFD_COREDUMP_SKIP	is raised if the task skipped coredumping (e.g.,
    undumpable).
  * PIDFD_COREDUMP_USER	is raised if this is a regular coredump and
    doesn't need special care by the coredump server.
  * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
    treated as sensitive and the coredump server should restrict to the
    generated coredump to sufficiently privileged users.

- Since unix_stream_connect() runs bpf programs during connect it's
  possible to even redirect or multiplex coredumps to other sockets.

- The coredump server should mark itself as non-dumpable.
  To capture coredumps for the coredump server itself a bpf program
  should be run at connect to redirect it to another socket in
  userspace. This can be useful for debugging crashing coredump servers.

- A container coredump server in a separate network namespace can simply
  bind to linuxafsk/coredump.socket and systemd-coredump fowards
  coredumps to the container.

- Fwiw, one idea is to handle coredumps via per-user/session coredump
  servers that run with that users privileges.

  The coredump server listens on the coredump socket and accepts a
  new coredump connection. It then retrieves SO_PEERPIDFD for the
  client, inspects uid/gid and hands the accepted client to the users
  own coredump handler which runs with the users privileges only.

The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers.

This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can just keep a worker pool.

This is easy to test:

(a) coredump processing (we're using socat):

    > cat coredump_socket.sh
    #!/bin/bash

    set -x

    sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
    sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc

(b) trigger a coredump:

    user1@localhost:~/data/scripts$ cat crash.c
    #include <stdio.h>
    #include <unistd.h>

    int main(int argc, char *argv[])
    {
            fprintf(stderr, "%u\n", (1 / 0));
            _exit(0);
    }

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v3:
- Use an abstract unix socket.
- Add documentation.
- Add selftests.
- Link to v2: https://lore.kernel.org/20250502-work-coredump-socket-v2-0-43259042ffc7@kernel.org

Changes in v2:
- Expose dumpability via PIDFD_GET_INFO.
- Place COREDUMP_SOCK handling under CONFIG_UNIX.
- Link to v1: https://lore.kernel.org/20250430-work-coredump-socket-v1-0-2faf027dbb47@kernel.org

---
Christian Brauner (10):
      coredump: massage format_corname()
      coredump: massage do_coredump()
      net: reserve prefix
      coredump: add coredump socket
      coredump: validate socket name as it is written
      coredump: show supported coredump modes
      pidfs, coredump: add PIDFD_INFO_COREDUMP
      net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
      selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
      selftests/coredump: add tests for AF_UNIX coredumps

 fs/coredump.c                                     | 358 +++++++++++++++++-----
 fs/pidfs.c                                        |  68 ++++
 include/linux/coredump.h                          |  12 +
 include/linux/pidfs.h                             |   4 +
 include/uapi/linux/pidfd.h                        |  16 +
 include/uapi/linux/un.h                           |   2 +
 net/unix/af_unix.c                                |  64 +++-
 tools/testing/selftests/coredump/stackdump_test.c |  71 ++++-
 tools/testing/selftests/pidfd/pidfd.h             |  22 ++
 9 files changed, 528 insertions(+), 89 deletions(-)
---
base-commit: 4dd6566b5a8ca1e8c9ff2652c2249715d6c64217
change-id: 20250429-work-coredump-socket-87cc0f17729c
Re: [PATCH RFC v3 00/10] coredump: add coredump socket
Posted by Mickaël Salaün 9 months, 1 week ago
On Mon, May 05, 2025 at 01:13:38PM +0200, Christian Brauner wrote:
> Coredumping currently supports two modes:
> 
> (1) Dumping directly into a file somewhere on the filesystem.
> (2) Dumping into a pipe connected to a usermode helper process
>     spawned as a child of the system_unbound_wq or kthreadd.
> 
> For simplicity I'm mostly ignoring (1). There's probably still some
> users of (1) out there but processing coredumps in this way can be
> considered adventurous especially in the face of set*id binaries.
> 
> The most common option should be (2) by now. It works by allowing
> userspace to put a string into /proc/sys/kernel/core_pattern like:
> 
>         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> 
> The "|" at the beginning indicates to the kernel that a pipe must be
> used. The path following the pipe indicator is a path to a binary that
> will be spawned as a usermode helper process. Any additional parameters
> pass information about the task that is generating the coredump to the
> binary that processes the coredump.
> 
> In the example core_pattern shown above systemd-coredump is spawned as a
> usermode helper. There's various conceptual consequences of this
> (non-exhaustive list):
> 
> - systemd-coredump is spawned with file descriptor number 0 (stdin)
>   connected to the read-end of the pipe. All other file descriptors are
>   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
>   already caused bugs because userspace assumed that this cannot happen
>   (Whether or not this is a sane assumption is irrelevant.).
> 
> - systemd-coredump will be spawned as a child of system_unbound_wq. So
>   it is not a child of any userspace process and specifically not a
>   child of PID 1. It cannot be waited upon and is in a weird hybrid
>   upcall which are difficult for userspace to control correctly.
> 
> - systemd-coredump is spawned with full kernel privileges. This
>   necessitates all kinds of weird privilege dropping excercises in
>   userspace to make this safe.
> 
> - A new usermode helper has to be spawned for each crashing process.
> 
> This series adds a new mode:
> 
> (3) Dumping into an abstract AF_UNIX socket.
> 
> Userspace can set /proc/sys/kernel/core_pattern to:
> 
>         @linuxafsk/coredump_socket
> 
> The "@" at the beginning indicates to the kernel that the abstract
> AF_UNIX coredump socket will be used to process coredumps.
> 
> The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> for now.
> 
> The coredump socket is located in the initial network namespace. To bind
> the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> user namespace. Listening and reading can happen from whatever
> unprivileged context is necessary to safely process coredumps.
> 
> When a task coredumps it opens a client socket in the initial network
> namespace and connects to the coredump socket. For now only tasks that
> are acctually coredumping are allowed to connect to the initial coredump
> socket.

I think we should avoid using abstract UNIX sockets, especially for new
interfaces, because it is hard to properly control such access.  Can we
create new dedicated AF_UNIX protocols instead?  One could be used by a
privileged process in the initial namespace to create a socket to
collect coredumps, and the other could be dedicatde to coredumped
proccesses.  Such (coredump collector) file descriptor or new (proxy)
socketpair ones could be passed to containers.

> 
> - The coredump server should use SO_PEERPIDFD to get a stable handle on
>   the connected crashing task. The retrieved pidfd will provide a stable
>   reference even if the crashing task gets SIGKILLed while generating
>   the coredump.
> 
> - By setting core_pipe_limit non-zero userspace can guarantee that the
>   crashing task cannot be reaped behind it's back and thus process all
>   necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
>   detect whether /proc/<pid> still refers to the same process.
> 
>   The core_pipe_limit isn't used to rate-limit connections to the
>   socket. This can simply be done via AF_UNIX socket directly.
> 
> - The pidfd for the crashing task will contain information how the task
>   coredumps. The PIDFD_GET_INFO ioctl gained a new flag
>   PIDFD_INFO_COREDUMP which can be used to retreive the coredump
>   information.
> 
>   If the coredump gets a new coredump client connection the kernel
>   guarantees that PIDFD_INFO_COREDUMP information is available.
>   Currently the following information is provided in the new
>   @coredump_mask extension to struct pidfd_info:
> 
>   * PIDFD_COREDUMPED is raised if the task did actually coredump.
>   * PIDFD_COREDUMP_SKIP	is raised if the task skipped coredumping (e.g.,
>     undumpable).
>   * PIDFD_COREDUMP_USER	is raised if this is a regular coredump and
>     doesn't need special care by the coredump server.
>   * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
>     treated as sensitive and the coredump server should restrict to the
>     generated coredump to sufficiently privileged users.
> 
> - Since unix_stream_connect() runs bpf programs during connect it's
>   possible to even redirect or multiplex coredumps to other sockets.
> 
> - The coredump server should mark itself as non-dumpable.
>   To capture coredumps for the coredump server itself a bpf program
>   should be run at connect to redirect it to another socket in
>   userspace. This can be useful for debugging crashing coredump servers.
> 
> - A container coredump server in a separate network namespace can simply
>   bind to linuxafsk/coredump.socket and systemd-coredump fowards
>   coredumps to the container.
> 
> - Fwiw, one idea is to handle coredumps via per-user/session coredump
>   servers that run with that users privileges.
> 
>   The coredump server listens on the coredump socket and accepts a
>   new coredump connection. It then retrieves SO_PEERPIDFD for the
>   client, inspects uid/gid and hands the accepted client to the users
>   own coredump handler which runs with the users privileges only.
> 
> The new coredump socket will allow userspace to not have to rely on
> usermode helpers for processing coredumps and provides a safer way to
> handle them instead of relying on super privileged coredumping helpers.
> 
> This will also be significantly more lightweight since no fork()+exec()
> for the usermodehelper is required for each crashing process. The
> coredump server in userspace can just keep a worker pool.
> 
> This is easy to test:
> 
> (a) coredump processing (we're using socat):
> 
>     > cat coredump_socket.sh
>     #!/bin/bash
> 
>     set -x
> 
>     sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
>     sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc
> 
> (b) trigger a coredump:
> 
>     user1@localhost:~/data/scripts$ cat crash.c
>     #include <stdio.h>
>     #include <unistd.h>
> 
>     int main(int argc, char *argv[])
>     {
>             fprintf(stderr, "%u\n", (1 / 0));
>             _exit(0);
>     }
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Changes in v3:
> - Use an abstract unix socket.
> - Add documentation.
> - Add selftests.
> - Link to v2: https://lore.kernel.org/20250502-work-coredump-socket-v2-0-43259042ffc7@kernel.org
> 
> Changes in v2:
> - Expose dumpability via PIDFD_GET_INFO.
> - Place COREDUMP_SOCK handling under CONFIG_UNIX.
> - Link to v1: https://lore.kernel.org/20250430-work-coredump-socket-v1-0-2faf027dbb47@kernel.org
> 
> ---
> Christian Brauner (10):
>       coredump: massage format_corname()
>       coredump: massage do_coredump()
>       net: reserve prefix
>       coredump: add coredump socket
>       coredump: validate socket name as it is written
>       coredump: show supported coredump modes
>       pidfs, coredump: add PIDFD_INFO_COREDUMP
>       net, pidfs, coredump: only allow coredumping tasks to connect to coredump socket
>       selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
>       selftests/coredump: add tests for AF_UNIX coredumps
> 
>  fs/coredump.c                                     | 358 +++++++++++++++++-----
>  fs/pidfs.c                                        |  68 ++++
>  include/linux/coredump.h                          |  12 +
>  include/linux/pidfs.h                             |   4 +
>  include/uapi/linux/pidfd.h                        |  16 +
>  include/uapi/linux/un.h                           |   2 +
>  net/unix/af_unix.c                                |  64 +++-
>  tools/testing/selftests/coredump/stackdump_test.c |  71 ++++-
>  tools/testing/selftests/pidfd/pidfd.h             |  22 ++
>  9 files changed, 528 insertions(+), 89 deletions(-)
> ---
> base-commit: 4dd6566b5a8ca1e8c9ff2652c2249715d6c64217
> change-id: 20250429-work-coredump-socket-87cc0f17729c
>
Re: [PATCH RFC v3 00/10] coredump: add coredump socket
Posted by Jann Horn 9 months, 1 week ago
On Mon, May 5, 2025 at 4:41 PM Mickaël Salaün <mic@digikod.net> wrote:
> On Mon, May 05, 2025 at 01:13:38PM +0200, Christian Brauner wrote:
> > Coredumping currently supports two modes:
> >
> > (1) Dumping directly into a file somewhere on the filesystem.
> > (2) Dumping into a pipe connected to a usermode helper process
> >     spawned as a child of the system_unbound_wq or kthreadd.
> >
> > For simplicity I'm mostly ignoring (1). There's probably still some
> > users of (1) out there but processing coredumps in this way can be
> > considered adventurous especially in the face of set*id binaries.
> >
> > The most common option should be (2) by now. It works by allowing
> > userspace to put a string into /proc/sys/kernel/core_pattern like:
> >
> >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> >
> > The "|" at the beginning indicates to the kernel that a pipe must be
> > used. The path following the pipe indicator is a path to a binary that
> > will be spawned as a usermode helper process. Any additional parameters
> > pass information about the task that is generating the coredump to the
> > binary that processes the coredump.
> >
> > In the example core_pattern shown above systemd-coredump is spawned as a
> > usermode helper. There's various conceptual consequences of this
> > (non-exhaustive list):
> >
> > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> >   connected to the read-end of the pipe. All other file descriptors are
> >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> >   already caused bugs because userspace assumed that this cannot happen
> >   (Whether or not this is a sane assumption is irrelevant.).
> >
> > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> >   it is not a child of any userspace process and specifically not a
> >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> >   upcall which are difficult for userspace to control correctly.
> >
> > - systemd-coredump is spawned with full kernel privileges. This
> >   necessitates all kinds of weird privilege dropping excercises in
> >   userspace to make this safe.
> >
> > - A new usermode helper has to be spawned for each crashing process.
> >
> > This series adds a new mode:
> >
> > (3) Dumping into an abstract AF_UNIX socket.
> >
> > Userspace can set /proc/sys/kernel/core_pattern to:
> >
> >         @linuxafsk/coredump_socket
> >
> > The "@" at the beginning indicates to the kernel that the abstract
> > AF_UNIX coredump socket will be used to process coredumps.
> >
> > The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> > for now.
> >
> > The coredump socket is located in the initial network namespace. To bind
> > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > user namespace. Listening and reading can happen from whatever
> > unprivileged context is necessary to safely process coredumps.
> >
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket. For now only tasks that
> > are acctually coredumping are allowed to connect to the initial coredump
> > socket.
>
> I think we should avoid using abstract UNIX sockets, especially for new
> interfaces, because it is hard to properly control such access.  Can we
> create new dedicated AF_UNIX protocols instead?  One could be used by a
> privileged process in the initial namespace to create a socket to
> collect coredumps, and the other could be dedicatde to coredumped
> proccesses.  Such (coredump collector) file descriptor or new (proxy)
> socketpair ones could be passed to containers.

I would agree with you if we were talking about designing a pure
userspace thing; but I think the limits that Christian added on bind()
and connect() to these special abstract names in this series
effectively make it behave as if they were dedicated AF_UNIX
protocols, and prevent things like random unprivileged userspace
processes bind()ing to them.
Re: [PATCH RFC v3 00/10] coredump: add coredump socket
Posted by Mickaël Salaün 9 months, 1 week ago
On Mon, May 05, 2025 at 04:59:41PM +0200, Jann Horn wrote:
> On Mon, May 5, 2025 at 4:41 PM Mickaël Salaün <mic@digikod.net> wrote:
> > On Mon, May 05, 2025 at 01:13:38PM +0200, Christian Brauner wrote:
> > > Coredumping currently supports two modes:
> > >
> > > (1) Dumping directly into a file somewhere on the filesystem.
> > > (2) Dumping into a pipe connected to a usermode helper process
> > >     spawned as a child of the system_unbound_wq or kthreadd.
> > >
> > > For simplicity I'm mostly ignoring (1). There's probably still some
> > > users of (1) out there but processing coredumps in this way can be
> > > considered adventurous especially in the face of set*id binaries.
> > >
> > > The most common option should be (2) by now. It works by allowing
> > > userspace to put a string into /proc/sys/kernel/core_pattern like:
> > >
> > >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> > >
> > > The "|" at the beginning indicates to the kernel that a pipe must be
> > > used. The path following the pipe indicator is a path to a binary that
> > > will be spawned as a usermode helper process. Any additional parameters
> > > pass information about the task that is generating the coredump to the
> > > binary that processes the coredump.
> > >
> > > In the example core_pattern shown above systemd-coredump is spawned as a
> > > usermode helper. There's various conceptual consequences of this
> > > (non-exhaustive list):
> > >
> > > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> > >   connected to the read-end of the pipe. All other file descriptors are
> > >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> > >   already caused bugs because userspace assumed that this cannot happen
> > >   (Whether or not this is a sane assumption is irrelevant.).
> > >
> > > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> > >   it is not a child of any userspace process and specifically not a
> > >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> > >   upcall which are difficult for userspace to control correctly.
> > >
> > > - systemd-coredump is spawned with full kernel privileges. This
> > >   necessitates all kinds of weird privilege dropping excercises in
> > >   userspace to make this safe.
> > >
> > > - A new usermode helper has to be spawned for each crashing process.
> > >
> > > This series adds a new mode:
> > >
> > > (3) Dumping into an abstract AF_UNIX socket.
> > >
> > > Userspace can set /proc/sys/kernel/core_pattern to:
> > >
> > >         @linuxafsk/coredump_socket
> > >
> > > The "@" at the beginning indicates to the kernel that the abstract
> > > AF_UNIX coredump socket will be used to process coredumps.
> > >
> > > The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> > > for now.
> > >
> > > The coredump socket is located in the initial network namespace. To bind
> > > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > > user namespace. Listening and reading can happen from whatever
> > > unprivileged context is necessary to safely process coredumps.
> > >
> > > When a task coredumps it opens a client socket in the initial network
> > > namespace and connects to the coredump socket. For now only tasks that
> > > are acctually coredumping are allowed to connect to the initial coredump
> > > socket.
> >
> > I think we should avoid using abstract UNIX sockets, especially for new
> > interfaces, because it is hard to properly control such access.  Can we
> > create new dedicated AF_UNIX protocols instead?  One could be used by a
> > privileged process in the initial namespace to create a socket to
> > collect coredumps, and the other could be dedicatde to coredumped
> > proccesses.  Such (coredump collector) file descriptor or new (proxy)
> > socketpair ones could be passed to containers.
> 
> I would agree with you if we were talking about designing a pure
> userspace thing; but I think the limits that Christian added on bind()
> and connect() to these special abstract names in this series
> effectively make it behave as if they were dedicated AF_UNIX
> protocols, and prevent things like random unprivileged userspace
> processes bind()ing to them.

OK, so why not create a proper protocol?  That should also simplify the
interface.
Re: [PATCH RFC v3 00/10] coredump: add coredump socket
Posted by Christian Brauner 9 months, 1 week ago
On Mon, May 05, 2025 at 04:41:28PM +0200, Mickaël Salaün wrote:
> On Mon, May 05, 2025 at 01:13:38PM +0200, Christian Brauner wrote:
> > Coredumping currently supports two modes:
> > 
> > (1) Dumping directly into a file somewhere on the filesystem.
> > (2) Dumping into a pipe connected to a usermode helper process
> >     spawned as a child of the system_unbound_wq or kthreadd.
> > 
> > For simplicity I'm mostly ignoring (1). There's probably still some
> > users of (1) out there but processing coredumps in this way can be
> > considered adventurous especially in the face of set*id binaries.
> > 
> > The most common option should be (2) by now. It works by allowing
> > userspace to put a string into /proc/sys/kernel/core_pattern like:
> > 
> >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> > 
> > The "|" at the beginning indicates to the kernel that a pipe must be
> > used. The path following the pipe indicator is a path to a binary that
> > will be spawned as a usermode helper process. Any additional parameters
> > pass information about the task that is generating the coredump to the
> > binary that processes the coredump.
> > 
> > In the example core_pattern shown above systemd-coredump is spawned as a
> > usermode helper. There's various conceptual consequences of this
> > (non-exhaustive list):
> > 
> > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> >   connected to the read-end of the pipe. All other file descriptors are
> >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> >   already caused bugs because userspace assumed that this cannot happen
> >   (Whether or not this is a sane assumption is irrelevant.).
> > 
> > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> >   it is not a child of any userspace process and specifically not a
> >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> >   upcall which are difficult for userspace to control correctly.
> > 
> > - systemd-coredump is spawned with full kernel privileges. This
> >   necessitates all kinds of weird privilege dropping excercises in
> >   userspace to make this safe.
> > 
> > - A new usermode helper has to be spawned for each crashing process.
> > 
> > This series adds a new mode:
> > 
> > (3) Dumping into an abstract AF_UNIX socket.
> > 
> > Userspace can set /proc/sys/kernel/core_pattern to:
> > 
> >         @linuxafsk/coredump_socket
> > 
> > The "@" at the beginning indicates to the kernel that the abstract
> > AF_UNIX coredump socket will be used to process coredumps.
> > 
> > The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> > for now.
> > 
> > The coredump socket is located in the initial network namespace. To bind
> > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > user namespace. Listening and reading can happen from whatever
> > unprivileged context is necessary to safely process coredumps.
> > 
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket. For now only tasks that
> > are acctually coredumping are allowed to connect to the initial coredump
> > socket.
> 
> I think we should avoid using abstract UNIX sockets, especially for new

Abstract unix sockets are at the core of a modern Linux system. During
boot alone about 100 or so are created on a modern system when I counted
during testing. Sorry, but this is a no-show argument.
Re: [PATCH RFC v3 00/10] coredump: add coredump socket
Posted by Mickaël Salaün 9 months, 1 week ago
On Mon, May 05, 2025 at 04:56:04PM +0200, Christian Brauner wrote:
> On Mon, May 05, 2025 at 04:41:28PM +0200, Mickaël Salaün wrote:
> > On Mon, May 05, 2025 at 01:13:38PM +0200, Christian Brauner wrote:
> > > Coredumping currently supports two modes:
> > > 
> > > (1) Dumping directly into a file somewhere on the filesystem.
> > > (2) Dumping into a pipe connected to a usermode helper process
> > >     spawned as a child of the system_unbound_wq or kthreadd.
> > > 
> > > For simplicity I'm mostly ignoring (1). There's probably still some
> > > users of (1) out there but processing coredumps in this way can be
> > > considered adventurous especially in the face of set*id binaries.
> > > 
> > > The most common option should be (2) by now. It works by allowing
> > > userspace to put a string into /proc/sys/kernel/core_pattern like:
> > > 
> > >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> > > 
> > > The "|" at the beginning indicates to the kernel that a pipe must be
> > > used. The path following the pipe indicator is a path to a binary that
> > > will be spawned as a usermode helper process. Any additional parameters
> > > pass information about the task that is generating the coredump to the
> > > binary that processes the coredump.
> > > 
> > > In the example core_pattern shown above systemd-coredump is spawned as a
> > > usermode helper. There's various conceptual consequences of this
> > > (non-exhaustive list):
> > > 
> > > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> > >   connected to the read-end of the pipe. All other file descriptors are
> > >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> > >   already caused bugs because userspace assumed that this cannot happen
> > >   (Whether or not this is a sane assumption is irrelevant.).
> > > 
> > > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> > >   it is not a child of any userspace process and specifically not a
> > >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> > >   upcall which are difficult for userspace to control correctly.
> > > 
> > > - systemd-coredump is spawned with full kernel privileges. This
> > >   necessitates all kinds of weird privilege dropping excercises in
> > >   userspace to make this safe.
> > > 
> > > - A new usermode helper has to be spawned for each crashing process.
> > > 
> > > This series adds a new mode:
> > > 
> > > (3) Dumping into an abstract AF_UNIX socket.
> > > 
> > > Userspace can set /proc/sys/kernel/core_pattern to:
> > > 
> > >         @linuxafsk/coredump_socket
> > > 
> > > The "@" at the beginning indicates to the kernel that the abstract
> > > AF_UNIX coredump socket will be used to process coredumps.
> > > 
> > > The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> > > for now.
> > > 
> > > The coredump socket is located in the initial network namespace. To bind
> > > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > > user namespace. Listening and reading can happen from whatever
> > > unprivileged context is necessary to safely process coredumps.
> > > 
> > > When a task coredumps it opens a client socket in the initial network
> > > namespace and connects to the coredump socket. For now only tasks that
> > > are acctually coredumping are allowed to connect to the initial coredump
> > > socket.
> > 
> > I think we should avoid using abstract UNIX sockets, especially for new
> 
> Abstract unix sockets are at the core of a modern Linux system. During
> boot alone about 100 or so are created on a modern system when I counted
> during testing. Sorry, but this is a no-show argument.

These kind of socket being used does not mean they should be used for
new interfaces. :)

AFAIK, these socket types are currently only used for IPC, not between a
kernel interface and user space.  This patch series changes this
assumption.

Security policies already in place can block abstract connections, and
it might not be possible to differenciate between a kernel or user space
peer with the current configuration.  Please Cc the LSM mailing list for
such new interfaces.

You cut and ignored most of my reply, which explained my reasoning and
proposed an alternative:
> > interfaces, because it is hard to properly control such access.  Can we
> > create new dedicated AF_UNIX protocols instead?  One could be used by a
> > privileged process in the initial namespace to create a socket to
> > collect coredumps, and the other could be dedicatde to coredumped
> > proccesses.  Such (coredump collector) file descriptor or new (proxy)
> > socketpair ones could be passed to containers.

Only one new "protocol" would be required though (because the client
side is created by the kernel).  That would be a backward compatible
change, and such socket type could easily be identified by other part of
the kernel, including access control mechanisms.
Re: [PATCH RFC v3 00/10] coredump: add coredump socket
Posted by Kuniyuki Iwashima 9 months, 1 week ago
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 05 May 2025 13:13:38 +0200
> Coredumping currently supports two modes:
> 
> (1) Dumping directly into a file somewhere on the filesystem.
> (2) Dumping into a pipe connected to a usermode helper process
>     spawned as a child of the system_unbound_wq or kthreadd.
> 
> For simplicity I'm mostly ignoring (1). There's probably still some
> users of (1) out there but processing coredumps in this way can be
> considered adventurous especially in the face of set*id binaries.
> 
> The most common option should be (2) by now. It works by allowing
> userspace to put a string into /proc/sys/kernel/core_pattern like:
> 
>         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> 
> The "|" at the beginning indicates to the kernel that a pipe must be
> used. The path following the pipe indicator is a path to a binary that
> will be spawned as a usermode helper process. Any additional parameters
> pass information about the task that is generating the coredump to the
> binary that processes the coredump.
> 
> In the example core_pattern shown above systemd-coredump is spawned as a
> usermode helper. There's various conceptual consequences of this
> (non-exhaustive list):
> 
> - systemd-coredump is spawned with file descriptor number 0 (stdin)
>   connected to the read-end of the pipe. All other file descriptors are
>   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
>   already caused bugs because userspace assumed that this cannot happen
>   (Whether or not this is a sane assumption is irrelevant.).
> 
> - systemd-coredump will be spawned as a child of system_unbound_wq. So
>   it is not a child of any userspace process and specifically not a
>   child of PID 1. It cannot be waited upon and is in a weird hybrid
>   upcall which are difficult for userspace to control correctly.
> 
> - systemd-coredump is spawned with full kernel privileges. This
>   necessitates all kinds of weird privilege dropping excercises in
>   userspace to make this safe.
> 
> - A new usermode helper has to be spawned for each crashing process.
> 
> This series adds a new mode:
> 
> (3) Dumping into an abstract AF_UNIX socket.
> 
> Userspace can set /proc/sys/kernel/core_pattern to:
> 
>         @linuxafsk/coredump_socket
> 
> The "@" at the beginning indicates to the kernel that the abstract
> AF_UNIX coredump socket will be used to process coredumps.
> 
> The coredump socket uses the fixed address "linuxafsk/coredump.socket"
> for now.

What's behind this dicision from v2 ?

/proc/sys/kernel/core_pattern can only be set by Administrator
and I don't see the point in having this limitation on the
AF_UNIX side.



> 
> The coredump socket is located in the initial network namespace.

I understand this is a reasonable decision to avoid complicated
path management in the mount ns but keep connectivity from any
namespace.


> To bind
> the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> user namespace. Listening and reading can happen from whatever
> unprivileged context is necessary to safely process coredumps.
> 
> When a task coredumps it opens a client socket in the initial network
> namespace and connects to the coredump socket. For now only tasks that
> are acctually coredumping are allowed to connect to the initial coredump
> socket.

This can be controlled by BPF (cgroup sockops or LSM) if a user
really cares about spam clients.

I think how to set up coredump is userspace responsibility.


> 
> - The coredump server should use SO_PEERPIDFD to get a stable handle on
>   the connected crashing task. The retrieved pidfd will provide a stable
>   reference even if the crashing task gets SIGKILLed while generating
>   the coredump.
> 
> - By setting core_pipe_limit non-zero userspace can guarantee that the
>   crashing task cannot be reaped behind it's back and thus process all
>   necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
>   detect whether /proc/<pid> still refers to the same process.
> 
>   The core_pipe_limit isn't used to rate-limit connections to the
>   socket. This can simply be done via AF_UNIX socket directly.
> 
> - The pidfd for the crashing task will contain information how the task
>   coredumps. The PIDFD_GET_INFO ioctl gained a new flag
>   PIDFD_INFO_COREDUMP which can be used to retreive the coredump
>   information.
> 
>   If the coredump gets a new coredump client connection the kernel
>   guarantees that PIDFD_INFO_COREDUMP information is available.
>   Currently the following information is provided in the new
>   @coredump_mask extension to struct pidfd_info:
> 
>   * PIDFD_COREDUMPED is raised if the task did actually coredump.
>   * PIDFD_COREDUMP_SKIP	is raised if the task skipped coredumping (e.g.,
>     undumpable).
>   * PIDFD_COREDUMP_USER	is raised if this is a regular coredump and
>     doesn't need special care by the coredump server.
>   * IDFD_COREDUMP_ROOT is raised if the generated coredump should be
>     treated as sensitive and the coredump server should restrict to the
>     generated coredump to sufficiently privileged users.
> 
> - Since unix_stream_connect() runs bpf programs during connect it's
>   possible to even redirect or multiplex coredumps to other sockets.

If the socket is in a cgroup, yes, and even if not, BPF LSM can
reject some requests.


> 
> - The coredump server should mark itself as non-dumpable.
>   To capture coredumps for the coredump server itself a bpf program
>   should be run at connect to redirect it to another socket in
>   userspace. This can be useful for debugging crashing coredump servers.
> 
> - A container coredump server in a separate network namespace can simply
>   bind to linuxafsk/coredump.socket and systemd-coredump fowards
>   coredumps to the container.

I think the name should be also configurable in non-initial netns.


> 
> - Fwiw, one idea is to handle coredumps via per-user/session coredump
>   servers that run with that users privileges.
> 
>   The coredump server listens on the coredump socket and accepts a
>   new coredump connection. It then retrieves SO_PEERPIDFD for the
>   client, inspects uid/gid and hands the accepted client to the users
>   own coredump handler which runs with the users privileges only.
> 
> The new coredump socket will allow userspace to not have to rely on
> usermode helpers for processing coredumps and provides a safer way to
> handle them instead of relying on super privileged coredumping helpers.
> 
> This will also be significantly more lightweight since no fork()+exec()
> for the usermodehelper is required for each crashing process. The
> coredump server in userspace can just keep a worker pool.
> 
> This is easy to test:
> 
> (a) coredump processing (we're using socat):
> 
>     > cat coredump_socket.sh
>     #!/bin/bash
> 
>     set -x
> 
>     sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern"
>     sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc
> 
> (b) trigger a coredump:
> 
>     user1@localhost:~/data/scripts$ cat crash.c
>     #include <stdio.h>
>     #include <unistd.h>
> 
>     int main(int argc, char *argv[])
>     {
>             fprintf(stderr, "%u\n", (1 / 0));
>             _exit(0);
>     }
Re: [PATCH RFC v3 00/10] coredump: add coredump socket
Posted by Christian Brauner 9 months, 1 week ago
> > The coredump socket is located in the initial network namespace.
> 
> I understand this is a reasonable decision to avoid complicated
> path management in the mount ns but keep connectivity from any
> namespace.

Yes, path lookup would not just be horrid it would also require playing
around with credentials and current->fs. The beauty in this
implementation is that its the crash dumping process itself which does
everything.

> > To bind
> > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial
> > user namespace. Listening and reading can happen from whatever
> > unprivileged context is necessary to safely process coredumps.
> > 
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket. For now only tasks that
> > are acctually coredumping are allowed to connect to the initial coredump
> > socket.
> 
> This can be controlled by BPF (cgroup sockops or LSM) if a user
> really cares about spam clients.
> 
> I think how to set up coredump is userspace responsibility.

I'll reply to that in the other thread so we don't have millions of
branch points.

> > - Since unix_stream_connect() runs bpf programs during connect it's
> >   possible to even redirect or multiplex coredumps to other sockets.
> 
> If the socket is in a cgroup, yes, and even if not, BPF LSM can
> reject some requests.

Indeed. I've outlined that in an earlier version as well.

> > - The coredump server should mark itself as non-dumpable.
> >   To capture coredumps for the coredump server itself a bpf program
> >   should be run at connect to redirect it to another socket in
> >   userspace. This can be useful for debugging crashing coredump servers.
> > 
> > - A container coredump server in a separate network namespace can simply
> >   bind to linuxafsk/coredump.socket and systemd-coredump fowards
> >   coredumps to the container.
> 
> I think the name should be also configurable in non-initial netns.

I don't see a good reason for this. We can always relax that later if we
have to. The fixed address keeps the coredump setup very very dumb and
simple.