[PATCH v8 0/9] coredump: add coredump socket

Christian Brauner posted 9 patches 7 months ago
fs/coredump.c                                     | 381 +++++++++++++-----
fs/pidfs.c                                        |  55 +++
include/linux/net.h                               |   1 +
include/linux/pidfs.h                             |   5 +
include/uapi/linux/pidfd.h                        |  16 +
net/unix/af_unix.c                                |  54 ++-
tools/testing/selftests/coredump/stackdump_test.c | 467 +++++++++++++++++++++-
tools/testing/selftests/pidfd/pidfd.h             |  22 +
8 files changed, 895 insertions(+), 106 deletions(-)
[PATCH v8 0/9] coredump: add coredump socket
Posted by Christian Brauner 7 months ago
Coredumping currently supports two modes:

(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
    spawned as a child of the system_unbound_wq or kthreadd.

For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.

The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:

        |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.

In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):

- systemd-coredump is spawned with file descriptor number 0 (stdin)
  connected to the read-end of the pipe. All other file descriptors are
  closed. That specifically includes 1 (stdout) and 2 (stderr). This has
  already caused bugs because userspace assumed that this cannot happen
  (Whether or not this is a sane assumption is irrelevant.).

- systemd-coredump will be spawned as a child of system_unbound_wq. So
  it is not a child of any userspace process and specifically not a
  child of PID 1. It cannot be waited upon and is in a weird hybrid
  upcall which are difficult for userspace to control correctly.

- systemd-coredump is spawned with full kernel privileges. This
  necessitates all kinds of weird privilege dropping excercises in
  userspace to make this safe.

- A new usermode helper has to be spawned for each crashing process.

This series adds a new mode:

(3) Dumping into an AF_UNIX socket.

Userspace can set /proc/sys/kernel/core_pattern to:

        @/path/to/coredump.socket

The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps.

The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket.

- The coredump server should use SO_PEERPIDFD to get a stable handle on
  the connected crashing task. The retrieved pidfd will provide a stable
  reference even if the crashing task gets SIGKILLed while generating
  the coredump.

- By setting core_pipe_limit non-zero userspace can guarantee that the
  crashing task cannot be reaped behind it's back and thus process all
  necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
  detect whether /proc/<pid> still refers to the same process.

  The core_pipe_limit isn't used to rate-limit connections to the
  socket. This can simply be done via AF_UNIX socket directly.

- The pidfd for the crashing task will contain information how the task
  coredumps. The PIDFD_GET_INFO ioctl gained a new flag
  PIDFD_INFO_COREDUMP which can be used to retreive the coredump
  information.

  If the coredump gets a new coredump client connection the kernel
  guarantees that PIDFD_INFO_COREDUMP information is available.
  Currently the following information is provided in the new
  @coredump_mask extension to struct pidfd_info:

  * PIDFD_COREDUMPED is raised if the task did actually coredump.
  * PIDFD_COREDUMP_SKIP	is raised if the task skipped coredumping (e.g.,
    undumpable).
  * PIDFD_COREDUMP_USER	is raised if this is a regular coredump and
    doesn't need special care by the coredump server.
  * PIDFD_COREDUMP_ROOT is raised if the generated coredump should be
    treated as sensitive and the coredump server should restrict access
    to the generated coredump to sufficiently privileged users.

- The coredump server should mark itself as non-dumpable.

- A container coredump server in a separate network namespace can simply
  bind to another well-know address and systemd-coredump fowards
  coredumps to the container.

- Coredumps could in the future also be handled via per-user/session
  coredump servers that run only with that users privileges.

  The coredump server listens on the coredump socket and accepts a
  new coredump connection. It then retrieves SO_PEERPIDFD for the
  client, inspects uid/gid and hands the accepted client to the users
  own coredump handler which runs with the users privileges only
  (It must of coure pay close attention to not forward crashing suid
  binaries.).

The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers.

This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can just keep a worker pool.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v8:
- Drop coredump_cookie now that we don't need it anymore. Connections
  can just be filtered by removing the ability to connect to the socket
  path.
- Fix a few minor bugs.
- Link to v7: https://lore.kernel.org/20250515-work-coredump-socket-v7-0-0a1329496c31@kernel.org

Changes in v7:
- Use regular AF_UNIX sockets instead of abstract AF_UNIX sockets. This
  fixes the permission problems as userspace can ensure that the socket
  path cannot be rebound by arbitrary unprivileged userspace via regular
  path permissions.

  This means:
  - We don't require privilege checks on a reserved abstract AF_UNIX
    namespace
  - We don't require a fixed address for the coredump socket.
  - We don't need to use abstract unix sockets at all.
  - We don't need  special socket cookie magic in the
    /proc/sys/kernel/core_pattern handler.
  - We are able to set /proc/sys/kernel/core_pattern statically without
    having any socket bound.

  That's all complaints addressed.

  Simply massage unix_find_bsd() to be able to handle this and always
  lookup the coredump socket in the initial mount namespace with
  appropriate credentials. The same thing we do for looking up other
  parts in the kernel like this. Only the lookup happens this way.
  Actual connection credentials are obviously from the coredumping task.
- Link to v6: https://lore.kernel.org/20250512-work-coredump-socket-v6-0-c51bc3450727@kernel.org

Changes in v6:
- Use the socket cookie to verify the coredump server.
- Link to v5: https://lore.kernel.org/20250509-work-coredump-socket-v5-0-23c5b14df1bc@kernel.org

Changes in v5:
- Don't use a prefix just the specific address.
- Link to v4: https://lore.kernel.org/20250507-work-coredump-socket-v4-0-af0ef317b2d0@kernel.org

Changes in v4:
- Expose the coredump socket cookie through the pidfd. This allows the
  coredump server to easily recognize coredump socket connections.
- Link to v3: https://lore.kernel.org/20250505-work-coredump-socket-v3-0-e1832f0e1eae@kernel.org

Changes in v3:
- Use an abstract unix socket.
- Add documentation.
- Add selftests.
- Link to v2: https://lore.kernel.org/20250502-work-coredump-socket-v2-0-43259042ffc7@kernel.org

Changes in v2:
- Expose dumpability via PIDFD_GET_INFO.
- Place COREDUMP_SOCK handling under CONFIG_UNIX.
- Link to v1: https://lore.kernel.org/20250430-work-coredump-socket-v1-0-2faf027dbb47@kernel.org

---
Christian Brauner (9):
      coredump: massage format_corename()
      coredump: massage do_coredump()
      coredump: reflow dump helpers a little
      coredump: add coredump socket
      pidfs, coredump: add PIDFD_INFO_COREDUMP
      coredump: show supported coredump modes
      coredump: validate socket name as it is written
      selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
      selftests/coredump: add tests for AF_UNIX coredumps

 fs/coredump.c                                     | 381 +++++++++++++-----
 fs/pidfs.c                                        |  55 +++
 include/linux/net.h                               |   1 +
 include/linux/pidfs.h                             |   5 +
 include/uapi/linux/pidfd.h                        |  16 +
 net/unix/af_unix.c                                |  54 ++-
 tools/testing/selftests/coredump/stackdump_test.c | 467 +++++++++++++++++++++-
 tools/testing/selftests/pidfd/pidfd.h             |  22 +
 8 files changed, 895 insertions(+), 106 deletions(-)
---
base-commit: 4dd6566b5a8ca1e8c9ff2652c2249715d6c64217
change-id: 20250429-work-coredump-socket-87cc0f17729c
Re: [PATCH v8 0/9] coredump: add coredump socket
Posted by Stephen Hemminger 6 months, 4 weeks ago
On Fri, 16 May 2025 13:25:27 +0200
Christian Brauner <brauner@kernel.org> wrote:

> Coredumping currently supports two modes:
> 
> (1) Dumping directly into a file somewhere on the filesystem.
> (2) Dumping into a pipe connected to a usermode helper process
>     spawned as a child of the system_unbound_wq or kthreadd.
> 
> For simplicity I'm mostly ignoring (1). There's probably still some
> users of (1) out there but processing coredumps in this way can be
> considered adventurous especially in the face of set*id binaries.
> 
> The most common option should be (2) by now. It works by allowing
> userspace to put a string into /proc/sys/kernel/core_pattern like:
> 
>         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> 
> The "|" at the beginning indicates to the kernel that a pipe must be
> used. The path following the pipe indicator is a path to a binary that
> will be spawned as a usermode helper process. Any additional parameters
> pass information about the task that is generating the coredump to the
> binary that processes the coredump.
> 
> In the example core_pattern shown above systemd-coredump is spawned as a
> usermode helper. There's various conceptual consequences of this
> (non-exhaustive list):
> 
> - systemd-coredump is spawned with file descriptor number 0 (stdin)
>   connected to the read-end of the pipe. All other file descriptors are
>   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
>   already caused bugs because userspace assumed that this cannot happen
>   (Whether or not this is a sane assumption is irrelevant.).
> 
> - systemd-coredump will be spawned as a child of system_unbound_wq. So
>   it is not a child of any userspace process and specifically not a
>   child of PID 1. It cannot be waited upon and is in a weird hybrid
>   upcall which are difficult for userspace to control correctly.
> 
> - systemd-coredump is spawned with full kernel privileges. This
>   necessitates all kinds of weird privilege dropping excercises in
>   userspace to make this safe.
> 
> - A new usermode helper has to be spawned for each crashing process.
> 
> This series adds a new mode:
> 
> (3) Dumping into an AF_UNIX socket.
> 
> Userspace can set /proc/sys/kernel/core_pattern to:
> 
>         @/path/to/coredump.socket
> 
> The "@" at the beginning indicates to the kernel that an AF_UNIX
> coredump socket will be used to process coredumps.
> 
> The coredump socket must be located in the initial mount namespace.
> When a task coredumps it opens a client socket in the initial network
> namespace and connects to the coredump socket.


There is a problem with using @ as naming convention.
The starting character of @ is already used to indicate abstract
unix domain sockets in some programs like ss.
And will the new coredump socekt allow use of abstrace unix
domain sockets?
Re: [PATCH v8 0/9] coredump: add coredump socket
Posted by Christian Brauner 6 months, 3 weeks ago
On Tue, May 20, 2025 at 12:28:38PM -0700, Stephen Hemminger wrote:
> On Fri, 16 May 2025 13:25:27 +0200
> Christian Brauner <brauner@kernel.org> wrote:
> 
> > Coredumping currently supports two modes:
> > 
> > (1) Dumping directly into a file somewhere on the filesystem.
> > (2) Dumping into a pipe connected to a usermode helper process
> >     spawned as a child of the system_unbound_wq or kthreadd.
> > 
> > For simplicity I'm mostly ignoring (1). There's probably still some
> > users of (1) out there but processing coredumps in this way can be
> > considered adventurous especially in the face of set*id binaries.
> > 
> > The most common option should be (2) by now. It works by allowing
> > userspace to put a string into /proc/sys/kernel/core_pattern like:
> > 
> >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> > 
> > The "|" at the beginning indicates to the kernel that a pipe must be
> > used. The path following the pipe indicator is a path to a binary that
> > will be spawned as a usermode helper process. Any additional parameters
> > pass information about the task that is generating the coredump to the
> > binary that processes the coredump.
> > 
> > In the example core_pattern shown above systemd-coredump is spawned as a
> > usermode helper. There's various conceptual consequences of this
> > (non-exhaustive list):
> > 
> > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> >   connected to the read-end of the pipe. All other file descriptors are
> >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> >   already caused bugs because userspace assumed that this cannot happen
> >   (Whether or not this is a sane assumption is irrelevant.).
> > 
> > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> >   it is not a child of any userspace process and specifically not a
> >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> >   upcall which are difficult for userspace to control correctly.
> > 
> > - systemd-coredump is spawned with full kernel privileges. This
> >   necessitates all kinds of weird privilege dropping excercises in
> >   userspace to make this safe.
> > 
> > - A new usermode helper has to be spawned for each crashing process.
> > 
> > This series adds a new mode:
> > 
> > (3) Dumping into an AF_UNIX socket.
> > 
> > Userspace can set /proc/sys/kernel/core_pattern to:
> > 
> >         @/path/to/coredump.socket
> > 
> > The "@" at the beginning indicates to the kernel that an AF_UNIX
> > coredump socket will be used to process coredumps.
> > 
> > The coredump socket must be located in the initial mount namespace.
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket.
> 
> 
> There is a problem with using @ as naming convention.
> The starting character of @ is already used to indicate abstract
> unix domain sockets in some programs like ss.

This shouldn't be a problem. First because @ isn't part of the actual
AF_UNIX path. But mostly because ss and other network related tools have
no relationship with /proc/sys/kernel/core_pattern whatsoever. I'm not
opposed to changing it if people do care strongly about it and send a
patch. But that will happen as a fixup after the merge window.

> And will the new coredump socekt allow use of abstrace unix
> domain sockets?

No. There's no safe permission model without involving LSMs.
Unprivileged attackers can recycle the socket address and use it to get
(suid) coredumps forwarded to them when the server crashes or restarts.
Re: [PATCH v8 0/9] coredump: add coredump socket
Posted by Kuniyuki Iwashima 6 months, 4 weeks ago
From: Stephen Hemminger <stephen@networkplumber.org>
Date: Tue, 20 May 2025 12:28:38 -0700
> On Fri, 16 May 2025 13:25:27 +0200
> Christian Brauner <brauner@kernel.org> wrote:
> 
> > Coredumping currently supports two modes:
> > 
> > (1) Dumping directly into a file somewhere on the filesystem.
> > (2) Dumping into a pipe connected to a usermode helper process
> >     spawned as a child of the system_unbound_wq or kthreadd.
> > 
> > For simplicity I'm mostly ignoring (1). There's probably still some
> > users of (1) out there but processing coredumps in this way can be
> > considered adventurous especially in the face of set*id binaries.
> > 
> > The most common option should be (2) by now. It works by allowing
> > userspace to put a string into /proc/sys/kernel/core_pattern like:
> > 
> >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> > 
> > The "|" at the beginning indicates to the kernel that a pipe must be
> > used. The path following the pipe indicator is a path to a binary that
> > will be spawned as a usermode helper process. Any additional parameters
> > pass information about the task that is generating the coredump to the
> > binary that processes the coredump.
> > 
> > In the example core_pattern shown above systemd-coredump is spawned as a
> > usermode helper. There's various conceptual consequences of this
> > (non-exhaustive list):
> > 
> > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> >   connected to the read-end of the pipe. All other file descriptors are
> >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> >   already caused bugs because userspace assumed that this cannot happen
> >   (Whether or not this is a sane assumption is irrelevant.).
> > 
> > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> >   it is not a child of any userspace process and specifically not a
> >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> >   upcall which are difficult for userspace to control correctly.
> > 
> > - systemd-coredump is spawned with full kernel privileges. This
> >   necessitates all kinds of weird privilege dropping excercises in
> >   userspace to make this safe.
> > 
> > - A new usermode helper has to be spawned for each crashing process.
> > 
> > This series adds a new mode:
> > 
> > (3) Dumping into an AF_UNIX socket.
> > 
> > Userspace can set /proc/sys/kernel/core_pattern to:
> > 
> >         @/path/to/coredump.socket
> > 
> > The "@" at the beginning indicates to the kernel that an AF_UNIX
> > coredump socket will be used to process coredumps.
> > 
> > The coredump socket must be located in the initial mount namespace.
> > When a task coredumps it opens a client socket in the initial network
> > namespace and connects to the coredump socket.
> 
> 
> There is a problem with using @ as naming convention.
> The starting character of @ is already used to indicate abstract
> unix domain sockets in some programs like ss.
> And will the new coredump socekt allow use of abstrace unix
> domain sockets?

The coredump only works with the pathname socket, so ideally
the prefix should be '/', but it's same with the direct-file
coredump.  We can distinguish the socket by S_ISSOCK() though.
Re: [PATCH v8 0/9] coredump: add coredump socket
Posted by Jann Horn 6 months, 4 weeks ago
On Wed, May 21, 2025 at 2:42 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: Stephen Hemminger <stephen@networkplumber.org>
> Date: Tue, 20 May 2025 12:28:38 -0700
> > On Fri, 16 May 2025 13:25:27 +0200
> > Christian Brauner <brauner@kernel.org> wrote:
> >
> > > Coredumping currently supports two modes:
> > >
> > > (1) Dumping directly into a file somewhere on the filesystem.
> > > (2) Dumping into a pipe connected to a usermode helper process
> > >     spawned as a child of the system_unbound_wq or kthreadd.
> > >
> > > For simplicity I'm mostly ignoring (1). There's probably still some
> > > users of (1) out there but processing coredumps in this way can be
> > > considered adventurous especially in the face of set*id binaries.
> > >
> > > The most common option should be (2) by now. It works by allowing
> > > userspace to put a string into /proc/sys/kernel/core_pattern like:
> > >
> > >         |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
> > >
> > > The "|" at the beginning indicates to the kernel that a pipe must be
> > > used. The path following the pipe indicator is a path to a binary that
> > > will be spawned as a usermode helper process. Any additional parameters
> > > pass information about the task that is generating the coredump to the
> > > binary that processes the coredump.
> > >
> > > In the example core_pattern shown above systemd-coredump is spawned as a
> > > usermode helper. There's various conceptual consequences of this
> > > (non-exhaustive list):
> > >
> > > - systemd-coredump is spawned with file descriptor number 0 (stdin)
> > >   connected to the read-end of the pipe. All other file descriptors are
> > >   closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> > >   already caused bugs because userspace assumed that this cannot happen
> > >   (Whether or not this is a sane assumption is irrelevant.).
> > >
> > > - systemd-coredump will be spawned as a child of system_unbound_wq. So
> > >   it is not a child of any userspace process and specifically not a
> > >   child of PID 1. It cannot be waited upon and is in a weird hybrid
> > >   upcall which are difficult for userspace to control correctly.
> > >
> > > - systemd-coredump is spawned with full kernel privileges. This
> > >   necessitates all kinds of weird privilege dropping excercises in
> > >   userspace to make this safe.
> > >
> > > - A new usermode helper has to be spawned for each crashing process.
> > >
> > > This series adds a new mode:
> > >
> > > (3) Dumping into an AF_UNIX socket.
> > >
> > > Userspace can set /proc/sys/kernel/core_pattern to:
> > >
> > >         @/path/to/coredump.socket
> > >
> > > The "@" at the beginning indicates to the kernel that an AF_UNIX
> > > coredump socket will be used to process coredumps.
> > >
> > > The coredump socket must be located in the initial mount namespace.
> > > When a task coredumps it opens a client socket in the initial network
> > > namespace and connects to the coredump socket.
> >
> >
> > There is a problem with using @ as naming convention.
> > The starting character of @ is already used to indicate abstract
> > unix domain sockets in some programs like ss.
> > And will the new coredump socekt allow use of abstrace unix
> > domain sockets?
>
> The coredump only works with the pathname socket, so ideally
> the prefix should be '/', but it's same with the direct-file
> coredump.  We can distinguish the socket by S_ISSOCK() though.

The path lookups work very differently between COREDUMP_SOCK and
COREDUMP_FILE - they are interpreted relative to different namespaces,
and they run with different privileges, and they do different format
string interpretation. I think trying to determine dynamically whether
the path refers to a socket or to a nonexistent location at which we
should create a file (or a preexisting file we should clobber) would
not be practical, partly for these reasons.

Also, fundamentally, if we have the choice between letting userspace
be explicit about what it wants, or trying to guess userspace's intent
from the kernel, I think we should always go for being explicit.

So I guess it could be reasonable to bikeshed the prefix letter and
turn '@' into some other character that is not overloaded with another
meaning in this context, like '>'; but I don't think we should be
changing the overall approach because of this.
Re: [PATCH v8 0/9] coredump: add coredump socket
Posted by Christian Brauner 6 months, 3 weeks ago
> The path lookups work very differently between COREDUMP_SOCK and
> COREDUMP_FILE - they are interpreted relative to different namespaces,
> and they run with different privileges, and they do different format
> string interpretation. I think trying to determine dynamically whether
> the path refers to a socket or to a nonexistent location at which we
> should create a file (or a preexisting file we should clobber) would
> not be practical, partly for these reasons.

Agreed.

> 
> Also, fundamentally, if we have the choice between letting userspace
> be explicit about what it wants, or trying to guess userspace's intent
> from the kernel, I think we should always go for being explicit.

Agreed.

> 
> meaning in this context, like '>'; but I don't think we should be
> changing the overall approach because of this.

Agreed.