[v2] coredump: support AF_UNIX sockets

[PATCH RFC v2 0/6] coredump: support AF_UNIX sockets

Posted by Christian Brauner 9 months, 1 week ago

I need some help with the following questions:

(i) The core_pipe_limit setting is of vital importance to userspace
    because it allows it to a) limit the number of concurrent coredumps
    and b) causes the kernel to wait until userspace closes the pipe and
    thus prevents the process from being reaped, allowing userspace to
    parse information out of /proc/<pid>/.

    Pipes already support this. I need to know from the networking
    people (or Oleg :)) how to wait for the userspace side to shutdown
    the socket/terminate the connection.

    I don't want to just read() because then userspace can send us
    SCM_RIGHTS messages and it's really ugly anyway.

(ii) The dumpability setting is of importance for userspace in order to
     know how a given binary is dumped: as regular user or as root user.
     This helps guard against exploits abusing set*id binaries. The
     setting needs to be the same as used at the time of the coredump.

     I'm exposing this as part of PIDFD_GET_INFO. I would like some
     input whether it's fine to simply expose the dumpability this way.
     I'm pretty sure it is. But it'd be good to have @Jann give his
     thoughts here.

Now the actual blurb:

Coredumping currently supports two modes:

(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
    spawned as a child of the system_unbound_wq or kthreadd.

For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.

The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:

        |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.

In this case systemd-coredump is spawned as a usermode helper. There's
various conceptual consequences of this (non-exhaustive list):

- systemd-coredump is spawned with file descriptor number 0 (stdin)
  to the read-end of the pipe. All other file descriptors are closed.
  That specifically includes 1 (stdout) and 2 (stderr). This has already
  caused bugs because userspace assumed that this cannot happen (Whether
  or not this is a sane assumption is irrelevant.).

- systemd-coredump will be spawned as a child of system_unbound_wq. So
  it is not a child of any userspace process and specifically not a
  child of PID 1 so it cannot be waited upon and is in general a weird
  hybrid upcall.

- systemd-coredump is spawned highly privileged as it is spawned with
  full kernel credentials requiring all kinds of weird privilege
  dropping excercises in userspaces.

This adds another mode:

(3) Dumping into a AF_UNIX socket.

Userspace can set /proc/sys/kernel/core_pattern to:

        :/run/coredump.socket

The ":" at the beginning indicates to the kernel that an AF_UNIX socket
is used to process coredumps. The task generating the coredump simply
connects to the socket and writes the coredump into the socket.

Userspace can get a stable handle on the task generating the coredump by
using the SO_PEERPIDFD socket option. SO_PEERPIDFD uses the thread-group
leader pid stashed during connect(). Even if the task generating the
coredump is a subthread in the thread-group the pidfd of the
thread-group leader is a reliable stable handle. Userspace that's
interested in the credentials of the specific thread that crashed can
use SCM_PIDFD to retrieve them.

The pidfd can be used to safely open and parse /proc/<pid> of the task
and it can also be used to retrieve additional meta information via the
PIDFD_GET_INFO ioctl().

This will allow userspace to not have to rely on usermode helpers for
processing coredumps and thus to stop having to handle super privileged
coredumping helpers.

This is easy to test:

(a) coredump processing (we're using socat):

    > cat coredump_socket.sh
    #!/bin/bash

    set -x

    sudo bash -c "echo ':/tmp/stream.sock' > /proc/sys/kernel/core_pattern"
    socat --statistics unix-listen:/tmp/stream.sock,fork FILE:core_file,create,append,truncate

(b) trigger a coredump:

    user1@localhost:~/data/scripts$ cat crash.c
    #include <stdio.h>
    #include <unistd.h>

    int main(int argc, char *argv[])
    {
            fprintf(stderr, "%u\n", (1 / 0));
            _exit(0);
    }

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Expose dumpability via PIDFD_GET_INFO.
- Place COREDUMP_SOCK handling under CONFIG_UNIX.
- Link to v1: https://lore.kernel.org/20250430-work-coredump-socket-v1-0-2faf027dbb47@kernel.org

---
Christian Brauner (6):
      coredump: massage format_corname()
      coredump: massage do_coredump()
      coredump: support AF_UNIX sockets
      coredump: show supported coredump modes
      pidfs, coredump: add PIDFD_INFO_COREDUMP
      selftests/coredump: add tests for AF_UNIX coredumps

 fs/coredump.c                                     | 312 ++++++++++++++++------
 fs/pidfs.c                                        |  58 ++++
 include/linux/pidfs.h                             |   3 +
 include/uapi/linux/pidfd.h                        |  11 +
 tools/testing/selftests/coredump/stackdump_test.c |  50 ++++
 5 files changed, 359 insertions(+), 75 deletions(-)
---
base-commit: 4dd6566b5a8ca1e8c9ff2652c2249715d6c64217
change-id: 20250429-work-coredump-socket-87cc0f17729c

Re: [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets

Posted by Jann Horn 9 months, 1 week ago

On Fri, May 2, 2025 at 2:42 PM Christian Brauner <brauner@kernel.org> wrote:
> I need some help with the following questions:
>
> (i) The core_pipe_limit setting is of vital importance to userspace
>     because it allows it to a) limit the number of concurrent coredumps
>     and b) causes the kernel to wait until userspace closes the pipe and
>     thus prevents the process from being reaped, allowing userspace to
>     parse information out of /proc/<pid>/.
>
>     Pipes already support this. I need to know from the networking
>     people (or Oleg :)) how to wait for the userspace side to shutdown
>     the socket/terminate the connection.
>
>     I don't want to just read() because then userspace can send us
>     SCM_RIGHTS messages and it's really ugly anyway.
>
> (ii) The dumpability setting is of importance for userspace in order to
>      know how a given binary is dumped: as regular user or as root user.
>      This helps guard against exploits abusing set*id binaries. The
>      setting needs to be the same as used at the time of the coredump.
>
>      I'm exposing this as part of PIDFD_GET_INFO. I would like some
>      input whether it's fine to simply expose the dumpability this way.
>      I'm pretty sure it is. But it'd be good to have @Jann give his
>      thoughts here.

My only concern here is that if we expect the userspace daemon to look
at the dumpability field and treat nondumpable tasks as "this may
contain secret data and resources owned by various UIDs mixed
together, only root should see the dump", we should have at least very
clear documentation around this.

[...]
> Userspace can get a stable handle on the task generating the coredump by
> using the SO_PEERPIDFD socket option. SO_PEERPIDFD uses the thread-group
> leader pid stashed during connect(). Even if the task generating the

Unrelated to this series: Huh, I think I haven't seen SO_PEERPIDFD
before. I guess one interesting consequence of that feature is that if
you get a unix domain socket whose peer is in another PID namespace,
you can call pidfd_getfd() on that peer, which wouldn't normally be
possible? Though of course it'll still be subject to the normal ptrace
checks.

Re: [PATCH RFC v2 0/6] coredump: support AF_UNIX sockets

Posted by Christian Brauner 9 months, 1 week ago

On Fri, May 02, 2025 at 04:04:28PM +0200, Jann Horn wrote:
> On Fri, May 2, 2025 at 2:42 PM Christian Brauner <brauner@kernel.org> wrote:
> > I need some help with the following questions:
> >
> > (i) The core_pipe_limit setting is of vital importance to userspace
> >     because it allows it to a) limit the number of concurrent coredumps
> >     and b) causes the kernel to wait until userspace closes the pipe and
> >     thus prevents the process from being reaped, allowing userspace to
> >     parse information out of /proc/<pid>/.
> >
> >     Pipes already support this. I need to know from the networking
> >     people (or Oleg :)) how to wait for the userspace side to shutdown
> >     the socket/terminate the connection.
> >
> >     I don't want to just read() because then userspace can send us
> >     SCM_RIGHTS messages and it's really ugly anyway.
> >
> > (ii) The dumpability setting is of importance for userspace in order to
> >      know how a given binary is dumped: as regular user or as root user.
> >      This helps guard against exploits abusing set*id binaries. The
> >      setting needs to be the same as used at the time of the coredump.
> >
> >      I'm exposing this as part of PIDFD_GET_INFO. I would like some
> >      input whether it's fine to simply expose the dumpability this way.
> >      I'm pretty sure it is. But it'd be good to have @Jann give his
> >      thoughts here.
> 
> My only concern here is that if we expect the userspace daemon to look
> at the dumpability field and treat nondumpable tasks as "this may
> contain secret data and resources owned by various UIDs mixed
> together, only root should see the dump", we should have at least very
> clear documentation around this.
> 
> [...]
> > Userspace can get a stable handle on the task generating the coredump by
> > using the SO_PEERPIDFD socket option. SO_PEERPIDFD uses the thread-group
> > leader pid stashed during connect(). Even if the task generating the
> 
> Unrelated to this series: Huh, I think I haven't seen SO_PEERPIDFD
> before. I guess one interesting consequence of that feature is that if

It's very heavily used by dbus-broker, polkit and systemd to safely
authenticate clients instead of by PIDs. (Fyi, it's even supported for
bluetooth sockets so they could benefit from this as well I'm sure.)

> you get a unix domain socket whose peer is in another PID namespace,
> you can call pidfd_getfd() on that peer, which wouldn't normally be
> possible? Though of course it'll still be subject to the normal ptrace
> checks.

I think that was already possible because you could send pidfds via
SCM_RIGHTS. That's a lot more cooperative than SO_PEERPIDFD of course
but still.

But if that's an issue we could of course enforce that pidfd_getfd() may
only work if the target is within your pidns hierarchy just as we do for
the PIDFD_GET_INFO ioctl() already. But I'm not sure it's an issue.