[v1] procfs: make reference pidns more user-visible

[PATCH RFC 0/4] procfs: make reference pidns more user-visible

Posted by Aleksa Sarai 6 months, 3 weeks ago

Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs mount is
auto-selected based on the mounting process's active pidns, and the
pidns itself is basically hidden once the mount has been constructed).
This has historically meant that userspace was required to do some
special dances in order to configure the pidns of a procfs mount as
desired. Examples include:

 * In order to bypass the mnt_too_revealing() check, Kubernetes creates
   a procfs mount from an empty pidns so that user namespaced containers
   can be nested (without this, the nested containers would fail to
   mount procfs). But this requires forking off a helper process because
   you cannot just one-shot this using mount(2).

 * Container runtimes in general need to fork into a container before
   configuring its mounts, which can lead to security issues in the case
   of shared-pidns containers (a privileged process in the pidns can
   interact with your container runtime process). While
   SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
   strict need for this due to a minor uAPI wart is kind of unfortunate.

Things would be much easier if there was a way for userspace to just
specify the pidns they want. Patch 1 implements a new "pidns" argument
which can be set using fsconfig(2):

    fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
    fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

or classic mount(2) / mount(8):

    // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
    mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

The initial security model I have in this RFC is to be as conservative
as possible and just mirror the security model for setns(2) -- which
means that you can only set pidns=... to pid namespaces that your
current pid namespace is a direct ancestor of. This fulfils the
requirements of container runtimes, but I suspect that this may be too
strict for some usecases.

The pidns argument is not displayed in mountinfo -- it's not clear to me
what value it would make sense to show (maybe we could just use ns_dname
to provide an identifier for the namespace, but this number would be
fairly useless to userspace). I'm open to suggestions.

In addition, being able to figure out what pid namespace is being used
by a procfs mount is quite useful when you have an administrative
process (such as a container runtime) which wants to figure out the
correct way of mapping PIDs between its own namespace and the namespace
for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
alternative ways to do this, but they all rely on ancillary information
that third-party libraries and tools do not necessarily have access to.

To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
can be used to get a reference to the pidns that a procfs is using.

It's not quite clear what is the correct security model for this API,
but the current approach I've taken is to:

 * Make the ioctl only valid on the root (meaning that a process without
   access to the procfs root -- such as only having an fd to a procfs
   file or some open_tree(2)-like subset -- cannot use this API).

 * Require that the process requesting either has access to
   /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns
   pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative
   access to it and can join it if they had a handle), or is in a pidns
   that is a direct ancestor of the target pidns (i.e. all of the pids
   are already visible in the procfs for the current process's pidns).

The security model for this is a little loose, as it seems to me that
all of the cases mentioned are valid cases to allow access, but I'm open
to suggestions for whether we need to make this stricter or looser.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Aleksa Sarai (4):
      pidns: move is-ancestor logic to helper
      procfs: add pidns= mount option
      procfs: add PROCFS_GET_PID_NAMESPACE ioctl
      selftests/proc: add tests for new pidns APIs

 Documentation/filesystems/proc.rst        |  10 ++
 fs/proc/root.c                            | 132 +++++++++++++-
 include/linux/pid_namespace.h             |   9 +
 include/uapi/linux/fs.h                   |   3 +
 kernel/pid_namespace.c                    |  21 ++-
 tools/testing/selftests/proc/.gitignore   |   1 +
 tools/testing/selftests/proc/Makefile     |   1 +
 tools/testing/selftests/proc/proc-pidns.c | 286 ++++++++++++++++++++++++++++++
 8 files changed, 448 insertions(+), 15 deletions(-)
---
base-commit: 4c838c7672c39ec6ec48456c6ce22d14a68f4cda
change-id: 20250717-procfs-pidns-api-8ed1583431f0

Best regards,
-- 
Aleksa Sarai <cyphar@cyphar.com>

Re: [PATCH RFC 0/4] procfs: make reference pidns more user-visible

Posted by Andy Lutomirski 6 months, 3 weeks ago

On Mon, Jul 21, 2025 at 1:44 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> This has historically meant that userspace was required to do some
> special dances in order to configure the pidns of a procfs mount as
> desired. Examples include:
>
>  * In order to bypass the mnt_too_revealing() check, Kubernetes creates
>    a procfs mount from an empty pidns so that user namespaced containers
>    can be nested (without this, the nested containers would fail to
>    mount procfs). But this requires forking off a helper process because
>    you cannot just one-shot this using mount(2).
>
>  * Container runtimes in general need to fork into a container before
>    configuring its mounts, which can lead to security issues in the case
>    of shared-pidns containers (a privileged process in the pidns can
>    interact with your container runtime process). While
>    SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
>    strict need for this due to a minor uAPI wart is kind of unfortunate.
>
> Things would be much easier if there was a way for userspace to just
> specify the pidns they want. Patch 1 implements a new "pidns" argument
> which can be set using fsconfig(2):
>
>     fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>     fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
>
> or classic mount(2) / mount(8):
>
>     // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
>     mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
>
> The initial security model I have in this RFC is to be as conservative
> as possible and just mirror the security model for setns(2) -- which
> means that you can only set pidns=... to pid namespaces that your
> current pid namespace is a direct ancestor of. This fulfils the
> requirements of container runtimes, but I suspect that this may be too
> strict for some usecases.
>
> The pidns argument is not displayed in mountinfo -- it's not clear to me
> what value it would make sense to show (maybe we could just use ns_dname
> to provide an identifier for the namespace, but this number would be
> fairly useless to userspace). I'm open to suggestions.
>
> In addition, being able to figure out what pid namespace is being used
> by a procfs mount is quite useful when you have an administrative
> process (such as a container runtime) which wants to figure out the
> correct way of mapping PIDs between its own namespace and the namespace
> for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> alternative ways to do this, but they all rely on ancillary information
> that third-party libraries and tools do not necessarily have access to.
>
> To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> can be used to get a reference to the pidns that a procfs is using.
>
> It's not quite clear what is the correct security model for this API,
> but the current approach I've taken is to:
>
>  * Make the ioctl only valid on the root (meaning that a process without
>    access to the procfs root -- such as only having an fd to a procfs
>    file or some open_tree(2)-like subset -- cannot use this API).
>
>  * Require that the process requesting either has access to
>    /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns
>    pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative
>    access to it and can join it if they had a handle), or is in a pidns
>    that is a direct ancestor of the target pidns (i.e. all of the pids
>    are already visible in the procfs for the current process's pidns).

What's the motivation for the ptrace-read option?  While I don't see
an attack off the top of my head, it seems like creating a procfs
mount may give write-ish access to things in the pidns (because the
creator is likely to have CAP_DAC_OVERRIDE, etc) and possibly even
access to namespace-wide things that aren't inherently visible to
PID1.

Even the ancestor check seems dicey.  Imagine that uid 1000 makes an
unprivileged container complete with a userns.  Then uid 1001 (outside
the container) makes its own userns and mountns but stays in the init
pidns and then mounts (and owns, with all filesystem-related
capabilities) that mount.  Is this really safe?

CAP_SYS_ADMIN seems about right.

--Andy

Re: [PATCH RFC 0/4] procfs: make reference pidns more user-visible

Posted by Aleksa Sarai 6 months, 3 weeks ago

On 2025-07-21, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jul 21, 2025 at 1:44 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > Ever since the introduction of pid namespaces, procfs has had very
> > implicit behaviour surrounding them (the pidns used by a procfs mount is
> > auto-selected based on the mounting process's active pidns, and the
> > pidns itself is basically hidden once the mount has been constructed).
> > This has historically meant that userspace was required to do some
> > special dances in order to configure the pidns of a procfs mount as
> > desired. Examples include:
> >
> >  * In order to bypass the mnt_too_revealing() check, Kubernetes creates
> >    a procfs mount from an empty pidns so that user namespaced containers
> >    can be nested (without this, the nested containers would fail to
> >    mount procfs). But this requires forking off a helper process because
> >    you cannot just one-shot this using mount(2).
> >
> >  * Container runtimes in general need to fork into a container before
> >    configuring its mounts, which can lead to security issues in the case
> >    of shared-pidns containers (a privileged process in the pidns can
> >    interact with your container runtime process). While
> >    SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
> >    strict need for this due to a minor uAPI wart is kind of unfortunate.
> >
> > Things would be much easier if there was a way for userspace to just
> > specify the pidns they want. Patch 1 implements a new "pidns" argument
> > which can be set using fsconfig(2):
> >
> >     fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
> >     fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> >
> > or classic mount(2) / mount(8):
> >
> >     // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
> >     mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> >
> > The initial security model I have in this RFC is to be as conservative
> > as possible and just mirror the security model for setns(2) -- which
> > means that you can only set pidns=... to pid namespaces that your
> > current pid namespace is a direct ancestor of. This fulfils the
> > requirements of container runtimes, but I suspect that this may be too
> > strict for some usecases.
> >
> > The pidns argument is not displayed in mountinfo -- it's not clear to me
> > what value it would make sense to show (maybe we could just use ns_dname
> > to provide an identifier for the namespace, but this number would be
> > fairly useless to userspace). I'm open to suggestions.
> >
> > In addition, being able to figure out what pid namespace is being used
> > by a procfs mount is quite useful when you have an administrative
> > process (such as a container runtime) which wants to figure out the
> > correct way of mapping PIDs between its own namespace and the namespace
> > for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> > alternative ways to do this, but they all rely on ancillary information
> > that third-party libraries and tools do not necessarily have access to.
> >
> > To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> > can be used to get a reference to the pidns that a procfs is using.
> >
> > It's not quite clear what is the correct security model for this API,
> > but the current approach I've taken is to:
> >
> >  * Make the ioctl only valid on the root (meaning that a process without
> >    access to the procfs root -- such as only having an fd to a procfs
> >    file or some open_tree(2)-like subset -- cannot use this API).
> >
> >  * Require that the process requesting either has access to
> >    /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns
> >    pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative
> >    access to it and can join it if they had a handle), or is in a pidns
> >    that is a direct ancestor of the target pidns (i.e. all of the pids
> >    are already visible in the procfs for the current process's pidns).
> 
> What's the motivation for the ptrace-read option?  While I don't see
> an attack off the top of my head, it seems like creating a procfs
> mount may give write-ish access to things in the pidns (because the
> creator is likely to have CAP_DAC_OVERRIDE, etc) and possibly even
> access to namespace-wide things that aren't inherently visible to
> PID1.

This latter section is about the privilege model for
ioctl(PROCFS_GET_PID_NAMESPACE), not the pidns= mount flag. pidns=
requires CAP_SYS_ADMIN for pidns->user_ns, in addition to the same
restrictions as pidns_install() (must be a direct ancestor). Maybe I
should add some headers in this cover letter for v2...

For the ioctl -- if the user can ptrace-read pid1 in the pidns, they can
open a handle to /proc/1/ns/pid which is exactly the same thing they'd
get from PROCFS_GET_PID_NAMESPACE.

> Even the ancestor check seems dicey.  Imagine that uid 1000 makes an
> unprivileged container complete with a userns.  Then uid 1001 (outside
> the container) makes its own userns and mountns but stays in the init
> pidns and then mounts (and owns, with all filesystem-related
> capabilities) that mount.  Is this really safe?

As for the ancestor check (for the ioctl), the logic I had was that
being in an ancestor pidns means that you already can see all of the
subprocesses in your own pidns, so it seems strange to not be able to
get a handle to their pidns. Maybe this isn't quite right, idk.

Ultimately there isn't too much you can do with a pidns fd if you don't
have privileges to join it (the only thing I can think of is that you
could bind-mount it, which could maybe be used to trick an
administrative process if they trusted your mountns for some reason).

> CAP_SYS_ADMIN seems about right.

For pidns=, sure. For the ioctl, I think this is overkill.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

Re: [PATCH RFC 0/4] procfs: make reference pidns more user-visible

Posted by Aleksa Sarai 6 months, 2 weeks ago

On 2025-07-22, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2025-07-21, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Mon, Jul 21, 2025 at 1:44 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > >
> > > Ever since the introduction of pid namespaces, procfs has had very
> > > implicit behaviour surrounding them (the pidns used by a procfs mount is
> > > auto-selected based on the mounting process's active pidns, and the
> > > pidns itself is basically hidden once the mount has been constructed).
> > > This has historically meant that userspace was required to do some
> > > special dances in order to configure the pidns of a procfs mount as
> > > desired. Examples include:
> > >
> > >  * In order to bypass the mnt_too_revealing() check, Kubernetes creates
> > >    a procfs mount from an empty pidns so that user namespaced containers
> > >    can be nested (without this, the nested containers would fail to
> > >    mount procfs). But this requires forking off a helper process because
> > >    you cannot just one-shot this using mount(2).
> > >
> > >  * Container runtimes in general need to fork into a container before
> > >    configuring its mounts, which can lead to security issues in the case
> > >    of shared-pidns containers (a privileged process in the pidns can
> > >    interact with your container runtime process). While
> > >    SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
> > >    strict need for this due to a minor uAPI wart is kind of unfortunate.
> > >
> > > Things would be much easier if there was a way for userspace to just
> > > specify the pidns they want. Patch 1 implements a new "pidns" argument
> > > which can be set using fsconfig(2):
> > >
> > >     fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
> > >     fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> > >
> > > or classic mount(2) / mount(8):
> > >
> > >     // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
> > >     mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> > >
> > > The initial security model I have in this RFC is to be as conservative
> > > as possible and just mirror the security model for setns(2) -- which
> > > means that you can only set pidns=... to pid namespaces that your
> > > current pid namespace is a direct ancestor of. This fulfils the
> > > requirements of container runtimes, but I suspect that this may be too
> > > strict for some usecases.
> > >
> > > The pidns argument is not displayed in mountinfo -- it's not clear to me
> > > what value it would make sense to show (maybe we could just use ns_dname
> > > to provide an identifier for the namespace, but this number would be
> > > fairly useless to userspace). I'm open to suggestions.
> > >
> > > In addition, being able to figure out what pid namespace is being used
> > > by a procfs mount is quite useful when you have an administrative
> > > process (such as a container runtime) which wants to figure out the
> > > correct way of mapping PIDs between its own namespace and the namespace
> > > for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> > > alternative ways to do this, but they all rely on ancillary information
> > > that third-party libraries and tools do not necessarily have access to.
> > >
> > > To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> > > can be used to get a reference to the pidns that a procfs is using.
> > >
> > > It's not quite clear what is the correct security model for this API,
> > > but the current approach I've taken is to:
> > >
> > >  * Make the ioctl only valid on the root (meaning that a process without
> > >    access to the procfs root -- such as only having an fd to a procfs
> > >    file or some open_tree(2)-like subset -- cannot use this API).
> > >
> > >  * Require that the process requesting either has access to
> > >    /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns
> > >    pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative
> > >    access to it and can join it if they had a handle), or is in a pidns
> > >    that is a direct ancestor of the target pidns (i.e. all of the pids
> > >    are already visible in the procfs for the current process's pidns).
> > 
> > What's the motivation for the ptrace-read option?  While I don't see
> > an attack off the top of my head, it seems like creating a procfs
> > mount may give write-ish access to things in the pidns (because the
> > creator is likely to have CAP_DAC_OVERRIDE, etc) and possibly even
> > access to namespace-wide things that aren't inherently visible to
> > PID1.
> 
> This latter section is about the privilege model for
> ioctl(PROCFS_GET_PID_NAMESPACE), not the pidns= mount flag. pidns=
> requires CAP_SYS_ADMIN for pidns->user_ns, in addition to the same
> restrictions as pidns_install() (must be a direct ancestor). Maybe I
> should add some headers in this cover letter for v2...
> 
> For the ioctl -- if the user can ptrace-read pid1 in the pidns, they can
> open a handle to /proc/1/ns/pid which is exactly the same thing they'd
> get from PROCFS_GET_PID_NAMESPACE.
> 
> > Even the ancestor check seems dicey.  Imagine that uid 1000 makes an
> > unprivileged container complete with a userns.  Then uid 1001 (outside
> > the container) makes its own userns and mountns but stays in the init
> > pidns and then mounts (and owns, with all filesystem-related
> > capabilities) that mount.  Is this really safe?
> 
> As for the ancestor check (for the ioctl), the logic I had was that
> being in an ancestor pidns means that you already can see all of the
> subprocesses in your own pidns, so it seems strange to not be able to
> get a handle to their pidns. Maybe this isn't quite right, idk.
> 
> Ultimately there isn't too much you can do with a pidns fd if you don't
> have privileges to join it (the only thing I can think of is that you
> could bind-mount it, which could maybe be used to trick an
> administrative process if they trusted your mountns for some reason).
> 
> > CAP_SYS_ADMIN seems about right.
> 
> For pidns=, sure. For the ioctl, I think this is overkill.

My bad, I forgot to add you to Cc for v2 Andy. PTAL:

 <https://lore.kernel.org/all/20250723-procfs-pidns-api-v2-0-621e7edd8e40@cyphar.com/>

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/