[v4] procfs: make reference pidns more user-visible

[PATCH v4 0/4] procfs: make reference pidns more user-visible

Posted by Aleksa Sarai 6 months ago

Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs mount is
auto-selected based on the mounting process's active pidns, and the
pidns itself is basically hidden once the mount has been constructed).

/* pidns mount option for procfs */

This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns of a
procfs mount as desired. Examples include:

 * In order to bypass the mnt_too_revealing() check, Kubernetes creates
   a procfs mount from an empty pidns so that user namespaced containers
   can be nested (without this, the nested containers would fail to
   mount procfs). But this requires forking off a helper process because
   you cannot just one-shot this using mount(2).

 * Container runtimes in general need to fork into a container before
   configuring its mounts, which can lead to security issues in the case
   of shared-pidns containers (a privileged process in the pidns can
   interact with your container runtime process). While
   SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
   strict need for this due to a minor uAPI wart is kind of unfortunate.

Things would be much easier if there was a way for userspace to just
specify the pidns they want. Patch 1 implements a new "pidns" argument
which can be set using fsconfig(2):

    fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
    fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

or classic mount(2) / mount(8):

    // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
    mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

The initial security model I have in this RFC is to be as conservative
as possible and just mirror the security model for setns(2) -- which
means that you can only set pidns=... to pid namespaces that your
current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
privileges over the pid namespace. This fulfils the requirements of
container runtimes, but I suspect that this may be too strict for some
usecases.

The pidns argument is not displayed in mountinfo -- it's not clear to me
what value it would make sense to show (maybe we could just use ns_dname
to provide an identifier for the namespace, but this number would be
fairly useless to userspace). I'm open to suggestions. Note that
PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
information about this outside of mountinfo.

Note that you cannot change the pidns of an already-created procfs
instance. The primary reason is that allowing this to be changed would
require RCU-protecting proc_pid_ns(sb) and thus auditing all of
fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
the pid namespace. Since creating procfs instances is very cheap, it
seems unnecessary to overcomplicate this upfront. Trying to reconfigure
procfs this way errors out with -EBUSY.

/* ioctl(PROCFS_GET_PID_NAMESPACE) */

In addition, being able to figure out what pid namespace is being used
by a procfs mount is quite useful when you have an administrative
process (such as a container runtime) which wants to figure out the
correct way of mapping PIDs between its own namespace and the namespace
for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
alternative ways to do this, but they all rely on ancillary information
that third-party libraries and tools do not necessarily have access to.

To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
can be used to get a reference to the pidns that a procfs is using.

Rather than copying the (fairly strict) security model for setns(2),
apply a slightly looser model to better match what userspace can already
do:

 * Make the ioctl only valid on the root (meaning that a process without
   access to the procfs root -- such as only having an fd to a procfs
   file or some open_tree(2)-like subset -- cannot use this API). This
   means that the process already has some level of access to the
   /proc/$pid directories.

 * If the calling process is in an ancestor pidns, then they can already
   create pidfd for processes inside the pidns, which is morally
   equivalent to a pidns file descriptor according to setns(2). So it
   seems reasonable to just allow it in this case. (The justification
   for this model was suggested by Christian.)

 * If the process has access to /proc/1/ns/pid already (i.e. has
   ptrace-read access to the pidns pid1), then this ioctl is equivalent
   to just opening a handle to it that way.

   Ideally we would check for ptrace-read access against all processes
   in the pidns (which is very likely to be true for at least one
   process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set
   by most programs), but this would obviously not scale.

I'm open to suggestions for whether we need to make this stricter (or
possibly allow more cases).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Changes in v4:
- Remove unneeded EXPORT_SYMBOL_GPL. [Christian Brauner]
- Return -EOPNOTSUPP for new APIs for CONFIG_PID_NS=n rather than
  pretending they don't exist entirely. [Christian Brauner]
- PROCFS_IOCTL_MAGIC conflicts with XSDFEC_MAGIC, so we need to allocate
  subvalues more carefully (switch to _IO(PROCFS_IOCTL_MAGIC, 32)).
- Add some more selftests for PROCFS_GET_PID_NAMESPACE.
- Reword argument for PROCFS_GET_PID_NAMESPACE security model based on
  Christian's suggestion, and remove CAP_SYS_ADMIN edge-case (in most
  cases, such a process would also have ptrace-read credentials over the
  pidns pid1).
- v3: <https://lore.kernel.org/r/20250724-procfs-pidns-api-v3-0-4c685c910923@cyphar.com>

Changes in v3:
- Disallow changing pidns for existing procfs instances, as we'd
  probably have to RCU-protect everything that touches the pinned pidns
  reference.
- Improve tests with slightly nicer ASSERT_ERRNO* macros.
- v2: <https://lore.kernel.org/r/20250723-procfs-pidns-api-v2-0-621e7edd8e40@cyphar.com>

Changes in v2:
- #ifdef CONFIG_PID_NS
- Improve cover letter wording to make it clear we're talking about two
  separate features with different permission models. [Andy Lutomirski]
- Fix build warnings in pidns_is_ancestor() patch. [kernel test robot]
- v1: <https://lore.kernel.org/r/20250721-procfs-pidns-api-v1-0-5cd9007e512d@cyphar.com>

---
Aleksa Sarai (4):
      pidns: move is-ancestor logic to helper
      procfs: add "pidns" mount option
      procfs: add PROCFS_GET_PID_NAMESPACE ioctl
      selftests/proc: add tests for new pidns APIs

 Documentation/filesystems/proc.rst        |  12 ++
 fs/proc/root.c                            | 166 +++++++++++++++-
 include/linux/pid_namespace.h             |   9 +
 include/uapi/linux/fs.h                   |   4 +
 kernel/pid_namespace.c                    |  22 ++-
 tools/testing/selftests/proc/.gitignore   |   1 +
 tools/testing/selftests/proc/Makefile     |   1 +
 tools/testing/selftests/proc/proc-pidns.c | 315 ++++++++++++++++++++++++++++++
 8 files changed, 514 insertions(+), 16 deletions(-)
---
base-commit: 66639db858112bf6b0f76677f7517643d586e575
change-id: 20250717-procfs-pidns-api-8ed1583431f0

Best regards,
-- 
Aleksa Sarai <cyphar@cyphar.com>

Re: [PATCH v4 0/4] procfs: make reference pidns more user-visible

Posted by Christian Brauner 5 months, 1 week ago

On Tue, Aug 05, 2025 at 03:45:07PM +1000, Aleksa Sarai wrote:
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> 
> /* pidns mount option for procfs */
> 
> This implicit behaviour has historically meant that userspace was
> required to do some special dances in order to configure the pidns of a
> procfs mount as desired. Examples include:
> 
>  * In order to bypass the mnt_too_revealing() check, Kubernetes creates
>    a procfs mount from an empty pidns so that user namespaced containers
>    can be nested (without this, the nested containers would fail to
>    mount procfs). But this requires forking off a helper process because
>    you cannot just one-shot this using mount(2).
> 
>  * Container runtimes in general need to fork into a container before
>    configuring its mounts, which can lead to security issues in the case
>    of shared-pidns containers (a privileged process in the pidns can
>    interact with your container runtime process). While
>    SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
>    strict need for this due to a minor uAPI wart is kind of unfortunate.
> 
> Things would be much easier if there was a way for userspace to just
> specify the pidns they want. Patch 1 implements a new "pidns" argument
> which can be set using fsconfig(2):
> 
>     fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>     fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> 
> or classic mount(2) / mount(8):
> 
>     // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
>     mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> 
> The initial security model I have in this RFC is to be as conservative
> as possible and just mirror the security model for setns(2) -- which
> means that you can only set pidns=... to pid namespaces that your
> current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
> privileges over the pid namespace. This fulfils the requirements of
> container runtimes, but I suspect that this may be too strict for some
> usecases.
> 
> The pidns argument is not displayed in mountinfo -- it's not clear to me
> what value it would make sense to show (maybe we could just use ns_dname
> to provide an identifier for the namespace, but this number would be
> fairly useless to userspace). I'm open to suggestions. Note that
> PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
> information about this outside of mountinfo.
> 
> Note that you cannot change the pidns of an already-created procfs
> instance. The primary reason is that allowing this to be changed would
> require RCU-protecting proc_pid_ns(sb) and thus auditing all of
> fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
> the pid namespace. Since creating procfs instances is very cheap, it
> seems unnecessary to overcomplicate this upfront. Trying to reconfigure
> procfs this way errors out with -EBUSY.
> 
> /* ioctl(PROCFS_GET_PID_NAMESPACE) */
> 
> In addition, being able to figure out what pid namespace is being used
> by a procfs mount is quite useful when you have an administrative
> process (such as a container runtime) which wants to figure out the
> correct way of mapping PIDs between its own namespace and the namespace
> for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> alternative ways to do this, but they all rely on ancillary information
> that third-party libraries and tools do not necessarily have access to.
> 
> To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> can be used to get a reference to the pidns that a procfs is using.
> 
> Rather than copying the (fairly strict) security model for setns(2),
> apply a slightly looser model to better match what userspace can already
> do:
> 
>  * Make the ioctl only valid on the root (meaning that a process without
>    access to the procfs root -- such as only having an fd to a procfs
>    file or some open_tree(2)-like subset -- cannot use this API). This
>    means that the process already has some level of access to the
>    /proc/$pid directories.
> 
>  * If the calling process is in an ancestor pidns, then they can already
>    create pidfd for processes inside the pidns, which is morally
>    equivalent to a pidns file descriptor according to setns(2). So it
>    seems reasonable to just allow it in this case. (The justification
>    for this model was suggested by Christian.)
> 
>  * If the process has access to /proc/1/ns/pid already (i.e. has
>    ptrace-read access to the pidns pid1), then this ioctl is equivalent
>    to just opening a handle to it that way.
> 
>    Ideally we would check for ptrace-read access against all processes
>    in the pidns (which is very likely to be true for at least one
>    process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set
>    by most programs), but this would obviously not scale.
> 
> I'm open to suggestions for whether we need to make this stricter (or
> possibly allow more cases).
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Thanks for the patchset. Being able to specify what pid namespace the
procfs instance is supposed to belong to is super useful and will make
things easier for userspace for sure.

The code you added contains a minor wrinkle that I disliked which I've
changed and you tell me if you can live with this restriction or not.

The way you've implemented it specifying a pid namespace that the caller
holds privilege over would silently also override the user namespace the
filesystem is supposed to belong to.

Specifically, you did something like:

        put_pid_ns(ctx->pid_ns);
        ctx->pid_ns = get_pid_ns(target);
        put_user_ns(fc->user_ns);
        fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);

This silently overrides the user namespace recorded at fsopen() time. I
think that's too subtle and we should just not allow that at all for
now.

Instead I've changed this to:

        if (fc->user_ns != target->user_ns)
                return invalfc(fc, "owning user namespace of pid namespace doesn't match procfs user namespace");

        put_pid_ns(ctx->pid_ns);
        ctx->pid_ns = get_pid_ns(target);

so we just refuse different owernship.

I've also dropped the procfs ioctl because I'm not sure how much value
it will actually add given that you can do this via /proc/1/ns/pid.

If that is something that libpathrs despearately needs I would like to
do it as a separate patch anyways.

Thanks for the excellent cover letter. This was a pleasure merging!

Re: [PATCH v4 0/4] procfs: make reference pidns more user-visible

Posted by Aleksa Sarai 5 months ago

On 2025-09-02, Christian Brauner <brauner@kernel.org> wrote:
> On Tue, Aug 05, 2025 at 03:45:07PM +1000, Aleksa Sarai wrote:
> > Ever since the introduction of pid namespaces, procfs has had very
> > implicit behaviour surrounding them (the pidns used by a procfs mount is
> > auto-selected based on the mounting process's active pidns, and the
> > pidns itself is basically hidden once the mount has been constructed).
> > 
> > /* pidns mount option for procfs */
> > 
> > This implicit behaviour has historically meant that userspace was
> > required to do some special dances in order to configure the pidns of a
> > procfs mount as desired. Examples include:
> > 
> >  * In order to bypass the mnt_too_revealing() check, Kubernetes creates
> >    a procfs mount from an empty pidns so that user namespaced containers
> >    can be nested (without this, the nested containers would fail to
> >    mount procfs). But this requires forking off a helper process because
> >    you cannot just one-shot this using mount(2).
> > 
> >  * Container runtimes in general need to fork into a container before
> >    configuring its mounts, which can lead to security issues in the case
> >    of shared-pidns containers (a privileged process in the pidns can
> >    interact with your container runtime process). While
> >    SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
> >    strict need for this due to a minor uAPI wart is kind of unfortunate.
> > 
> > Things would be much easier if there was a way for userspace to just
> > specify the pidns they want. Patch 1 implements a new "pidns" argument
> > which can be set using fsconfig(2):
> > 
> >     fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
> >     fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> > 
> > or classic mount(2) / mount(8):
> > 
> >     // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
> >     mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> > 
> > The initial security model I have in this RFC is to be as conservative
> > as possible and just mirror the security model for setns(2) -- which
> > means that you can only set pidns=... to pid namespaces that your
> > current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
> > privileges over the pid namespace. This fulfils the requirements of
> > container runtimes, but I suspect that this may be too strict for some
> > usecases.
> > 
> > The pidns argument is not displayed in mountinfo -- it's not clear to me
> > what value it would make sense to show (maybe we could just use ns_dname
> > to provide an identifier for the namespace, but this number would be
> > fairly useless to userspace). I'm open to suggestions. Note that
> > PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
> > information about this outside of mountinfo.
> > 
> > Note that you cannot change the pidns of an already-created procfs
> > instance. The primary reason is that allowing this to be changed would
> > require RCU-protecting proc_pid_ns(sb) and thus auditing all of
> > fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
> > the pid namespace. Since creating procfs instances is very cheap, it
> > seems unnecessary to overcomplicate this upfront. Trying to reconfigure
> > procfs this way errors out with -EBUSY.
> > 
> > /* ioctl(PROCFS_GET_PID_NAMESPACE) */
> > 
> > In addition, being able to figure out what pid namespace is being used
> > by a procfs mount is quite useful when you have an administrative
> > process (such as a container runtime) which wants to figure out the
> > correct way of mapping PIDs between its own namespace and the namespace
> > for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> > alternative ways to do this, but they all rely on ancillary information
> > that third-party libraries and tools do not necessarily have access to.
> > 
> > To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> > can be used to get a reference to the pidns that a procfs is using.
> > 
> > Rather than copying the (fairly strict) security model for setns(2),
> > apply a slightly looser model to better match what userspace can already
> > do:
> > 
> >  * Make the ioctl only valid on the root (meaning that a process without
> >    access to the procfs root -- such as only having an fd to a procfs
> >    file or some open_tree(2)-like subset -- cannot use this API). This
> >    means that the process already has some level of access to the
> >    /proc/$pid directories.
> > 
> >  * If the calling process is in an ancestor pidns, then they can already
> >    create pidfd for processes inside the pidns, which is morally
> >    equivalent to a pidns file descriptor according to setns(2). So it
> >    seems reasonable to just allow it in this case. (The justification
> >    for this model was suggested by Christian.)
> > 
> >  * If the process has access to /proc/1/ns/pid already (i.e. has
> >    ptrace-read access to the pidns pid1), then this ioctl is equivalent
> >    to just opening a handle to it that way.
> > 
> >    Ideally we would check for ptrace-read access against all processes
> >    in the pidns (which is very likely to be true for at least one
> >    process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set
> >    by most programs), but this would obviously not scale.
> > 
> > I'm open to suggestions for whether we need to make this stricter (or
> > possibly allow more cases).
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> Thanks for the patchset. Being able to specify what pid namespace the
> procfs instance is supposed to belong to is super useful and will make
> things easier for userspace for sure.

I was going to send a new version changing the whole thing to be struct
path based (and adding FSCONFIG_SET_PATH{,_EMPTY} support) so we don't
need to allocate a file explicitly for the non-FSCONFIG_SET_FD case, but
we can do that as a follow-up I guess.

> The code you added contains a minor wrinkle that I disliked which I've
> changed and you tell me if you can live with this restriction or not.
> 
> The way you've implemented it specifying a pid namespace that the caller
> holds privilege over would silently also override the user namespace the
> filesystem is supposed to belong to.
> 
> Specifically, you did something like:
> 
>         put_pid_ns(ctx->pid_ns);
>         ctx->pid_ns = get_pid_ns(target);
>         put_user_ns(fc->user_ns);
>         fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
> 
> This silently overrides the user namespace recorded at fsopen() time. I
> think that's too subtle and we should just not allow that at all for
> now.
> 
> Instead I've changed this to:
> 
>         if (fc->user_ns != target->user_ns)
>                 return invalfc(fc, "owning user namespace of pid namespace doesn't match procfs user namespace");
> 
>         put_pid_ns(ctx->pid_ns);
>         ctx->pid_ns = get_pid_ns(target);
> 
> so we just refuse different owernship.

That sounds fine, I wasn't quite sure what to do with fc->user_ns to be
honest. Being more conservative is probably the right call here.

> I've also dropped the procfs ioctl because I'm not sure how much value
> it will actually add given that you can do this via /proc/1/ns/pid.
> 
> If that is something that libpathrs despearately needs I would like to
> do it as a separate patch anyways.

The main issues are:

1. pid1 can often be non-dumpable, which can block you from doing that.
   In principle, because the dumpable flag is reset on execve, it is
   theoretically possible to get access to /proc/$pid/ns/pid if you win
   the race in a pid namespace with lots of process activity, but this
   kind of sucks.

2. This approach doesn't work for empty pid namesapces.
   pidns_for_children doesn't let you get a handle to an empty pid
   namespace either (I briefly looked at the history and it seems this
   was silently changed in v2 of the patchset based on some feedback
   that I'm not sure was entirely correct).

3. Now that you can configure the procfs mount, it seems like a
   half-baked interface to not provide diagnostic information about the
   namespace. (I suspect the criu folks would be happy to have this too
   ;).)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

Re: [PATCH v4 0/4] procfs: make reference pidns more user-visible

Posted by Christian Brauner 4 months, 3 weeks ago

> The main issues are:
> 
> 1. pid1 can often be non-dumpable, which can block you from doing that.
>    In principle, because the dumpable flag is reset on execve, it is
>    theoretically possible to get access to /proc/$pid/ns/pid if you win
>    the race in a pid namespace with lots of process activity, but this
>    kind of sucks.
> 
> 2. This approach doesn't work for empty pid namesapces.
>    pidns_for_children doesn't let you get a handle to an empty pid
>    namespace either (I briefly looked at the history and it seems this
>    was silently changed in v2 of the patchset based on some feedback
>    that I'm not sure was entirely correct).
> 
> 3. Now that you can configure the procfs mount, it seems like a
>    half-baked interface to not provide diagnostic information about the
>    namespace. (I suspect the criu folks would be happy to have this too
>    ;).)

I think the easiest would be to add an ioctl that returns a pid
namespace based on a procfs root if the caller is located in the pid
namespace of the procfs instance (like
current_in_namespace(proc->pid_ns) or if the caller is privileged over
the owning ns. That would be simple and doesn't need to involve any
ptrace.

Re: (subset) [PATCH v4 0/4] procfs: make reference pidns more user-visible

Posted by Christian Brauner 5 months, 1 week ago

On Tue, 05 Aug 2025 15:45:07 +1000, Aleksa Sarai wrote:
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> 
> /* pidns mount option for procfs */
> 
> [...]

Applied to the vfs-6.18.procfs branch of the vfs/vfs.git tree.
Patches in the vfs-6.18.procfs branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.18.procfs

[1/4] pidns: move is-ancestor logic to helper
      https://git.kernel.org/vfs/vfs/c/60d22c6ef41b
[2/4] procfs: add "pidns" mount option
      https://git.kernel.org/vfs/vfs/c/77e211dd1392
[4/4] selftests/proc: add tests for new pidns APIs
      https://git.kernel.org/vfs/vfs/c/568d4239002c