/proc has historically had very opaque semantics about PID namespaces,
which is a little unfortunate for container runtimes and other programs
that deal with switching namespaces very often. One common issue is that
of converting between PIDs in the process's namespace and PIDs in the
namespace of /proc.
In principle, it is possible to do this today by opening a pidfd with
pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
contain a PID value translated to the pid namespace associated with that
procfs superblock). However, allocating a new file for each PID to be
converted is less than ideal for programs that may need to scan procfs,
and it is generally useful for userspace to be able to finally get this
information from procfs.
So, add a new API for this in the form of an ioctl(2) you can call on
the root directory of procfs. The returned file descriptor will have
O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount
option, finally allowing userspace full control of the pid namespaces
associated with procfs instances.
The permission model for this is a bit looser than that of the "pidns"
mount option, but this is mainly because /proc/1/ns/pid provides the
same information, so as long as you have access to that magic-link (or
something equivalently reasonable such as privileges with CAP_SYS_ADMIN
or being in an ancestor pid namespace) it makes sense to allow userspace
to grab a handle. setns(2) will still have their own permission checks,
so being able to open a pidns handle doesn't really provide too many
other capabilities.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Documentation/filesystems/proc.rst | 4 +++
fs/proc/root.c | 54 ++++++++++++++++++++++++++++++++++++--
include/uapi/linux/fs.h | 3 +++
3 files changed, 59 insertions(+), 2 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index c520b9f8a3fd..506383273c9d 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like
will be used by the procfs instance when translating pids. By default, procfs
will use the calling process's active pid namespace.
+Processes can check which pid namespace is used by a procfs instance by using
+the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
+instance.
+
Chapter 5: Filesystem behavior
==============================
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 057c8a125c6e..548a57ec2152 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -23,8 +23,10 @@
#include <linux/cred.h>
#include <linux/magic.h>
#include <linux/slab.h>
+#include <linux/ptrace.h>
#include "internal.h"
+#include "../internal.h"
struct proc_fs_context {
struct pid_namespace *pid_ns;
@@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
return proc_pid_readdir(file, ctx);
}
+static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+ switch (cmd) {
+#ifdef CONFIG_PID_NS
+ case PROCFS_GET_PID_NAMESPACE: {
+ struct pid_namespace *active = task_active_pid_ns(current);
+ struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
+ bool can_access_pidns = false;
+
+ /*
+ * If we are in an ancestors of the pidns, or have join
+ * privileges (CAP_SYS_ADMIN), then it makes sense that we
+ * would be able to grab a handle to the pidns.
+ *
+ * Otherwise, if there is a root process, then being able to
+ * access /proc/$pid/ns/pid is equivalent to this ioctl and so
+ * we should probably match the permission model. For empty
+ * namespaces it seems unlikely for there to be a downside to
+ * allowing unprivileged users to open a handle to it (setns
+ * will fail for unprivileged users anyway).
+ */
+ can_access_pidns = pidns_is_ancestor(ns, active) ||
+ ns_capable(ns->user_ns, CAP_SYS_ADMIN);
+ if (!can_access_pidns) {
+ bool cannot_ptrace_pid1 = false;
+
+ read_lock(&tasklist_lock);
+ if (ns->child_reaper)
+ cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper,
+ PTRACE_MODE_READ_FSCREDS);
+ read_unlock(&tasklist_lock);
+ can_access_pidns = !cannot_ptrace_pid1;
+ }
+ if (!can_access_pidns)
+ return -EPERM;
+
+ /* open_namespace() unconditionally consumes the reference. */
+ get_pid_ns(ns);
+ return open_namespace(to_ns_common(ns));
+ }
+#endif /* CONFIG_PID_NS */
+ default:
+ return -ENOIOCTLCMD;
+ }
+}
+
/*
* The root /proc directory is special, as it has the
* <pid> directories. Thus we don't use the generic
* directory handling functions for that..
*/
static const struct file_operations proc_root_operations = {
- .read = generic_read_dir,
- .iterate_shared = proc_root_readdir,
+ .read = generic_read_dir,
+ .iterate_shared = proc_root_readdir,
.llseek = generic_file_llseek,
+ .unlocked_ioctl = proc_root_ioctl,
+ .compat_ioctl = compat_ptr_ioctl,
};
/*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 0bd678a4a10e..aa642cb48feb 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -437,6 +437,9 @@ typedef int __bitwise __kernel_rwf_t;
#define PROCFS_IOCTL_MAGIC 'f'
+/* procfs root ioctls */
+#define PROCFS_GET_PID_NAMESPACE _IO(PROCFS_IOCTL_MAGIC, 1)
+
/* Pagemap ioctl */
#define PAGEMAP_SCAN _IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
--
2.50.0
On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote: > /proc has historically had very opaque semantics about PID namespaces, > which is a little unfortunate for container runtimes and other programs > that deal with switching namespaces very often. One common issue is that > of converting between PIDs in the process's namespace and PIDs in the > namespace of /proc. > > In principle, it is possible to do this today by opening a pidfd with > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will > contain a PID value translated to the pid namespace associated with that > procfs superblock). However, allocating a new file for each PID to be > converted is less than ideal for programs that may need to scan procfs, > and it is generally useful for userspace to be able to finally get this > information from procfs. > > So, add a new API for this in the form of an ioctl(2) you can call on > the root directory of procfs. The returned file descriptor will have > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount > option, finally allowing userspace full control of the pid namespaces > associated with procfs instances. > > The permission model for this is a bit looser than that of the "pidns" > mount option, but this is mainly because /proc/1/ns/pid provides the > same information, so as long as you have access to that magic-link (or > something equivalently reasonable such as privileges with CAP_SYS_ADMIN > or being in an ancestor pid namespace) it makes sense to allow userspace > to grab a handle. setns(2) will still have their own permission checks, > so being able to open a pidns handle doesn't really provide too many > other capabilities. > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > --- > Documentation/filesystems/proc.rst | 4 +++ > fs/proc/root.c | 54 ++++++++++++++++++++++++++++++++++++-- > include/uapi/linux/fs.h | 3 +++ > 3 files changed, 59 insertions(+), 2 deletions(-) > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > index c520b9f8a3fd..506383273c9d 100644 > --- a/Documentation/filesystems/proc.rst > +++ b/Documentation/filesystems/proc.rst > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like > will be used by the procfs instance when translating pids. By default, procfs > will use the calling process's active pid namespace. > > +Processes can check which pid namespace is used by a procfs instance by using > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs > +instance. > + > Chapter 5: Filesystem behavior > ============================== > > diff --git a/fs/proc/root.c b/fs/proc/root.c > index 057c8a125c6e..548a57ec2152 100644 > --- a/fs/proc/root.c > +++ b/fs/proc/root.c > @@ -23,8 +23,10 @@ > #include <linux/cred.h> > #include <linux/magic.h> > #include <linux/slab.h> > +#include <linux/ptrace.h> > > #include "internal.h" > +#include "../internal.h" > > struct proc_fs_context { > struct pid_namespace *pid_ns; > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx) > return proc_pid_readdir(file, ctx); > } > > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) > +{ > + switch (cmd) { > +#ifdef CONFIG_PID_NS > + case PROCFS_GET_PID_NAMESPACE: { > + struct pid_namespace *active = task_active_pid_ns(current); > + struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb); > + bool can_access_pidns = false; > + > + /* > + * If we are in an ancestors of the pidns, or have join > + * privileges (CAP_SYS_ADMIN), then it makes sense that we > + * would be able to grab a handle to the pidns. > + * > + * Otherwise, if there is a root process, then being able to > + * access /proc/$pid/ns/pid is equivalent to this ioctl and so > + * we should probably match the permission model. For empty > + * namespaces it seems unlikely for there to be a downside to > + * allowing unprivileged users to open a handle to it (setns > + * will fail for unprivileged users anyway). > + */ > + can_access_pidns = pidns_is_ancestor(ns, active) || > + ns_capable(ns->user_ns, CAP_SYS_ADMIN); This seems to imply that if @ns is a descendant of @active that the caller holds privileges over it. Is that actually always true? IOW, why is the check different from the previous pidns= mount option check. I would've expected: ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active) and then the ptrace check as a fallback. > + if (!can_access_pidns) { > + bool cannot_ptrace_pid1 = false; > + > + read_lock(&tasklist_lock); > + if (ns->child_reaper) > + cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper, > + PTRACE_MODE_READ_FSCREDS); > + read_unlock(&tasklist_lock); > + can_access_pidns = !cannot_ptrace_pid1; > + } > + if (!can_access_pidns) > + return -EPERM; > + > + /* open_namespace() unconditionally consumes the reference. */ > + get_pid_ns(ns); > + return open_namespace(to_ns_common(ns)); > + } > +#endif /* CONFIG_PID_NS */ > + default: > + return -ENOIOCTLCMD; > + } > +} > + > /* > * The root /proc directory is special, as it has the > * <pid> directories. Thus we don't use the generic > * directory handling functions for that.. > */ > static const struct file_operations proc_root_operations = { > - .read = generic_read_dir, > - .iterate_shared = proc_root_readdir, > + .read = generic_read_dir, > + .iterate_shared = proc_root_readdir, > .llseek = generic_file_llseek, > + .unlocked_ioctl = proc_root_ioctl, > + .compat_ioctl = compat_ptr_ioctl, > }; > > /* > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > index 0bd678a4a10e..aa642cb48feb 100644 > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -437,6 +437,9 @@ typedef int __bitwise __kernel_rwf_t; > > #define PROCFS_IOCTL_MAGIC 'f' > > +/* procfs root ioctls */ > +#define PROCFS_GET_PID_NAMESPACE _IO(PROCFS_IOCTL_MAGIC, 1) > + > /* Pagemap ioctl */ > #define PAGEMAP_SCAN _IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg) > > > -- > 2.50.0 >
On 2025-07-24, Christian Brauner <brauner@kernel.org> wrote: > On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote: > > /proc has historically had very opaque semantics about PID namespaces, > > which is a little unfortunate for container runtimes and other programs > > that deal with switching namespaces very often. One common issue is that > > of converting between PIDs in the process's namespace and PIDs in the > > namespace of /proc. > > > > In principle, it is possible to do this today by opening a pidfd with > > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will > > contain a PID value translated to the pid namespace associated with that > > procfs superblock). However, allocating a new file for each PID to be > > converted is less than ideal for programs that may need to scan procfs, > > and it is generally useful for userspace to be able to finally get this > > information from procfs. > > > > So, add a new API for this in the form of an ioctl(2) you can call on > > the root directory of procfs. The returned file descriptor will have > > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount > > option, finally allowing userspace full control of the pid namespaces > > associated with procfs instances. > > > > The permission model for this is a bit looser than that of the "pidns" > > mount option, but this is mainly because /proc/1/ns/pid provides the > > same information, so as long as you have access to that magic-link (or > > something equivalently reasonable such as privileges with CAP_SYS_ADMIN > > or being in an ancestor pid namespace) it makes sense to allow userspace > > to grab a handle. setns(2) will still have their own permission checks, > > so being able to open a pidns handle doesn't really provide too many > > other capabilities. > > > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > > --- > > Documentation/filesystems/proc.rst | 4 +++ > > fs/proc/root.c | 54 ++++++++++++++++++++++++++++++++++++-- > > include/uapi/linux/fs.h | 3 +++ > > 3 files changed, 59 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > > index c520b9f8a3fd..506383273c9d 100644 > > --- a/Documentation/filesystems/proc.rst > > +++ b/Documentation/filesystems/proc.rst > > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like > > will be used by the procfs instance when translating pids. By default, procfs > > will use the calling process's active pid namespace. > > > > +Processes can check which pid namespace is used by a procfs instance by using > > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs > > +instance. > > + > > Chapter 5: Filesystem behavior > > ============================== > > > > diff --git a/fs/proc/root.c b/fs/proc/root.c > > index 057c8a125c6e..548a57ec2152 100644 > > --- a/fs/proc/root.c > > +++ b/fs/proc/root.c > > @@ -23,8 +23,10 @@ > > #include <linux/cred.h> > > #include <linux/magic.h> > > #include <linux/slab.h> > > +#include <linux/ptrace.h> > > > > #include "internal.h" > > +#include "../internal.h" > > > > struct proc_fs_context { > > struct pid_namespace *pid_ns; > > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx) > > return proc_pid_readdir(file, ctx); > > } > > > > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) > > +{ > > + switch (cmd) { > > +#ifdef CONFIG_PID_NS > > + case PROCFS_GET_PID_NAMESPACE: { > > + struct pid_namespace *active = task_active_pid_ns(current); > > + struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb); > > + bool can_access_pidns = false; > > + > > + /* > > + * If we are in an ancestors of the pidns, or have join > > + * privileges (CAP_SYS_ADMIN), then it makes sense that we > > + * would be able to grab a handle to the pidns. > > + * > > + * Otherwise, if there is a root process, then being able to > > + * access /proc/$pid/ns/pid is equivalent to this ioctl and so > > + * we should probably match the permission model. For empty > > + * namespaces it seems unlikely for there to be a downside to > > + * allowing unprivileged users to open a handle to it (setns > > + * will fail for unprivileged users anyway). > > + */ > > + can_access_pidns = pidns_is_ancestor(ns, active) || > > + ns_capable(ns->user_ns, CAP_SYS_ADMIN); > > This seems to imply that if @ns is a descendant of @active that the > caller holds privileges over it. Is that actually always true? > > IOW, why is the check different from the previous pidns= mount option > check. I would've expected: > > ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active) > > and then the ptrace check as a fallback. That would mirror pidns_install(), and I did think about it. The primary (mostly handwave-y) reasoning I had for making it less strict was that: * If you are in an ancestor pidns, then you can already see those processes in your own /proc. In theory that means that you will be able to access /proc/$pid/ns/pid for at least some subprocess there (even if some subprocesses have SUID_DUMP_DISABLE, that flag is cleared on ). Though hypothetically if they are all running as a different user, this does not apply (and you could create scenarios where a child pidns is owned by a userns that you do not have privileges over -- if you deal with setuid binaries). Maybe that risk means we should just combine them, I'm not sure. * If you have CAP_SYS_ADMIN permissions over the pidns, it seems strange to disallow access even if it is not in an ancestor namespace. This is distinct to pidns_install(), where you want to ensure you cannot escape to a parent pid namespace, this is about getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS). Maybe they should be combined to match pidns_install(), but then I would expect the ptrace_may_access() check to apply to all processes in the pidns to make it less restrictive, which is not something you can practically do (and there is a higher chance that pid1 will have SUID_DUMP_DISABLE than some random subprocess, which almost certainly will not be SUID_DUMP_DISABLE). Fundamentally, I guess I'm still trying to see what the risk is of allowing a process to get a handle to a pidns that they have some kind of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being able to see and address all processes in the namespace, or by being able to open /proc/$pidns_pid1/ns/pid anyway) but cannot join. Then again, maybe the fact that it is kind of strange to explain is enough of a reason to just make it simpler... > > + if (!can_access_pidns) { > > + bool cannot_ptrace_pid1 = false; > > + > > + read_lock(&tasklist_lock); > > + if (ns->child_reaper) > > + cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper, > > + PTRACE_MODE_READ_FSCREDS); > > + read_unlock(&tasklist_lock); > > + can_access_pidns = !cannot_ptrace_pid1; > > + } > > + if (!can_access_pidns) > > + return -EPERM; > > + > > + /* open_namespace() unconditionally consumes the reference. */ > > + get_pid_ns(ns); > > + return open_namespace(to_ns_common(ns)); > > + } > > +#endif /* CONFIG_PID_NS */ > > + default: > > + return -ENOIOCTLCMD; > > + } > > +} > > + > > /* > > * The root /proc directory is special, as it has the > > * <pid> directories. Thus we don't use the generic > > * directory handling functions for that.. > > */ > > static const struct file_operations proc_root_operations = { > > - .read = generic_read_dir, > > - .iterate_shared = proc_root_readdir, > > + .read = generic_read_dir, > > + .iterate_shared = proc_root_readdir, > > .llseek = generic_file_llseek, > > + .unlocked_ioctl = proc_root_ioctl, > > + .compat_ioctl = compat_ptr_ioctl, > > }; > > > > /* > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > > index 0bd678a4a10e..aa642cb48feb 100644 > > --- a/include/uapi/linux/fs.h > > +++ b/include/uapi/linux/fs.h > > @@ -437,6 +437,9 @@ typedef int __bitwise __kernel_rwf_t; > > > > #define PROCFS_IOCTL_MAGIC 'f' > > > > +/* procfs root ioctls */ > > +#define PROCFS_GET_PID_NAMESPACE _IO(PROCFS_IOCTL_MAGIC, 1) > > + > > /* Pagemap ioctl */ > > #define PAGEMAP_SCAN _IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg) > > > > > > -- > > 2.50.0 > > -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
On Fri, Jul 25, 2025 at 12:24:28PM +1000, Aleksa Sarai wrote: > On 2025-07-24, Christian Brauner <brauner@kernel.org> wrote: > > On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote: > > > /proc has historically had very opaque semantics about PID namespaces, > > > which is a little unfortunate for container runtimes and other programs > > > that deal with switching namespaces very often. One common issue is that > > > of converting between PIDs in the process's namespace and PIDs in the > > > namespace of /proc. > > > > > > In principle, it is possible to do this today by opening a pidfd with > > > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will > > > contain a PID value translated to the pid namespace associated with that > > > procfs superblock). However, allocating a new file for each PID to be > > > converted is less than ideal for programs that may need to scan procfs, > > > and it is generally useful for userspace to be able to finally get this > > > information from procfs. > > > > > > So, add a new API for this in the form of an ioctl(2) you can call on > > > the root directory of procfs. The returned file descriptor will have > > > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount > > > option, finally allowing userspace full control of the pid namespaces > > > associated with procfs instances. > > > > > > The permission model for this is a bit looser than that of the "pidns" > > > mount option, but this is mainly because /proc/1/ns/pid provides the > > > same information, so as long as you have access to that magic-link (or > > > something equivalently reasonable such as privileges with CAP_SYS_ADMIN > > > or being in an ancestor pid namespace) it makes sense to allow userspace > > > to grab a handle. setns(2) will still have their own permission checks, > > > so being able to open a pidns handle doesn't really provide too many > > > other capabilities. > > > > > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > > > --- > > > Documentation/filesystems/proc.rst | 4 +++ > > > fs/proc/root.c | 54 ++++++++++++++++++++++++++++++++++++-- > > > include/uapi/linux/fs.h | 3 +++ > > > 3 files changed, 59 insertions(+), 2 deletions(-) > > > > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > > > index c520b9f8a3fd..506383273c9d 100644 > > > --- a/Documentation/filesystems/proc.rst > > > +++ b/Documentation/filesystems/proc.rst > > > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like > > > will be used by the procfs instance when translating pids. By default, procfs > > > will use the calling process's active pid namespace. > > > > > > +Processes can check which pid namespace is used by a procfs instance by using > > > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs > > > +instance. > > > + > > > Chapter 5: Filesystem behavior > > > ============================== > > > > > > diff --git a/fs/proc/root.c b/fs/proc/root.c > > > index 057c8a125c6e..548a57ec2152 100644 > > > --- a/fs/proc/root.c > > > +++ b/fs/proc/root.c > > > @@ -23,8 +23,10 @@ > > > #include <linux/cred.h> > > > #include <linux/magic.h> > > > #include <linux/slab.h> > > > +#include <linux/ptrace.h> > > > > > > #include "internal.h" > > > +#include "../internal.h" > > > > > > struct proc_fs_context { > > > struct pid_namespace *pid_ns; > > > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx) > > > return proc_pid_readdir(file, ctx); > > > } > > > > > > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) > > > +{ > > > + switch (cmd) { > > > +#ifdef CONFIG_PID_NS > > > + case PROCFS_GET_PID_NAMESPACE: { > > > + struct pid_namespace *active = task_active_pid_ns(current); > > > + struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb); > > > + bool can_access_pidns = false; > > > + > > > + /* > > > + * If we are in an ancestors of the pidns, or have join > > > + * privileges (CAP_SYS_ADMIN), then it makes sense that we > > > + * would be able to grab a handle to the pidns. > > > + * > > > + * Otherwise, if there is a root process, then being able to > > > + * access /proc/$pid/ns/pid is equivalent to this ioctl and so > > > + * we should probably match the permission model. For empty > > > + * namespaces it seems unlikely for there to be a downside to > > > + * allowing unprivileged users to open a handle to it (setns > > > + * will fail for unprivileged users anyway). > > > + */ > > > + can_access_pidns = pidns_is_ancestor(ns, active) || > > > + ns_capable(ns->user_ns, CAP_SYS_ADMIN); > > > > This seems to imply that if @ns is a descendant of @active that the > > caller holds privileges over it. Is that actually always true? > > > > IOW, why is the check different from the previous pidns= mount option > > check. I would've expected: > > > > ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active) > > > > and then the ptrace check as a fallback. > > That would mirror pidns_install(), and I did think about it. The primary > (mostly handwave-y) reasoning I had for making it less strict was that: > > * If you are in an ancestor pidns, then you can already see those > processes in your own /proc. In theory that means that you will be > able to access /proc/$pid/ns/pid for at least some subprocess there > (even if some subprocesses have SUID_DUMP_DISABLE, that flag is > cleared on ). > > Though hypothetically if they are all running as a different user, > this does not apply (and you could create scenarios where a child > pidns is owned by a userns that you do not have privileges over -- if > you deal with setuid binaries). Maybe that risk means we should just > combine them, I'm not sure. > > * If you have CAP_SYS_ADMIN permissions over the pidns, it seems > strange to disallow access even if it is not in an ancestor > namespace. This is distinct to pidns_install(), where you want to > ensure you cannot escape to a parent pid namespace, this is about > getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS). > > Maybe they should be combined to match pidns_install(), but then I would > expect the ptrace_may_access() check to apply to all processes in the > pidns to make it less restrictive, which is not something you can > practically do (and there is a higher chance that pid1 will have > SUID_DUMP_DISABLE than some random subprocess, which almost certainly > will not be SUID_DUMP_DISABLE). > > Fundamentally, I guess I'm still trying to see what the risk is of > allowing a process to get a handle to a pidns that they have some kind > of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being There shouldn't be. For example, you kinda implicitly do that with a pidfd, no? Because you can pass the pidfd to setns() instead of a namespace fd itself. Maybe that's the argument you're lookin for?
On 2025-07-31, Christian Brauner <brauner@kernel.org> wrote: > On Fri, Jul 25, 2025 at 12:24:28PM +1000, Aleksa Sarai wrote: > > On 2025-07-24, Christian Brauner <brauner@kernel.org> wrote: > > > On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote: > > > > /proc has historically had very opaque semantics about PID namespaces, > > > > which is a little unfortunate for container runtimes and other programs > > > > that deal with switching namespaces very often. One common issue is that > > > > of converting between PIDs in the process's namespace and PIDs in the > > > > namespace of /proc. > > > > > > > > In principle, it is possible to do this today by opening a pidfd with > > > > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will > > > > contain a PID value translated to the pid namespace associated with that > > > > procfs superblock). However, allocating a new file for each PID to be > > > > converted is less than ideal for programs that may need to scan procfs, > > > > and it is generally useful for userspace to be able to finally get this > > > > information from procfs. > > > > > > > > So, add a new API for this in the form of an ioctl(2) you can call on > > > > the root directory of procfs. The returned file descriptor will have > > > > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount > > > > option, finally allowing userspace full control of the pid namespaces > > > > associated with procfs instances. > > > > > > > > The permission model for this is a bit looser than that of the "pidns" > > > > mount option, but this is mainly because /proc/1/ns/pid provides the > > > > same information, so as long as you have access to that magic-link (or > > > > something equivalently reasonable such as privileges with CAP_SYS_ADMIN > > > > or being in an ancestor pid namespace) it makes sense to allow userspace > > > > to grab a handle. setns(2) will still have their own permission checks, > > > > so being able to open a pidns handle doesn't really provide too many > > > > other capabilities. > > > > > > > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > > > > --- > > > > Documentation/filesystems/proc.rst | 4 +++ > > > > fs/proc/root.c | 54 ++++++++++++++++++++++++++++++++++++-- > > > > include/uapi/linux/fs.h | 3 +++ > > > > 3 files changed, 59 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > > > > index c520b9f8a3fd..506383273c9d 100644 > > > > --- a/Documentation/filesystems/proc.rst > > > > +++ b/Documentation/filesystems/proc.rst > > > > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like > > > > will be used by the procfs instance when translating pids. By default, procfs > > > > will use the calling process's active pid namespace. > > > > > > > > +Processes can check which pid namespace is used by a procfs instance by using > > > > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs > > > > +instance. > > > > + > > > > Chapter 5: Filesystem behavior > > > > ============================== > > > > > > > > diff --git a/fs/proc/root.c b/fs/proc/root.c > > > > index 057c8a125c6e..548a57ec2152 100644 > > > > --- a/fs/proc/root.c > > > > +++ b/fs/proc/root.c > > > > @@ -23,8 +23,10 @@ > > > > #include <linux/cred.h> > > > > #include <linux/magic.h> > > > > #include <linux/slab.h> > > > > +#include <linux/ptrace.h> > > > > > > > > #include "internal.h" > > > > +#include "../internal.h" > > > > > > > > struct proc_fs_context { > > > > struct pid_namespace *pid_ns; > > > > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx) > > > > return proc_pid_readdir(file, ctx); > > > > } > > > > > > > > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) > > > > +{ > > > > + switch (cmd) { > > > > +#ifdef CONFIG_PID_NS > > > > + case PROCFS_GET_PID_NAMESPACE: { > > > > + struct pid_namespace *active = task_active_pid_ns(current); > > > > + struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb); > > > > + bool can_access_pidns = false; > > > > + > > > > + /* > > > > + * If we are in an ancestors of the pidns, or have join > > > > + * privileges (CAP_SYS_ADMIN), then it makes sense that we > > > > + * would be able to grab a handle to the pidns. > > > > + * > > > > + * Otherwise, if there is a root process, then being able to > > > > + * access /proc/$pid/ns/pid is equivalent to this ioctl and so > > > > + * we should probably match the permission model. For empty > > > > + * namespaces it seems unlikely for there to be a downside to > > > > + * allowing unprivileged users to open a handle to it (setns > > > > + * will fail for unprivileged users anyway). > > > > + */ > > > > + can_access_pidns = pidns_is_ancestor(ns, active) || > > > > + ns_capable(ns->user_ns, CAP_SYS_ADMIN); > > > > > > This seems to imply that if @ns is a descendant of @active that the > > > caller holds privileges over it. Is that actually always true? > > > > > > IOW, why is the check different from the previous pidns= mount option > > > check. I would've expected: > > > > > > ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active) > > > > > > and then the ptrace check as a fallback. > > > > That would mirror pidns_install(), and I did think about it. The primary > > (mostly handwave-y) reasoning I had for making it less strict was that: > > > > * If you are in an ancestor pidns, then you can already see those > > processes in your own /proc. In theory that means that you will be > > able to access /proc/$pid/ns/pid for at least some subprocess there > > (even if some subprocesses have SUID_DUMP_DISABLE, that flag is > > cleared on ). > > > > Though hypothetically if they are all running as a different user, > > this does not apply (and you could create scenarios where a child > > pidns is owned by a userns that you do not have privileges over -- if > > you deal with setuid binaries). Maybe that risk means we should just > > combine them, I'm not sure. > > > > * If you have CAP_SYS_ADMIN permissions over the pidns, it seems > > strange to disallow access even if it is not in an ancestor > > namespace. This is distinct to pidns_install(), where you want to > > ensure you cannot escape to a parent pid namespace, this is about > > getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS). > > > > Maybe they should be combined to match pidns_install(), but then I would > > expect the ptrace_may_access() check to apply to all processes in the > > pidns to make it less restrictive, which is not something you can > > practically do (and there is a higher chance that pid1 will have > > SUID_DUMP_DISABLE than some random subprocess, which almost certainly > > will not be SUID_DUMP_DISABLE). > > > > Fundamentally, I guess I'm still trying to see what the risk is of > > allowing a process to get a handle to a pidns that they have some kind > > of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being > > There shouldn't be. For example, you kinda implicitly do that with a > pidfd, no? Because you can pass the pidfd to setns() instead of a > namespace fd itself. Maybe that's the argument you're lookin for? That argument works for me! I'll rewrite the commit message to make sure it sounds like I came up with it. ;) -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
© 2016 - 2025 Red Hat, Inc.