[v4] procfs: make reference pidns more user-visible

[PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl

Posted by Aleksa Sarai 6 months, 1 week ago

/proc has historically had very opaque semantics about PID namespaces,
which is a little unfortunate for container runtimes and other programs
that deal with switching namespaces very often. One common issue is that
of converting between PIDs in the process's namespace and PIDs in the
namespace of /proc.

In principle, it is possible to do this today by opening a pidfd with
pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
contain a PID value translated to the pid namespace associated with that
procfs superblock). However, allocating a new file for each PID to be
converted is less than ideal for programs that may need to scan procfs,
and it is generally useful for userspace to be able to finally get this
information from procfs.

So, add a new API to get the pid namespace of a procfs instance, in the
form of an ioctl(2) you can call on the root directory of said procfs.
The returned file descriptor will have O_CLOEXEC set. This acts as a
sister feature to the new "pidns" mount option, finally allowing
userspace full control of the pid namespaces associated with procfs
instances.

The permission model for this is a bit looser than that of the "pidns"
mount option (and also setns(2)) because /proc/1/ns/pid provides the
same information, so as long as you have access to that magic-link (or
something equivalently reasonable such as being in an ancestor pid
namespace) it makes sense to allow userspace to grab a handle. Ideally
we would check for ptrace-read access against all processes in the pidns
(which is very likely to be true for at least one process, as
SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
programs), but this would obviously not scale.

setns(2) will still have their own permission checks, so being able to
open a pidns handle doesn't really provide too many other capabilities.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 Documentation/filesystems/proc.rst |  4 +++
 fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
 include/uapi/linux/fs.h            |  4 +++
 3 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 5a157dadea0b..840f820fb467 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2400,6 +2400,10 @@ will use the calling process's active pid namespace. Note that the pid
 namespace of an existing procfs instance cannot be modified (attempting to do
 so will give an `-EBUSY` error).
 
+Processes can check which pid namespace is used by a procfs instance by using
+the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
+instance.
+
 Chapter 5: Filesystem behavior
 ==============================
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index fd1f1c8a939a..ac9b115fad7b 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -23,8 +23,10 @@
 #include <linux/cred.h>
 #include <linux/magic.h>
 #include <linux/slab.h>
+#include <linux/ptrace.h>
 
 #include "internal.h"
+#include "../internal.h"
 
 struct proc_fs_context {
 	struct pid_namespace	*pid_ns;
@@ -426,15 +428,77 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
 	return proc_pid_readdir(file, ctx);
 }
 
+static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case PROCFS_GET_PID_NAMESPACE: {
+#ifdef CONFIG_PID_NS
+		struct pid_namespace *active = task_active_pid_ns(current);
+		struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
+		bool can_access_pidns = false;
+
+		/*
+		 * Having a handle to a pidns is not sufficient to do anything
+		 * particularly harmful, as setns(2) has its own separate
+		 * privilege checks. So, we can loosen the privilege
+		 * requirements here a little to make this more ergonomic.
+		 *
+		 * If we are in an ancestor pidns of the pidns, then we can
+		 * already address any process in the pidns. From a setns(2)
+		 * privileges perspective, we can create a pidfd which setns(2)
+		 * would also accept (pending any privilege checks).
+		 *
+		 * If we are not in an ancestor pidns, because this operation
+		 * is being done on the root of the /proc instance, the caller
+		 * can try to access /proc/1/ns/pid which is equivalent to this
+		 * ioctl and so we should copy the PTRACE_MODE_READ_FSCREDS
+		 * permission model use by proc_ns_get_link(). Ideally we would
+		 * check for ptrace-read access against all processes in the
+		 * pidns (which is very likely to be true for at least one
+		 * process, as SUID_DUMP_DISABLE is cleared on exec(2) and is
+		 * rarely set by most programs), but this would obviously not
+		 * scale.
+		 *
+		 * If there is no root process, then there is no real downside
+		 * to unprivileged users to open a handle to it.
+		 */
+		can_access_pidns = pidns_is_ancestor(ns, active);
+		if (!can_access_pidns) {
+			bool cannot_ptrace_pid1 = false;
+
+			read_lock(&tasklist_lock);
+			if (ns->child_reaper)
+				cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper,
+								       PTRACE_MODE_READ_FSCREDS);
+			read_unlock(&tasklist_lock);
+			can_access_pidns = !cannot_ptrace_pid1;
+		}
+		if (!can_access_pidns)
+			return -EPERM;
+
+		/* open_namespace() unconditionally consumes the reference. */
+		get_pid_ns(ns);
+		return open_namespace(to_ns_common(ns));
+#else
+		return -EOPNOTSUPP;
+#endif
+	}
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
 /*
  * The root /proc directory is special, as it has the
  * <pid> directories. Thus we don't use the generic
  * directory handling functions for that..
  */
 static const struct file_operations proc_root_operations = {
-	.read		 = generic_read_dir,
-	.iterate_shared	 = proc_root_readdir,
+	.read		= generic_read_dir,
+	.iterate_shared	= proc_root_readdir,
 	.llseek		= generic_file_llseek,
+	.unlocked_ioctl = proc_root_ioctl,
+	.compat_ioctl   = compat_ptr_ioctl,
 };
 
 /*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 0bd678a4a10e..68e65e6d7d6b 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
 			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
 			 RWF_DONTCACHE)
 
+/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
 #define PROCFS_IOCTL_MAGIC 'f'
 
+/* procfs root ioctls */
+#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)
+
 /* Pagemap ioctl */
 #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
 

-- 
2.50.1

Re: [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl

Posted by Randy Dunlap 6 months, 1 week ago


On 8/4/25 10:45 PM, Aleksa Sarai wrote:
> /proc has historically had very opaque semantics about PID namespaces,
> which is a little unfortunate for container runtimes and other programs
> that deal with switching namespaces very often. One common issue is that
> of converting between PIDs in the process's namespace and PIDs in the
> namespace of /proc.
> 
> In principle, it is possible to do this today by opening a pidfd with
> pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> contain a PID value translated to the pid namespace associated with that
> procfs superblock). However, allocating a new file for each PID to be
> converted is less than ideal for programs that may need to scan procfs,
> and it is generally useful for userspace to be able to finally get this
> information from procfs.
> 
> So, add a new API to get the pid namespace of a procfs instance, in the
> form of an ioctl(2) you can call on the root directory of said procfs.
> The returned file descriptor will have O_CLOEXEC set. This acts as a
> sister feature to the new "pidns" mount option, finally allowing
> userspace full control of the pid namespaces associated with procfs
> instances.
> 
> The permission model for this is a bit looser than that of the "pidns"
> mount option (and also setns(2)) because /proc/1/ns/pid provides the
> same information, so as long as you have access to that magic-link (or
> something equivalently reasonable such as being in an ancestor pid
> namespace) it makes sense to allow userspace to grab a handle. Ideally
> we would check for ptrace-read access against all processes in the pidns
> (which is very likely to be true for at least one process, as
> SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
> programs), but this would obviously not scale.
> 
> setns(2) will still have their own permission checks, so being able to
> open a pidns handle doesn't really provide too many other capabilities.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  Documentation/filesystems/proc.rst |  4 +++
>  fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/fs.h            |  4 +++
>  3 files changed, 74 insertions(+), 2 deletions(-)
> 


> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 0bd678a4a10e..68e65e6d7d6b 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
>  			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
>  			 RWF_DONTCACHE)
>  
> +/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
>  #define PROCFS_IOCTL_MAGIC 'f'
>  
> +/* procfs root ioctls */
> +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)

Since the _IO() nr here is 32, Documentation/userspace-api/ioctl/ioctl-number.rst
should be updated like:

-'f'   00-0F  linux/fs.h                                                conflict!
+'f'   00-1F  linux/fs.h                                                conflict!

(17 is already used for PROCFS_IOCTL_MAGIC somewhere else, so that probably should
have update the Doc/rst file.)

> +
>  /* Pagemap ioctl */
>  #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
>  
> 
Thanks.
-- 
~Randy

Re: [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl

Posted by Aleksa Sarai 6 months ago

On 2025-08-05, Randy Dunlap <rdunlap@infradead.org> wrote:
> 
> 
> On 8/4/25 10:45 PM, Aleksa Sarai wrote:
> > /proc has historically had very opaque semantics about PID namespaces,
> > which is a little unfortunate for container runtimes and other programs
> > that deal with switching namespaces very often. One common issue is that
> > of converting between PIDs in the process's namespace and PIDs in the
> > namespace of /proc.
> > 
> > In principle, it is possible to do this today by opening a pidfd with
> > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> > contain a PID value translated to the pid namespace associated with that
> > procfs superblock). However, allocating a new file for each PID to be
> > converted is less than ideal for programs that may need to scan procfs,
> > and it is generally useful for userspace to be able to finally get this
> > information from procfs.
> > 
> > So, add a new API to get the pid namespace of a procfs instance, in the
> > form of an ioctl(2) you can call on the root directory of said procfs.
> > The returned file descriptor will have O_CLOEXEC set. This acts as a
> > sister feature to the new "pidns" mount option, finally allowing
> > userspace full control of the pid namespaces associated with procfs
> > instances.
> > 
> > The permission model for this is a bit looser than that of the "pidns"
> > mount option (and also setns(2)) because /proc/1/ns/pid provides the
> > same information, so as long as you have access to that magic-link (or
> > something equivalently reasonable such as being in an ancestor pid
> > namespace) it makes sense to allow userspace to grab a handle. Ideally
> > we would check for ptrace-read access against all processes in the pidns
> > (which is very likely to be true for at least one process, as
> > SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
> > programs), but this would obviously not scale.
> > 
> > setns(2) will still have their own permission checks, so being able to
> > open a pidns handle doesn't really provide too many other capabilities.
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  Documentation/filesystems/proc.rst |  4 +++
> >  fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
> >  include/uapi/linux/fs.h            |  4 +++
> >  3 files changed, 74 insertions(+), 2 deletions(-)
> > 
> 
> 
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 0bd678a4a10e..68e65e6d7d6b 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
> >  			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
> >  			 RWF_DONTCACHE)
> >  
> > +/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
> >  #define PROCFS_IOCTL_MAGIC 'f'
> >  
> > +/* procfs root ioctls */
> > +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)
> 
> Since the _IO() nr here is 32, Documentation/userspace-api/ioctl/ioctl-number.rst
> should be updated like:
> 
> -'f'   00-0F  linux/fs.h                                                conflict!
> +'f'   00-1F  linux/fs.h                                                conflict!

Should this be 00-20 (or 00-2F) instead?

Also, is there a better value to use for this new ioctl? I'm not quite
sure what is the best practice to handle these kinds of conflicts...

> (17 is already used for PROCFS_IOCTL_MAGIC somewhere else, so that probably should
> have update the Doc/rst file.)
> 
> > +
> >  /* Pagemap ioctl */
> >  #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
> >  
> > 
> Thanks.
> -- 
> ~Randy
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

Re: [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl

Posted by Randy Dunlap 6 months ago


On 8/6/25 11:02 AM, Aleksa Sarai wrote:
> On 2025-08-05, Randy Dunlap <rdunlap@infradead.org> wrote:
>>
>>
>> On 8/4/25 10:45 PM, Aleksa Sarai wrote:
>>> /proc has historically had very opaque semantics about PID namespaces,
>>> which is a little unfortunate for container runtimes and other programs
>>> that deal with switching namespaces very often. One common issue is that
>>> of converting between PIDs in the process's namespace and PIDs in the
>>> namespace of /proc.
>>>
>>> In principle, it is possible to do this today by opening a pidfd with
>>> pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
>>> contain a PID value translated to the pid namespace associated with that
>>> procfs superblock). However, allocating a new file for each PID to be
>>> converted is less than ideal for programs that may need to scan procfs,
>>> and it is generally useful for userspace to be able to finally get this
>>> information from procfs.
>>>
>>> So, add a new API to get the pid namespace of a procfs instance, in the
>>> form of an ioctl(2) you can call on the root directory of said procfs.
>>> The returned file descriptor will have O_CLOEXEC set. This acts as a
>>> sister feature to the new "pidns" mount option, finally allowing
>>> userspace full control of the pid namespaces associated with procfs
>>> instances.
>>>
>>> The permission model for this is a bit looser than that of the "pidns"
>>> mount option (and also setns(2)) because /proc/1/ns/pid provides the
>>> same information, so as long as you have access to that magic-link (or
>>> something equivalently reasonable such as being in an ancestor pid
>>> namespace) it makes sense to allow userspace to grab a handle. Ideally
>>> we would check for ptrace-read access against all processes in the pidns
>>> (which is very likely to be true for at least one process, as
>>> SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
>>> programs), but this would obviously not scale.
>>>
>>> setns(2) will still have their own permission checks, so being able to
>>> open a pidns handle doesn't really provide too many other capabilities.
>>>
>>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
>>> ---
>>>  Documentation/filesystems/proc.rst |  4 +++
>>>  fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
>>>  include/uapi/linux/fs.h            |  4 +++
>>>  3 files changed, 74 insertions(+), 2 deletions(-)
>>>
>>
>>
>>> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
>>> index 0bd678a4a10e..68e65e6d7d6b 100644
>>> --- a/include/uapi/linux/fs.h
>>> +++ b/include/uapi/linux/fs.h
>>> @@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
>>>  			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
>>>  			 RWF_DONTCACHE)
>>>  
>>> +/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
>>>  #define PROCFS_IOCTL_MAGIC 'f'
>>>  
>>> +/* procfs root ioctls */
>>> +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)
>>
>> Since the _IO() nr here is 32, Documentation/userspace-api/ioctl/ioctl-number.rst
>> should be updated like:
>>
>> -'f'   00-0F  linux/fs.h                                                conflict!
>> +'f'   00-1F  linux/fs.h                                                conflict!
> 
> Should this be 00-20 (or 00-2F) instead?

Oops, yes, it should be one of those. Thanks.

> Also, is there a better value to use for this new ioctl? I'm not quite
> sure what is the best practice to handle these kinds of conflicts...

I wouldn't worry about it. We have *many* conflicts.
(unless Al or Christian are concerned)

>> (17 is already used for PROCFS_IOCTL_MAGIC somewhere else, so that probably should
>> have update the Doc/rst file.)
>>
>>> +
>>>  /* Pagemap ioctl */
>>>  #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)

-- 
~Randy

Re: [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl

Posted by Christian Brauner 6 months ago

On Wed, Aug 06, 2025 at 11:57:42AM -0700, Randy Dunlap wrote:
> 
> 
> On 8/6/25 11:02 AM, Aleksa Sarai wrote:
> > On 2025-08-05, Randy Dunlap <rdunlap@infradead.org> wrote:
> >>
> >>
> >> On 8/4/25 10:45 PM, Aleksa Sarai wrote:
> >>> /proc has historically had very opaque semantics about PID namespaces,
> >>> which is a little unfortunate for container runtimes and other programs
> >>> that deal with switching namespaces very often. One common issue is that
> >>> of converting between PIDs in the process's namespace and PIDs in the
> >>> namespace of /proc.
> >>>
> >>> In principle, it is possible to do this today by opening a pidfd with
> >>> pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> >>> contain a PID value translated to the pid namespace associated with that
> >>> procfs superblock). However, allocating a new file for each PID to be
> >>> converted is less than ideal for programs that may need to scan procfs,
> >>> and it is generally useful for userspace to be able to finally get this
> >>> information from procfs.
> >>>
> >>> So, add a new API to get the pid namespace of a procfs instance, in the
> >>> form of an ioctl(2) you can call on the root directory of said procfs.
> >>> The returned file descriptor will have O_CLOEXEC set. This acts as a
> >>> sister feature to the new "pidns" mount option, finally allowing
> >>> userspace full control of the pid namespaces associated with procfs
> >>> instances.
> >>>
> >>> The permission model for this is a bit looser than that of the "pidns"
> >>> mount option (and also setns(2)) because /proc/1/ns/pid provides the
> >>> same information, so as long as you have access to that magic-link (or
> >>> something equivalently reasonable such as being in an ancestor pid
> >>> namespace) it makes sense to allow userspace to grab a handle. Ideally
> >>> we would check for ptrace-read access against all processes in the pidns
> >>> (which is very likely to be true for at least one process, as
> >>> SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
> >>> programs), but this would obviously not scale.
> >>>
> >>> setns(2) will still have their own permission checks, so being able to
> >>> open a pidns handle doesn't really provide too many other capabilities.
> >>>
> >>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> >>> ---
> >>>  Documentation/filesystems/proc.rst |  4 +++
> >>>  fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
> >>>  include/uapi/linux/fs.h            |  4 +++
> >>>  3 files changed, 74 insertions(+), 2 deletions(-)
> >>>
> >>
> >>
> >>> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> >>> index 0bd678a4a10e..68e65e6d7d6b 100644
> >>> --- a/include/uapi/linux/fs.h
> >>> +++ b/include/uapi/linux/fs.h
> >>> @@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
> >>>  			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
> >>>  			 RWF_DONTCACHE)
> >>>  
> >>> +/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
> >>>  #define PROCFS_IOCTL_MAGIC 'f'
> >>>  
> >>> +/* procfs root ioctls */
> >>> +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)
> >>
> >> Since the _IO() nr here is 32, Documentation/userspace-api/ioctl/ioctl-number.rst
> >> should be updated like:
> >>
> >> -'f'   00-0F  linux/fs.h                                                conflict!
> >> +'f'   00-1F  linux/fs.h                                                conflict!
> > 
> > Should this be 00-20 (or 00-2F) instead?
> 
> Oops, yes, it should be one of those. Thanks.
> 
> > Also, is there a better value to use for this new ioctl? I'm not quite
> > sure what is the best practice to handle these kinds of conflicts...
> 
> I wouldn't worry about it. We have *many* conflicts.
> (unless Al or Christian are concerned)

We try to minimize conflicts but we unfortunately give no strong
guarantees in any way. I always defer to Arnd in such matters as he's
got a pretty good mental model of what is best to do for ioctls.

> 
> >> (17 is already used for PROCFS_IOCTL_MAGIC somewhere else, so that probably should
> >> have update the Doc/rst file.)

[PATCH v4 1/4] pidns: move is-ancestor logic to helper
[PATCH v4 2/4] procfs: add "pidns" mount option
[PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
[PATCH v4 4/4] selftests/proc: add tests for new pidns APIs