[v2] pidfs: implement file handle support

[PATCH v2 0/3] pidfs: implement file handle support

Posted by Erin Shepherd 1 year, 2 months ago

Since the introduction of pidfs, we have had 64-bit process identifiers
that will not be reused for the entire uptime of the system. This greatly
facilitates process tracking in userspace.

There are two limitations at present:

 * These identifiers are currently only exposed to processes on 64-bit
   systems. On 32-bit systems, inode space is also limited to 32 bits and
   therefore is subject to the same reuse issues.
 * There is no way to go from one of these unique identifiers to a pid or
   pidfd.

This patch implements fh_export and fh_to_dentry which enables userspace to
convert PIDs to and from PID file handles. A process can convert a pidfd into
a file handle using name_to_handle_at, store it (in memory, on disk, or 
elsewhere) and then convert it back into a pidfd suing open_by_handle_at.

To support us going from a file handle to a pidfd, we have to store a pid
inside the file handle. To ensure file handles are invariant and can move
between pid namespaces, we stash a pid from the initial namespace inside
the file handle.
	
  (There has been some discussion as to whether or not it is OK to include
  the PID in the initial pid namespace, but so far there hasn't been any
  conclusive reason given as to why this would be a bad idea)	

Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
---
Changes in v2:
- Permit filesystems to opt out of CAP_DAC_READ_SEARCH
- Inline find_pid_ns/get_pid logic; remove unnecessary put_pid
- Squash fh_export & fh_to_dentry into one commit
- Link to v1: https://lore.kernel.org/r/2aa94713-c12a-4344-a45c-a01f26e16a0d@e43.eu

Remaining: To address the PIDFD_THREAD question
 - Do we want to stash it in file handles (IMO no but there may be merits to
   doing so)
 - If not do we want PIDFD_THREAD/O_EXCL to open_by_handle_at to work or do we
   do something else?

   (Perhaps we could just add an ioctl which lets an app change the PIDFD_THREAD flag?)

If PIDFD_SELF lands <https://lore.kernel.org/r/cover.1727644404.git.lorenzo.stoakes@oracle.com>:
 - Support for PIDFD_SELF(_THREAD) as reference fd in open_by_handle_at?
   (We can even detect that special case early there and bypass most of the file handle logic)

---
Erin Shepherd (3):
      pseudofs: add support for export_ops
      exportfs: allow fs to disable CAP_DAC_READ_SEARCH check
      pidfs: implement file handle support

 fs/fhandle.c              | 36 +++++++++++++++------------
 fs/libfs.c                |  1 +
 fs/pidfs.c                | 62 ++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/exportfs.h  |  3 +++
 include/linux/pseudo_fs.h |  1 +
 5 files changed, 87 insertions(+), 16 deletions(-)
---
base-commit: 14b6320953a3f856a3f93bf9a0e423395baa593d
change-id: 20241112-pidfs_fh-fbb04b108a2b

Best regards,
-- 
Erin Shepherd <erin.shepherd@e43.eu>

Re: [PATCH v2 0/3] pidfs: implement file handle support

Posted by Amir Goldstein 1 year, 2 months ago

On Wed, Nov 13, 2024 at 7:01 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>
> Since the introduction of pidfs, we have had 64-bit process identifiers
> that will not be reused for the entire uptime of the system. This greatly
> facilitates process tracking in userspace.
>
> There are two limitations at present:
>
>  * These identifiers are currently only exposed to processes on 64-bit
>    systems. On 32-bit systems, inode space is also limited to 32 bits and
>    therefore is subject to the same reuse issues.
>  * There is no way to go from one of these unique identifiers to a pid or
>    pidfd.
>
> This patch implements fh_export and fh_to_dentry which enables userspace to
> convert PIDs to and from PID file handles. A process can convert a pidfd into
> a file handle using name_to_handle_at, store it (in memory, on disk, or
> elsewhere) and then convert it back into a pidfd suing open_by_handle_at.
>
> To support us going from a file handle to a pidfd, we have to store a pid
> inside the file handle. To ensure file handles are invariant and can move
> between pid namespaces, we stash a pid from the initial namespace inside
> the file handle.
>
>   (There has been some discussion as to whether or not it is OK to include
>   the PID in the initial pid namespace, but so far there hasn't been any
>   conclusive reason given as to why this would be a bad idea)

IIUC, this is already exposed as st_ino on a 64bit arch?
If that is the case, then there is certainly no new info leak in this patch.

>
> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> ---
> Changes in v2:
> - Permit filesystems to opt out of CAP_DAC_READ_SEARCH
> - Inline find_pid_ns/get_pid logic; remove unnecessary put_pid
> - Squash fh_export & fh_to_dentry into one commit

Not sure why you did that.
It was pretty nice as separate commits if you ask me. Whatever.

Thanks,
Amir.

Re: [PATCH v2 0/3] pidfs: implement file handle support

Posted by Erin Shepherd 1 year, 2 months ago

   

On 14/11/2024 08:02, Amir Goldstein wrote:
> On Wed, Nov 13, 2024 at 7:01 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>> Since the introduction of pidfs, we have had 64-bit process identifiers
>> that will not be reused for the entire uptime of the system. This greatly
>> facilitates process tracking in userspace.
>>
>> There are two limitations at present:
>>
>>  * These identifiers are currently only exposed to processes on 64-bit
>>    systems. On 32-bit systems, inode space is also limited to 32 bits and
>>    therefore is subject to the same reuse issues.
>>  * There is no way to go from one of these unique identifiers to a pid or
>>    pidfd.
>>
>> This patch implements fh_export and fh_to_dentry which enables userspace to
>> convert PIDs to and from PID file handles. A process can convert a pidfd into
>> a file handle using name_to_handle_at, store it (in memory, on disk, or
>> elsewhere) and then convert it back into a pidfd suing open_by_handle_at.
>>
>> To support us going from a file handle to a pidfd, we have to store a pid
>> inside the file handle. To ensure file handles are invariant and can move
>> between pid namespaces, we stash a pid from the initial namespace inside
>> the file handle.
>>
>>   (There has been some discussion as to whether or not it is OK to include
>>   the PID in the initial pid namespace, but so far there hasn't been any
>>   conclusive reason given as to why this would be a bad idea)
> IIUC, this is already exposed as st_ino on a 64bit arch?
> If that is the case, then there is certainly no new info leak in this patch.

pid.ino is exposed, but the init-ns pid isn't exposed anywhere to my knowledge.

>> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
>> ---
>> Changes in v2:
>> - Permit filesystems to opt out of CAP_DAC_READ_SEARCH
>> - Inline find_pid_ns/get_pid logic; remove unnecessary put_pid
>> - Squash fh_export & fh_to_dentry into one commit
> Not sure why you did that.
> It was pretty nice as separate commits if you ask me. Whatever.

I can revert that if you prefer. I squashed them because there was some churn
when adding the init-ns-pid necessary to restore them, but I am happy to do
things in two steps.

Do you prefer having the final handle format in the first step, or letting it
evolve into final form over the series?

Re: [PATCH v2 0/3] pidfs: implement file handle support

Posted by Amir Goldstein 1 year, 2 months ago

On Thu, Nov 14, 2024 at 4:51 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>
>
>
> On 14/11/2024 08:02, Amir Goldstein wrote:
> > On Wed, Nov 13, 2024 at 7:01 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
> >> Since the introduction of pidfs, we have had 64-bit process identifiers
> >> that will not be reused for the entire uptime of the system. This greatly
> >> facilitates process tracking in userspace.
> >>
> >> There are two limitations at present:
> >>
> >>  * These identifiers are currently only exposed to processes on 64-bit
> >>    systems. On 32-bit systems, inode space is also limited to 32 bits and
> >>    therefore is subject to the same reuse issues.
> >>  * There is no way to go from one of these unique identifiers to a pid or
> >>    pidfd.
> >>
> >> This patch implements fh_export and fh_to_dentry which enables userspace to
> >> convert PIDs to and from PID file handles. A process can convert a pidfd into
> >> a file handle using name_to_handle_at, store it (in memory, on disk, or
> >> elsewhere) and then convert it back into a pidfd suing open_by_handle_at.
> >>
> >> To support us going from a file handle to a pidfd, we have to store a pid
> >> inside the file handle. To ensure file handles are invariant and can move
> >> between pid namespaces, we stash a pid from the initial namespace inside
> >> the file handle.
> >>
> >>   (There has been some discussion as to whether or not it is OK to include
> >>   the PID in the initial pid namespace, but so far there hasn't been any
> >>   conclusive reason given as to why this would be a bad idea)
> > IIUC, this is already exposed as st_ino on a 64bit arch?
> > If that is the case, then there is certainly no new info leak in this patch.
>
> pid.ino is exposed, but the init-ns pid isn't exposed anywhere to my knowledge.
>
> >> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> >> ---
> >> Changes in v2:
> >> - Permit filesystems to opt out of CAP_DAC_READ_SEARCH
> >> - Inline find_pid_ns/get_pid logic; remove unnecessary put_pid
> >> - Squash fh_export & fh_to_dentry into one commit
> > Not sure why you did that.
> > It was pretty nice as separate commits if you ask me. Whatever.
>
> I can revert that if you prefer. I squashed them because there was some churn
> when adding the init-ns-pid necessary to restore them, but I am happy to do
> things in two steps.
>
> Do you prefer having the final handle format in the first step, or letting it
> evolve into final form over the series?
>

I don't really mind. Was just wondering. Either way is fine.

Thanks,
Amir.

Re: [PATCH v2 0/3] pidfs: implement file handle support

Posted by Christian Brauner 1 year, 2 months ago

On Thu, Nov 14, 2024 at 01:48:02PM +0100, Erin Shepherd wrote:
>    
> 
> On 14/11/2024 08:02, Amir Goldstein wrote:
> > On Wed, Nov 13, 2024 at 7:01 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
> >> Since the introduction of pidfs, we have had 64-bit process identifiers
> >> that will not be reused for the entire uptime of the system. This greatly
> >> facilitates process tracking in userspace.
> >>
> >> There are two limitations at present:
> >>
> >>  * These identifiers are currently only exposed to processes on 64-bit
> >>    systems. On 32-bit systems, inode space is also limited to 32 bits and
> >>    therefore is subject to the same reuse issues.
> >>  * There is no way to go from one of these unique identifiers to a pid or
> >>    pidfd.
> >>
> >> This patch implements fh_export and fh_to_dentry which enables userspace to
> >> convert PIDs to and from PID file handles. A process can convert a pidfd into
> >> a file handle using name_to_handle_at, store it (in memory, on disk, or
> >> elsewhere) and then convert it back into a pidfd suing open_by_handle_at.
> >>
> >> To support us going from a file handle to a pidfd, we have to store a pid
> >> inside the file handle. To ensure file handles are invariant and can move
> >> between pid namespaces, we stash a pid from the initial namespace inside
> >> the file handle.
> >>
> >>   (There has been some discussion as to whether or not it is OK to include
> >>   the PID in the initial pid namespace, but so far there hasn't been any
> >>   conclusive reason given as to why this would be a bad idea)
> > IIUC, this is already exposed as st_ino on a 64bit arch?
> > If that is the case, then there is certainly no new info leak in this patch.
> 
> pid.ino is exposed, but the init-ns pid isn't exposed anywhere to my knowledge.

I see what you mean. That might be an information leak. Not a very
interesting one, I think but I need to think about it.

> 
> >> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> >> ---
> >> Changes in v2:
> >> - Permit filesystems to opt out of CAP_DAC_READ_SEARCH
> >> - Inline find_pid_ns/get_pid logic; remove unnecessary put_pid
> >> - Squash fh_export & fh_to_dentry into one commit
> > Not sure why you did that.
> > It was pretty nice as separate commits if you ask me. Whatever.
> 
> I can revert that if you prefer. I squashed them because there was some churn
> when adding the init-ns-pid necessary to restore them, but I am happy to do
> things in two steps.
> 
> Do you prefer having the final handle format in the first step, or letting it
> evolve into final form over the series?
>

[PATCH RFC 0/2] pidfs: file handle preliminaries

Posted by Christian Brauner 1 year, 2 months ago

Hey,

This reworks the inode number allocation for pidfs in order to support
file handles properly.

Recently we received a patchset that aims to enable file handle encoding
and decoding via name_to_handle_at(2) and open_by_handle_at(2).

A crucical step in the patch series is how to go from inode number to
struct pid without leaking information into unprivileged contexts. The
issue is that in order to find a struct pid the pid number in the
initial pid namespace must be encoded into the file handle via
name_to_handle_at(2). This can be used by containers using a separate
pid namespace to learn what the pid number of a given process in the
initial pid namespace is. While this is a weak information leak it could
be used in various exploits and in general is an ugly wart in the
design.

To solve this problem a new way is needed to lookup a struct pid based
on the inode number allocated for that struct pid. The other part is to
remove the custom inode number allocation on 32bit systems that is also
an ugly wart that should go away.

So, a new scheme is used that I was discusssing with Tejun some time
back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
are used for the generation number. This gives a 64 bit inode number
that is unique on both 32 bit and 64 bit. The lower 32 bit number is
recycled slowly and can be used to lookup struct pids.

So after applying the pidfs file handle series at
https://lore.kernel.org/r/20241101135452.19359-1-erin.shepherd@e43.eu on
top of the patches here we should be able to simplify encoding and
decoding to something like:

diff --git a/fs/pidfs.c b/fs/pidfs.c
index e71294d3d607..a38b833a2d38 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -78,7 +78,7 @@ void pidfs_remove_pid(struct pid *pid)
 }
 
 /* Find a struct pid based on the inode number. */
-static __maybe_unused struct pid *pidfs_ino_get_pid(u64 ino)
+static struct pid *pidfs_ino_get_pid(u64 ino)
 {
 	ino_t pid_ino = pidfs_ino(ino);
 	u32 gen = pidfs_gen(ino);
@@ -475,49 +475,37 @@ static const struct dentry_operations pidfs_dentry_operations = {
 	.d_prune	= stashed_dentry_prune,
 };
 
-#define PIDFD_FID_LEN 3
-
-struct pidfd_fid {
-	u64 ino;
-	s32 pid;
-} __packed;
-
-static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
+static int pidfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
 			   struct inode *parent)
 {
 	struct pid *pid = inode->i_private;
-	struct pidfd_fid *fid = (struct pidfd_fid *)fh;
 
-	if (*max_len < PIDFD_FID_LEN) {
-		*max_len = PIDFD_FID_LEN;
+	if (*max_len < 2) {
+		*max_len = 2;
 		return FILEID_INVALID;
 	}
 
-	fid->ino = pid->ino;
-	fid->pid = pid_nr(pid);
-	*max_len = PIDFD_FID_LEN;
+	*max_len = 2;
+	*(u64 *)fh = pid->ino;
 	return FILEID_INO64_GEN;
 }
 
 static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
-					 struct fid *gen_fid,
+					 struct fid *fid,
 					 int fh_len, int fh_type)
 {
 	int ret;
 	struct path path;
-	struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
 	struct pid *pid;
+	u64 pid_ino;
 
-	if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
+	if (fh_type != FILEID_INO64_GEN || fh_len < 2)
 		return NULL;
 
-	scoped_guard(rcu) {
-		pid = find_pid_ns(fid->pid, &init_pid_ns);
-		if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0)
-			return NULL;
-
-		pid = get_pid(pid);
-	}
+	pid_ino = *(u64 *)fid;
+	pid = pidfs_ino_get_pid(pid_ino);
+	if (!pid)
+		return NULL;
 
 	ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
 	if (ret < 0)

Thanks!
Christian

---
Christian Brauner (2):
      pidfs: rework inode number allocation
      pidfs: remove 32bit inode number handling

 fs/pidfs.c            | 138 +++++++++++++++++++++++++++++++++++---------------
 include/linux/pidfs.h |   2 +
 kernel/pid.c          |  11 ++--
 3 files changed, 103 insertions(+), 48 deletions(-)
---
base-commit: b86545e02e8c22fb89218f29d381fa8e8b91d815
change-id: 20241128-work-pidfs-2bd42c7ea772

Re: [PATCH RFC 0/2] pidfs: file handle preliminaries

Posted by Amir Goldstein 1 year, 2 months ago

On Thu, Nov 28, 2024 at 1:34 PM Christian Brauner <brauner@kernel.org> wrote:
>
> Hey,
>
> This reworks the inode number allocation for pidfs in order to support
> file handles properly.
>
> Recently we received a patchset that aims to enable file handle encoding
> and decoding via name_to_handle_at(2) and open_by_handle_at(2).
>
> A crucical step in the patch series is how to go from inode number to
> struct pid without leaking information into unprivileged contexts. The
> issue is that in order to find a struct pid the pid number in the
> initial pid namespace must be encoded into the file handle via
> name_to_handle_at(2). This can be used by containers using a separate
> pid namespace to learn what the pid number of a given process in the
> initial pid namespace is. While this is a weak information leak it could
> be used in various exploits and in general is an ugly wart in the
> design.
>
> To solve this problem a new way is needed to lookup a struct pid based
> on the inode number allocated for that struct pid. The other part is to
> remove the custom inode number allocation on 32bit systems that is also
> an ugly wart that should go away.
>
> So, a new scheme is used that I was discusssing with Tejun some time
> back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
> are used for the generation number. This gives a 64 bit inode number
> that is unique on both 32 bit and 64 bit. The lower 32 bit number is
> recycled slowly and can be used to lookup struct pids.
>
> So after applying the pidfs file handle series at
> https://lore.kernel.org/r/20241101135452.19359-1-erin.shepherd@e43.eu on
> top of the patches here we should be able to simplify encoding and
> decoding to something like:
>
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index e71294d3d607..a38b833a2d38 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -78,7 +78,7 @@ void pidfs_remove_pid(struct pid *pid)
>  }
>
>  /* Find a struct pid based on the inode number. */
> -static __maybe_unused struct pid *pidfs_ino_get_pid(u64 ino)
> +static struct pid *pidfs_ino_get_pid(u64 ino)
>  {
>         ino_t pid_ino = pidfs_ino(ino);
>         u32 gen = pidfs_gen(ino);
> @@ -475,49 +475,37 @@ static const struct dentry_operations pidfs_dentry_operations = {
>         .d_prune        = stashed_dentry_prune,
>  };
>
> -#define PIDFD_FID_LEN 3
> -
> -struct pidfd_fid {
> -       u64 ino;
> -       s32 pid;
> -} __packed;
> -
> -static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
> +static int pidfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
>                            struct inode *parent)
>  {
>         struct pid *pid = inode->i_private;
> -       struct pidfd_fid *fid = (struct pidfd_fid *)fh;
>
> -       if (*max_len < PIDFD_FID_LEN) {
> -               *max_len = PIDFD_FID_LEN;
> +       if (*max_len < 2) {
> +               *max_len = 2;
>                 return FILEID_INVALID;
>         }
>
> -       fid->ino = pid->ino;
> -       fid->pid = pid_nr(pid);
> -       *max_len = PIDFD_FID_LEN;
> +       *max_len = 2;
> +       *(u64 *)fh = pid->ino;
>         return FILEID_INO64_GEN;

Semantic remark:
        /*
         * 64 bit inode number, 32 bit generation number.
         */
        FILEID_INO64_GEN = 0x81,

filesystems are free to abuse the constants and return whatever id they want
(e.g. shmem_encode_fh()), but if you want to play by the rules, this would be
either:
        /*
         * 64 bit unique kernfs id
         */
        FILEID_KERNFS = 0xfe,

or:
        /*
         * 32bit inode number, 32 bit generation number.
         */
        FILEID_INO32_GEN = 1,

which is at least sometimes correct.
or define:
        /*
         * 64 bit inode number.
         */
        FILEID_INO64 = 0x80,


Thanks,
Amir.

[PATCH RFC 1/2] pidfs: rework inode number allocation

Posted by Christian Brauner 1 year, 2 months ago

Recently we received a patchset that aims to enable file handle encoding
and decoding via name_to_handle_at(2) and open_by_handle_at(2).

A crucical step in the patch series is how to go from inode number to
struct pid without leaking information into unprivileged contexts. The
issue is that in order to find a struct pid the pid number in the
initial pid namespace must be encoded into the file handle via
name_to_handle_at(2). This can be used by containers using a separate
pid namespace to learn what the pid number of a given process in the
initial pid namespace is. While this is a weak information leak it could
be used in various exploits and in general is an ugly wart in the design.

To solve this problem a new way is needed to lookup a struct pid based
on the inode number allocated for that struct pid. The other part is to
remove the custom inode number allocation on 32bit systems that is also
an ugly wart that should go away.

So, a new scheme is used that I was discusssing with Tejun some time
back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
are used for the generation number. This gives a 64 bit inode number
that is unique on both 32 bit and 64 bit. The lower 32 bit number is
recycled slowly and can be used to lookup struct pids.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c            | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pidfs.h |  2 ++
 kernel/pid.c          | 11 +++---
 3 files changed, 98 insertions(+), 7 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 618abb1fa1b84cf31282c922374e28d60cd49d00..09a0c8ac805301927a94758b3f7d1e513826daf9 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -23,6 +23,88 @@
 #include "internal.h"
 #include "mount.h"
 
+static u32 pidfs_ino_highbits;
+static u32 pidfs_ino_last_ino_lowbits;
+
+static DEFINE_IDR(pidfs_ino_idr);
+
+static inline ino_t pidfs_ino(u64 ino)
+{
+	/* On 32 bit low 32 bits are the inode. */
+	if (sizeof(ino_t) < sizeof(u64))
+		return (u32)ino;
+
+	/* On 64 bit simply return ino. */
+	return ino;
+}
+
+static inline u32 pidfs_gen(u64 ino)
+{
+	/* On 32 bit the generation number are the upper 32 bits. */
+	if (sizeof(ino_t) < sizeof(u64))
+		return ino >> 32;
+
+	/* On 64 bit the generation number is 1. */
+	return 1;
+}
+
+/*
+ * Construct an inode number for struct pid in a way that we can use the
+ * lower 32bit to lookup struct pid independent of any pid numbers that
+ * could be leaked into userspace (e.g., via file handle encoding).
+ */
+int pidfs_add_pid(struct pid *pid)
+{
+	u32 ino_highbits;
+	int ret;
+
+	ret = idr_alloc_cyclic(&pidfs_ino_idr, pid, 1, 0, GFP_ATOMIC);
+	if (ret >= 0 && ret < pidfs_ino_last_ino_lowbits)
+		pidfs_ino_highbits++;
+	ino_highbits = pidfs_ino_highbits;
+	pidfs_ino_last_ino_lowbits = ret;
+	if (ret < 0)
+		return ret;
+
+	pid->ino = (u64)ino_highbits << 32 | ret;
+	pid->stashed = NULL;
+	return 0;
+}
+
+void pidfs_remove_pid(struct pid *pid)
+{
+	idr_remove(&pidfs_ino_idr, (u32)pidfs_ino(pid->ino));
+}
+
+/* Find a struct pid based on the inode number. */
+static __maybe_unused struct pid *pidfs_ino_get_pid(u64 ino)
+{
+	ino_t pid_ino = pidfs_ino(ino);
+	u32 gen = pidfs_gen(ino);
+	struct pid *pid;
+
+	guard(rcu)();
+
+	/* Handle @pid lookup carefully so there's no risk of UAF. */
+	pid = idr_find(&pidfs_ino_idr, (u32)ino);
+	if (!pid)
+		return NULL;
+
+	if (sizeof(ino_t) < sizeof(u64)) {
+		if (gen && pidfs_gen(pid->ino) != gen)
+			pid = NULL;
+	} else {
+		if (pidfs_ino(pid->ino) != pid_ino)
+			pid = NULL;
+	}
+
+	/* Within our pid namespace hierarchy? */
+	if (pid_vnr(pid) == 0)
+		pid = NULL;
+
+	return get_pid(pid);
+}
+
 #ifdef CONFIG_PROC_FS
 /**
  * pidfd_show_fdinfo - print information about a pidfd
@@ -491,6 +573,16 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
 
 void __init pidfs_init(void)
 {
+	/*
+	 * On 32 bit systems the lower 32 bits are the inode number and
+	 * the higher 32 bits are the generation number. The starting
+	 * value for the inode number and the generation number is one.
+	 */
+	if (sizeof(ino_t) < sizeof(u64))
+		pidfs_ino_highbits = 1;
+	else
+		pidfs_ino_highbits = 0;
+
 	pidfs_mnt = kern_mount(&pidfs_type);
 	if (IS_ERR(pidfs_mnt))
 		panic("Failed to mount pidfs pseudo filesystem");
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index 75bdf9807802a5d1a9699c99aa42648c2bd34170..2958652bb108b8a2e02128e17317be4545b40a01 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -4,5 +4,7 @@
 
 struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
 void __init pidfs_init(void);
+int pidfs_add_pid(struct pid *pid);
+void pidfs_remove_pid(struct pid *pid);
 
 #endif /* _LINUX_PID_FS_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index 115448e89c3e9e664d0d51c8d853e8167ba0540c..9f321c6456d24af705c28f0256ca4de771f5e681 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -64,11 +64,6 @@ int pid_max = PID_MAX_DEFAULT;
 
 int pid_max_min = RESERVED_PIDS + 1;
 int pid_max_max = PID_MAX_LIMIT;
-/*
- * Pseudo filesystems start inode numbering after one. We use Reserved
- * PIDs as a natural offset.
- */
-static u64 pidfs_ino = RESERVED_PIDS;
 
 /*
  * PID-map pages start out as NULL, they get allocated upon
@@ -157,6 +152,7 @@ void free_pid(struct pid *pid)
 		}
 
 		idr_remove(&ns->idr, upid->nr);
+		pidfs_remove_pid(pid);
 	}
 	spin_unlock_irqrestore(&pidmap_lock, flags);
 
@@ -276,8 +272,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	spin_lock_irq(&pidmap_lock);
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
-	pid->stashed = NULL;
-	pid->ino = ++pidfs_ino;
+	retval = pidfs_add_pid(pid);
+	if (retval)
+		goto out_unlock;
 	for ( ; upid >= pid->numbers; --upid) {
 		/* Make the PID visible to find_pid_ns. */
 		idr_replace(&upid->ns->idr, pid, upid->nr);

-- 
2.45.2

Re: [PATCH RFC 1/2] pidfs: rework inode number allocation

Posted by Amir Goldstein 1 year, 2 months ago

On Thu, Nov 28, 2024 at 1:34 PM Christian Brauner <brauner@kernel.org> wrote:
>
> Recently we received a patchset that aims to enable file handle encoding
> and decoding via name_to_handle_at(2) and open_by_handle_at(2).
>
> A crucical step in the patch series is how to go from inode number to
> struct pid without leaking information into unprivileged contexts. The
> issue is that in order to find a struct pid the pid number in the
> initial pid namespace must be encoded into the file handle via
> name_to_handle_at(2). This can be used by containers using a separate
> pid namespace to learn what the pid number of a given process in the
> initial pid namespace is. While this is a weak information leak it could
> be used in various exploits and in general is an ugly wart in the design.
>
> To solve this problem a new way is needed to lookup a struct pid based
> on the inode number allocated for that struct pid. The other part is to
> remove the custom inode number allocation on 32bit systems that is also
> an ugly wart that should go away.
>
> So, a new scheme is used that I was discusssing with Tejun some time
> back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
> are used for the generation number. This gives a 64 bit inode number
> that is unique on both 32 bit and 64 bit. The lower 32 bit number is
> recycled slowly and can be used to lookup struct pids.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/pidfs.c            | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pidfs.h |  2 ++
>  kernel/pid.c          | 11 +++---
>  3 files changed, 98 insertions(+), 7 deletions(-)
>
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 618abb1fa1b84cf31282c922374e28d60cd49d00..09a0c8ac805301927a94758b3f7d1e513826daf9 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -23,6 +23,88 @@
>  #include "internal.h"
>  #include "mount.h"
>
> +static u32 pidfs_ino_highbits;
> +static u32 pidfs_ino_last_ino_lowbits;
> +
> +static DEFINE_IDR(pidfs_ino_idr);
> +
> +static inline ino_t pidfs_ino(u64 ino)
> +{
> +       /* On 32 bit low 32 bits are the inode. */
> +       if (sizeof(ino_t) < sizeof(u64))
> +               return (u32)ino;
> +
> +       /* On 64 bit simply return ino. */
> +       return ino;
> +}
> +
> +static inline u32 pidfs_gen(u64 ino)
> +{
> +       /* On 32 bit the generation number are the upper 32 bits. */
> +       if (sizeof(ino_t) < sizeof(u64))
> +               return ino >> 32;
> +
> +       /* On 64 bit the generation number is 1. */
> +       return 1;
> +}
> +
> +/*
> + * Construct an inode number for struct pid in a way that we can use the
> + * lower 32bit to lookup struct pid independent of any pid numbers that
> + * could be leaked into userspace (e.g., via file handle encoding).
> + */
> +int pidfs_add_pid(struct pid *pid)
> +{
> +       u32 ino_highbits;
> +       int ret;
> +
> +       ret = idr_alloc_cyclic(&pidfs_ino_idr, pid, 1, 0, GFP_ATOMIC);
> +       if (ret >= 0 && ret < pidfs_ino_last_ino_lowbits)
> +               pidfs_ino_highbits++;
> +       ino_highbits = pidfs_ino_highbits;
> +       pidfs_ino_last_ino_lowbits = ret;
> +       if (ret < 0)
> +               return ret;
> +

This code looks "too useful" to be in a random filesystem.
Maybe work a generation counter into idr_alloc_cyclic or
just let it let the caller know that the range has been cycled?
Only if this is not too complicated to do.

> +       pid->ino = (u64)ino_highbits << 32 | ret;
> +       pid->stashed = NULL;
> +       return 0;
> +}
> +
> +void pidfs_remove_pid(struct pid *pid)
> +{
> +       idr_remove(&pidfs_ino_idr, (u32)pidfs_ino(pid->ino));
> +}
> +
> +/* Find a struct pid based on the inode number. */
> +static __maybe_unused struct pid *pidfs_ino_get_pid(u64 ino)
> +{
> +       ino_t pid_ino = pidfs_ino(ino);
> +       u32 gen = pidfs_gen(ino);
> +       struct pid *pid;
> +
> +       guard(rcu)();
> +
> +       /* Handle @pid lookup carefully so there's no risk of UAF. */
> +       pid = idr_find(&pidfs_ino_idr, (u32)ino);
> +       if (!pid)
> +               return NULL;
> +
> +       if (sizeof(ino_t) < sizeof(u64)) {
> +               if (gen && pidfs_gen(pid->ino) != gen)
> +                       pid = NULL;
> +       } else {
> +               if (pidfs_ino(pid->ino) != pid_ino)
> +                       pid = NULL;
> +       }
> +
> +       /* Within our pid namespace hierarchy? */
> +       if (pid_vnr(pid) == 0)
> +               pid = NULL;
> +
> +       return get_pid(pid);
> +}
> +
>  #ifdef CONFIG_PROC_FS
>  /**
>   * pidfd_show_fdinfo - print information about a pidfd
> @@ -491,6 +573,16 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
>
>  void __init pidfs_init(void)
>  {
> +       /*
> +        * On 32 bit systems the lower 32 bits are the inode number and
> +        * the higher 32 bits are the generation number. The starting
> +        * value for the inode number and the generation number is one.
> +        */
> +       if (sizeof(ino_t) < sizeof(u64))
> +               pidfs_ino_highbits = 1;
> +       else
> +               pidfs_ino_highbits = 0;
> +
>         pidfs_mnt = kern_mount(&pidfs_type);
>         if (IS_ERR(pidfs_mnt))
>                 panic("Failed to mount pidfs pseudo filesystem");
> diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
> index 75bdf9807802a5d1a9699c99aa42648c2bd34170..2958652bb108b8a2e02128e17317be4545b40a01 100644
> --- a/include/linux/pidfs.h
> +++ b/include/linux/pidfs.h
> @@ -4,5 +4,7 @@
>
>  struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
>  void __init pidfs_init(void);
> +int pidfs_add_pid(struct pid *pid);
> +void pidfs_remove_pid(struct pid *pid);
>
>  #endif /* _LINUX_PID_FS_H */
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 115448e89c3e9e664d0d51c8d853e8167ba0540c..9f321c6456d24af705c28f0256ca4de771f5e681 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -64,11 +64,6 @@ int pid_max = PID_MAX_DEFAULT;
>
>  int pid_max_min = RESERVED_PIDS + 1;
>  int pid_max_max = PID_MAX_LIMIT;
> -/*
> - * Pseudo filesystems start inode numbering after one. We use Reserved
> - * PIDs as a natural offset.
> - */
> -static u64 pidfs_ino = RESERVED_PIDS;
>
>  /*
>   * PID-map pages start out as NULL, they get allocated upon
> @@ -157,6 +152,7 @@ void free_pid(struct pid *pid)
>                 }
>
>                 idr_remove(&ns->idr, upid->nr);
> +               pidfs_remove_pid(pid);
>         }
>         spin_unlock_irqrestore(&pidmap_lock, flags);
>
> @@ -276,8 +272,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>         spin_lock_irq(&pidmap_lock);
>         if (!(ns->pid_allocated & PIDNS_ADDING))
>                 goto out_unlock;
> -       pid->stashed = NULL;
> -       pid->ino = ++pidfs_ino;
> +       retval = pidfs_add_pid(pid);

I hope this does not create a scalability issue compared to the prev method?

> +       if (retval)
> +               goto out_unlock;
>         for ( ; upid >= pid->numbers; --upid) {
>                 /* Make the PID visible to find_pid_ns. */
>                 idr_replace(&upid->ns->idr, pid, upid->nr);
>
> --
> 2.45.2
>

Nice idea!
If there is something wrong with it, then I could not find it ;)

Thanks,
Amir.

[PATCH RFC 2/2] pidfs: remove 32bit inode number handling

Posted by Christian Brauner 1 year, 2 months ago

Now that we have a unified inode number handling model remove the custom
ida-based allocation for 32bit.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c | 46 +++++-----------------------------------------
 1 file changed, 5 insertions(+), 41 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 09a0c8ac805301927a94758b3f7d1e513826daf9..334d2fc3f01ee6f219dc3e52309bfe00726885ca 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -400,40 +400,6 @@ struct pid *pidfd_pid(const struct file *file)
 
 static struct vfsmount *pidfs_mnt __ro_after_init;
 
-#if BITS_PER_LONG == 32
-/*
- * Provide a fallback mechanism for 32-bit systems so processes remain
- * reliably comparable by inode number even on those systems.
- */
-static DEFINE_IDA(pidfd_inum_ida);
-
-static int pidfs_inum(struct pid *pid, unsigned long *ino)
-{
-	int ret;
-
-	ret = ida_alloc_range(&pidfd_inum_ida, RESERVED_PIDS + 1,
-			      UINT_MAX, GFP_ATOMIC);
-	if (ret < 0)
-		return -ENOSPC;
-
-	*ino = ret;
-	return 0;
-}
-
-static inline void pidfs_free_inum(unsigned long ino)
-{
-	if (ino > 0)
-		ida_free(&pidfd_inum_ida, ino);
-}
-#else
-static inline int pidfs_inum(struct pid *pid, unsigned long *ino)
-{
-	*ino = pid->ino;
-	return 0;
-}
-#define pidfs_free_inum(ino) ((void)(ino))
-#endif
-
 /*
  * The vfs falls back to simple_setattr() if i_op->setattr() isn't
  * implemented. Let's reject it completely until we have a clean
@@ -485,7 +451,6 @@ static void pidfs_evict_inode(struct inode *inode)
 
 	clear_inode(inode);
 	put_pid(pid);
-	pidfs_free_inum(inode->i_ino);
 }
 
 static const struct super_operations pidfs_sops = {
@@ -511,17 +476,16 @@ static const struct dentry_operations pidfs_dentry_operations = {
 
 static int pidfs_init_inode(struct inode *inode, void *data)
 {
+	struct pid *pid = data;
+
 	inode->i_private = data;
 	inode->i_flags |= S_PRIVATE;
 	inode->i_mode |= S_IRWXU;
 	inode->i_op = &pidfs_inode_operations;
 	inode->i_fop = &pidfs_file_operations;
-	/*
-	 * Inode numbering for pidfs start at RESERVED_PIDS + 1. This
-	 * avoids collisions with the root inode which is 1 for pseudo
-	 * filesystems.
-	 */
-	return pidfs_inum(data, &inode->i_ino);
+	inode->i_ino = pidfs_ino(pid->ino);
+	inode->i_generation = pidfs_gen(pid->ino);
+	return 0;
 }
 
 static void pidfs_put_data(void *data)

-- 
2.45.2