[v2] pidfs: file handle preliminaries

[PATCH RFC v2 0/3] pidfs: file handle preliminaries

Posted by Christian Brauner 1 year, 2 months ago

Hey,

This reworks the inode number allocation for pidfs in order to support
file handles properly.

Recently we received a patchset that aims to enable file handle encoding
and decoding via name_to_handle_at(2) and open_by_handle_at(2).

A crucical step in the patch series is how to go from inode number to
struct pid without leaking information into unprivileged contexts. The
issue is that in order to find a struct pid the pid number in the
initial pid namespace must be encoded into the file handle via
name_to_handle_at(2). This can be used by containers using a separate
pid namespace to learn what the pid number of a given process in the
initial pid namespace is. While this is a weak information leak it could
be used in various exploits and in general is an ugly wart in the
design.

To solve this problem a new way is needed to lookup a struct pid based
on the inode number allocated for that struct pid. The other part is to
remove the custom inode number allocation on 32bit systems that is also
an ugly wart that should go away.

So, a new scheme is used that I was discusssing with Tejun some time
back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
are used for the generation number. This gives a 64 bit inode number
that is unique on both 32 bit and 64 bit. The lower 32 bit number is
recycled slowly and can be used to lookup struct pids.

Thanks!
Christian

---
Changes in v2:
- Remove __maybe_unused pidfd_ino_get_pid() function that was only there
  for initial illustration purposes.
- Link to v1: https://lore.kernel.org/r/20241128-work-pidfs-v1-0-80f267639d98@kernel.org

---
Christian Brauner (3):
      pidfs: rework inode number allocation
      pidfs: remove 32bit inode number handling
      pidfs: support FS_IOC_GETVERSION

 fs/pidfs.c            | 118 ++++++++++++++++++++++++++++++++------------------
 include/linux/pidfs.h |   2 +
 kernel/pid.c          |  14 +++---
 3 files changed, 86 insertions(+), 48 deletions(-)
---
base-commit: b86545e02e8c22fb89218f29d381fa8e8b91d815
change-id: 20241128-work-pidfs-2bd42c7ea772

[PATCH RFC 0/6] pidfs: implement file handle support

Posted by Christian Brauner 1 year, 2 months ago

Hey,

Now that we have the preliminaries to lookup struct pid based on its
inode number alone we can implement file handle support.

This is based on custom export operation methods which allows pidfs to
implement permission checking and opening of pidfs file handles cleanly
without hacking around in the core file handle code too much.

This is lightly tested.

Thanks!
Christian

---
Christian Brauner (5):
      fhandle: simplify error handling
      exportfs: add open method
      fhandle: pull CAP_DAC_READ_SEARCH check into may_decode_fh()
      exportfs: add permission method
      pidfs: implement file handle support

Erin Shepherd (1):
      pseudofs: add support for export_ops

 fs/fhandle.c              | 97 ++++++++++++++++++++++-------------------------
 fs/libfs.c                |  1 +
 fs/pidfs.c                | 96 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/exportfs.h  | 20 ++++++++++
 include/linux/pseudo_fs.h |  1 +
 5 files changed, 164 insertions(+), 51 deletions(-)
---
base-commit: 94c9a56ad3521a28177610c63298d66de634cb9d
change-id: 20241129-work-pidfs-file_handle-07bdfb860a38

Re: [PATCH RFC 0/6] pidfs: implement file handle support

Posted by Amir Goldstein 1 year, 2 months ago

On Fri, Nov 29, 2024 at 2:39 PM Christian Brauner <brauner@kernel.org> wrote:
>
> Hey,
>
> Now that we have the preliminaries to lookup struct pid based on its
> inode number alone we can implement file handle support.
>
> This is based on custom export operation methods which allows pidfs to
> implement permission checking and opening of pidfs file handles cleanly
> without hacking around in the core file handle code too much.
>
> This is lightly tested.

With my comments addressed as you pushed to vfs-6.14.pidfs branch
in your tree, you may add to the patches posted:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

HOWEVER,
IMO there is still one thing that has to be addressed before merge -
We must make sure that nfsd cannot export pidfs.

In principal, SB_NOUSER filesystems should not be accessible to
userspace paths, so exportfs should not be able to configure nfsd
export of pidfs, but maybe this limitation can be worked around by
using magic link paths?

I think it may be worth explicitly disallowing nfsd export of SB_NOUSER
filesystems and we could also consider blocking SB_KERNMOUNT,
but may there are users exporting ramfs?

Jeff has mentioned that he thinks we are blocking export of cgroupfs
by nfsd, but I really don't see where that is being enforced.
The requirement for FS_REQUIRES_DEV in check_export() is weak
because user can overrule it with manual fsid argument to exportfs.

So maybe we disallow nfsd export of kernfs and backport to stable kernels
to be on the safe side?

On top of that, we may also want to reject nfsd export of any fs
with custom ->open() or ->permission() export ops, on the grounds
that nfsd does not call these ops?

Regarding the two other kernel users of exportfs, namely,
overlayfs and fanotify -

For overlayfs, I think that in ovl_can_decode_fh() we can safely
opt-out of SB_NOUSER and SB_KERNMOUNT filesystems,
to not allow nfs exporting of overlayfs over those lower fs.

For fanotify, there is already a check in fanotify_events_supported()
to disallow sb/mount marks on SB_NOUSER and a comment that
questions the value of allowing them for SB_KERNMOUNT.
So for pidfs there is no risk wrt fanotify and it does not look like pidfs
is going to generate any fanotify events anyway.

Thanks,
Amir.

Re: [PATCH RFC 0/6] pidfs: implement file handle support

Posted by Christian Brauner 1 year, 2 months ago

On Sat, Nov 30, 2024 at 01:22:05PM +0100, Amir Goldstein wrote:
> On Fri, Nov 29, 2024 at 2:39 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > Hey,
> >
> > Now that we have the preliminaries to lookup struct pid based on its
> > inode number alone we can implement file handle support.
> >
> > This is based on custom export operation methods which allows pidfs to
> > implement permission checking and opening of pidfs file handles cleanly
> > without hacking around in the core file handle code too much.
> >
> > This is lightly tested.
> 
> With my comments addressed as you pushed to vfs-6.14.pidfs branch
> in your tree, you may add to the patches posted:
> 
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> 
> HOWEVER,
> IMO there is still one thing that has to be addressed before merge -
> We must make sure that nfsd cannot export pidfs.
> 
> In principal, SB_NOUSER filesystems should not be accessible to
> userspace paths, so exportfs should not be able to configure nfsd
> export of pidfs, but maybe this limitation can be worked around by
> using magic link paths?

I don't see how. I might be missing details.

> I think it may be worth explicitly disallowing nfsd export of SB_NOUSER
> filesystems and we could also consider blocking SB_KERNMOUNT,
> but may there are users exporting ramfs?

No need to restrict it if it's safe, I guess.

> Jeff has mentioned that he thinks we are blocking export of cgroupfs
> by nfsd, but I really don't see where that is being enforced.
> The requirement for FS_REQUIRES_DEV in check_export() is weak
> because user can overrule it with manual fsid argument to exportfs.
> So maybe we disallow nfsd export of kernfs and backport to stable kernels
> to be on the safe side?

File handles and nfs export have become two distinct things and there
filesystems based on kernfs, and pidfs want to support file handles
without support nfs export.

So I think instead of having nfs check what filesystems may be exported
we should let the filesystems indicate that they cannot be exported and
make nfs honour that.

So something like the untested sketch:

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 1358c21837f1..a5c75cb1c812 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -154,6 +154,7 @@ static const struct export_operations kernfs_export_ops = {
 	.fh_to_dentry	= kernfs_fh_to_dentry,
 	.fh_to_parent	= kernfs_fh_to_parent,
 	.get_parent	= kernfs_get_parent_dentry,
+	.flags		= EXPORT_OP_FILE_HANDLE,
 };
 
 /**
diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index eacafe46e3b6..170c5729e7f2 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -417,6 +417,7 @@ static struct svc_export *svc_export_lookup(struct svc_export *);
 static int check_export(struct path *path, int *flags, unsigned char *uuid)
 {
 	struct inode *inode = d_inode(path->dentry);
+	const struct export_operations *nop;
 
 	/*
 	 * We currently export only dirs, regular files, and (for v4
@@ -449,11 +450,16 @@ static int check_export(struct path *path, int *flags, unsigned char *uuid)
 		return -EINVAL;
 	}
 
-	if (!exportfs_can_decode_fh(inode->i_sb->s_export_op)) {
+	if (!exportfs_can_decode_fh(nop)) {
 		dprintk("exp_export: export of invalid fs type.\n");
 		return -EINVAL;
 	}
 
+	if (nop && nop->flags & EXPORT_OP_FILE_HANDLE) {
+		dprintk("exp_export: filesystem only supports non-exportable file handles.\n");
+		return -EINVAL;
+	}
+
 	if (is_idmapped_mnt(path->mnt)) {
 		dprintk("exp_export: export of idmapped mounts not yet supported.\n");
 		return -EINVAL;
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 9aa7493b1e10..d1646c0789e1 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -83,10 +83,15 @@ void ovl_revert_creds(const struct cred *old_cred)
  */
 int ovl_can_decode_fh(struct super_block *sb)
 {
+	const struct export_operations *nop = sb->s_export_op;
+
 	if (!capable(CAP_DAC_READ_SEARCH))
 		return 0;
 
-	if (!exportfs_can_decode_fh(sb->s_export_op))
+	if (!exportfs_can_decode_fh(nop))
+		return 0;
+
+	if (nop && nop->flags & EXPORT_OP_FILE_HANDLE)
 		return 0;
 
 	return sb->s_export_op->encode_fh ? -1 : FILEID_INO32_GEN;
diff --git a/fs/pidfs.c b/fs/pidfs.c
index dde3e4e90ea9..9d98b5461dc7 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -570,6 +570,7 @@ static const struct export_operations pidfs_export_operations = {
 	.fh_to_dentry	= pidfs_fh_to_dentry,
 	.open		= pidfs_export_open,
 	.permission	= pidfs_export_permission,
+	.flags          = EXPORT_OP_FILE_HANDLE,
 };
 
 static int pidfs_init_inode(struct inode *inode, void *data)
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index a087606ace19..98f7cb17abee 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -280,6 +280,7 @@ struct export_operations {
 						*/
 #define EXPORT_OP_FLUSH_ON_CLOSE	(0x20) /* fs flushes file data on close */
 #define EXPORT_OP_ASYNC_LOCK		(0x40) /* fs can do async lock request */
+#define EXPORT_OP_FILE_HANDLE		(0x80) /* fs only supports file handles, no proper export */
 	unsigned long	flags;
 };

> On top of that, we may also want to reject nfsd export of any fs
> with custom ->open() or ->permission() export ops, on the grounds
> that nfsd does not call these ops?
> 
> Regarding the two other kernel users of exportfs, namely,
> overlayfs and fanotify -
> 
> For overlayfs, I think that in ovl_can_decode_fh() we can safely
> opt-out of SB_NOUSER and SB_KERNMOUNT filesystems,
> to not allow nfs exporting of overlayfs over those lower fs.
> 
> For fanotify, there is already a check in fanotify_events_supported()
> to disallow sb/mount marks on SB_NOUSER and a comment that
> questions the value of allowing them for SB_KERNMOUNT.
> So for pidfs there is no risk wrt fanotify and it does not look like pidfs
> is going to generate any fanotify events anyway.

Re: [PATCH RFC 0/6] pidfs: implement file handle support

Posted by Amir Goldstein 1 year, 2 months ago

On Sun, Dec 1, 2024 at 9:43 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Sat, Nov 30, 2024 at 01:22:05PM +0100, Amir Goldstein wrote:
> > On Fri, Nov 29, 2024 at 2:39 PM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > Hey,
> > >
> > > Now that we have the preliminaries to lookup struct pid based on its
> > > inode number alone we can implement file handle support.
> > >
> > > This is based on custom export operation methods which allows pidfs to
> > > implement permission checking and opening of pidfs file handles cleanly
> > > without hacking around in the core file handle code too much.
> > >
> > > This is lightly tested.
> >
> > With my comments addressed as you pushed to vfs-6.14.pidfs branch
> > in your tree, you may add to the patches posted:
> >
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> >
> > HOWEVER,
> > IMO there is still one thing that has to be addressed before merge -
> > We must make sure that nfsd cannot export pidfs.
> >
> > In principal, SB_NOUSER filesystems should not be accessible to
> > userspace paths, so exportfs should not be able to configure nfsd
> > export of pidfs, but maybe this limitation can be worked around by
> > using magic link paths?
>
> I don't see how. I might be missing details.

AFAIK, nfsd gets the paths to export from userspace via
svc_export_parse() =>  kern_path(buf, 0, &exp.ex_path)
afterwards check_export() validates exp.ex_path and I see that regular
files can be exported.
I suppose that a pidfs file can have a magic link path no?
The question is whether this magic link path could be passed to nfsd
via the exportfs UAPI.

>
> > I think it may be worth explicitly disallowing nfsd export of SB_NOUSER
> > filesystems and we could also consider blocking SB_KERNMOUNT,
> > but may there are users exporting ramfs?
>
> No need to restrict it if it's safe, I guess.
>
> > Jeff has mentioned that he thinks we are blocking export of cgroupfs
> > by nfsd, but I really don't see where that is being enforced.
> > The requirement for FS_REQUIRES_DEV in check_export() is weak
> > because user can overrule it with manual fsid argument to exportfs.
> > So maybe we disallow nfsd export of kernfs and backport to stable kernels
> > to be on the safe side?
>
> File handles and nfs export have become two distinct things and there
> filesystems based on kernfs, and pidfs want to support file handles
> without support nfs export.
>
> So I think instead of having nfs check what filesystems may be exported
> we should let the filesystems indicate that they cannot be exported and
> make nfs honour that.

Yes, I agree, but...

>
> So something like the untested sketch:
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index 1358c21837f1..a5c75cb1c812 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -154,6 +154,7 @@ static const struct export_operations kernfs_export_ops = {
>         .fh_to_dentry   = kernfs_fh_to_dentry,
>         .fh_to_parent   = kernfs_fh_to_parent,
>         .get_parent     = kernfs_get_parent_dentry,
> +       .flags          = EXPORT_OP_FILE_HANDLE,
>  };
>
>  /**
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index eacafe46e3b6..170c5729e7f2 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -417,6 +417,7 @@ static struct svc_export *svc_export_lookup(struct svc_export *);
>  static int check_export(struct path *path, int *flags, unsigned char *uuid)
>  {
>         struct inode *inode = d_inode(path->dentry);
> +       const struct export_operations *nop;
>
>         /*
>          * We currently export only dirs, regular files, and (for v4
> @@ -449,11 +450,16 @@ static int check_export(struct path *path, int *flags, unsigned char *uuid)
>                 return -EINVAL;
>         }
>
> -       if (!exportfs_can_decode_fh(inode->i_sb->s_export_op)) {
> +       if (!exportfs_can_decode_fh(nop)) {
>                 dprintk("exp_export: export of invalid fs type.\n");
>                 return -EINVAL;
>         }
>
> +       if (nop && nop->flags & EXPORT_OP_FILE_HANDLE) {
> +               dprintk("exp_export: filesystem only supports non-exportable file handles.\n");
> +               return -EINVAL;
> +       }
> +
>         if (is_idmapped_mnt(path->mnt)) {
>                 dprintk("exp_export: export of idmapped mounts not yet supported.\n");
>                 return -EINVAL;
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index 9aa7493b1e10..d1646c0789e1 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -83,10 +83,15 @@ void ovl_revert_creds(const struct cred *old_cred)
>   */
>  int ovl_can_decode_fh(struct super_block *sb)
>  {
> +       const struct export_operations *nop = sb->s_export_op;
> +
>         if (!capable(CAP_DAC_READ_SEARCH))
>                 return 0;
>
> -       if (!exportfs_can_decode_fh(sb->s_export_op))
> +       if (!exportfs_can_decode_fh(nop))
> +               return 0;
> +
> +       if (nop && nop->flags & EXPORT_OP_FILE_HANDLE)
>                 return 0;
>
>         return sb->s_export_op->encode_fh ? -1 : FILEID_INO32_GEN;
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index dde3e4e90ea9..9d98b5461dc7 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -570,6 +570,7 @@ static const struct export_operations pidfs_export_operations = {
>         .fh_to_dentry   = pidfs_fh_to_dentry,
>         .open           = pidfs_export_open,
>         .permission     = pidfs_export_permission,
> +       .flags          = EXPORT_OP_FILE_HANDLE,
>  };
>
>  static int pidfs_init_inode(struct inode *inode, void *data)
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index a087606ace19..98f7cb17abee 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -280,6 +280,7 @@ struct export_operations {
>                                                 */
>  #define EXPORT_OP_FLUSH_ON_CLOSE       (0x20) /* fs flushes file data on close */
>  #define EXPORT_OP_ASYNC_LOCK           (0x40) /* fs can do async lock request */
> +#define EXPORT_OP_FILE_HANDLE          (0x80) /* fs only supports file handles, no proper export */

This is a bad name IMO, since pidfs clearly does support file handles
and supports the open_by_handle_at() UAPI.

I was going to suggest EXPORT_OP_NO_NFS_EXPORT, but it also
sounds silly, so maybe:

#define EXPORT_OP_LOCAL_FILE_HANDLE          (0x80) /* fs only
supports local file handles, no nfs export */

With that you may add:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

Re: [PATCH RFC 0/6] pidfs: implement file handle support

Posted by Christian Brauner 1 year, 2 months ago

On Sun, Dec 01, 2024 at 01:09:17PM +0100, Amir Goldstein wrote:
> On Sun, Dec 1, 2024 at 9:43 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Sat, Nov 30, 2024 at 01:22:05PM +0100, Amir Goldstein wrote:
> > > On Fri, Nov 29, 2024 at 2:39 PM Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > Hey,
> > > >
> > > > Now that we have the preliminaries to lookup struct pid based on its
> > > > inode number alone we can implement file handle support.
> > > >
> > > > This is based on custom export operation methods which allows pidfs to
> > > > implement permission checking and opening of pidfs file handles cleanly
> > > > without hacking around in the core file handle code too much.
> > > >
> > > > This is lightly tested.
> > >
> > > With my comments addressed as you pushed to vfs-6.14.pidfs branch
> > > in your tree, you may add to the patches posted:
> > >
> > > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> > >
> > > HOWEVER,
> > > IMO there is still one thing that has to be addressed before merge -
> > > We must make sure that nfsd cannot export pidfs.
> > >
> > > In principal, SB_NOUSER filesystems should not be accessible to
> > > userspace paths, so exportfs should not be able to configure nfsd
> > > export of pidfs, but maybe this limitation can be worked around by
> > > using magic link paths?
> >
> > I don't see how. I might be missing details.
> 
> AFAIK, nfsd gets the paths to export from userspace via
> svc_export_parse() =>  kern_path(buf, 0, &exp.ex_path)
> afterwards check_export() validates exp.ex_path and I see that regular
> files can be exported.
> I suppose that a pidfs file can have a magic link path no?
> The question is whether this magic link path could be passed to nfsd
> via the exportfs UAPI.

Ah, ok. I see what you mean. You're thinking about specifying
/proc/<pid>/fd/<pidfd> in /etc/exports. Yes, that would work.

> 
> >
> > > I think it may be worth explicitly disallowing nfsd export of SB_NOUSER
> > > filesystems and we could also consider blocking SB_KERNMOUNT,
> > > but may there are users exporting ramfs?
> >
> > No need to restrict it if it's safe, I guess.
> >
> > > Jeff has mentioned that he thinks we are blocking export of cgroupfs
> > > by nfsd, but I really don't see where that is being enforced.
> > > The requirement for FS_REQUIRES_DEV in check_export() is weak
> > > because user can overrule it with manual fsid argument to exportfs.
> > > So maybe we disallow nfsd export of kernfs and backport to stable kernels
> > > to be on the safe side?
> >
> > File handles and nfs export have become two distinct things and there
> > filesystems based on kernfs, and pidfs want to support file handles
> > without support nfs export.
> >
> > So I think instead of having nfs check what filesystems may be exported
> > we should let the filesystems indicate that they cannot be exported and
> > make nfs honour that.
> 
> Yes, I agree, but...
> 
> >
> > So something like the untested sketch:
> >
> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > index 1358c21837f1..a5c75cb1c812 100644
> > --- a/fs/kernfs/mount.c
> > +++ b/fs/kernfs/mount.c
> > @@ -154,6 +154,7 @@ static const struct export_operations kernfs_export_ops = {
> >         .fh_to_dentry   = kernfs_fh_to_dentry,
> >         .fh_to_parent   = kernfs_fh_to_parent,
> >         .get_parent     = kernfs_get_parent_dentry,
> > +       .flags          = EXPORT_OP_FILE_HANDLE,
> >  };
> >
> >  /**
> > diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> > index eacafe46e3b6..170c5729e7f2 100644
> > --- a/fs/nfsd/export.c
> > +++ b/fs/nfsd/export.c
> > @@ -417,6 +417,7 @@ static struct svc_export *svc_export_lookup(struct svc_export *);
> >  static int check_export(struct path *path, int *flags, unsigned char *uuid)
> >  {
> >         struct inode *inode = d_inode(path->dentry);
> > +       const struct export_operations *nop;
> >
> >         /*
> >          * We currently export only dirs, regular files, and (for v4
> > @@ -449,11 +450,16 @@ static int check_export(struct path *path, int *flags, unsigned char *uuid)
> >                 return -EINVAL;
> >         }
> >
> > -       if (!exportfs_can_decode_fh(inode->i_sb->s_export_op)) {
> > +       if (!exportfs_can_decode_fh(nop)) {
> >                 dprintk("exp_export: export of invalid fs type.\n");
> >                 return -EINVAL;
> >         }
> >
> > +       if (nop && nop->flags & EXPORT_OP_FILE_HANDLE) {
> > +               dprintk("exp_export: filesystem only supports non-exportable file handles.\n");
> > +               return -EINVAL;
> > +       }
> > +
> >         if (is_idmapped_mnt(path->mnt)) {
> >                 dprintk("exp_export: export of idmapped mounts not yet supported.\n");
> >                 return -EINVAL;
> > diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> > index 9aa7493b1e10..d1646c0789e1 100644
> > --- a/fs/overlayfs/util.c
> > +++ b/fs/overlayfs/util.c
> > @@ -83,10 +83,15 @@ void ovl_revert_creds(const struct cred *old_cred)
> >   */
> >  int ovl_can_decode_fh(struct super_block *sb)
> >  {
> > +       const struct export_operations *nop = sb->s_export_op;
> > +
> >         if (!capable(CAP_DAC_READ_SEARCH))
> >                 return 0;
> >
> > -       if (!exportfs_can_decode_fh(sb->s_export_op))
> > +       if (!exportfs_can_decode_fh(nop))
> > +               return 0;
> > +
> > +       if (nop && nop->flags & EXPORT_OP_FILE_HANDLE)
> >                 return 0;
> >
> >         return sb->s_export_op->encode_fh ? -1 : FILEID_INO32_GEN;
> > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > index dde3e4e90ea9..9d98b5461dc7 100644
> > --- a/fs/pidfs.c
> > +++ b/fs/pidfs.c
> > @@ -570,6 +570,7 @@ static const struct export_operations pidfs_export_operations = {
> >         .fh_to_dentry   = pidfs_fh_to_dentry,
> >         .open           = pidfs_export_open,
> >         .permission     = pidfs_export_permission,
> > +       .flags          = EXPORT_OP_FILE_HANDLE,
> >  };
> >
> >  static int pidfs_init_inode(struct inode *inode, void *data)
> > diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> > index a087606ace19..98f7cb17abee 100644
> > --- a/include/linux/exportfs.h
> > +++ b/include/linux/exportfs.h
> > @@ -280,6 +280,7 @@ struct export_operations {
> >                                                 */
> >  #define EXPORT_OP_FLUSH_ON_CLOSE       (0x20) /* fs flushes file data on close */
> >  #define EXPORT_OP_ASYNC_LOCK           (0x40) /* fs can do async lock request */
> > +#define EXPORT_OP_FILE_HANDLE          (0x80) /* fs only supports file handles, no proper export */
> 
> This is a bad name IMO, since pidfs clearly does support file handles
> and supports the open_by_handle_at() UAPI.
> 
> I was going to suggest EXPORT_OP_NO_NFS_EXPORT, but it also
> sounds silly, so maybe:
> 
> #define EXPORT_OP_LOCAL_FILE_HANDLE          (0x80) /* fs only
> supports local file handles, no nfs export */

Thank you. I'll send a reply with a proper patch to this thread in a second.

> With that you may add:
> 
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> 
> Thanks,
> Amir.

[PATCH RFC 1/6] pseudofs: add support for export_ops

Posted by Christian Brauner 1 year, 2 months ago

From: Erin Shepherd <erin.shepherd@e43.eu>

Pseudo-filesystems might reasonably wish to implement the export ops
(particularly for name_to_handle_at/open_by_handle_at); plumb this
through pseudo_fs_context

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
Link: https://lore.kernel.org/r/20241113-pidfs_fh-v2-1-9a4d28155a37@e43.eu
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/libfs.c                | 1 +
 include/linux/pseudo_fs.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/libfs.c b/fs/libfs.c
index 748ac59231547c29abcbade3fa025e3b00533d8b..2890a9c4a414b7e2be5c337e238db84743f0a30b 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -673,6 +673,7 @@ static int pseudo_fs_fill_super(struct super_block *s, struct fs_context *fc)
 	s->s_blocksize_bits = PAGE_SHIFT;
 	s->s_magic = ctx->magic;
 	s->s_op = ctx->ops ?: &simple_super_operations;
+	s->s_export_op = ctx->eops;
 	s->s_xattr = ctx->xattr;
 	s->s_time_gran = 1;
 	root = new_inode(s);
diff --git a/include/linux/pseudo_fs.h b/include/linux/pseudo_fs.h
index 730f77381d55f1816ef14adf7dd2cf1d62bb912c..2503f7625d65e7b1fbe9e64d5abf06cd8f017b5f 100644
--- a/include/linux/pseudo_fs.h
+++ b/include/linux/pseudo_fs.h
@@ -5,6 +5,7 @@
 
 struct pseudo_fs_context {
 	const struct super_operations *ops;
+	const struct export_operations *eops;
 	const struct xattr_handler * const *xattr;
 	const struct dentry_operations *dops;
 	unsigned long magic;

-- 
2.45.2

Re: [PATCH RFC 1/6] pseudofs: add support for export_ops

Posted by Jan Kara 1 year, 2 months ago

On Fri 29-11-24 14:38:00, Christian Brauner wrote:
> From: Erin Shepherd <erin.shepherd@e43.eu>
> 
> Pseudo-filesystems might reasonably wish to implement the export ops
> (particularly for name_to_handle_at/open_by_handle_at); plumb this
> through pseudo_fs_context
> 
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> Link: https://lore.kernel.org/r/20241113-pidfs_fh-v2-1-9a4d28155a37@e43.eu
> Signed-off-by: Christian Brauner <brauner@kernel.org>

OK, feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/libfs.c                | 1 +
>  include/linux/pseudo_fs.h | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 748ac59231547c29abcbade3fa025e3b00533d8b..2890a9c4a414b7e2be5c337e238db84743f0a30b 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -673,6 +673,7 @@ static int pseudo_fs_fill_super(struct super_block *s, struct fs_context *fc)
>  	s->s_blocksize_bits = PAGE_SHIFT;
>  	s->s_magic = ctx->magic;
>  	s->s_op = ctx->ops ?: &simple_super_operations;
> +	s->s_export_op = ctx->eops;
>  	s->s_xattr = ctx->xattr;
>  	s->s_time_gran = 1;
>  	root = new_inode(s);
> diff --git a/include/linux/pseudo_fs.h b/include/linux/pseudo_fs.h
> index 730f77381d55f1816ef14adf7dd2cf1d62bb912c..2503f7625d65e7b1fbe9e64d5abf06cd8f017b5f 100644
> --- a/include/linux/pseudo_fs.h
> +++ b/include/linux/pseudo_fs.h
> @@ -5,6 +5,7 @@
>  
>  struct pseudo_fs_context {
>  	const struct super_operations *ops;
> +	const struct export_operations *eops;
>  	const struct xattr_handler * const *xattr;
>  	const struct dentry_operations *dops;
>  	unsigned long magic;
> 
> -- 
> 2.45.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[PATCH RFC 2/6] fhandle: simplify error handling

Posted by Christian Brauner 1 year, 2 months ago

Rely on our cleanup infrastructure.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/fhandle.c | 39 +++++++++++++++++----------------------
 1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/fs/fhandle.c b/fs/fhandle.c
index ec9145047dfc9d25e109e72d210987bbf6b36a20..c00d88fb14e16654b5cbbb71760c0478eac20384 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -261,19 +261,20 @@ static int do_handle_to_path(struct file_handle *handle, struct path *path,
 {
 	int handle_dwords;
 	struct vfsmount *mnt = ctx->root.mnt;
+	struct dentry *dentry;
 
 	/* change the handle size to multiple of sizeof(u32) */
 	handle_dwords = handle->handle_bytes >> 2;
-	path->dentry = exportfs_decode_fh_raw(mnt,
-					  (struct fid *)handle->f_handle,
-					  handle_dwords, handle->handle_type,
-					  ctx->fh_flags,
-					  vfs_dentry_acceptable, ctx);
-	if (IS_ERR_OR_NULL(path->dentry)) {
-		if (path->dentry == ERR_PTR(-ENOMEM))
+	dentry = exportfs_decode_fh_raw(mnt, (struct fid *)handle->f_handle,
+					handle_dwords, handle->handle_type,
+					ctx->fh_flags, vfs_dentry_acceptable,
+					ctx);
+	if (IS_ERR_OR_NULL(dentry)) {
+		if (dentry == ERR_PTR(-ENOMEM))
 			return -ENOMEM;
 		return -ESTALE;
 	}
+	path->dentry = dentry;
 	path->mnt = mntget(mnt);
 	return 0;
 }
@@ -398,29 +399,23 @@ static long do_handle_open(int mountdirfd, struct file_handle __user *ufh,
 			   int open_flag)
 {
 	long retval = 0;
-	struct path path;
+	struct path path __free(path_put) = {};
 	struct file *file;
-	int fd;
 
 	retval = handle_to_path(mountdirfd, ufh, &path, open_flag);
 	if (retval)
 		return retval;
 
-	fd = get_unused_fd_flags(open_flag);
-	if (fd < 0) {
-		path_put(&path);
+	CLASS(get_unused_fd, fd)(O_CLOEXEC);
+	if (fd < 0)
 		return fd;
-	}
+
 	file = file_open_root(&path, "", open_flag, 0);
-	if (IS_ERR(file)) {
-		put_unused_fd(fd);
-		retval =  PTR_ERR(file);
-	} else {
-		retval = fd;
-		fd_install(fd, file);
-	}
-	path_put(&path);
-	return retval;
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	fd_install(fd, file);
+	return take_fd(fd);
 }
 
 /**

-- 
2.45.2

Re: [PATCH RFC 2/6] fhandle: simplify error handling

Posted by Jan Kara 1 year, 2 months ago

On Fri 29-11-24 14:38:01, Christian Brauner wrote:
> Rely on our cleanup infrastructure.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fhandle.c | 39 +++++++++++++++++----------------------
>  1 file changed, 17 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/fhandle.c b/fs/fhandle.c
> index ec9145047dfc9d25e109e72d210987bbf6b36a20..c00d88fb14e16654b5cbbb71760c0478eac20384 100644
> --- a/fs/fhandle.c
> +++ b/fs/fhandle.c
> @@ -261,19 +261,20 @@ static int do_handle_to_path(struct file_handle *handle, struct path *path,
>  {
>  	int handle_dwords;
>  	struct vfsmount *mnt = ctx->root.mnt;
> +	struct dentry *dentry;
>  
>  	/* change the handle size to multiple of sizeof(u32) */
>  	handle_dwords = handle->handle_bytes >> 2;
> -	path->dentry = exportfs_decode_fh_raw(mnt,
> -					  (struct fid *)handle->f_handle,
> -					  handle_dwords, handle->handle_type,
> -					  ctx->fh_flags,
> -					  vfs_dentry_acceptable, ctx);
> -	if (IS_ERR_OR_NULL(path->dentry)) {
> -		if (path->dentry == ERR_PTR(-ENOMEM))
> +	dentry = exportfs_decode_fh_raw(mnt, (struct fid *)handle->f_handle,
> +					handle_dwords, handle->handle_type,
> +					ctx->fh_flags, vfs_dentry_acceptable,
> +					ctx);
> +	if (IS_ERR_OR_NULL(dentry)) {
> +		if (dentry == ERR_PTR(-ENOMEM))
>  			return -ENOMEM;
>  		return -ESTALE;
>  	}
> +	path->dentry = dentry;
>  	path->mnt = mntget(mnt);
>  	return 0;
>  }
> @@ -398,29 +399,23 @@ static long do_handle_open(int mountdirfd, struct file_handle __user *ufh,
>  			   int open_flag)
>  {
>  	long retval = 0;
> -	struct path path;
> +	struct path path __free(path_put) = {};
>  	struct file *file;
> -	int fd;
>  
>  	retval = handle_to_path(mountdirfd, ufh, &path, open_flag);
>  	if (retval)
>  		return retval;
>  
> -	fd = get_unused_fd_flags(open_flag);
> -	if (fd < 0) {
> -		path_put(&path);
> +	CLASS(get_unused_fd, fd)(O_CLOEXEC);
> +	if (fd < 0)
>  		return fd;
> -	}
> +
>  	file = file_open_root(&path, "", open_flag, 0);
> -	if (IS_ERR(file)) {
> -		put_unused_fd(fd);
> -		retval =  PTR_ERR(file);
> -	} else {
> -		retval = fd;
> -		fd_install(fd, file);
> -	}
> -	path_put(&path);
> -	return retval;
> +	if (IS_ERR(file))
> +		return PTR_ERR(file);
> +
> +	fd_install(fd, file);
> +	return take_fd(fd);
>  }
>  
>  /**
> 
> -- 
> 2.45.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[PATCH RFC 3/6] exportfs: add open method

Posted by Christian Brauner 1 year, 2 months ago

This allows filesystems such as pidfs to provide their custom open.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/exportfs.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 4cc8801e50e395442f9e3ae69b6c9f04fa590ff9..c69b79b64466d5bc32ffe9b2796a388130fe72d8 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -10,6 +10,7 @@ struct inode;
 struct iomap;
 struct super_block;
 struct vfsmount;
+struct path;
 
 /* limit the handle size to NFSv4 handle size now */
 #define MAX_HANDLE_SZ 128
@@ -225,6 +226,9 @@ struct fid {
  *    is also a directory.  In the event that it cannot be found, or storage
  *    space cannot be allocated, a %ERR_PTR should be returned.
  *
+ * open:
+ *    Allow filesystems to specify a custom open function.
+ *
  * commit_metadata:
  *    @commit_metadata should commit metadata changes to stable storage.
  *
@@ -251,6 +255,7 @@ struct export_operations {
 			  bool write, u32 *device_generation);
 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 			     int nr_iomaps, struct iattr *iattr);
+	struct file * (*open)(struct path *path, unsigned int oflags);
 #define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
 #define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */
 #define	EXPORT_OP_CLOSE_BEFORE_UNLINK	(0x4) /* close files before unlink */

-- 
2.45.2

Re: [PATCH RFC 3/6] exportfs: add open method

Posted by Jan Kara 1 year, 2 months ago

On Fri 29-11-24 14:38:02, Christian Brauner wrote:
> This allows filesystems such as pidfs to provide their custom open.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/exportfs.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index 4cc8801e50e395442f9e3ae69b6c9f04fa590ff9..c69b79b64466d5bc32ffe9b2796a388130fe72d8 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -10,6 +10,7 @@ struct inode;
>  struct iomap;
>  struct super_block;
>  struct vfsmount;
> +struct path;
>  
>  /* limit the handle size to NFSv4 handle size now */
>  #define MAX_HANDLE_SZ 128
> @@ -225,6 +226,9 @@ struct fid {
>   *    is also a directory.  In the event that it cannot be found, or storage
>   *    space cannot be allocated, a %ERR_PTR should be returned.
>   *
> + * open:
> + *    Allow filesystems to specify a custom open function.
> + *
>   * commit_metadata:
>   *    @commit_metadata should commit metadata changes to stable storage.
>   *
> @@ -251,6 +255,7 @@ struct export_operations {
>  			  bool write, u32 *device_generation);
>  	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
>  			     int nr_iomaps, struct iattr *iattr);
> +	struct file * (*open)(struct path *path, unsigned int oflags);
>  #define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
>  #define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */
>  #define	EXPORT_OP_CLOSE_BEFORE_UNLINK	(0x4) /* close files before unlink */
> 
> -- 
> 2.45.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[PATCH RFC 4/6] fhandle: pull CAP_DAC_READ_SEARCH check into may_decode_fh()

Posted by Christian Brauner 1 year, 2 months ago

There's no point in keeping it outside of that helper. This way we have
all the permission pieces in one place.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/fhandle.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/fhandle.c b/fs/fhandle.c
index c00d88fb14e16654b5cbbb71760c0478eac20384..031ad5592a0beabcc299436f037ad5fe626332e6 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -298,6 +298,9 @@ static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
 {
 	struct path *root = &ctx->root;
 
+	if (capable(CAP_DAC_READ_SEARCH))
+		return true;
+
 	/*
 	 * Restrict to O_DIRECTORY to provide a deterministic API that avoids a
 	 * confusing api in the face of disconnected non-dir dentries.
@@ -337,7 +340,7 @@ static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
 	if (retval)
 		goto out_err;
 
-	if (!capable(CAP_DAC_READ_SEARCH) && !may_decode_fh(&ctx, o_flags)) {
+	if (!may_decode_fh(&ctx, o_flags)) {
 		retval = -EPERM;
 		goto out_path;
 	}

-- 
2.45.2

Re: [PATCH RFC 4/6] fhandle: pull CAP_DAC_READ_SEARCH check into may_decode_fh()

Posted by Jan Kara 1 year, 2 months ago

On Fri 29-11-24 14:38:03, Christian Brauner wrote:
> There's no point in keeping it outside of that helper. This way we have
> all the permission pieces in one place.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Nice. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fhandle.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fhandle.c b/fs/fhandle.c
> index c00d88fb14e16654b5cbbb71760c0478eac20384..031ad5592a0beabcc299436f037ad5fe626332e6 100644
> --- a/fs/fhandle.c
> +++ b/fs/fhandle.c
> @@ -298,6 +298,9 @@ static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
>  {
>  	struct path *root = &ctx->root;
>  
> +	if (capable(CAP_DAC_READ_SEARCH))
> +		return true;
> +
>  	/*
>  	 * Restrict to O_DIRECTORY to provide a deterministic API that avoids a
>  	 * confusing api in the face of disconnected non-dir dentries.
> @@ -337,7 +340,7 @@ static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
>  	if (retval)
>  		goto out_err;
>  
> -	if (!capable(CAP_DAC_READ_SEARCH) && !may_decode_fh(&ctx, o_flags)) {
> +	if (!may_decode_fh(&ctx, o_flags)) {
>  		retval = -EPERM;
>  		goto out_path;
>  	}
> 
> -- 
> 2.45.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[PATCH RFC 5/6] exportfs: add permission method

Posted by Christian Brauner 1 year, 2 months ago

This allows filesystems such as pidfs to provide their custom permission
checks.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/fhandle.c             | 21 +++++++--------------
 include/linux/exportfs.h | 17 ++++++++++++++++-
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/fs/fhandle.c b/fs/fhandle.c
index 031ad5592a0beabcc299436f037ad5fe626332e6..23491094032ec037066a271873ea8ff794616bee 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -187,17 +187,6 @@ static int get_path_from_fd(int fd, struct path *root)
 	return 0;
 }
 
-enum handle_to_path_flags {
-	HANDLE_CHECK_PERMS   = (1 << 0),
-	HANDLE_CHECK_SUBTREE = (1 << 1),
-};
-
-struct handle_to_path_ctx {
-	struct path root;
-	enum handle_to_path_flags flags;
-	unsigned int fh_flags;
-};
-
 static int vfs_dentry_acceptable(void *context, struct dentry *dentry)
 {
 	struct handle_to_path_ctx *ctx = context;
@@ -335,15 +324,19 @@ static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
 	struct file_handle f_handle;
 	struct file_handle *handle = NULL;
 	struct handle_to_path_ctx ctx = {};
+	const struct export_operations *eops;
 
 	retval = get_path_from_fd(mountdirfd, &ctx.root);
 	if (retval)
 		goto out_err;
 
-	if (!may_decode_fh(&ctx, o_flags)) {
-		retval = -EPERM;
+	eops = ctx.root.mnt->mnt_sb->s_export_op;
+	if (eops && eops->permission)
+		retval = eops->permission(&ctx, o_flags);
+	else
+		retval = may_decode_fh(&ctx, o_flags);
+	if (retval)
 		goto out_path;
-	}
 
 	if (copy_from_user(&f_handle, ufh, sizeof(struct file_handle))) {
 		retval = -EFAULT;
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index c69b79b64466d5bc32ffe9b2796a388130fe72d8..a087606ace194ccc9d1250f990ce55627aaf8dc5 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -3,6 +3,7 @@
 #define LINUX_EXPORTFS_H 1
 
 #include <linux/types.h>
+#include <linux/path.h>
 
 struct dentry;
 struct iattr;
@@ -10,7 +11,6 @@ struct inode;
 struct iomap;
 struct super_block;
 struct vfsmount;
-struct path;
 
 /* limit the handle size to NFSv4 handle size now */
 #define MAX_HANDLE_SZ 128
@@ -157,6 +157,17 @@ struct fid {
 	};
 };
 
+enum handle_to_path_flags {
+	HANDLE_CHECK_PERMS   = (1 << 0),
+	HANDLE_CHECK_SUBTREE = (1 << 1),
+};
+
+struct handle_to_path_ctx {
+	struct path root;
+	enum handle_to_path_flags flags;
+	unsigned int fh_flags;
+};
+
 #define EXPORT_FH_CONNECTABLE	0x1 /* Encode file handle with parent */
 #define EXPORT_FH_FID		0x2 /* File handle may be non-decodeable */
 #define EXPORT_FH_DIR_ONLY	0x4 /* Only decode file handle for a directory */
@@ -226,6 +237,9 @@ struct fid {
  *    is also a directory.  In the event that it cannot be found, or storage
  *    space cannot be allocated, a %ERR_PTR should be returned.
  *
+ * permission:
+ *    Allow filesystems to specify a custom permission function.
+ *
  * open:
  *    Allow filesystems to specify a custom open function.
  *
@@ -255,6 +269,7 @@ struct export_operations {
 			  bool write, u32 *device_generation);
 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 			     int nr_iomaps, struct iattr *iattr);
+	int (*permission)(struct handle_to_path_ctx *ctx, unsigned int oflags);
 	struct file * (*open)(struct path *path, unsigned int oflags);
 #define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
 #define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */

-- 
2.45.2

Re: [PATCH RFC 5/6] exportfs: add permission method

Posted by Jan Kara 1 year, 2 months ago

On Fri 29-11-24 14:38:04, Christian Brauner wrote:
> This allows filesystems such as pidfs to provide their custom permission
> checks.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fhandle.c             | 21 +++++++--------------
>  include/linux/exportfs.h | 17 ++++++++++++++++-
>  2 files changed, 23 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/fhandle.c b/fs/fhandle.c
> index 031ad5592a0beabcc299436f037ad5fe626332e6..23491094032ec037066a271873ea8ff794616bee 100644
> --- a/fs/fhandle.c
> +++ b/fs/fhandle.c
> @@ -187,17 +187,6 @@ static int get_path_from_fd(int fd, struct path *root)
>  	return 0;
>  }
>  
> -enum handle_to_path_flags {
> -	HANDLE_CHECK_PERMS   = (1 << 0),
> -	HANDLE_CHECK_SUBTREE = (1 << 1),
> -};
> -
> -struct handle_to_path_ctx {
> -	struct path root;
> -	enum handle_to_path_flags flags;
> -	unsigned int fh_flags;
> -};
> -
>  static int vfs_dentry_acceptable(void *context, struct dentry *dentry)
>  {
>  	struct handle_to_path_ctx *ctx = context;
> @@ -335,15 +324,19 @@ static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
>  	struct file_handle f_handle;
>  	struct file_handle *handle = NULL;
>  	struct handle_to_path_ctx ctx = {};
> +	const struct export_operations *eops;
>  
>  	retval = get_path_from_fd(mountdirfd, &ctx.root);
>  	if (retval)
>  		goto out_err;
>  
> -	if (!may_decode_fh(&ctx, o_flags)) {
> -		retval = -EPERM;
> +	eops = ctx.root.mnt->mnt_sb->s_export_op;
> +	if (eops && eops->permission)
> +		retval = eops->permission(&ctx, o_flags);
> +	else
> +		retval = may_decode_fh(&ctx, o_flags);
> +	if (retval)
>  		goto out_path;
> -	}
>  
>  	if (copy_from_user(&f_handle, ufh, sizeof(struct file_handle))) {
>  		retval = -EFAULT;
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index c69b79b64466d5bc32ffe9b2796a388130fe72d8..a087606ace194ccc9d1250f990ce55627aaf8dc5 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -3,6 +3,7 @@
>  #define LINUX_EXPORTFS_H 1
>  
>  #include <linux/types.h>
> +#include <linux/path.h>
>  
>  struct dentry;
>  struct iattr;
> @@ -10,7 +11,6 @@ struct inode;
>  struct iomap;
>  struct super_block;
>  struct vfsmount;
> -struct path;
>  
>  /* limit the handle size to NFSv4 handle size now */
>  #define MAX_HANDLE_SZ 128
> @@ -157,6 +157,17 @@ struct fid {
>  	};
>  };
>  
> +enum handle_to_path_flags {
> +	HANDLE_CHECK_PERMS   = (1 << 0),
> +	HANDLE_CHECK_SUBTREE = (1 << 1),
> +};
> +
> +struct handle_to_path_ctx {
> +	struct path root;
> +	enum handle_to_path_flags flags;
> +	unsigned int fh_flags;
> +};
> +
>  #define EXPORT_FH_CONNECTABLE	0x1 /* Encode file handle with parent */
>  #define EXPORT_FH_FID		0x2 /* File handle may be non-decodeable */
>  #define EXPORT_FH_DIR_ONLY	0x4 /* Only decode file handle for a directory */
> @@ -226,6 +237,9 @@ struct fid {
>   *    is also a directory.  In the event that it cannot be found, or storage
>   *    space cannot be allocated, a %ERR_PTR should be returned.
>   *
> + * permission:
> + *    Allow filesystems to specify a custom permission function.
> + *
>   * open:
>   *    Allow filesystems to specify a custom open function.
>   *
> @@ -255,6 +269,7 @@ struct export_operations {
>  			  bool write, u32 *device_generation);
>  	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
>  			     int nr_iomaps, struct iattr *iattr);
> +	int (*permission)(struct handle_to_path_ctx *ctx, unsigned int oflags);
>  	struct file * (*open)(struct path *path, unsigned int oflags);
>  #define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
>  #define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */
> 
> -- 
> 2.45.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[PATCH RFC 6/6] pidfs: implement file handle support

Posted by Christian Brauner 1 year, 2 months ago

On 64-bit platforms, userspace can read the pidfd's inode in order to
get a never-repeated PID identifier. On 32-bit platforms this identifier
is not exposed, as inodes are limited to 32 bits. Instead expose the
identifier via export_fh, which makes it available to userspace via
name_to_handle_at.

In addition we implement fh_to_dentry, which allows userspace to
recover a pidfd from a pidfs file handle.

Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
[brauner: patch heavily rewritten]
Co-Developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/fhandle.c | 34 +++++++++++----------
 fs/pidfs.c   | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 115 insertions(+), 15 deletions(-)

diff --git a/fs/fhandle.c b/fs/fhandle.c
index 23491094032ec037066a271873ea8ff794616bee..4c847ca16fabe31d51ff5698b0c9c355c3e2fb67 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -268,20 +268,6 @@ static int do_handle_to_path(struct file_handle *handle, struct path *path,
 	return 0;
 }
 
-/*
- * Allow relaxed permissions of file handles if the caller has the
- * ability to mount the filesystem or create a bind-mount of the
- * provided @mountdirfd.
- *
- * In both cases the caller may be able to get an unobstructed way to
- * the encoded file handle. If the caller is only able to create a
- * bind-mount we need to verify that there are no locked mounts on top
- * of it that could prevent us from getting to the encoded file.
- *
- * In principle, locked mounts can prevent the caller from mounting the
- * filesystem but that only applies to procfs and sysfs neither of which
- * support decoding file handles.
- */
 static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
 				 unsigned int o_flags)
 {
@@ -291,6 +277,19 @@ static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
 		return true;
 
 	/*
+	 * Allow relaxed permissions of file handles if the caller has the
+	 * ability to mount the filesystem or create a bind-mount of the
+	 * provided @mountdirfd.
+	 *
+	 * In both cases the caller may be able to get an unobstructed way to
+	 * the encoded file handle. If the caller is only able to create a
+	 * bind-mount we need to verify that there are no locked mounts on top
+	 * of it that could prevent us from getting to the encoded file.
+	 *
+	 * In principle, locked mounts can prevent the caller from mounting the
+	 * filesystem but that only applies to procfs and sysfs neither of which
+	 * support decoding file handles.
+	 *
 	 * Restrict to O_DIRECTORY to provide a deterministic API that avoids a
 	 * confusing api in the face of disconnected non-dir dentries.
 	 *
@@ -397,6 +396,7 @@ static long do_handle_open(int mountdirfd, struct file_handle __user *ufh,
 	long retval = 0;
 	struct path path __free(path_put) = {};
 	struct file *file;
+	const struct export_operations *eops;
 
 	retval = handle_to_path(mountdirfd, ufh, &path, open_flag);
 	if (retval)
@@ -406,7 +406,11 @@ static long do_handle_open(int mountdirfd, struct file_handle __user *ufh,
 	if (fd < 0)
 		return fd;
 
-	file = file_open_root(&path, "", open_flag, 0);
+	eops = path.mnt->mnt_sb->s_export_op;
+	if (eops->open)
+		file = eops->open(&path, open_flag);
+	else
+		file = file_open_root(&path, "", open_flag, 0);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
diff --git a/fs/pidfs.c b/fs/pidfs.c
index f73a47e1d8379df886a90a044fb887f8d06f7c0b..f09af08a4abe4a9100ed972bee8f5c5d7ab33d84 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/anon_inodes.h>
+#include <linux/exportfs.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/cgroup.h>
@@ -454,6 +455,100 @@ static const struct dentry_operations pidfs_dentry_operations = {
 	.d_prune	= stashed_dentry_prune,
 };
 
+static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
+			   struct inode *parent)
+{
+	struct pid *pid = inode->i_private;
+
+	if (*max_len < 2) {
+		*max_len = 2;
+		return FILEID_INVALID;
+	}
+
+	*max_len = 2;
+	*(u64 *)fh = pid->ino;
+	return FILEID_KERNFS;
+}
+
+/* Find a struct pid based on the inode number. */
+static struct pid *pidfs_ino_get_pid(u64 ino)
+{
+	ino_t pid_ino = pidfs_ino(ino);
+	u32 gen = pidfs_gen(ino);
+	struct pid *pid;
+
+	guard(rcu)();
+
+	/* Handle @pid lookup carefully so there's no risk of UAF. */
+	pid = idr_find(&pidfs_ino_idr, (u32)pid_ino);
+	if (!pid)
+		return NULL;
+
+	if (sizeof(ino_t) < sizeof(u64)) {
+		if (gen && pidfs_gen(pid->ino) != gen)
+			pid = NULL;
+	} else {
+		if (pidfs_ino(pid->ino) != pid_ino)
+			pid = NULL;
+	}
+
+	/* Within our pid namespace hierarchy? */
+	if (pid_vnr(pid) == 0)
+		pid = NULL;
+
+	return get_pid(pid);
+}
+
+static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
+					 struct fid *fid, int fh_len,
+					 int fh_type)
+{
+	int ret;
+	u64 pid_ino;
+	struct path path;
+	struct pid *pid;
+
+	if (fh_len < 2)
+		return NULL;
+
+	switch (fh_type) {
+	case FILEID_KERNFS:
+		pid_ino = *(u64 *)fid;
+		break;
+	default:
+		return NULL;
+	}
+
+	pid = pidfs_ino_get_pid(pid_ino);
+	if (!pid)
+		return NULL;
+
+	ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	mntput(path.mnt);
+	return path.dentry;
+}
+
+static int pidfs_export_permission(struct handle_to_path_ctx *ctx,
+				   unsigned int oflags)
+{
+	return 0;
+}
+
+static struct file *pidfs_export_open(struct path *path, unsigned int oflags)
+{
+	return dentry_open(path, oflags | O_RDWR, current_cred());
+}
+
+static const struct export_operations pidfs_export_operations = {
+	.encode_fh	= pidfs_encode_fh,
+	.fh_to_dentry	= pidfs_fh_to_dentry,
+	.open		= pidfs_export_open,
+	.permission	= pidfs_export_permission,
+};
+
 static int pidfs_init_inode(struct inode *inode, void *data)
 {
 	struct pid *pid = data;
@@ -488,6 +583,7 @@ static int pidfs_init_fs_context(struct fs_context *fc)
 		return -ENOMEM;
 
 	ctx->ops = &pidfs_sops;
+	ctx->eops = &pidfs_export_operations;
 	ctx->dops = &pidfs_dentry_operations;
 	fc->s_fs_info = (void *)&pidfs_stashed_ops;
 	return 0;

-- 
2.45.2

Re: [PATCH RFC 6/6] pidfs: implement file handle support

Posted by Amir Goldstein 1 year, 2 months ago

On Fri, Nov 29, 2024 at 2:39 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On 64-bit platforms, userspace can read the pidfd's inode in order to
> get a never-repeated PID identifier. On 32-bit platforms this identifier
> is not exposed, as inodes are limited to 32 bits. Instead expose the
> identifier via export_fh, which makes it available to userspace via
> name_to_handle_at.
>
> In addition we implement fh_to_dentry, which allows userspace to
> recover a pidfd from a pidfs file handle.
>
> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> [brauner: patch heavily rewritten]
> Co-Developed-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/fhandle.c | 34 +++++++++++----------
>  fs/pidfs.c   | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 115 insertions(+), 15 deletions(-)
>
> diff --git a/fs/fhandle.c b/fs/fhandle.c
> index 23491094032ec037066a271873ea8ff794616bee..4c847ca16fabe31d51ff5698b0c9c355c3e2fb67 100644
> --- a/fs/fhandle.c
> +++ b/fs/fhandle.c
> @@ -268,20 +268,6 @@ static int do_handle_to_path(struct file_handle *handle, struct path *path,
>         return 0;
>  }
>
> -/*
> - * Allow relaxed permissions of file handles if the caller has the
> - * ability to mount the filesystem or create a bind-mount of the
> - * provided @mountdirfd.
> - *
> - * In both cases the caller may be able to get an unobstructed way to
> - * the encoded file handle. If the caller is only able to create a
> - * bind-mount we need to verify that there are no locked mounts on top
> - * of it that could prevent us from getting to the encoded file.
> - *
> - * In principle, locked mounts can prevent the caller from mounting the
> - * filesystem but that only applies to procfs and sysfs neither of which
> - * support decoding file handles.
> - */
>  static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
>                                  unsigned int o_flags)
>  {
> @@ -291,6 +277,19 @@ static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
>                 return true;
>
>         /*
> +        * Allow relaxed permissions of file handles if the caller has the
> +        * ability to mount the filesystem or create a bind-mount of the
> +        * provided @mountdirfd.
> +        *
> +        * In both cases the caller may be able to get an unobstructed way to
> +        * the encoded file handle. If the caller is only able to create a
> +        * bind-mount we need to verify that there are no locked mounts on top
> +        * of it that could prevent us from getting to the encoded file.
> +        *
> +        * In principle, locked mounts can prevent the caller from mounting the
> +        * filesystem but that only applies to procfs and sysfs neither of which
> +        * support decoding file handles.
> +        *

Belongs in patch 4

>          * Restrict to O_DIRECTORY to provide a deterministic API that avoids a
>          * confusing api in the face of disconnected non-dir dentries.
>          *
> @@ -397,6 +396,7 @@ static long do_handle_open(int mountdirfd, struct file_handle __user *ufh,
>         long retval = 0;
>         struct path path __free(path_put) = {};
>         struct file *file;
> +       const struct export_operations *eops;
>
>         retval = handle_to_path(mountdirfd, ufh, &path, open_flag);
>         if (retval)
> @@ -406,7 +406,11 @@ static long do_handle_open(int mountdirfd, struct file_handle __user *ufh,
>         if (fd < 0)
>                 return fd;
>
> -       file = file_open_root(&path, "", open_flag, 0);
> +       eops = path.mnt->mnt_sb->s_export_op;
> +       if (eops->open)
> +               file = eops->open(&path, open_flag);
> +       else
> +               file = file_open_root(&path, "", open_flag, 0);

Belongs in patch 3

>         if (IS_ERR(file))
>                 return PTR_ERR(file);
>
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index f73a47e1d8379df886a90a044fb887f8d06f7c0b..f09af08a4abe4a9100ed972bee8f5c5d7ab33d84 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -1,5 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <linux/anon_inodes.h>
> +#include <linux/exportfs.h>
>  #include <linux/file.h>
>  #include <linux/fs.h>
>  #include <linux/cgroup.h>
> @@ -454,6 +455,100 @@ static const struct dentry_operations pidfs_dentry_operations = {
>         .d_prune        = stashed_dentry_prune,
>  };
>
> +static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
> +                          struct inode *parent)
> +{
> +       struct pid *pid = inode->i_private;
> +
> +       if (*max_len < 2) {
> +               *max_len = 2;
> +               return FILEID_INVALID;
> +       }
> +
> +       *max_len = 2;
> +       *(u64 *)fh = pid->ino;
> +       return FILEID_KERNFS;
> +}
> +
> +/* Find a struct pid based on the inode number. */
> +static struct pid *pidfs_ino_get_pid(u64 ino)
> +{
> +       ino_t pid_ino = pidfs_ino(ino);
> +       u32 gen = pidfs_gen(ino);
> +       struct pid *pid;
> +
> +       guard(rcu)();
> +
> +       /* Handle @pid lookup carefully so there's no risk of UAF. */
> +       pid = idr_find(&pidfs_ino_idr, (u32)pid_ino);
> +       if (!pid)
> +               return NULL;
> +
> +       if (sizeof(ino_t) < sizeof(u64)) {

Not sure why the two cases are needed. Isn't this enough?

  if (pidfs_ino(pid->ino) != pid_ino || pidfs_gen(pid->ino) != gen)
         pid = NULL;


> +               if (gen && pidfs_gen(pid->ino) != gen)
> +                       pid = NULL;
> +       } else {
> +               if (pidfs_ino(pid->ino) != pid_ino)
> +                       pid = NULL;
> +       }
> +
> +       /* Within our pid namespace hierarchy? */
> +       if (pid_vnr(pid) == 0)
> +               pid = NULL;
> +
> +       return get_pid(pid);
> +}
> +
> +static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
> +                                        struct fid *fid, int fh_len,
> +                                        int fh_type)
> +{
> +       int ret;
> +       u64 pid_ino;
> +       struct path path;
> +       struct pid *pid;
> +
> +       if (fh_len < 2)
> +               return NULL;
> +
> +       switch (fh_type) {
> +       case FILEID_KERNFS:
> +               pid_ino = *(u64 *)fid;
> +               break;
> +       default:
> +               return NULL;
> +       }
> +
> +       pid = pidfs_ino_get_pid(pid_ino);
> +       if (!pid)
> +               return NULL;
> +
> +       ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
> +       if (ret < 0)
> +               return ERR_PTR(ret);
> +
> +       mntput(path.mnt);
> +       return path.dentry;
> +}
> +
> +static int pidfs_export_permission(struct handle_to_path_ctx *ctx,
> +                                  unsigned int oflags)
> +{

This deserves a comment to explain why no permissions are required.

> +       return 0;
> +}
> +
> +static struct file *pidfs_export_open(struct path *path, unsigned int oflags)
> +{
> +       return dentry_open(path, oflags | O_RDWR, current_cred());

Why is O_RDWR needed here? perhaps a comment to explain.

Thanks,
Amir.

Re: [PATCH RFC v2 0/3] pidfs: file handle preliminaries

Posted by Jeff Layton 1 year, 2 months ago

On Fri, 2024-11-29 at 14:02 +0100, Christian Brauner wrote:
> Hey,
> 
> This reworks the inode number allocation for pidfs in order to support
> file handles properly.
> 
> Recently we received a patchset that aims to enable file handle encoding
> and decoding via name_to_handle_at(2) and open_by_handle_at(2).
> 
> A crucical step in the patch series is how to go from inode number to
> struct pid without leaking information into unprivileged contexts. The
> issue is that in order to find a struct pid the pid number in the
> initial pid namespace must be encoded into the file handle via
> name_to_handle_at(2). This can be used by containers using a separate
> pid namespace to learn what the pid number of a given process in the
> initial pid namespace is. While this is a weak information leak it could
> be used in various exploits and in general is an ugly wart in the
> design.
> 
> To solve this problem a new way is needed to lookup a struct pid based
> on the inode number allocated for that struct pid. The other part is to
> remove the custom inode number allocation on 32bit systems that is also
> an ugly wart that should go away.
> 
> So, a new scheme is used that I was discusssing with Tejun some time
> back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
> are used for the generation number. This gives a 64 bit inode number
> that is unique on both 32 bit and 64 bit. The lower 32 bit number is
> recycled slowly and can be used to lookup struct pids.
> 
> Thanks!
> Christian
> 
> ---
> Changes in v2:
> - Remove __maybe_unused pidfd_ino_get_pid() function that was only there
>   for initial illustration purposes.
> - Link to v1: https://lore.kernel.org/r/20241128-work-pidfs-v1-0-80f267639d98@kernel.org
> 
> ---
> Christian Brauner (3):
>       pidfs: rework inode number allocation
>       pidfs: remove 32bit inode number handling
>       pidfs: support FS_IOC_GETVERSION
> 
>  fs/pidfs.c            | 118 ++++++++++++++++++++++++++++++++------------------
>  include/linux/pidfs.h |   2 +
>  kernel/pid.c          |  14 +++---
>  3 files changed, 86 insertions(+), 48 deletions(-)
> ---
> base-commit: b86545e02e8c22fb89218f29d381fa8e8b91d815
> change-id: 20241128-work-pidfs-2bd42c7ea772
> 
> 

This seems like a good stopgap fix until we can sort out how to get to
64-bit inode numbers internally everywhere.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

Re: [PATCH RFC v2 0/3] pidfs: file handle preliminaries

Posted by Amir Goldstein 1 year, 2 months ago

On Fri, Nov 29, 2024 at 3:27 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Fri, 2024-11-29 at 14:02 +0100, Christian Brauner wrote:
> > Hey,
> >
> > This reworks the inode number allocation for pidfs in order to support
> > file handles properly.
> >
> > Recently we received a patchset that aims to enable file handle encoding
> > and decoding via name_to_handle_at(2) and open_by_handle_at(2).
> >
> > A crucical step in the patch series is how to go from inode number to
> > struct pid without leaking information into unprivileged contexts. The
> > issue is that in order to find a struct pid the pid number in the
> > initial pid namespace must be encoded into the file handle via
> > name_to_handle_at(2). This can be used by containers using a separate
> > pid namespace to learn what the pid number of a given process in the
> > initial pid namespace is. While this is a weak information leak it could
> > be used in various exploits and in general is an ugly wart in the
> > design.
> >
> > To solve this problem a new way is needed to lookup a struct pid based
> > on the inode number allocated for that struct pid. The other part is to
> > remove the custom inode number allocation on 32bit systems that is also
> > an ugly wart that should go away.
> >
> > So, a new scheme is used that I was discusssing with Tejun some time
> > back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
> > are used for the generation number. This gives a 64 bit inode number
> > that is unique on both 32 bit and 64 bit. The lower 32 bit number is
> > recycled slowly and can be used to lookup struct pids.
> >
> > Thanks!
> > Christian
> >
> > ---
> > Changes in v2:
> > - Remove __maybe_unused pidfd_ino_get_pid() function that was only there
> >   for initial illustration purposes.
> > - Link to v1: https://lore.kernel.org/r/20241128-work-pidfs-v1-0-80f267639d98@kernel.org
> >
> > ---
> > Christian Brauner (3):
> >       pidfs: rework inode number allocation
> >       pidfs: remove 32bit inode number handling
> >       pidfs: support FS_IOC_GETVERSION
> >
> >  fs/pidfs.c            | 118 ++++++++++++++++++++++++++++++++------------------
> >  include/linux/pidfs.h |   2 +
> >  kernel/pid.c          |  14 +++---
> >  3 files changed, 86 insertions(+), 48 deletions(-)
> > ---
> > base-commit: b86545e02e8c22fb89218f29d381fa8e8b91d815
> > change-id: 20241128-work-pidfs-2bd42c7ea772
> >
> >
>
> This seems like a good stopgap fix until we can sort out how to get to
> 64-bit inode numbers internally everywhere.
>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>

Yep. look good

Reviewed-by: Amir Goldstein <amir73il@gmail.com>