vfs: allow mounting inside a container without FS_USERNS_MOUNT by root

[PATCH RFC] vfs: allow mounting inside a container without FS_USERNS_MOUNT by root

Posted by Jeff Layton 1 week, 2 days ago

Meta (and some other places) have an unusual process for doing an NFS
mount inside an unprivilged container. They do the fsopen() and
inside the container, and then pass it to a privileged daemon running
outside that container via unix socket, that then does the mount.

Commit e1c5ae59c0f22 ("fs: don't allow non-init s_user_ns for
filesystems without FS_USERNS_MOUNT") broke this scheme, as the fc->user_ns is
not init_user_ns, even though the daemon doing the mount has CAP_SYS_ADMIN.

Add a check for CAP_SYS_ADMIN to get it working again.

Fixes: e1c5ae59c0f22 ("fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
We've needed to revert e1c5ae59c0f22 for the last year or so in order to
keep NFS mounts inside containers working. Does this approach seem sane,
or are there valid concerns with allowing this that I'm not aware of?

This is not well tested yet, hence the RFC.
---
 fs/super.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 3d85265d14001d51524dbaec0778af8f12c048ac..d06f3e5765921a2ab341827a95dcd663c38cb594 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -738,12 +738,15 @@ struct super_block *sget_fc(struct fs_context *fc,
 	int err;
 
 	/*
-	 * Never allow s_user_ns != &init_user_ns when FS_USERNS_MOUNT is
+	 * Don't allow s_user_ns != &init_user_ns when FS_USERNS_MOUNT is
 	 * not set, as the filesystem is likely unprepared to handle it.
 	 * This can happen when fsconfig() is called from init_user_ns with
-	 * an fs_fd opened in another user namespace.
+	 * an fs_fd opened in another user namespace. If the user has
+	 * CAP_SYS_ADMIN in the init_user_ns however, allow it.
 	 */
-	if (user_ns != &init_user_ns && !(fc->fs_type->fs_flags & FS_USERNS_MOUNT)) {
+	if (user_ns != &init_user_ns &&
+	    !(fc->fs_type->fs_flags & FS_USERNS_MOUNT) &&
+	    !capable(CAP_SYS_ADMIN)) {
 		errorfc(fc, "VFS: Mounting from non-initial user namespace is not allowed");
 		return ERR_PTR(-EPERM);
 	}

---
base-commit: 1f97d9dcf53649c41c33227b345a36902cbb08ad
change-id: 20260128-twmount-c29299e88464

Best regards,
-- 
Jeff Layton <jlayton@kernel.org>

Re: [PATCH RFC] vfs: allow mounting inside a container without FS_USERNS_MOUNT by root

Posted by Jeff Layton 1 week, 2 days ago

On Wed, 2026-01-28 at 12:47 -0500, Jeff Layton wrote:
> Meta (and some other places) have an unusual process for doing an NFS
> mount inside an unprivilged container. They do the fsopen() and
> inside the container, and then pass it to a privileged daemon running
> outside that container via unix socket, that then does the mount.
> 
> Commit e1c5ae59c0f22 ("fs: don't allow non-init s_user_ns for
> filesystems without FS_USERNS_MOUNT") broke this scheme, as the fc->user_ns is
> not init_user_ns, even though the daemon doing the mount has CAP_SYS_ADMIN.
> 
> Add a check for CAP_SYS_ADMIN to get it working again.
> 
> Fixes: e1c5ae59c0f22 ("fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT")
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
> We've needed to revert e1c5ae59c0f22 for the last year or so in order to
> keep NFS mounts inside containers working. Does this approach seem sane,
> or are there valid concerns with allowing this that I'm not aware of?
> 
> This is not well tested yet, hence the RFC.
> ---
>  fs/super.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 3d85265d14001d51524dbaec0778af8f12c048ac..d06f3e5765921a2ab341827a95dcd663c38cb594 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -738,12 +738,15 @@ struct super_block *sget_fc(struct fs_context *fc,
>  	int err;
>  
>  	/*
> -	 * Never allow s_user_ns != &init_user_ns when FS_USERNS_MOUNT is
> +	 * Don't allow s_user_ns != &init_user_ns when FS_USERNS_MOUNT is
>  	 * not set, as the filesystem is likely unprepared to handle it.
>  	 * This can happen when fsconfig() is called from init_user_ns with
> -	 * an fs_fd opened in another user namespace.
> +	 * an fs_fd opened in another user namespace. If the user has
> +	 * CAP_SYS_ADMIN in the init_user_ns however, allow it.
>  	 */
> -	if (user_ns != &init_user_ns && !(fc->fs_type->fs_flags & FS_USERNS_MOUNT)) {
> +	if (user_ns != &init_user_ns &&
> +	    !(fc->fs_type->fs_flags & FS_USERNS_MOUNT) &&
> +	    !capable(CAP_SYS_ADMIN)) {
>  		errorfc(fc, "VFS: Mounting from non-initial user namespace is not allowed");
>  		return ERR_PTR(-EPERM);
>  	}
> 

Actually, the above seems wrong, and goes against what the original
patch is trying to do. That said, I think the original patch is wrong
too. The original flag is only supposed to govern whether root inside
the userns is allowed to mount this fs_type:

#define FS_USERNS_MOUNT              8       /* Can be mounted by userns root */

...but e1c5ae59c0f22 uses this as proxy for "filesystem can work inside
a different userns". These are two different things.

For instance, AFAICT NFS works just fine when mounted inside an
alternate userns. We don't want to add FS_USERNS_MOUNT though, since
there is too much danger of someone mounting a malicious server to
exploit a bug in the client -- we want to leave that to a privileged
daemon in the init_user_ns.

Should we split this flag into two?
-- 
Jeff Layton <jlayton@kernel.org>