[PATCH RFC v2 18/23] fs: add umh argument to struct kernel_clone_args

Christian Brauner posted 23 patches 1 month ago
There is a newer version of this series
[PATCH RFC v2 18/23] fs: add umh argument to struct kernel_clone_args
Posted by Christian Brauner 1 month ago
Add a umh field to struct kernel_clone_args. When set, copy_fs() copies
from pid 1's fs_struct instead of the kthread's fs_struct. This ensures
usermodehelper threads always get init's filesystem state regardless of
their parent's (kthreadd's) fs.

Usermodehelper threads are not allowed to create mount namespaces
(CLONE_NEWNS), share filesystem state (CLONE_FS), or be started from
a non-initial mount namespace. No usermodehelper currently does this so
we don't need to worry about this restriction.

Set .umh = 1 in user_mode_thread(). At this stage pid 1's fs points to
rootfs which is the same as kthreadd's fs, so this is functionally
equivalent.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/sched/task.h |  1 +
 kernel/fork.c              | 23 ++++++++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 41ed884cffc9..e0c1ca8c6a18 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -31,6 +31,7 @@ struct kernel_clone_args {
 	u32 io_thread:1;
 	u32 user_worker:1;
 	u32 no_files:1;
+	u32 umh:1;
 	unsigned long stack;
 	unsigned long stack_size;
 	unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index 73f4ed82f656..c740fe2ad1ef 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1590,9 +1590,25 @@ static int copy_mm(u64 clone_flags, struct task_struct *tsk)
 	return 0;
 }
 
-static int copy_fs(u64 clone_flags, struct task_struct *tsk)
+static int copy_fs(u64 clone_flags, struct task_struct *tsk, bool umh)
 {
-	struct fs_struct *fs = current->fs;
+	struct fs_struct *fs;
+
+	/*
+	 * Usermodehelper may use userspace_init_fs filesystem state but
+	 * they don't get to create mount namespaces, share the
+	 * filesystem state, or be started from a non-initial mount
+	 * namespace.
+	 */
+	if (umh) {
+		if (clone_flags & (CLONE_NEWNS | CLONE_FS))
+			return -EINVAL;
+		if (current->nsproxy->mnt_ns != &init_mnt_ns)
+			return -EINVAL;
+		fs = userspace_init_fs;
+	} else {
+		fs = current->fs;
+	}
 
 	VFS_WARN_ON_ONCE(current->fs != current->real_fs);
 	if (clone_flags & CLONE_FS) {
@@ -2213,7 +2229,7 @@ __latent_entropy struct task_struct *copy_process(
 	retval = copy_files(clone_flags, p, args->no_files);
 	if (retval)
 		goto bad_fork_cleanup_semundo;
-	retval = copy_fs(clone_flags, p);
+	retval = copy_fs(clone_flags, p, args->umh);
 	if (retval)
 		goto bad_fork_cleanup_files;
 	retval = copy_sighand(clone_flags, p);
@@ -2727,6 +2743,7 @@ pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags)
 		.exit_signal	= (flags & CSIGNAL),
 		.fn		= fn,
 		.fn_arg		= arg,
+		.umh		= 1,
 	};
 
 	return kernel_clone(&args);

-- 
2.47.3
Re: [PATCH RFC v2 18/23] fs: add umh argument to struct kernel_clone_args
Posted by Jann Horn 1 month ago
On Fri, Mar 6, 2026 at 12:31 AM Christian Brauner <brauner@kernel.org> wrote:
> Add a umh field to struct kernel_clone_args. When set, copy_fs() copies
> from pid 1's fs_struct instead of the kthread's fs_struct. This ensures
> usermodehelper threads always get init's filesystem state regardless of
> their parent's (kthreadd's) fs.
>
> Usermodehelper threads are not allowed to create mount namespaces
> (CLONE_NEWNS), share filesystem state (CLONE_FS), or be started from
> a non-initial mount namespace. No usermodehelper currently does this so
> we don't need to worry about this restriction.
>
> Set .umh = 1 in user_mode_thread(). At this stage pid 1's fs points to
> rootfs which is the same as kthreadd's fs, so this is functionally
> equivalent.
[...]
> -static int copy_fs(u64 clone_flags, struct task_struct *tsk)
> +static int copy_fs(u64 clone_flags, struct task_struct *tsk, bool umh)
>  {
> -       struct fs_struct *fs = current->fs;
> +       struct fs_struct *fs;
> +
> +       /*
> +        * Usermodehelper may use userspace_init_fs filesystem state but
> +        * they don't get to create mount namespaces, share the
> +        * filesystem state, or be started from a non-initial mount
> +        * namespace.
> +        */
> +       if (umh) {
> +               if (clone_flags & (CLONE_NEWNS | CLONE_FS))
> +                       return -EINVAL;
> +               if (current->nsproxy->mnt_ns != &init_mnt_ns)
> +                       return -EINVAL;
> +               fs = userspace_init_fs;
> +       } else {
> +               fs = current->fs;
> +       }
>
>         VFS_WARN_ON_ONCE(current->fs != current->real_fs);

Should that VFS_WARN_ON_ONCE() be in the else {} block?

I don't know if it could happen that a VFS operation that happens with
overwritten current->fs calls back into firmware loading or such, or
if that is anyway impossible for locking reasons or such...
Re: [PATCH RFC v2 18/23] fs: add umh argument to struct kernel_clone_args
Posted by Christian Brauner 1 month ago
On Mon, Mar 09, 2026 at 05:06:24PM +0100, Jann Horn wrote:
> On Fri, Mar 6, 2026 at 12:31 AM Christian Brauner <brauner@kernel.org> wrote:
> > Add a umh field to struct kernel_clone_args. When set, copy_fs() copies
> > from pid 1's fs_struct instead of the kthread's fs_struct. This ensures
> > usermodehelper threads always get init's filesystem state regardless of
> > their parent's (kthreadd's) fs.
> >
> > Usermodehelper threads are not allowed to create mount namespaces
> > (CLONE_NEWNS), share filesystem state (CLONE_FS), or be started from
> > a non-initial mount namespace. No usermodehelper currently does this so
> > we don't need to worry about this restriction.
> >
> > Set .umh = 1 in user_mode_thread(). At this stage pid 1's fs points to
> > rootfs which is the same as kthreadd's fs, so this is functionally
> > equivalent.
> [...]
> > -static int copy_fs(u64 clone_flags, struct task_struct *tsk)
> > +static int copy_fs(u64 clone_flags, struct task_struct *tsk, bool umh)
> >  {
> > -       struct fs_struct *fs = current->fs;
> > +       struct fs_struct *fs;
> > +
> > +       /*
> > +        * Usermodehelper may use userspace_init_fs filesystem state but
> > +        * they don't get to create mount namespaces, share the
> > +        * filesystem state, or be started from a non-initial mount
> > +        * namespace.
> > +        */
> > +       if (umh) {
> > +               if (clone_flags & (CLONE_NEWNS | CLONE_FS))
> > +                       return -EINVAL;
> > +               if (current->nsproxy->mnt_ns != &init_mnt_ns)
> > +                       return -EINVAL;
> > +               fs = userspace_init_fs;
> > +       } else {
> > +               fs = current->fs;
> > +       }
> >
> >         VFS_WARN_ON_ONCE(current->fs != current->real_fs);
> 
> Should that VFS_WARN_ON_ONCE() be in the else {} block?

I think in spirit the placement you suggest makes more sense.

> I don't know if it could happen that a VFS operation that happens with
> overwritten current->fs calls back into firmware loading or such, or
> if that is anyway impossible for locking reasons or such...

Usermodehelper are terrible hybrids that always go through workqueue
dispatch. So let's say somehow someone ends up triggering a
usermodehelper under scoped_with_init_fs() - no matter if regular task
or kthread - what would actually happen is:

INIT_WORK(&sub_info->work, call_usermodehelper_exec_work)

which means all usermodehelper creations are done from some other
kthread and never by the caller. IOW, even if the caller has overridden
current->fs the usermodehelper would be created from a pristine kthread
context.