From nobody Thu Apr  9 19:17:39 2026
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org
 [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 78FC6368976;
	Tue,  3 Mar 2026 13:49:47 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=10.30.226.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1772545788; cv=none;
 b=ZQDcXrCcaBUjeZ2LkfU8QXCVCRqsMVTJ1ZqHTT4r2WQKKSW578m2I2f+ARJFDBIYtAAZVWr2t9Ew7qgMIRhqlReVXXKPrQuBfIdD+ZmcgCAHXBH/sxYujpQSqPir+qLEi3o30kg9HIaKmZJWERiwAM8KNPZVBnIkDw24kDlqRQg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1772545788; c=relaxed/simple;
	bh=xWc1SYhagsFIDcFD1Ftyff0sKkkiatJdtEEz/BrQlu0=;
	h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References:
	 In-Reply-To:To:Cc;
 b=nERYc3aYMJwyk12gPZIYxWWBtSPL3Vl6pZSlR8XvQ21kh6zIWV1R0LoLo5i8JE64C676+KluLBLdQE0hVMLdMuos73BQsi1pbFew8jTRdrrTfVl4NrYdAya71hORf38Bi8eInehBkbpTW6cR+KLj3jjh/T3dP17jKw5uoOam5uU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b=I6UJFgX3; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b="I6UJFgX3"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id A4117C2BCAF;
	Tue,  3 Mar 2026 13:49:45 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1772545787;
	bh=xWc1SYhagsFIDcFD1Ftyff0sKkkiatJdtEEz/BrQlu0=;
	h=From:Date:Subject:References:In-Reply-To:To:Cc:From;
	b=I6UJFgX3ETgpJRUabnKnCKeS3y3BTYQTmNPCdjeYA1TUGdjIaWFp12INNy9U0iJ6b
	 3OwaymS1YWZWJc+Yey3dZads/5t7Xaj2ozzOz95rdt17n5TZOA6Bm1L3ZT5fYx45Ti
	 E/nVWlnY/l/WkBGCl3NH6rUtMnZtAXoL3/blXOa9QkE4FSSOCh4ArUJbBiaEJvlN7L
	 vQToiagtJH+XTpkGGGZrN7j8BpmAf4lWL/MhQw5SBa25sllxCLJeqyVZUmY2sF1vx5
	 VqdMeXHFVTOvxombxAbi9hR0fOfAnIdM6tCZPophF3NMDPMis693dtFv/NIZg1RIdD
	 F/GsnxciPK2uQ==
From: Christian Brauner <brauner@kernel.org>
Date: Tue, 03 Mar 2026 14:49:22 +0100
Subject: [PATCH RFC DRAFT POC 11/11] fs: isolate all kthreads in nullfs
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Message-Id: <20260303-work-kthread-nullfs-v1-11-87e559b94375@kernel.org>
References: <20260303-work-kthread-nullfs-v1-0-87e559b94375@kernel.org>
In-Reply-To: <20260303-work-kthread-nullfs-v1-0-87e559b94375@kernel.org>
To: linux-fsdevel@vger.kernel.org,
 Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>,
 Jens Axboe <axboe@kernel.dk>, Jan Kara <jack@suse.cz>,
 Tejun Heo <tj@kernel.org>, Jann Horn <jannh@google.com>,
 Christian Brauner <brauner@kernel.org>
X-Mailer: b4 0.15-dev-47773
X-Developer-Signature: v=1; a=openpgp-sha256; l=9100; i=brauner@kernel.org;
 h=from:subject:message-id; bh=xWc1SYhagsFIDcFD1Ftyff0sKkkiatJdtEEz/BrQlu0=;
 b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWQue3bP7UYr29pnt0xuprRP+CBQtHDqiyubtk7xFlPVN
 P70VNz2QEcpC4MYF4OsmCKLQ7tJuNxynorNRpkaMHNYmUCGMHBxCsBE3rEw/JU+ccosULbBSqdX
 4a1Y7OXOcieXetebbQXCtatcp67KTGRkmLFyzoQDffON2K2Cs6TcPSUkXX0uaK577SV5knv7n3x
 JbgA=
X-Developer-Key: i=brauner@kernel.org; a=openpgp;
 fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624

Leave all kthreads isolated in nullfs and move userspace init into its
separate fs_struct that any kthread can grab on demand to perform
lookup. This isolates kthreads from userspace filesystem state quite a
bit and makes it hard for anyone to mess up when performing filesystem
operations from kthreads. Without LOOKUP_IN_INIT they will just not be
able to do anything at all: no lookup or creation.

Add a new struct kernel_clone_args extension that allows to create a
task that shares init's filesystem state. This is only going to be used
by user_mode_thread() which execute stuff in init's filesystem state.
That concept should go away.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/fs_struct.c             | 49 ++++++++++++++++++++++++++++++++++++++++++=
+---
 fs/namei.c                 |  4 ++--
 fs/namespace.c             |  4 ----
 include/linux/fs_struct.h  |  1 +
 include/linux/init_task.h  |  1 +
 include/linux/sched/task.h |  1 +
 init/main.c                | 10 +++++++++-
 kernel/fork.c              | 26 +++++++++++++++++++++---
 8 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 64b5840131cb..164139c27380 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -8,6 +8,7 @@
 #include <linux/fs_struct.h>
 #include <linux/init_task.h>
 #include "internal.h"
+#include "mount.h"
=20
 /*
  * Replace the fs->{rootmnt,root} with {mnt,dentry}. Put the old values.
@@ -160,13 +161,30 @@ EXPORT_SYMBOL_GPL(unshare_fs_struct);
  * fs_struct state. Breaking that contract sucks for both sides.
  * So just don't bother with extra work for this. No sane init
  * system should ever do this.
+ *
+ * On older kernels if PID 1 unshared its filesystem state with us the
+ * kernel simply used the stale fs_struct state implicitly pinning
+ * anything that PID 1 had last used. Even if PID 1 might've moved on to
+ * some completely different fs_struct state and might've even unmounted
+ * the old root.
+ *
+ * This has hilarious consequences: Think continuing to dump coredump
+ * state into an implicitly pinned directory somewhere. Calling random
+ * binaries in the old rootfs via usermodehelpers.
+ *
+ * Be aggressive about this: We simply reject operating on stale
+ * fs_struct state by reverting to nullfs. Every kworker that does
+ * lookups after this point will fail. Every usermodehelper call will
+ * fail. Tough luck but let's be kind and emit a warning to userspace.
  */
 static inline bool nullfs_userspace_init(void)
 {
 	struct fs_struct *fs =3D current->fs;
=20
-	if (unlikely(current->pid =3D=3D 1) && fs !=3D &init_fs) {
+	if (unlikely(current->pid =3D=3D 1) && fs !=3D &userspace_init_fs) {
 		pr_warn("VFS: Pid 1 stopped sharing filesystem state\n");
+		set_fs_root(&userspace_init_fs, &init_fs.root);
+		set_fs_pwd(&userspace_init_fs, &init_fs.root);
 		return true;
 	}
=20
@@ -186,7 +204,9 @@ struct fs_struct *switch_fs_struct(struct fs_struct *ne=
w_fs)
 		new_fs =3D fs;
 	read_sequnlock_excl(&fs->seq);
=20
-	nullfs_userspace_init();
+	/* one reference belongs to us */
+	if (nullfs_userspace_init())
+		return NULL;
 	return new_fs;
 }
=20
@@ -197,8 +217,31 @@ struct fs_struct init_fs =3D {
 	.umask		=3D 0022,
 };
=20
+struct fs_struct userspace_init_fs =3D {
+	.users		=3D 1,
+	.seq		=3D __SEQLOCK_UNLOCKED(userspace_init_fs.seq),
+	.umask		=3D 0022,
+};
+
 void init_root(struct path *root)
 {
-	get_fs_root(&init_fs, root);
+	get_fs_root(&userspace_init_fs, root);
 }
 EXPORT_SYMBOL_GPL(init_root);
+
+void __init init_userspace_fs(void)
+{
+	struct mount *m;
+	struct path root;
+
+	/* Move PID 1 from nullfs into the initramfs. */
+	m =3D topmost_overmount(current->nsproxy->mnt_ns->root);
+	root.mnt =3D &m->mnt;
+	root.dentry =3D root.mnt->mnt_root;
+
+	VFS_WARN_ON_ONCE(current->fs !=3D &init_fs);
+	VFS_WARN_ON_ONCE(current->pid !=3D 1);
+	set_fs_root(&userspace_init_fs, &root);
+	set_fs_pwd(&userspace_init_fs, &root);
+	switch_fs_struct(&userspace_init_fs);
+}
diff --git a/fs/namei.c b/fs/namei.c
index 976b1e9f7032..6cc53040e9eb 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1102,7 +1102,7 @@ static int set_root(struct nameidata *nd)
 	struct fs_struct *fs;
=20
 	if (nd->flags & LOOKUP_IN_INIT)
-		fs =3D &init_fs;
+		fs =3D &userspace_init_fs;
 	else
 		fs =3D current->fs;
=20
@@ -2724,7 +2724,7 @@ static const char *path_init(struct nameidata *nd, un=
signed flags)
 		struct fs_struct *fs;
=20
 		if (nd->flags & LOOKUP_IN_INIT)
-			fs =3D &init_fs;
+			fs =3D &userspace_init_fs;
 		else
 			fs =3D current->fs;
=20
diff --git a/fs/namespace.c b/fs/namespace.c
index 854f4fc66469..10056ac1dcd2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -6190,10 +6190,6 @@ static void __init init_mount_tree(void)
=20
 	init_task.nsproxy->mnt_ns =3D &init_mnt_ns;
 	get_mnt_ns(&init_mnt_ns);
-
-	/* The root and pwd always point to the mutable rootfs. */
-	root.mnt	=3D mnt;
-	root.dentry	=3D mnt->mnt_root;
 	set_fs_pwd(current->fs, &root);
 	set_fs_root(current->fs, &root);
=20
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 8ff1acd8389d..5c40fdc39550 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -50,5 +50,6 @@ static inline int current_umask(void)
 }
=20
 void init_root(struct path *root);
+void __init init_userspace_fs(void);
=20
 #endif /* _LINUX_FS_STRUCT_H */
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index a6cb241ea00c..f27f88598394 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -24,6 +24,7 @@
=20
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
+extern struct fs_struct userspace_init_fs;
 extern struct nsproxy init_nsproxy;
=20
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 41ed884cffc9..e0c1ca8c6a18 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -31,6 +31,7 @@ struct kernel_clone_args {
 	u32 io_thread:1;
 	u32 user_worker:1;
 	u32 no_files:1;
+	u32 umh:1;
 	unsigned long stack;
 	unsigned long stack_size;
 	unsigned long tls;
diff --git a/init/main.c b/init/main.c
index 1cb395dd94e4..ca0d0914c63e 100644
--- a/init/main.c
+++ b/init/main.c
@@ -102,6 +102,7 @@
 #include <linux/stackdepot.h>
 #include <linux/randomize_kstack.h>
 #include <linux/pidfs.h>
+#include <linux/fs_struct.h>
 #include <linux/ptdump.h>
 #include <linux/time_namespace.h>
 #include <linux/unaligned.h>
@@ -713,6 +714,11 @@ static __initdata DECLARE_COMPLETION(kthreadd_done);
=20
 static noinline void __ref __noreturn rest_init(void)
 {
+	struct kernel_clone_args init_args =3D {
+		.flags		=3D (CLONE_FS | CLONE_VM | CLONE_UNTRACED),
+		.fn		=3D kernel_init,
+		.fn_arg		=3D NULL,
+	};
 	struct task_struct *tsk;
 	int pid;
=20
@@ -722,7 +728,7 @@ static noinline void __ref __noreturn rest_init(void)
 	 * the init task will end up wanting to create kthreads, which, if
 	 * we schedule it before we create kthreadd, will OOPS.
 	 */
-	pid =3D user_mode_thread(kernel_init, NULL, CLONE_FS);
+	pid =3D kernel_clone(&init_args);
 	/*
 	 * Pin init on the boot CPU. Task migration is not properly working
 	 * until sched_init_smp() has been run. It will set the allowed
@@ -1574,6 +1580,8 @@ static int __ref kernel_init(void *unused)
 {
 	int ret;
=20
+	init_userspace_fs();
+
 	/*
 	 * Wait until kthreadd is all set-up.
 	 */
diff --git a/kernel/fork.c b/kernel/fork.c
index 583078c69bbd..121538f58272 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1590,9 +1590,28 @@ static int copy_mm(u64 clone_flags, struct task_stru=
ct *tsk)
 	return 0;
 }
=20
-static int copy_fs(u64 clone_flags, struct task_struct *tsk)
+static int copy_fs(u64 clone_flags, struct task_struct *tsk, bool umh)
 {
-	struct fs_struct *fs =3D current->fs;
+	struct fs_struct *fs;
+
+	/*
+	 * Usermodehelper may use userspace_init_fs filesystem state but
+	 * they don't get to create mount namespaces, share the
+	 * filesystem state, or be started from a non-initial mount
+	 * namespace.
+	 */
+	if (umh) {
+		if (clone_flags & (CLONE_NEWNS | CLONE_FS))
+			return -EINVAL;
+		if (current->nsproxy->mnt_ns !=3D &init_mnt_ns)
+			return -EINVAL;
+	}
+
+	if (umh)
+		fs =3D &userspace_init_fs;
+	else
+		fs =3D current->fs;
+
 	if (clone_flags & CLONE_FS) {
 		/* tsk->fs is already what we want */
 		read_seqlock_excl(&fs->seq);
@@ -2211,7 +2230,7 @@ __latent_entropy struct task_struct *copy_process(
 	retval =3D copy_files(clone_flags, p, args->no_files);
 	if (retval)
 		goto bad_fork_cleanup_semundo;
-	retval =3D copy_fs(clone_flags, p);
+	retval =3D copy_fs(clone_flags, p, args->umh);
 	if (retval)
 		goto bad_fork_cleanup_files;
 	retval =3D copy_sighand(clone_flags, p);
@@ -2725,6 +2744,7 @@ pid_t user_mode_thread(int (*fn)(void *), void *arg, =
unsigned long flags)
 		.exit_signal	=3D (flags & CSIGNAL),
 		.fn		=3D fn,
 		.fn_arg		=3D arg,
+		.umh		=3D 1,
 	};
=20
 	return kernel_clone(&args);

--=20
2.47.3