[v2] fs,kthread: start all kthreads in nullfs

[PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs

Posted by Christian Brauner 1 month ago

Summary:

* all kthreads are isolated in a separate SB_KERNMOUNT of nullfs.
  -> no lookup of anything else, no mounting on top of it, completely
  isolated.
* init has a separate fs_struct from all kthreads
* scoped_with_init_fs() allows a kthread to temporarily assume init's
  fs_struct for filesystem operations.

So this is a bit of a crazy series. When the kernel is started it
roughly goes like this:

init_task
==> create pid 1 (systemd etc.)
==> pid 2 (kthreadd)

After this point all kthreads and PID 1 share the same filesystem state.
That obviously already came up when we discussed pivot_root() as this
allows pivot_root() to rewrite the fs_struct of all kthreads.

This rewriting is really weird and mostly done so kthread can use init's
filesystem state when they would like to. But this really should be
discouraged. The rewriting should also stop completely. I worked a bit
to get rid of it in a more fundamental way. Is it crazy? Yes. Is it
likely broken? Yes. Does it at least boot? Yes.

Instead of sharing fs_struct between kernel threads and pid 1, pid 1
get's a completely separate fs_struct. All kthreads continue sharing
init_fs as before and pid 1's fs_struct is isolated from kthread's
filesystem state. IOW, userspace init cannot affect kthreads filesystem
state anymore and kthreads cannot affect userspace's filesystem state
anymore - without explicit opt-in.

All kthreads are anchored in a kernel internal mount of nullfs that
cannot be mounted on and that cannot be used to follow other mounts.
It's a completely private mount that insulates kthreads.

This series makes performing mountains of filesystem work such as path
lookup and file opening and so on from kthreads hard - painfully so. I
think this is a benefit because it takes the idea of just offloading
_security sensitive_ operations in init's filesystem state and
running random binaries or opening and creating files to kthreads
difficult behind the shed... And imho it should.

The only remaining kernel tasks that actually share init's filesystem
state are usermodhelpers - as they execute random binaries in the root
filesystem. Another concept we should really show the back of the shed.

This gives a lot stronger guarantees than what we have now. This also
makes path lookup from kthreads fail by default. IOW, it won't be
possible anymore to just lookup random stuff in init's filesytem state
without explicitly opting in to that.

The places that need to perform lookup in init's filesystem state may
use scoped_with_init_fs() which will temporarily override the caller's
fs_struct with init's fs_struct.

We now also warn and notice when pid 1 simply stops sharing filesystem
state with us, i.e., abandons it's userspace_init_fs.

On older kernels if PID 1 unshared its filesystem state with us the
kernel simply used the stale fs_struct state implicitly pinning
anything that PID 1 had last used. Even if PID 1 might've moved on to
some completely different fs_struct state and might've even unmounted
the old root.

This has hilarious consequences: Think continuing to dump coredump
state into an implicitly pinned directory somewhere. Calling random
binaries in the old rootfs via usermodehelpers.

Be aggressive about this: We simply reject operating on stale
fs_struct state by reverting userspace_init_fs to nullfs. Every kworker
that does lookups after this point will fail. Every usermodehelper call
will fail. This is a lot stronger but I wouldn't know what it means for
pid 1 to simply stop sharing its fs state with the kernel. Clearly it
wanted to separate so cut all ties.

I've went through the kernel and looked at hopefully everything that
does path lookup from kthreads (workqueues, ...).

TL;DR:

==== PID 1 (systemd) ====

  root@localhost:~# stat --file-system /proc/1/root
    File: "/proc/1/root"
      ID: e3cb00dd533cd3d7 Namelen: 255     Type: ext2/ext3

  root@localhost:~# cat /proc/1/mountinfo | wc -l
  30

==== PID 2 (kthreadd) ====

  root@localhost:~# stat --file-system /proc/2/root
    File: "/proc/2/root"
      ID: 200000000 Namelen: 255     Type: nullfs

  root@localhost:~# cat /proc/2/mountinfo | wc -l
  0

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Remove LOOKUP_IN_INIT in favor of scoped_with_init_fs().
- Link to v1: https://patch.msgid.link/20260303-work-kthread-nullfs-v1-0-87e559b94375@kernel.org

---
Christian Brauner (23):
      fs: notice when init abandons fs sharing
      fs: add scoped_with_init_fs()
      rnbd: use scoped_with_init_fs() for block device open
      crypto: ccp: use scoped_with_init_fs() for SEV file access
      scsi: target: use scoped_with_init_fs() for ALUA metadata
      scsi: target: use scoped_with_init_fs() for APTPL metadata
      btrfs: use scoped_with_init_fs() for update_dev_time()
      coredump: use scoped_with_init_fs() for coredump path resolution
      fs: use scoped_with_init_fs() for kernel_read_file_from_path_initns()
      ksmbd: use scoped_with_init_fs() for share path resolution
      ksmbd: use scoped_with_init_fs() for filesystem info path lookup
      ksmbd: use scoped_with_init_fs() for VFS path operations
      initramfs: use scoped_with_init_fs() for rootfs unpacking
      af_unix: use scoped_with_init_fs() for coredump socket lookup
      fs: add real_fs to track task's actual fs_struct
      fs: make userspace_init_fs a dynamically-initialized pointer
      fs: stop sharing fs_struct between init_task and pid 1
      fs: add umh argument to struct kernel_clone_args
      fs: add kthread_mntns()
      devtmpfs: create private mount namespace
      nullfs: make nullfs multi-instance
      fs: start all kthreads in nullfs
      fs: stop rewriting kthread fs structs

 drivers/base/devtmpfs.c           |  2 +-
 drivers/block/rnbd/rnbd-srv.c     |  4 +-
 drivers/crypto/ccp/sev-dev.c      | 12 ++---
 drivers/target/target_core_alua.c |  6 ++-
 drivers/target/target_core_pr.c   |  4 +-
 fs/btrfs/volumes.c                | 11 ++++-
 fs/coredump.c                     | 11 ++---
 fs/fs_struct.c                    | 96 ++++++++++++++++++++++++++++++++++++++-
 fs/kernel_read_file.c             |  9 +---
 fs/namespace.c                    | 40 ++++++++++++++--
 fs/nullfs.c                       |  7 +--
 fs/smb/server/mgmt/share_config.c |  4 +-
 fs/smb/server/smb2pdu.c           |  4 +-
 fs/smb/server/vfs.c               | 14 ++++--
 include/linux/fs_struct.h         | 34 ++++++++++++++
 include/linux/init_task.h         |  1 +
 include/linux/mount.h             |  1 +
 include/linux/sched.h             |  1 +
 include/linux/sched/task.h        |  1 +
 init/init_task.c                  |  1 +
 init/initramfs.c                  | 12 +++--
 init/main.c                       | 10 +++-
 kernel/fork.c                     | 41 +++++++++++------
 net/unix/af_unix.c                | 17 +++----
 24 files changed, 266 insertions(+), 77 deletions(-)
---
base-commit: c107785c7e8dbabd1c18301a1c362544b5786282
change-id: 20260303-work-kthread-nullfs-875a837f4198

Re: [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs

Posted by Askar Safin 1 month ago

Christian Brauner <brauner@kernel.org>:
> Summary:

Comment in "call_usermodehelper_exec_async" contains this:

> Initial kernel threads share ther FS with init

This is now wrong and should be changed.

-- 
Askar Safin

Re: [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs

Posted by Jann Horn 1 month ago

On Fri, Mar 6, 2026 at 12:30 AM Christian Brauner <brauner@kernel.org> wrote:
> The places that need to perform lookup in init's filesystem state may
> use scoped_with_init_fs() which will temporarily override the caller's
> fs_struct with init's fs_struct.

One small concern I have about the overall approach is that the use of
scoped_with_init_fs() in non-kernel tasks reminds me a _little_ bit of
the set_fs(KERNEL_DS) mechanism that was removed a few years ago:
There is state in the task that controls whether some argument is
interpreted as a user-supplied, untrusted value or a kernel-supplied
value that is interpreted in some more privileged scope. I think there
were occasionally security issues where userspace-supplied pointers
were accidentally accessed under KERNEL_DS, allowing userspace to
cause accesses to arbitrary kernel addresses - in particular,
performance interrupts could occur in KERNEL_DS sections and attempt
to access userspace stack memory, see
<https://project-zero.issues.chromium.org/42452355>.

I think switching task_struct::fs is much less problematic - path
walks shouldn't happen in IRQ context or such, scoped_with_init_fs()
will likely only be used when accessing paths that unprivileged
userspace has no influence over, and VFS operations normally don't
operate on multiple logically unrelated file paths; but it means we'll
have to keep in mind that filesystem handlers for some operations like
lookup/open can run with weird task_struct::fs.

To be clear, I think what you're doing is fine; it's just something to
keep in mind.

Re: [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs

Posted by Christian Brauner 1 month ago

On Mon, Mar 09, 2026 at 05:50:36PM +0100, Jann Horn wrote:
> On Fri, Mar 6, 2026 at 12:30 AM Christian Brauner <brauner@kernel.org> wrote:
> > The places that need to perform lookup in init's filesystem state may
> > use scoped_with_init_fs() which will temporarily override the caller's
> > fs_struct with init's fs_struct.
> 
> One small concern I have about the overall approach is that the use of
> scoped_with_init_fs() in non-kernel tasks reminds me a _little_ bit of
> the set_fs(KERNEL_DS) mechanism that was removed a few years ago:
> There is state in the task that controls whether some argument is
> interpreted as a user-supplied, untrusted value or a kernel-supplied
> value that is interpreted in some more privileged scope. I think there
> were occasionally security issues where userspace-supplied pointers
> were accidentally accessed under KERNEL_DS, allowing userspace to
> cause accesses to arbitrary kernel addresses - in particular,
> performance interrupts could occur in KERNEL_DS sections and attempt
> to access userspace stack memory, see
> <https://project-zero.issues.chromium.org/42452355>.
> 
> I think switching task_struct::fs is much less problematic - path
> walks shouldn't happen in IRQ context or such, scoped_with_init_fs()
> will likely only be used when accessing paths that unprivileged
> userspace has no influence over, and VFS operations normally don't
> operate on multiple logically unrelated file paths; but it means we'll
> have to keep in mind that filesystem handlers for some operations like
> lookup/open can run with weird task_struct::fs.
> 
> To be clear, I think what you're doing is fine; it's just something to
> keep in mind.

Just for some background. I think as it currently stands we have a 1:1
sharing between all kthreads and pid 1. So effectively a kthread is in a
permanent scope_with_init_fs() block. Any driver can just do:

file = filp_open("/proc/sys/kernel/core_pattern")
kernel_write(file, "/usr/bin/systemctl poweroff")

which is ofc nonsense but still.

But my wider point is that this implicit lookup context is probably in
very few people's mind.

Some people who are aware of this then end up with brilliant ideas such
as writing kernel modules that perform mountains of actual path lookup
work from kthread context because it's just so easy to do and lets them
avoid having to do any real conceptual work to come up with a better
solution.

Offloading fs work to kthreads is really nasty... And we've relearned
that lesson not too long ago when io_uring was still based on kthreads
with custom credential overrides. It's a broken concept.

scoped_with_init_fs() forces the users that do this to acknowledge that
they are now performing lookup work within PID 1's filesystem state. We
have few of those and this will make it harder to gain more.