[v1] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

[PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Jann Horn 4 days, 13 hours ago

may_decode_fh() accesses mount::mnt_ns without holding any locks; that
means the mount can concurrently be unmounted, and the mnt_namespace can
concurrently be freed after an RCU grace period.

This race can happens as follows, assuming that the mount point was
created by open_tree(..., OPEN_TREE_CLONE):

thread 1            thread 2            RCU
                    __do_sys_open_by_handle_at
                      do_handle_open
                        handle_to_path
                          may_decode_fh
                            is_mounted
                              [mount::mnt_ns access]
                            [mount::mnt_ns access]
__do_sys_close
  fput_close_sync
    __fput
      dissolve_on_fput
        umount_tree
        class_namespace_excl_destructor
          namespace_unlock
            free_mnt_ns
              mnt_ns_tree_remove
                call_rcu(mnt_ns_release_rcu)
                                        mnt_ns_release_rcu
                                          mnt_ns_release
                                            kfree
                            [mnt_namespace::user_ns access] **UAF**

Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
in __prepend_path().
Additionally, document the semantics of mount::mnt_ns, and use WRITE_ONCE()
for writers that can race with lockless readers.

This bug is unreachable unless one of the following is set:

 - CONFIG_PREEMPTION
 - CONFIG_RCU_STRICT_GRACE_PERIOD

because it requires an RCU grace period to happen during a syscall without
an explicit preemption.

This doesn't seem to have interesting security impact; worst-case, it could
leak the result of an integer comparison to userspace (from the level
check in cap_capable()), cause an endless loop, or crash the kernel by
dereferencing an invalid address.

Fixes: 620c266f3949 ("fhandle: relax open_by_handle_at() permission checks")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
---
I used custom tooling to force this race condition to occur and check
that it leads to a KASAN splat - let me know if you want me to create a
kernel patch to force the race condition and a reproducer you can run.

I remember Christian asking me for feedback on the patch that introduced
the bug, and I missed the bug because I didn't realize what the semantics
of mount::mnt_ns are...
---
 fs/fhandle.c   | 16 ++++++++++++++--
 fs/mount.h     |  8 +++++++-
 fs/namespace.c |  6 +++---
 3 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/fhandle.c b/fs/fhandle.c
index 642e3d569497..1ca7eb3a6cb5 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -285,6 +285,19 @@ static int do_handle_to_path(struct file_handle *handle, struct path *path,
 	return 0;
 }
 
+static bool capable_wrt_mount(struct mount *mount)
+{
+	struct mnt_namespace *mnt_ns;
+
+	/*
+	 * For ->mnt_ns access.
+	 * The following READ_ONCE() is semantically rcu_dereference().
+	 */
+	guard(rcu)();
+	mnt_ns = READ_ONCE(mount->mnt_ns);
+	return ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN);
+}
+
 static inline int may_decode_fh(struct handle_to_path_ctx *ctx,
 				unsigned int o_flags)
 {
@@ -320,8 +333,7 @@ static inline int may_decode_fh(struct handle_to_path_ctx *ctx,
 	if (ns_capable(root->mnt->mnt_sb->s_user_ns, CAP_SYS_ADMIN))
 		ctx->flags = HANDLE_CHECK_PERMS;
 	else if (is_mounted(root->mnt) &&
-		 ns_capable(real_mount(root->mnt)->mnt_ns->user_ns,
-			    CAP_SYS_ADMIN) &&
+		 capable_wrt_mount(real_mount(root->mnt)) &&
 		 !has_locked_children(real_mount(root->mnt), root->dentry))
 		ctx->flags = HANDLE_CHECK_PERMS | HANDLE_CHECK_SUBTREE;
 	else
diff --git a/fs/mount.h b/fs/mount.h
index e0816c11a198..f0af6d789bfc 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -71,7 +71,13 @@ struct mount {
 	struct hlist_head mnt_slave_list;/* list of slave mounts */
 	struct hlist_node mnt_slave;	/* slave list entry */
 	struct mount *mnt_master;	/* slave is on master->mnt_slave_list */
-	struct mnt_namespace *mnt_ns;	/* containing namespace */
+	/*
+	 * Containing namespace.
+	 * Normally protected by namespace_sem, but there are also lockless
+	 * readers (which must use RCU to guard against the namespace being
+	 * freed).
+	 */
+	struct mnt_namespace *mnt_ns;
 	struct mountpoint *mnt_mp;	/* where is it mounted */
 	union {
 		struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
diff --git a/fs/namespace.c b/fs/namespace.c
index fe919abd2f01..f5905f4ec560 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1079,7 +1079,7 @@ static void mnt_add_to_ns(struct mnt_namespace *ns, struct mount *mnt)
 	bool mnt_first_node = true, mnt_last_node = true;
 
 	WARN_ON(mnt_ns_attached(mnt));
-	mnt->mnt_ns = ns;
+	WRITE_ONCE(mnt->mnt_ns, ns);
 	while (*link) {
 		parent = *link;
 		if (mnt->mnt_id_unique < node_to_mount(parent)->mnt_id_unique) {
@@ -1434,7 +1434,7 @@ EXPORT_SYMBOL(mntget);
 void mnt_make_shortterm(struct vfsmount *mnt)
 {
 	if (mnt)
-		real_mount(mnt)->mnt_ns = NULL;
+		WRITE_ONCE(real_mount(mnt)->mnt_ns, NULL);
 }
 
 /**
@@ -1806,7 +1806,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 			ns->nr_mounts--;
 			__touch_mnt_namespace(ns);
 		}
-		p->mnt_ns = NULL;
+		WRITE_ONCE(p->mnt_ns, NULL);
 		if (how & UMOUNT_SYNC)
 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
 

---
base-commit: ba3e43a9e601636f5edb54e259a74f96ca3b8fd8
change-id: 20260603-vfs-fhandle-uaf-fix-32279d5b2758

Best regards,
--  
Jann Horn <jannh@google.com>

Re: [PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Al Viro 4 days, 13 hours ago

On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote:

> Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
> in __prepend_path().

> +	/*
> +	 * Containing namespace.
> +	 * Normally protected by namespace_sem, but there are also lockless
> +	 * readers (which must use RCU to guard against the namespace being
> +	 * freed).
> +	 */
> +	struct mnt_namespace *mnt_ns;

Umm...  It's somewhat subtle - at the very least you need to explain why
there will be an RCU delay between umount_tree() clearing that and
having the sucker freed.

Re: [PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Al Viro 4 days, 13 hours ago

On Wed, Jun 03, 2026 at 07:15:23PM +0100, Al Viro wrote:
> On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote:
> 
> > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
> > in __prepend_path().
> 
> > +	/*
> > +	 * Containing namespace.
> > +	 * Normally protected by namespace_sem, but there are also lockless
> > +	 * readers (which must use RCU to guard against the namespace being
> > +	 * freed).
> > +	 */
> > +	struct mnt_namespace *mnt_ns;
> 
> Umm...  It's somewhat subtle - at the very least you need to explain why
> there will be an RCU delay between umount_tree() clearing that and
> having the sucker freed.

Something along the lines of "removals from namespace are serialized on
namespace_sem and guaranteed to happen no later than the active
refcount on namespace reaches zero; freeing of namespace happens only
after the passive refcount hitting zero and there's an RCU delay between
dropping the last active ref and dropping the passive one that had been
implicitly held by the fact of having actives", perhaps?  Only in
more readable form than that, please...

Re: [PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Jann Horn 4 days, 12 hours ago

On Wed, Jun 3, 2026 at 8:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Jun 03, 2026 at 07:15:23PM +0100, Al Viro wrote:
> > On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote:
> >
> > > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
> > > in __prepend_path().
> >
> > > +   /*
> > > +    * Containing namespace.
> > > +    * Normally protected by namespace_sem, but there are also lockless
> > > +    * readers (which must use RCU to guard against the namespace being
> > > +    * freed).
> > > +    */
> > > +   struct mnt_namespace *mnt_ns;
> >
> > Umm...  It's somewhat subtle - at the very least you need to explain why
> > there will be an RCU delay between umount_tree() clearing that and
> > having the sucker freed.
>
> Something along the lines of "removals from namespace are serialized on
> namespace_sem and guaranteed to happen no later than the active
> refcount on namespace reaches zero; freeing of namespace happens only
> after the passive refcount hitting zero and there's an RCU delay between
> dropping the last active ref and dropping the passive one that had been
> implicitly held by the fact of having actives", perhaps?  Only in
> more readable form than that, please...

Hm, like this?

Containing namespace (active).
Normally protected by namespace_sem.
Can also be accessed locklessly under RCU. RCU readers can't rely on
the namespace still being active, but implicitly hold a passive
reference (because an RCU delay happens between a namespace no longer
being active and the corresponding passive refcount drop).

Re: [PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Al Viro 4 days, 12 hours ago

On Wed, Jun 03, 2026 at 08:46:07PM +0200, Jann Horn wrote:
> On Wed, Jun 3, 2026 at 8:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Wed, Jun 03, 2026 at 07:15:23PM +0100, Al Viro wrote:
> > > On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote:
> > >
> > > > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
> > > > in __prepend_path().
> > >
> > > > +   /*
> > > > +    * Containing namespace.
> > > > +    * Normally protected by namespace_sem, but there are also lockless
> > > > +    * readers (which must use RCU to guard against the namespace being
> > > > +    * freed).
> > > > +    */
> > > > +   struct mnt_namespace *mnt_ns;
> > >
> > > Umm...  It's somewhat subtle - at the very least you need to explain why
> > > there will be an RCU delay between umount_tree() clearing that and
> > > having the sucker freed.
> >
> > Something along the lines of "removals from namespace are serialized on
> > namespace_sem and guaranteed to happen no later than the active
> > refcount on namespace reaches zero; freeing of namespace happens only
> > after the passive refcount hitting zero and there's an RCU delay between
> > dropping the last active ref and dropping the passive one that had been
> > implicitly held by the fact of having actives", perhaps?  Only in
> > more readable form than that, please...
> 
> Hm, like this?
> 
> Containing namespace (active).

Umm...  That's actually "active or has _just_ dropped the last active
reference and didn't get around to scheduling decrement of passive refcount
yet", unfortunately.

Hell knows - "active or deactivating", perhaps?

Re: [PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Al Viro 4 days, 12 hours ago

On Wed, Jun 03, 2026 at 07:53:24PM +0100, Al Viro wrote:
> On Wed, Jun 03, 2026 at 08:46:07PM +0200, Jann Horn wrote:
> > On Wed, Jun 3, 2026 at 8:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Wed, Jun 03, 2026 at 07:15:23PM +0100, Al Viro wrote:
> > > > On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote:
> > > >
> > > > > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
> > > > > in __prepend_path().
> > > >
> > > > > +   /*
> > > > > +    * Containing namespace.
> > > > > +    * Normally protected by namespace_sem, but there are also lockless
> > > > > +    * readers (which must use RCU to guard against the namespace being
> > > > > +    * freed).
> > > > > +    */
> > > > > +   struct mnt_namespace *mnt_ns;
> > > >
> > > > Umm...  It's somewhat subtle - at the very least you need to explain why
> > > > there will be an RCU delay between umount_tree() clearing that and
> > > > having the sucker freed.
> > >
> > > Something along the lines of "removals from namespace are serialized on
> > > namespace_sem and guaranteed to happen no later than the active
> > > refcount on namespace reaches zero; freeing of namespace happens only
> > > after the passive refcount hitting zero and there's an RCU delay between
> > > dropping the last active ref and dropping the passive one that had been
> > > implicitly held by the fact of having actives", perhaps?  Only in
> > > more readable form than that, please...
> > 
> > Hm, like this?
> > 
> > Containing namespace (active).
> 
> Umm...  That's actually "active or has _just_ dropped the last active
> reference and didn't get around to scheduling decrement of passive refcount
> yet", unfortunately.
> 
> Hell knows - "active or deactivating", perhaps?

Note that "active" in such context is easy to mistake for "active reference",
which it definitely isn't - it does not contribute to active refcount.
Mounts within a namespace do not pin it - it's the other way round; they
are guaranteed to stay live until they leave the sucker.  Anything that
hasn't left by the time the active refcount of namespace drops to zero
will get pushed out (and killed off unless there are other references to
any such mounts)

Re: [PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Jann Horn 4 days, 12 hours ago

On Wed, Jun 3, 2026 at 9:02 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Jun 03, 2026 at 07:53:24PM +0100, Al Viro wrote:
> > On Wed, Jun 03, 2026 at 08:46:07PM +0200, Jann Horn wrote:
> > > On Wed, Jun 3, 2026 at 8:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > On Wed, Jun 03, 2026 at 07:15:23PM +0100, Al Viro wrote:
> > > > > On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote:
> > > > >
> > > > > > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
> > > > > > in __prepend_path().
> > > > >
> > > > > > +   /*
> > > > > > +    * Containing namespace.
> > > > > > +    * Normally protected by namespace_sem, but there are also lockless
> > > > > > +    * readers (which must use RCU to guard against the namespace being
> > > > > > +    * freed).
> > > > > > +    */
> > > > > > +   struct mnt_namespace *mnt_ns;
> > > > >
> > > > > Umm...  It's somewhat subtle - at the very least you need to explain why
> > > > > there will be an RCU delay between umount_tree() clearing that and
> > > > > having the sucker freed.
> > > >
> > > > Something along the lines of "removals from namespace are serialized on
> > > > namespace_sem and guaranteed to happen no later than the active
> > > > refcount on namespace reaches zero; freeing of namespace happens only
> > > > after the passive refcount hitting zero and there's an RCU delay between
> > > > dropping the last active ref and dropping the passive one that had been
> > > > implicitly held by the fact of having actives", perhaps?  Only in
> > > > more readable form than that, please...
> > >
> > > Hm, like this?
> > >
> > > Containing namespace (active).
> >
> > Umm...  That's actually "active or has _just_ dropped the last active
> > reference and didn't get around to scheduling decrement of passive refcount
> > yet", unfortunately.

Ah, right, I see, because the mounts of non-anonymous namespaces are
only cleared in put_mnt_ns() after the active reference drop.

> > Hell knows - "active or deactivating", perhaps?
>
> Note that "active" in such context is easy to mistake for "active reference",
> which it definitely isn't - it does not contribute to active refcount.
> Mounts within a namespace do not pin it - it's the other way round; they
> are guaranteed to stay live until they leave the sucker.  Anything that
> hasn't left by the time the active refcount of namespace drops to zero
> will get pushed out (and killed off unless there are other references to
> any such mounts)

(And there's also that weird detail of how, for anonymous namespaces,
the active refcount isn't used and AFAICS never actually drops to
zero...)

So I guess I'll write "Containing namespace (active or deactivating,
non-refcounted)."?

Re: [PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Al Viro 4 days, 12 hours ago

On Wed, Jun 03, 2026 at 09:08:26PM +0200, Jann Horn wrote:

> (And there's also that weird detail of how, for anonymous namespaces,
> the active refcount isn't used and AFAICS never actually drops to
> zero...)

More like "is always 1 and we skip decrement when we decide to drop
that", really.

> So I guess I'll write "Containing namespace (active or deactivating,
> non-refcounted)."?

That would probably do for now...  The lifecycle for mnt_namespace
really needs to be documented; right now we have a maze of twisty
little functions around that area and it takes quite a non-trivial
amount of searching to recall the names - and I am familiar with
the area ;-/

Re: [PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Jann Horn 4 days, 12 hours ago

On Wed, Jun 3, 2026 at 9:08 PM Jann Horn <jannh@google.com> wrote:
> (And there's also that weird detail of how, for anonymous namespaces,
> the active refcount isn't used and AFAICS never actually drops to
> zero...)

(Er, nevermind, I missed that anonymous namespaces just have their
active refcount set to 0 from the start already.)

Re: [PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Christian Brauner 3 days, 23 hours ago

On Wed, Jun 03, 2026 at 09:14:25PM +0200, Jann Horn wrote:
> On Wed, Jun 3, 2026 at 9:08 PM Jann Horn <jannh@google.com> wrote:
> > (And there's also that weird detail of how, for anonymous namespaces,
> > the active refcount isn't used and AFAICS never actually drops to
> > zero...)
> 
> (Er, nevermind, I missed that anonymous namespaces just have their
> active refcount set to 0 from the start already.)

Let's distinguish a few things:

(1) generic reference count of namespaces in general: __ns_ref
    - for mntns: keeps the mount namespace and the mounts attached to it alive
(2) active reference count of namespaces in general: __ns_ref_active.
    - always a subset of (1)
    - only regulates userspace visibility of the namespace and has no
      lifetime implications per se. "active" just means "reachable from
      userspace". It's nothing that the mount layer itself should care
      about at all.
(3) passive reference count of struct mnt_namespace
    - keeps the mount namespace alive but not the mounts attached to it

With (3) you can grab a reference to the mount namespaces without
pinning the mounts in it. Then do other stuff that you want and then you
can grab namespace_sem which allows you to see whether the namespace is
still alive via mnt_ns_empty(). At no point does the caller need to
artificially prolong the lifetime of a mount namespaces by grabbing a
__ns_ref reference count. This is especially useful if the caller needs
to do a bunch of sleeping operations before they can actually do the
meat of the work they need.

Re: [PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Jann Horn 4 days, 13 hours ago

On Wed, Jun 3, 2026 at 8:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote:
> > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
> > in __prepend_path().
>
> > +     /*
> > +      * Containing namespace.
> > +      * Normally protected by namespace_sem, but there are also lockless
> > +      * readers (which must use RCU to guard against the namespace being
> > +      * freed).
> > +      */
> > +     struct mnt_namespace *mnt_ns;
>
> Umm...  It's somewhat subtle - at the very least you need to explain why
> there will be an RCU delay between umount_tree() clearing that and
> having the sucker freed.

I guess I could write something like this instead, to make it clear
that this basically follows normal RCU rules, except that this code
isn't actually using RCU markings and accessors?

"This is like an __rcu pointer which is protected by RCU and
namespace_sem; however, because most accesses happen under
namespace_sem, it is not marked as __rcu, and RCU access is done with
READ_ONCE()."

Or we could put __rcu on this pointer, and annotate all the locked
accesses with rcu_dereference_protected(...,
lockdep_is_held(&namespace_lock)), but I guess you'd probably prefer
to not do that?

Re: [PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Al Viro 4 days, 12 hours ago

On Wed, Jun 03, 2026 at 08:23:44PM +0200, Jann Horn wrote:
> On Wed, Jun 3, 2026 at 8:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote:
> > > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
> > > in __prepend_path().
> >
> > > +     /*
> > > +      * Containing namespace.
> > > +      * Normally protected by namespace_sem, but there are also lockless
> > > +      * readers (which must use RCU to guard against the namespace being
> > > +      * freed).
> > > +      */
> > > +     struct mnt_namespace *mnt_ns;
> >
> > Umm...  It's somewhat subtle - at the very least you need to explain why
> > there will be an RCU delay between umount_tree() clearing that and
> > having the sucker freed.
> 
> I guess I could write something like this instead, to make it clear
> that this basically follows normal RCU rules, except that this code
> isn't actually using RCU markings and accessors?
> 
> "This is like an __rcu pointer which is protected by RCU and
> namespace_sem; however, because most accesses happen under
> namespace_sem, it is not marked as __rcu, and RCU access is done with
> READ_ONCE()."
> 
> Or we could put __rcu on this pointer, and annotate all the locked
> accesses with rcu_dereference_protected(...,
> lockdep_is_held(&namespace_lock)), but I guess you'd probably prefer
> to not do that?

Not the point...  What I'm talking about is the reason why RCU access
to ->mnt_ns is valid in the first place - prompt freeing of namespace
instances *is* possible; we do have a guaranteed RCU delay between
zeroing ->mnt_ns and having the instance it pointed to freed, but it's
not instantly obvious where to look for it.

Basically, the store that cleared ->mnt_ns has been done in namespace_sem
scope and that scope is either no later than the scope in put_mnt_ns()
that has dropped the active refcount of ns to zero.  At the beginning
of that scope in put_mnt_ns() we are guaranteed to have the passive
refcount positive.  Dropping the passive reference happens after an
rcu delay started in later in the same namespace_sem scope and namespace
is not freed until the passive refcount reaches zero.

Re: [PATCH] fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

Posted by Al Viro 4 days, 12 hours ago

On Wed, Jun 03, 2026 at 07:41:51PM +0100, Al Viro wrote:

> Basically, the store that cleared ->mnt_ns has been done in namespace_sem
> scope and that scope is either no later than the scope in put_mnt_ns()

argh...  s/either//

> that has dropped the active refcount of ns to zero.  At the beginning
> of that scope in put_mnt_ns() we are guaranteed to have the passive
> refcount positive.  Dropping the passive reference happens after an
> rcu delay started in later in the same namespace_sem scope and namespace
> is not freed until the passive refcount reaches zero.

TL;DR: your fix is correct, but needs a better explanation of correctness.
If nothing else, I'd like to have the above findable on lore - I've way
too many pieces of half-baked docs sitting around in local notes as it is ;-/