fs/fhandle.c | 16 ++++++++++++++-- fs/mount.h | 8 +++++++- fs/namespace.c | 6 +++--- 3 files changed, 24 insertions(+), 6 deletions(-)
may_decode_fh() accesses mount::mnt_ns without holding any locks; that
means the mount can concurrently be unmounted, and the mnt_namespace can
concurrently be freed after an RCU grace period.
This race can happens as follows, assuming that the mount point was
created by open_tree(..., OPEN_TREE_CLONE):
thread 1 thread 2 RCU
__do_sys_open_by_handle_at
do_handle_open
handle_to_path
may_decode_fh
is_mounted
[mount::mnt_ns access]
[mount::mnt_ns access]
__do_sys_close
fput_close_sync
__fput
dissolve_on_fput
umount_tree
class_namespace_excl_destructor
namespace_unlock
free_mnt_ns
mnt_ns_tree_remove
call_rcu(mnt_ns_release_rcu)
mnt_ns_release_rcu
mnt_ns_release
kfree
[mnt_namespace::user_ns access] **UAF**
Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
in __prepend_path().
Additionally, document the semantics of mount::mnt_ns, and use WRITE_ONCE()
for writers that can race with lockless readers.
This bug is unreachable unless one of the following is set:
- CONFIG_PREEMPTION
- CONFIG_RCU_STRICT_GRACE_PERIOD
because it requires an RCU grace period to happen during a syscall without
an explicit preemption.
This doesn't seem to have interesting security impact; worst-case, it could
leak the result of an integer comparison to userspace (from the level
check in cap_capable()), cause an endless loop, or crash the kernel by
dereferencing an invalid address.
Fixes: 620c266f3949 ("fhandle: relax open_by_handle_at() permission checks")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
---
I used custom tooling to force this race condition to occur and check
that it leads to a KASAN splat - let me know if you want me to create a
kernel patch to force the race condition and a reproducer you can run.
I remember Christian asking me for feedback on the patch that introduced
the bug, and I missed the bug because I didn't realize what the semantics
of mount::mnt_ns are...
---
fs/fhandle.c | 16 ++++++++++++++--
fs/mount.h | 8 +++++++-
fs/namespace.c | 6 +++---
3 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/fs/fhandle.c b/fs/fhandle.c
index 642e3d569497..1ca7eb3a6cb5 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -285,6 +285,19 @@ static int do_handle_to_path(struct file_handle *handle, struct path *path,
return 0;
}
+static bool capable_wrt_mount(struct mount *mount)
+{
+ struct mnt_namespace *mnt_ns;
+
+ /*
+ * For ->mnt_ns access.
+ * The following READ_ONCE() is semantically rcu_dereference().
+ */
+ guard(rcu)();
+ mnt_ns = READ_ONCE(mount->mnt_ns);
+ return ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN);
+}
+
static inline int may_decode_fh(struct handle_to_path_ctx *ctx,
unsigned int o_flags)
{
@@ -320,8 +333,7 @@ static inline int may_decode_fh(struct handle_to_path_ctx *ctx,
if (ns_capable(root->mnt->mnt_sb->s_user_ns, CAP_SYS_ADMIN))
ctx->flags = HANDLE_CHECK_PERMS;
else if (is_mounted(root->mnt) &&
- ns_capable(real_mount(root->mnt)->mnt_ns->user_ns,
- CAP_SYS_ADMIN) &&
+ capable_wrt_mount(real_mount(root->mnt)) &&
!has_locked_children(real_mount(root->mnt), root->dentry))
ctx->flags = HANDLE_CHECK_PERMS | HANDLE_CHECK_SUBTREE;
else
diff --git a/fs/mount.h b/fs/mount.h
index e0816c11a198..f0af6d789bfc 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -71,7 +71,13 @@ struct mount {
struct hlist_head mnt_slave_list;/* list of slave mounts */
struct hlist_node mnt_slave; /* slave list entry */
struct mount *mnt_master; /* slave is on master->mnt_slave_list */
- struct mnt_namespace *mnt_ns; /* containing namespace */
+ /*
+ * Containing namespace.
+ * Normally protected by namespace_sem, but there are also lockless
+ * readers (which must use RCU to guard against the namespace being
+ * freed).
+ */
+ struct mnt_namespace *mnt_ns;
struct mountpoint *mnt_mp; /* where is it mounted */
union {
struct hlist_node mnt_mp_list; /* list mounts with the same mountpoint */
diff --git a/fs/namespace.c b/fs/namespace.c
index fe919abd2f01..f5905f4ec560 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1079,7 +1079,7 @@ static void mnt_add_to_ns(struct mnt_namespace *ns, struct mount *mnt)
bool mnt_first_node = true, mnt_last_node = true;
WARN_ON(mnt_ns_attached(mnt));
- mnt->mnt_ns = ns;
+ WRITE_ONCE(mnt->mnt_ns, ns);
while (*link) {
parent = *link;
if (mnt->mnt_id_unique < node_to_mount(parent)->mnt_id_unique) {
@@ -1434,7 +1434,7 @@ EXPORT_SYMBOL(mntget);
void mnt_make_shortterm(struct vfsmount *mnt)
{
if (mnt)
- real_mount(mnt)->mnt_ns = NULL;
+ WRITE_ONCE(real_mount(mnt)->mnt_ns, NULL);
}
/**
@@ -1806,7 +1806,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
ns->nr_mounts--;
__touch_mnt_namespace(ns);
}
- p->mnt_ns = NULL;
+ WRITE_ONCE(p->mnt_ns, NULL);
if (how & UMOUNT_SYNC)
p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
---
base-commit: ba3e43a9e601636f5edb54e259a74f96ca3b8fd8
change-id: 20260603-vfs-fhandle-uaf-fix-32279d5b2758
Best regards,
--
Jann Horn <jannh@google.com>
On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote: > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like > in __prepend_path(). > + /* > + * Containing namespace. > + * Normally protected by namespace_sem, but there are also lockless > + * readers (which must use RCU to guard against the namespace being > + * freed). > + */ > + struct mnt_namespace *mnt_ns; Umm... It's somewhat subtle - at the very least you need to explain why there will be an RCU delay between umount_tree() clearing that and having the sucker freed.
On Wed, Jun 03, 2026 at 07:15:23PM +0100, Al Viro wrote: > On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote: > > > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like > > in __prepend_path(). > > > + /* > > + * Containing namespace. > > + * Normally protected by namespace_sem, but there are also lockless > > + * readers (which must use RCU to guard against the namespace being > > + * freed). > > + */ > > + struct mnt_namespace *mnt_ns; > > Umm... It's somewhat subtle - at the very least you need to explain why > there will be an RCU delay between umount_tree() clearing that and > having the sucker freed. Something along the lines of "removals from namespace are serialized on namespace_sem and guaranteed to happen no later than the active refcount on namespace reaches zero; freeing of namespace happens only after the passive refcount hitting zero and there's an RCU delay between dropping the last active ref and dropping the passive one that had been implicitly held by the fact of having actives", perhaps? Only in more readable form than that, please...
On Wed, Jun 3, 2026 at 8:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > On Wed, Jun 03, 2026 at 07:15:23PM +0100, Al Viro wrote: > > On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote: > > > > > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like > > > in __prepend_path(). > > > > > + /* > > > + * Containing namespace. > > > + * Normally protected by namespace_sem, but there are also lockless > > > + * readers (which must use RCU to guard against the namespace being > > > + * freed). > > > + */ > > > + struct mnt_namespace *mnt_ns; > > > > Umm... It's somewhat subtle - at the very least you need to explain why > > there will be an RCU delay between umount_tree() clearing that and > > having the sucker freed. > > Something along the lines of "removals from namespace are serialized on > namespace_sem and guaranteed to happen no later than the active > refcount on namespace reaches zero; freeing of namespace happens only > after the passive refcount hitting zero and there's an RCU delay between > dropping the last active ref and dropping the passive one that had been > implicitly held by the fact of having actives", perhaps? Only in > more readable form than that, please... Hm, like this? Containing namespace (active). Normally protected by namespace_sem. Can also be accessed locklessly under RCU. RCU readers can't rely on the namespace still being active, but implicitly hold a passive reference (because an RCU delay happens between a namespace no longer being active and the corresponding passive refcount drop).
On Wed, Jun 03, 2026 at 08:46:07PM +0200, Jann Horn wrote: > On Wed, Jun 3, 2026 at 8:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Jun 03, 2026 at 07:15:23PM +0100, Al Viro wrote: > > > On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote: > > > > > > > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like > > > > in __prepend_path(). > > > > > > > + /* > > > > + * Containing namespace. > > > > + * Normally protected by namespace_sem, but there are also lockless > > > > + * readers (which must use RCU to guard against the namespace being > > > > + * freed). > > > > + */ > > > > + struct mnt_namespace *mnt_ns; > > > > > > Umm... It's somewhat subtle - at the very least you need to explain why > > > there will be an RCU delay between umount_tree() clearing that and > > > having the sucker freed. > > > > Something along the lines of "removals from namespace are serialized on > > namespace_sem and guaranteed to happen no later than the active > > refcount on namespace reaches zero; freeing of namespace happens only > > after the passive refcount hitting zero and there's an RCU delay between > > dropping the last active ref and dropping the passive one that had been > > implicitly held by the fact of having actives", perhaps? Only in > > more readable form than that, please... > > Hm, like this? > > Containing namespace (active). Umm... That's actually "active or has _just_ dropped the last active reference and didn't get around to scheduling decrement of passive refcount yet", unfortunately. Hell knows - "active or deactivating", perhaps?
On Wed, Jun 03, 2026 at 07:53:24PM +0100, Al Viro wrote: > On Wed, Jun 03, 2026 at 08:46:07PM +0200, Jann Horn wrote: > > On Wed, Jun 3, 2026 at 8:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > On Wed, Jun 03, 2026 at 07:15:23PM +0100, Al Viro wrote: > > > > On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote: > > > > > > > > > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like > > > > > in __prepend_path(). > > > > > > > > > + /* > > > > > + * Containing namespace. > > > > > + * Normally protected by namespace_sem, but there are also lockless > > > > > + * readers (which must use RCU to guard against the namespace being > > > > > + * freed). > > > > > + */ > > > > > + struct mnt_namespace *mnt_ns; > > > > > > > > Umm... It's somewhat subtle - at the very least you need to explain why > > > > there will be an RCU delay between umount_tree() clearing that and > > > > having the sucker freed. > > > > > > Something along the lines of "removals from namespace are serialized on > > > namespace_sem and guaranteed to happen no later than the active > > > refcount on namespace reaches zero; freeing of namespace happens only > > > after the passive refcount hitting zero and there's an RCU delay between > > > dropping the last active ref and dropping the passive one that had been > > > implicitly held by the fact of having actives", perhaps? Only in > > > more readable form than that, please... > > > > Hm, like this? > > > > Containing namespace (active). > > Umm... That's actually "active or has _just_ dropped the last active > reference and didn't get around to scheduling decrement of passive refcount > yet", unfortunately. > > Hell knows - "active or deactivating", perhaps? Note that "active" in such context is easy to mistake for "active reference", which it definitely isn't - it does not contribute to active refcount. Mounts within a namespace do not pin it - it's the other way round; they are guaranteed to stay live until they leave the sucker. Anything that hasn't left by the time the active refcount of namespace drops to zero will get pushed out (and killed off unless there are other references to any such mounts)
On Wed, Jun 3, 2026 at 9:02 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > On Wed, Jun 03, 2026 at 07:53:24PM +0100, Al Viro wrote: > > On Wed, Jun 03, 2026 at 08:46:07PM +0200, Jann Horn wrote: > > > On Wed, Jun 3, 2026 at 8:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > On Wed, Jun 03, 2026 at 07:15:23PM +0100, Al Viro wrote: > > > > > On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote: > > > > > > > > > > > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like > > > > > > in __prepend_path(). > > > > > > > > > > > + /* > > > > > > + * Containing namespace. > > > > > > + * Normally protected by namespace_sem, but there are also lockless > > > > > > + * readers (which must use RCU to guard against the namespace being > > > > > > + * freed). > > > > > > + */ > > > > > > + struct mnt_namespace *mnt_ns; > > > > > > > > > > Umm... It's somewhat subtle - at the very least you need to explain why > > > > > there will be an RCU delay between umount_tree() clearing that and > > > > > having the sucker freed. > > > > > > > > Something along the lines of "removals from namespace are serialized on > > > > namespace_sem and guaranteed to happen no later than the active > > > > refcount on namespace reaches zero; freeing of namespace happens only > > > > after the passive refcount hitting zero and there's an RCU delay between > > > > dropping the last active ref and dropping the passive one that had been > > > > implicitly held by the fact of having actives", perhaps? Only in > > > > more readable form than that, please... > > > > > > Hm, like this? > > > > > > Containing namespace (active). > > > > Umm... That's actually "active or has _just_ dropped the last active > > reference and didn't get around to scheduling decrement of passive refcount > > yet", unfortunately. Ah, right, I see, because the mounts of non-anonymous namespaces are only cleared in put_mnt_ns() after the active reference drop. > > Hell knows - "active or deactivating", perhaps? > > Note that "active" in such context is easy to mistake for "active reference", > which it definitely isn't - it does not contribute to active refcount. > Mounts within a namespace do not pin it - it's the other way round; they > are guaranteed to stay live until they leave the sucker. Anything that > hasn't left by the time the active refcount of namespace drops to zero > will get pushed out (and killed off unless there are other references to > any such mounts) (And there's also that weird detail of how, for anonymous namespaces, the active refcount isn't used and AFAICS never actually drops to zero...) So I guess I'll write "Containing namespace (active or deactivating, non-refcounted)."?
On Wed, Jun 03, 2026 at 09:08:26PM +0200, Jann Horn wrote: > (And there's also that weird detail of how, for anonymous namespaces, > the active refcount isn't used and AFAICS never actually drops to > zero...) More like "is always 1 and we skip decrement when we decide to drop that", really. > So I guess I'll write "Containing namespace (active or deactivating, > non-refcounted)."? That would probably do for now... The lifecycle for mnt_namespace really needs to be documented; right now we have a maze of twisty little functions around that area and it takes quite a non-trivial amount of searching to recall the names - and I am familiar with the area ;-/
On Wed, Jun 3, 2026 at 9:08 PM Jann Horn <jannh@google.com> wrote: > (And there's also that weird detail of how, for anonymous namespaces, > the active refcount isn't used and AFAICS never actually drops to > zero...) (Er, nevermind, I missed that anonymous namespaces just have their active refcount set to 0 from the start already.)
On Wed, Jun 03, 2026 at 09:14:25PM +0200, Jann Horn wrote:
> On Wed, Jun 3, 2026 at 9:08 PM Jann Horn <jannh@google.com> wrote:
> > (And there's also that weird detail of how, for anonymous namespaces,
> > the active refcount isn't used and AFAICS never actually drops to
> > zero...)
>
> (Er, nevermind, I missed that anonymous namespaces just have their
> active refcount set to 0 from the start already.)
Let's distinguish a few things:
(1) generic reference count of namespaces in general: __ns_ref
- for mntns: keeps the mount namespace and the mounts attached to it alive
(2) active reference count of namespaces in general: __ns_ref_active.
- always a subset of (1)
- only regulates userspace visibility of the namespace and has no
lifetime implications per se. "active" just means "reachable from
userspace". It's nothing that the mount layer itself should care
about at all.
(3) passive reference count of struct mnt_namespace
- keeps the mount namespace alive but not the mounts attached to it
With (3) you can grab a reference to the mount namespaces without
pinning the mounts in it. Then do other stuff that you want and then you
can grab namespace_sem which allows you to see whether the namespace is
still alive via mnt_ns_empty(). At no point does the caller need to
artificially prolong the lifetime of a mount namespaces by grabbing a
__ns_ref reference count. This is especially useful if the caller needs
to do a bunch of sleeping operations before they can actually do the
meat of the work they need.
On Wed, Jun 3, 2026 at 8:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote: > > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like > > in __prepend_path(). > > > + /* > > + * Containing namespace. > > + * Normally protected by namespace_sem, but there are also lockless > > + * readers (which must use RCU to guard against the namespace being > > + * freed). > > + */ > > + struct mnt_namespace *mnt_ns; > > Umm... It's somewhat subtle - at the very least you need to explain why > there will be an RCU delay between umount_tree() clearing that and > having the sucker freed. I guess I could write something like this instead, to make it clear that this basically follows normal RCU rules, except that this code isn't actually using RCU markings and accessors? "This is like an __rcu pointer which is protected by RCU and namespace_sem; however, because most accesses happen under namespace_sem, it is not marked as __rcu, and RCU access is done with READ_ONCE()." Or we could put __rcu on this pointer, and annotate all the locked accesses with rcu_dereference_protected(..., lockdep_is_held(&namespace_lock)), but I guess you'd probably prefer to not do that?
On Wed, Jun 03, 2026 at 08:23:44PM +0200, Jann Horn wrote: > On Wed, Jun 3, 2026 at 8:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Jun 03, 2026 at 07:38:06PM +0200, Jann Horn wrote: > > > Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like > > > in __prepend_path(). > > > > > + /* > > > + * Containing namespace. > > > + * Normally protected by namespace_sem, but there are also lockless > > > + * readers (which must use RCU to guard against the namespace being > > > + * freed). > > > + */ > > > + struct mnt_namespace *mnt_ns; > > > > Umm... It's somewhat subtle - at the very least you need to explain why > > there will be an RCU delay between umount_tree() clearing that and > > having the sucker freed. > > I guess I could write something like this instead, to make it clear > that this basically follows normal RCU rules, except that this code > isn't actually using RCU markings and accessors? > > "This is like an __rcu pointer which is protected by RCU and > namespace_sem; however, because most accesses happen under > namespace_sem, it is not marked as __rcu, and RCU access is done with > READ_ONCE()." > > Or we could put __rcu on this pointer, and annotate all the locked > accesses with rcu_dereference_protected(..., > lockdep_is_held(&namespace_lock)), but I guess you'd probably prefer > to not do that? Not the point... What I'm talking about is the reason why RCU access to ->mnt_ns is valid in the first place - prompt freeing of namespace instances *is* possible; we do have a guaranteed RCU delay between zeroing ->mnt_ns and having the instance it pointed to freed, but it's not instantly obvious where to look for it. Basically, the store that cleared ->mnt_ns has been done in namespace_sem scope and that scope is either no later than the scope in put_mnt_ns() that has dropped the active refcount of ns to zero. At the beginning of that scope in put_mnt_ns() we are guaranteed to have the passive refcount positive. Dropping the passive reference happens after an rcu delay started in later in the same namespace_sem scope and namespace is not freed until the passive refcount reaches zero.
On Wed, Jun 03, 2026 at 07:41:51PM +0100, Al Viro wrote: > Basically, the store that cleared ->mnt_ns has been done in namespace_sem > scope and that scope is either no later than the scope in put_mnt_ns() argh... s/either// > that has dropped the active refcount of ns to zero. At the beginning > of that scope in put_mnt_ns() we are guaranteed to have the passive > refcount positive. Dropping the passive reference happens after an > rcu delay started in later in the same namespace_sem scope and namespace > is not freed until the passive refcount reaches zero. TL;DR: your fix is correct, but needs a better explanation of correctness. If nothing else, I'd like to have the above findable on lore - I've way too many pieces of half-baked docs sitting around in local notes as it is ;-/
© 2016 - 2026 Red Hat, Inc.