On Fri, Feb 06, 2026 at 03:04:53PM -0500, Waiman Long wrote:
[summary of subthread: there's an unpleasant corner case in unshare(2),
when we have a CLONE_NEWNS in flags and current->fs hadn't been shared
at all; in that case copy_mnt_ns() gets passed current->fs instead of
a private copy, which causes interesting warts in proof of correctness]
> I guess if private means fs->users == 1, the condition could still be true.
Unfortunately, it's worse than just a convoluted proof of correctness.
Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS
(and current->fs->users == 1).
We pass current->fs to copy_mnt_ns(), all right. Suppose it succeeds and
flips current->fs->{pwd,root} to corresponding locations in the new namespace.
Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM).
We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's
destroyed and its mount tree is dissolved, but... current->fs->root and
current->fs->pwd are both left pointing to now detached mounts.
They are pinning those, so it's not a UAF, but it leaves the calling
process with unshare(2) failing with -ENOMEM _and_ leaving it with
pwd and root on detached isolated mounts. The last part is clearly a bug.
There is other fun related to that mess (races with pivot_root(), including
the one between pivot_root() and fork(), of all things), but this one
is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new
fs_struct even if it hadn't been shared in the first place". Sure, we could
go for something like "if both CLONE_NEWNS *and* one of the things that might
end up failing after copy_mnt_ns() call in create_new_namespaces() are set,
force allocation of new fs_struct", but let's keep it simple - the cost
of copy_fs_struct() is trivial.
Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets
a freshly allocated fs_struct, yet to be attached to anything. That
seriously simplifies the analysis...
FWIW, that bug had been there since the introduction of unshare(2) ;-/
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/kernel/fork.c b/kernel/fork.c
index b1f3915d5f8e..68ccbaea7398 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3082,7 +3082,7 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
return 0;
/* don't need lock here; in the worst case we'll do useless copy */
- if (fs->users == 1)
+ if (!(unshare_flags & CLONE_NEWNS) && fs->users == 1)
return 0;
*new_fsp = copy_fs_struct(fs);
On Sat, Feb 07, 2026 at 08:25:24AM +0000, Al Viro wrote:
> On Fri, Feb 06, 2026 at 03:04:53PM -0500, Waiman Long wrote:
>
> [summary of subthread: there's an unpleasant corner case in unshare(2),
> when we have a CLONE_NEWNS in flags and current->fs hadn't been shared
> at all; in that case copy_mnt_ns() gets passed current->fs instead of
> a private copy, which causes interesting warts in proof of correctness]
>
> > I guess if private means fs->users == 1, the condition could still be true.
>
> Unfortunately, it's worse than just a convoluted proof of correctness.
> Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS
> (and current->fs->users == 1).
>
> We pass current->fs to copy_mnt_ns(), all right. Suppose it succeeds and
> flips current->fs->{pwd,root} to corresponding locations in the new namespace.
> Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM).
> We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's
> destroyed and its mount tree is dissolved, but... current->fs->root and
> current->fs->pwd are both left pointing to now detached mounts.
>
> They are pinning those, so it's not a UAF, but it leaves the calling
> process with unshare(2) failing with -ENOMEM _and_ leaving it with
> pwd and root on detached isolated mounts. The last part is clearly a bug.
>
> There is other fun related to that mess (races with pivot_root(), including
> the one between pivot_root() and fork(), of all things), but this one
> is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new
> fs_struct even if it hadn't been shared in the first place". Sure, we could
> go for something like "if both CLONE_NEWNS *and* one of the things that might
> end up failing after copy_mnt_ns() call in create_new_namespaces() are set,
> force allocation of new fs_struct", but let's keep it simple - the cost
> of copy_fs_struct() is trivial.
>
> Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets
> a freshly allocated fs_struct, yet to be attached to anything. That
> seriously simplifies the analysis...
>
> FWIW, that bug had been there since the introduction of unshare(2) ;-/
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
I'll take that through my namespace tree as a fix.
Thanks!
Christian
On Sat, 07 Feb 2026 08:25:24 +0000, Al Viro wrote:
> On Fri, Feb 06, 2026 at 03:04:53PM -0500, Waiman Long wrote:
>
> [summary of subthread: there's an unpleasant corner case in unshare(2),
> when we have a CLONE_NEWNS in flags and current->fs hadn't been shared
> at all; in that case copy_mnt_ns() gets passed current->fs instead of
> a private copy, which causes interesting warts in proof of correctness]
>
> [...]
Applied to the vfs.fixes branch of the vfs/vfs.git tree.
Patches in the vfs.fixes branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.fixes
[1/1] bug in unshare(2) failure recovery
https://git.kernel.org/vfs/vfs/c/c11eb27d36ff
On 2/7/26 3:25 AM, Al Viro wrote:
> On Fri, Feb 06, 2026 at 03:04:53PM -0500, Waiman Long wrote:
>
> [summary of subthread: there's an unpleasant corner case in unshare(2),
> when we have a CLONE_NEWNS in flags and current->fs hadn't been shared
> at all; in that case copy_mnt_ns() gets passed current->fs instead of
> a private copy, which causes interesting warts in proof of correctness]
>
>> I guess if private means fs->users == 1, the condition could still be true.
> Unfortunately, it's worse than just a convoluted proof of correctness.
> Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS
> (and current->fs->users == 1).
>
> We pass current->fs to copy_mnt_ns(), all right. Suppose it succeeds and
> flips current->fs->{pwd,root} to corresponding locations in the new namespace.
> Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM).
> We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's
> destroyed and its mount tree is dissolved, but... current->fs->root and
> current->fs->pwd are both left pointing to now detached mounts.
>
> They are pinning those, so it's not a UAF, but it leaves the calling
> process with unshare(2) failing with -ENOMEM _and_ leaving it with
> pwd and root on detached isolated mounts. The last part is clearly a bug.
>
> There is other fun related to that mess (races with pivot_root(), including
> the one between pivot_root() and fork(), of all things), but this one
> is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new
> fs_struct even if it hadn't been shared in the first place". Sure, we could
> go for something like "if both CLONE_NEWNS *and* one of the things that might
> end up failing after copy_mnt_ns() call in create_new_namespaces() are set,
> force allocation of new fs_struct", but let's keep it simple - the cost
> of copy_fs_struct() is trivial.
>
> Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets
> a freshly allocated fs_struct, yet to be attached to anything. That
> seriously simplifies the analysis...
>
> FWIW, that bug had been there since the introduction of unshare(2) ;-/
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b1f3915d5f8e..68ccbaea7398 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -3082,7 +3082,7 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
> return 0;
>
> /* don't need lock here; in the worst case we'll do useless copy */
> - if (fs->users == 1)
> + if (!(unshare_flags & CLONE_NEWNS) && fs->users == 1)
> return 0;
>
> *new_fsp = copy_fs_struct(fs);
>
After booting up a vanilla 6.19.0-rc8 kernel, I found that copy_mnt_ns()
was called 13 times during the bootup process with current->fs passed
down to it. After applying this patch, the count dropped to 0.
Tested-by: Waiman Long <longman@redhat.com>
© 2016 - 2026 Red Hat, Inc.