[PATCH 0/8] ns: fixes for namespace iteration and active reference counting

Christian Brauner posted 8 patches 1 month, 1 week ago
fs/nsfs.c                                          |   2 +-
include/linux/ns_common.h                          |  49 +-
kernel/nscommon.c                                  |  52 +-
kernel/nstree.c                                    |  44 +-
tools/testing/selftests/namespaces/.gitignore      |   2 +
tools/testing/selftests/namespaces/Makefile        |   6 +-
.../selftests/namespaces/listns_efault_test.c      | 521 +++++++++++++++++++++
.../namespaces/regression_pidfd_setns_test.c       | 113 +++++
8 files changed, 715 insertions(+), 74 deletions(-)
[PATCH 0/8] ns: fixes for namespace iteration and active reference counting
Posted by Christian Brauner 1 month, 1 week ago
* Make sure to initialize the active reference count for the initial
  network namespace and prevent __ns_common_init() from returning too
  early.

* Make sure that passive reference counts are dropped outside of rcu
  read locks as some namespaces such as the mount namespace do in fact
  sleep when putting the last reference.

* The setns() system call supports:

  (1) namespace file descriptors (nsfd)
  (2) process file descriptors (pidfd)

  When using nsfds the namespaces will remain active because they are
  pinned by the vfs. However, when pidfds are used things are more
  complicated.

  When the target task exits and passes through exit_nsproxy_namespaces()
  or is reaped and thus also passes through exit_cred_namespaces() after
  the setns()'ing task has called prepare_nsset() but before the active
  reference count of the set of namespaces it wants to setns() to might
  have been dropped already:

    P1                                                              P2

    pid_p1 = clone(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
                                                                    pidfd = pidfd_open(pid_p1)
                                                                    setns(pidfd, CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
                                                                    prepare_nsset()

    exit(0)
    // ns->__ns_active_ref        == 1
    // parent_ns->__ns_active_ref == 1
    -> exit_nsproxy_namespaces()
    -> exit_cred_namespaces()

    // ns_active_ref_put() will also put
    // the reference on the owner of the
    // namespace. If the only reason the
    // owning namespace was alive was
    // because it was a parent of @ns
    // it's active reference count now goes
    // to zero... --------------------------------
    //                                           |
    // ns->__ns_active_ref        == 0           |
    // parent_ns->__ns_active_ref == 0           |
                                                 |                  commit_nsset()
                                                 -----------------> // If setns()
                                                                    // now manages to install the namespaces
                                                                    // it will call ns_active_ref_get()
                                                                    // on them thus bumping the active reference
                                                                    // count from zero again but without also
                                                                    // taking the required reference on the owner.
                                                                    // Thus we get:
                                                                    //
                                                                    // ns->__ns_active_ref        == 1
                                                                    // parent_ns->__ns_active_ref == 0

    When later someone does ns_active_ref_put() on @ns it will underflow
    parent_ns->__ns_active_ref leading to a splat from our asserts
    thinking there are still active references when in fact the counter
    just underflowed.

  So resurrect the ownership chain if necessary as well. If the caller
  succeeded to grab passive references to the set of namespaces the
  setns() should simply succeed even if the target task exists or gets
  reaped in the meantime.

  The race is rare and can only be triggered when using pidfs to setns()
  to namespaces. Also note that active reference on initial namespaces are
  nops.

  Since we now always handle parent references directly we can drop
  ns_ref_active_get_owner() when adding a namespace to a namespace tree.
  This is now all handled uniformly in the places where the new namespaces
  actually become active.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (8):
      ns: don't skip active reference count initialization
      ns: don't increment or decrement initial namespaces
      ns: make sure reference are dropped outside of rcu lock
      ns: return EFAULT on put_user() error
      ns: handle setns(pidfd, ...) cleanly
      ns: add asserts for active refcount underflow
      selftests/namespaces: add active reference count regression test
      selftests/namespaces: test for efault

 fs/nsfs.c                                          |   2 +-
 include/linux/ns_common.h                          |  49 +-
 kernel/nscommon.c                                  |  52 +-
 kernel/nstree.c                                    |  44 +-
 tools/testing/selftests/namespaces/.gitignore      |   2 +
 tools/testing/selftests/namespaces/Makefile        |   6 +-
 .../selftests/namespaces/listns_efault_test.c      | 521 +++++++++++++++++++++
 .../namespaces/regression_pidfd_setns_test.c       | 113 +++++
 8 files changed, 715 insertions(+), 74 deletions(-)
---
base-commit: 8ebfb9896c97ab609222460e705f425cb3f0aad0
change-id: 20251109-namespace-6-19-fixes-5bbff9fc6267
Re: [PATCH 0/8] ns: fixes for namespace iteration and active reference counting
Posted by Hillf Danton 1 month, 1 week ago
On Sun, 09 Nov 2025 22:11:21 +0100 Christian Brauner wrote:
> * Make sure to initialize the active reference count for the initial
>   network namespace and prevent __ns_common_init() from returning too
>   early.
> 
> * Make sure that passive reference counts are dropped outside of rcu
>   read locks as some namespaces such as the mount namespace do in fact
>   sleep when putting the last reference.
> 
> * The setns() system call supports:
> 
>   (1) namespace file descriptors (nsfd)
>   (2) process file descriptors (pidfd)
> 
>   When using nsfds the namespaces will remain active because they are
>   pinned by the vfs. However, when pidfds are used things are more
>   complicated.
> 
>   When the target task exits and passes through exit_nsproxy_namespaces()
>   or is reaped and thus also passes through exit_cred_namespaces() after
>   the setns()'ing task has called prepare_nsset() but before the active
>   reference count of the set of namespaces it wants to setns() to might
>   have been dropped already:
> 
>     P1                                                              P2
> 
>     pid_p1 = clone(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
>                                                                     pidfd = pidfd_open(pid_p1)
>                                                                     setns(pidfd, CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
>                                                                     prepare_nsset()
> 
>     exit(0)
>     // ns->__ns_active_ref        == 1
>     // parent_ns->__ns_active_ref == 1
>     -> exit_nsproxy_namespaces()
>     -> exit_cred_namespaces()
> 
>     // ns_active_ref_put() will also put
>     // the reference on the owner of the
>     // namespace. If the only reason the
>     // owning namespace was alive was
>     // because it was a parent of @ns
>     // it's active reference count now goes
>     // to zero... --------------------------------
>     //                                           |
>     // ns->__ns_active_ref        == 0           |
>     // parent_ns->__ns_active_ref == 0           |
>                                                  |                  commit_nsset()
>                                                  -----------------> // If setns()
>                                                                     // now manages to install the namespaces
>                                                                     // it will call ns_active_ref_get()
>                                                                     // on them thus bumping the active reference
>                                                                     // count from zero again but without also
>                                                                     // taking the required reference on the owner.
>                                                                     // Thus we get:
>                                                                     //
>                                                                     // ns->__ns_active_ref        == 1
>                                                                     // parent_ns->__ns_active_ref == 0
> 
>     When later someone does ns_active_ref_put() on @ns it will underflow
>     parent_ns->__ns_active_ref leading to a splat from our asserts
>     thinking there are still active references when in fact the counter
>     just underflowed.
> 
>   So resurrect the ownership chain if necessary as well. If the caller
>   succeeded to grab passive references to the set of namespaces the
>   setns() should simply succeed even if the target task exists or gets
>   reaped in the meantime.
> 
>   The race is rare and can only be triggered when using pidfs to setns()
>   to namespaces. Also note that active reference on initial namespaces are
>   nops.
> 
>   Since we now always handle parent references directly we can drop
>   ns_ref_active_get_owner() when adding a namespace to a namespace tree.
>   This is now all handled uniformly in the places where the new namespaces
>   actually become active.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>
FYI namespace-6.19.fixes failed to survive the syzbot test [1].

[1] Subject: Re: [syzbot] [lsm?] WARNING in put_cred_rcu
https://lore.kernel.org/lkml/690eedba.a70a0220.22f260.0075.GAE@google.com/
Re: [PATCH 0/8] ns: fixes for namespace iteration and active reference counting
Posted by Christian Brauner 1 month, 1 week ago
On Mon, Nov 10, 2025 at 06:55:26AM +0800, Hillf Danton wrote:
> On Sun, 09 Nov 2025 22:11:21 +0100 Christian Brauner wrote:
> > * Make sure to initialize the active reference count for the initial
> >   network namespace and prevent __ns_common_init() from returning too
> >   early.
> > 
> > * Make sure that passive reference counts are dropped outside of rcu
> >   read locks as some namespaces such as the mount namespace do in fact
> >   sleep when putting the last reference.
> > 
> > * The setns() system call supports:
> > 
> >   (1) namespace file descriptors (nsfd)
> >   (2) process file descriptors (pidfd)
> > 
> >   When using nsfds the namespaces will remain active because they are
> >   pinned by the vfs. However, when pidfds are used things are more
> >   complicated.
> > 
> >   When the target task exits and passes through exit_nsproxy_namespaces()
> >   or is reaped and thus also passes through exit_cred_namespaces() after
> >   the setns()'ing task has called prepare_nsset() but before the active
> >   reference count of the set of namespaces it wants to setns() to might
> >   have been dropped already:
> > 
> >     P1                                                              P2
> > 
> >     pid_p1 = clone(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
> >                                                                     pidfd = pidfd_open(pid_p1)
> >                                                                     setns(pidfd, CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
> >                                                                     prepare_nsset()
> > 
> >     exit(0)
> >     // ns->__ns_active_ref        == 1
> >     // parent_ns->__ns_active_ref == 1
> >     -> exit_nsproxy_namespaces()
> >     -> exit_cred_namespaces()
> > 
> >     // ns_active_ref_put() will also put
> >     // the reference on the owner of the
> >     // namespace. If the only reason the
> >     // owning namespace was alive was
> >     // because it was a parent of @ns
> >     // it's active reference count now goes
> >     // to zero... --------------------------------
> >     //                                           |
> >     // ns->__ns_active_ref        == 0           |
> >     // parent_ns->__ns_active_ref == 0           |
> >                                                  |                  commit_nsset()
> >                                                  -----------------> // If setns()
> >                                                                     // now manages to install the namespaces
> >                                                                     // it will call ns_active_ref_get()
> >                                                                     // on them thus bumping the active reference
> >                                                                     // count from zero again but without also
> >                                                                     // taking the required reference on the owner.
> >                                                                     // Thus we get:
> >                                                                     //
> >                                                                     // ns->__ns_active_ref        == 1
> >                                                                     // parent_ns->__ns_active_ref == 0
> > 
> >     When later someone does ns_active_ref_put() on @ns it will underflow
> >     parent_ns->__ns_active_ref leading to a splat from our asserts
> >     thinking there are still active references when in fact the counter
> >     just underflowed.
> > 
> >   So resurrect the ownership chain if necessary as well. If the caller
> >   succeeded to grab passive references to the set of namespaces the
> >   setns() should simply succeed even if the target task exists or gets
> >   reaped in the meantime.
> > 
> >   The race is rare and can only be triggered when using pidfs to setns()
> >   to namespaces. Also note that active reference on initial namespaces are
> >   nops.
> > 
> >   Since we now always handle parent references directly we can drop
> >   ns_ref_active_get_owner() when adding a namespace to a namespace tree.
> >   This is now all handled uniformly in the places where the new namespaces
> >   actually become active.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >
> FYI namespace-6.19.fixes failed to survive the syzbot test [1].
> 
> [1] Subject: Re: [syzbot] [lsm?] WARNING in put_cred_rcu
> https://lore.kernel.org/lkml/690eedba.a70a0220.22f260.0075.GAE@google.com/

This used a stale branch that existed for testing:

Tested on:

commit:         00f5a3b5 DO NOT MERGE - This is purely for testing a b..

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

git tree:       https://github.com/brauner/linux.git namespace-6.19.fixes
console output: https://syzkaller.appspot.com/x/log.txt?x=17a46a58580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=e31f5f45f87b6763
dashboard link: https://syzkaller.appspot.com/bug?extid=553c4078ab14e3cf3358
compiler:       Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8

Note: no patches were applied.
Re: [PATCH 0/8] ns: fixes for namespace iteration and active reference counting
Posted by Hillf Danton 1 month ago
On Mon, 10 Nov 2025 09:41:56 +0100 Christian Brauner wrote:
> On Mon, Nov 10, 2025 at 06:55:26AM +0800, Hillf Danton wrote:
> > FYI namespace-6.19.fixes failed to survive the syzbot test [1].
> > 
> > [1] Subject: Re: [syzbot] [lsm?] WARNING in put_cred_rcu
> > https://lore.kernel.org/lkml/690eedba.a70a0220.22f260.0075.GAE@google.com/
> 
> This used a stale branch that existed for testing:
> 
> Tested on:
> 
> commit:         00f5a3b5 DO NOT MERGE - This is purely for testing a b..
>
FYI namespace-6.19 failed to survive syzbot test [2].

[2] Subject: Re: [syzbot] [kernel?] general protection fault in put_pid_ns
https://lore.kernel.org/lkml/691658dd.a70a0220.3124cb.0033.GAE@google.com/
Re: [PATCH 0/8] ns: fixes for namespace iteration and active reference counting
Posted by Hillf Danton 1 month, 1 week ago
On Mon, 10 Nov 2025 09:41:56 +0100 Christian Brauner wrote:
> On Mon, Nov 10, 2025 at 06:55:26AM +0800, Hillf Danton wrote:
> > FYI namespace-6.19.fixes failed to survive the syzbot test [1].
> > 
> > [1] Subject: Re: [syzbot] [lsm?] WARNING in put_cred_rcu
> > https://lore.kernel.org/lkml/690eedba.a70a0220.22f260.0075.GAE@google.com/
> 
> This used a stale branch that existed for testing:
> 
> Tested on:
> 
> commit:         00f5a3b5 DO NOT MERGE - This is purely for testing a b..
>
Then feel free to show the commit id to be fed to syzbot.
	syz test https://github.com/brauner/linux.git ID-to-test