[v1] nstree: listns()

[PATCH RFC DRAFT 00/50] nstree: listns()

Posted by Christian Brauner 3 months, 2 weeks ago

Hey,

As announced a while ago this is the next step building on the nstree
work from prior cycles. There's a bunch of fixes and semantic cleanups
in here and a ton of tests.

I need helper here!: Consider the following current design:

Currently listns() is relying on active namespace reference counts which
are introduced alongside this series.

The active reference count of a namespace consists of the live tasks
that make use of this namespace and any namespace file descriptors that
explicitly pin the namespace.

Once all tasks making use of this namespace have exited or reaped, all
namespace file descriptors for that namespace have been closed and all
bind-mounts for that namespace unmounted it ceases to appear in the
listns() output.

My reason for introducing the active reference count was that namespaces
might obviously still be pinned internally for various reasons. For
example the user namespace might still be pinned because there are still
open files that have stashed the openers credentials in file->f_cred, or
the last reference might be put with an rcu delay keeping that namespace
active on the namespace lists.

But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
which uses lazy TLB destruction.

When this option is set a userspace task's struct mm_struct may be used
for kernel threads such as the idle task and will only be destroyed once
the cpu's runqueue switches back to another task. So the kernel thread
will take a reference on the struct mm_struct pinning it.

And for ptrace() based access checks struct mm_struct stashes the user
namespace of the task that struct mm_struct belonged to originally and
thus takes a reference to the users namespace and pins it.

So on an idle system such user namespaces can be persisted for pretty
arbitrary amounts of time via struct mm_struct.

Now, without the active reference count regulating visibility all
namespace that still are pinned in some way on the system will appear in
the listns() output and can be reopened using namespace file handles.

Of course that requires suitable privileges and it's not really a
concern per se because a task could've also persist the namespace
recorded in struct mm_struct explicitly and then the idle task would
still reuse that struct mm_struct and another task could still happily
setns() to it afaict and reuse it for something else.

The active reference count though has drawbacks itself. Namely that
socket files break the assumption that namespaces can only be opened if
there's either live processes pinning the namespace or there are file
descriptors open that pin the namespace itself as the socket SIOCGSKNS
ioctl() can be used to open a network namespace based on a socket which
only indirectly pins a network namespace.

So that punches a whole in the active reference count tracking. So this
will have to be handled as right now socket file descriptors that pin a
network namespace that don't have an active reference anymore (no live
processes, not explicit persistence via namespace fds) can't be used to
issue a SIOCGSKNS ioctl() to open the associated network namespace.

So two options I see if the api is based on ids:

(1) We use the active reference count and somehow also make it work with
    sockets.
(2) The active reference count is not needed and we say that listns() is
    an introspection system call anyway so we just always list
    namespaces regardless of why they are still pinned: files,
    mm_struct, network devices, everything is fair game.
(3) Throw hands up in the air and just not do it.

=====================================================================

Add a new listns() system call that allows userspace to iterate through
namespaces in the system. This provides a programmatic interface to
discover and inspect namespaces, enhancing existing namespace apis.

Currently, there is no direct way for userspace to enumerate namespaces
in the system. Applications must resort to scanning /proc/<pid>/ns/
across all processes, which is:

1. Inefficient - requires iterating over all processes
2. Incomplete - misses inactive namespaces that aren't attached to any
   running process but are kept alive by file descriptors, bind mounts,
   or parent namespace references
3. Permission-heavy - requires access to /proc for many processes
4. No ordering or ownership.
5. No filtering per namespace type: Must always iterate and check all
   namespaces.

The list goes on. The listns() system call solves these problems by
providing direct kernel-level enumeration of namespaces. It is similar
to listmount() but obviously tailored to namespaces.

/*
 * @req: Pointer to struct ns_id_req specifying search parameters
 * @ns_ids: User buffer to receive namespace IDs
 * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
 * @flags: Reserved for future use (must be 0)
 */
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
               size_t nr_ns_ids, unsigned int flags);

Returns:
- On success: Number of namespace IDs written to ns_ids
- On error: Negative error code

/*
 * @size: Structure size
 * @ns_id: Starting point for iteration; use 0 for first call, then
 *         use the last returned ID for subsequent calls to paginate
 * @ns_type: Bitmask of namespace types to include (from enum ns_type):
 *           0: Return all namespace types
 *           MNT_NS: Mount namespaces
 *           NET_NS: Network namespaces
 *           USER_NS: User namespaces
 *           etc. Can be OR'd together
 * @user_ns_id: Filter results to namespaces owned by this user namespace:
 *              0: Return all namespaces (subject to permission checks)
 *              LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace
 *              Other value: Namespaces owned by the specified user namespace ID
 */
struct ns_id_req {
        __u32 size;         /* sizeof(struct ns_id_req) */
        __u32 spare;        /* Reserved, must be 0 */
        __u64 ns_id;        /* Last seen namespace ID (for pagination) */
        __u32 ns_type;      /* Filter by namespace type(s) */
        __u32 spare2;       /* Reserved, must be 0 */
        __u64 user_ns_id;   /* Filter by owning user namespace */
};

Example 1: List all namespaces

void list_all_namespaces(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,      /* Start from beginning */
		.ns_type = 0,    /* All types */
		.user_ns_id = 0, /* All user namespaces */
	};
	uint64_t ids[100];
	ssize_t ret;

	printf("All namespaces in the system:\n");
	do {
		ret = listns(&req, ids, 100, 0);
		if (ret < 0) {
			perror("listns");
			break;
		}

		for (ssize_t i = 0; i < ret; i++)
			printf("  Namespace ID: %llu\n", (unsigned long long)ids[i]);

		/* Continue from last seen ID */
		if (ret > 0)
			req.ns_id = ids[ret - 1];
	} while (ret == 100); /* Buffer was full, more may exist */
}

Example 2 : List network namespaces only

void list_network_namespaces(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,
		.ns_type = NET_NS, /* Only network namespaces */
		.user_ns_id = 0,
	};
	uint64_t ids[100];
	ssize_t ret;

	ret = listns(&req, ids, 100, 0);
	if (ret < 0) {
		perror("listns");
		return;
	}

	printf("Network namespaces: %zd found\n", ret);
	for (ssize_t i = 0; i < ret; i++)
		printf("  netns ID: %llu\n", (unsigned long long)ids[i]);
}

Example 3 : List namespaces owned by current user namespace

void list_owned_namespaces(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,
		.ns_type = 0,                      /* All types */
		.user_ns_id = LISTNS_CURRENT_USER, /* Current userns */
	};
	uint64_t ids[100];
	ssize_t ret;

	ret = listns(&req, ids, 100, 0);
	if (ret < 0) {
		perror("listns");
		return;
	}

	printf("Namespaces owned by my user namespace: %zd\n", ret);
	for (ssize_t i = 0; i < ret; i++)
		printf("  ns ID: %llu\n", (unsigned long long)ids[i]);
}

Example 4 : List multiple namespace types

void list_network_and_mount_namespaces(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,
		.ns_type = NET_NS | MNT_NS, /* Network and mount */
		.user_ns_id = 0,
	};
	uint64_t ids[100];
	ssize_t ret;

	ret = listns(&req, ids, 100, 0);
	printf("Network and mount namespaces: %zd found\n", ret);
}

Example 5 : Pagination through large namespace sets

void list_all_with_pagination(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,
		.ns_type = 0,
		.user_ns_id = 0,
	};
	uint64_t ids[50];
	size_t total = 0;
	ssize_t ret;

	printf("Enumerating all namespaces with pagination:\n");

	while (1) {
		ret = listns(&req, ids, 50, 0);
		if (ret < 0) {
			perror("listns");
			break;
		}
		if (ret == 0)
			break; /* No more namespaces */

		total += ret;
		printf("  Batch: %zd namespaces\n", ret);

		/* Last ID in this batch becomes start of next batch */
		req.ns_id = ids[ret - 1];

		if (ret < 50)
			break; /* Partial batch = end of results */
	}

	printf("Total: %zu namespaces\n", total);
}

listns() respects namespace isolation and capabilities:

(1) Global listing (user_ns_id = 0):
    - Requires CAP_SYS_ADMIN in the namespace's owning user namespace
    - OR the namespace must be in the caller's namespace context (e.g.,
      a namespace the caller is currently using)
    - User namespaces additionally allow listing if the caller has
      CAP_SYS_ADMIN in that user namespace itself
(2) Owner-filtered listing (user_ns_id != 0):
    - Requires CAP_SYS_ADMIN in the specified owner user namespace
    - OR the namespace must be in the caller's namespace context
    - This allows unprivileged processes to enumerate namespaces they own
(3) Visibility:
    - Only "active" namespaces are listed
    - A namespace is active if it has a non-zero __ns_ref_active count
    - This includes namespaces used by running processes, held by open
      file descriptors, or kept active by bind mounts
    - Inactive namespaces (kept alive only by internal kernel
      references) are not visible via listns()

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (50):
      libfs: allow to specify s_d_flags
      nsfs: use inode_just_drop()
      nsfs: raise DCACHE_DONTCACHE explicitly
      pidfs: raise DCACHE_DONTCACHE explicitly
      nsfs: raise SB_I_NODEV and SB_I_NOEXEC
      nstree: simplify return
      ns: initialize ns_list_node for initial namespaces
      ns: add __ns_ref_read()
      ns: add active reference count
      ns: use anonymous struct to group list member
      nstree: introduce a unified tree
      nstree: allow lookup solely based on inode
      nstree: assign fixed ids to the initial namespaces
      ns: maintain list of owned namespaces
      nstree: add listns()
      arch: hookup listns() system call
      nsfs: update tools header
      selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper
      selftests/namespaces: first active reference count tests
      selftests/namespaces: second active reference count tests
      selftests/namespaces: third active reference count tests
      selftests/namespaces: fourth active reference count tests
      selftests/namespaces: fifth active reference count tests
      selftests/namespaces: sixth active reference count tests
      selftests/namespaces: seventh active reference count tests
      selftests/namespaces: eigth active reference count tests
      selftests/namespaces: ninth active reference count tests
      selftests/namespaces: tenth active reference count tests
      selftests/namespaces: eleventh active reference count tests
      selftests/namespaces: twelth active reference count tests
      selftests/namespaces: thirteenth active reference count tests
      selftests/namespaces: fourteenth active reference count tests
      selftests/namespaces: fifteenth active reference count tests
      selftests/namespaces: add listns() wrapper
      selftests/namespaces: first listns() test
      selftests/namespaces: second listns() test
      selftests/namespaces: third listns() test
      selftests/namespaces: fourth listns() test
      selftests/namespaces: fifth listns() test
      selftests/namespaces: sixth listns() test
      selftests/namespaces: seventh listns() test
      selftests/namespaces: ninth listns() test
      selftests/namespaces: ninth listns() test
      selftests/namespaces: first listns() permission test
      selftests/namespaces: second listns() permission test
      selftests/namespaces: third listns() permission test
      selftests/namespaces: fourth listns() permission test
      selftests/namespaces: fifth listns() permission test
      selftests/namespaces: sixth listns() permission test
      selftests/namespaces: seventh listns() permission test

 arch/alpha/kernel/syscalls/syscall.tbl             |    1 +
 arch/arm/tools/syscall.tbl                         |    1 +
 arch/arm64/tools/syscall_32.tbl                    |    1 +
 arch/m68k/kernel/syscalls/syscall.tbl              |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl        |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl          |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl          |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl          |    1 +
 arch/parisc/kernel/syscalls/syscall.tbl            |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl           |    1 +
 arch/s390/kernel/syscalls/syscall.tbl              |    1 +
 arch/sh/kernel/syscalls/syscall.tbl                |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl             |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl             |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl             |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl            |    1 +
 fs/libfs.c                                         |    1 +
 fs/namespace.c                                     |    8 +-
 fs/nsfs.c                                          |   79 +-
 fs/pidfs.c                                         |    1 +
 include/linux/ns_common.h                          |  147 +-
 include/linux/nsfs.h                               |    3 +
 include/linux/nstree.h                             |   26 +-
 include/linux/pseudo_fs.h                          |    1 +
 include/linux/syscalls.h                           |    4 +
 include/uapi/asm-generic/unistd.h                  |    4 +-
 include/uapi/linux/nsfs.h                          |   58 +
 init/version-timestamp.c                           |    5 +
 ipc/msgutil.c                                      |    5 +
 ipc/namespace.c                                    |    1 +
 kernel/cgroup/cgroup.c                             |    5 +
 kernel/cgroup/namespace.c                          |    1 +
 kernel/cred.c                                      |   17 +
 kernel/exit.c                                      |    1 +
 kernel/nscommon.c                                  |   59 +-
 kernel/nsproxy.c                                   |    7 +
 kernel/nstree.c                                    |  527 ++++-
 kernel/pid.c                                       |   15 +
 kernel/pid_namespace.c                             |    1 +
 kernel/time/namespace.c                            |    6 +
 kernel/user.c                                      |    5 +
 kernel/user_namespace.c                            |    1 +
 kernel/utsname.c                                   |    1 +
 net/core/net_namespace.c                           |    3 +-
 scripts/syscall.tbl                                |    1 +
 tools/include/uapi/linux/nsfs.h                    |   70 +
 tools/testing/selftests/filesystems/utils.c        |    2 +-
 tools/testing/selftests/namespaces/.gitignore      |    3 +
 tools/testing/selftests/namespaces/Makefile        |    7 +-
 .../selftests/namespaces/listns_permissions_test.c |  777 +++++++
 tools/testing/selftests/namespaces/listns_test.c   |  656 ++++++
 .../selftests/namespaces/ns_active_ref_test.c      | 2226 ++++++++++++++++++++
 tools/testing/selftests/namespaces/wrappers.h      |   35 +
 53 files changed, 4737 insertions(+), 48 deletions(-)
---
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
change-id: 20251020-work-namespace-nstree-listns-9fd71518515c

Re: [PATCH RFC DRAFT 00/50] nstree: listns()

Posted by Josef Bacik 3 months, 2 weeks ago

On Tue, Oct 21, 2025 at 01:43:06PM +0200, Christian Brauner wrote:
> Hey,
> 
> As announced a while ago this is the next step building on the nstree
> work from prior cycles. There's a bunch of fixes and semantic cleanups
> in here and a ton of tests.
> 
> I need helper here!: Consider the following current design:
> 
> Currently listns() is relying on active namespace reference counts which
> are introduced alongside this series.
> 
> The active reference count of a namespace consists of the live tasks
> that make use of this namespace and any namespace file descriptors that
> explicitly pin the namespace.
> 
> Once all tasks making use of this namespace have exited or reaped, all
> namespace file descriptors for that namespace have been closed and all
> bind-mounts for that namespace unmounted it ceases to appear in the
> listns() output.
> 
> My reason for introducing the active reference count was that namespaces
> might obviously still be pinned internally for various reasons. For
> example the user namespace might still be pinned because there are still
> open files that have stashed the openers credentials in file->f_cred, or
> the last reference might be put with an rcu delay keeping that namespace
> active on the namespace lists.
> 
> But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
> Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
> which uses lazy TLB destruction.
> 
> When this option is set a userspace task's struct mm_struct may be used
> for kernel threads such as the idle task and will only be destroyed once
> the cpu's runqueue switches back to another task. So the kernel thread
> will take a reference on the struct mm_struct pinning it.
> 
> And for ptrace() based access checks struct mm_struct stashes the user
> namespace of the task that struct mm_struct belonged to originally and
> thus takes a reference to the users namespace and pins it.
> 
> So on an idle system such user namespaces can be persisted for pretty
> arbitrary amounts of time via struct mm_struct.
> 
> Now, without the active reference count regulating visibility all
> namespace that still are pinned in some way on the system will appear in
> the listns() output and can be reopened using namespace file handles.
> 
> Of course that requires suitable privileges and it's not really a
> concern per se because a task could've also persist the namespace
> recorded in struct mm_struct explicitly and then the idle task would
> still reuse that struct mm_struct and another task could still happily
> setns() to it afaict and reuse it for something else.
> 
> The active reference count though has drawbacks itself. Namely that
> socket files break the assumption that namespaces can only be opened if
> there's either live processes pinning the namespace or there are file
> descriptors open that pin the namespace itself as the socket SIOCGSKNS
> ioctl() can be used to open a network namespace based on a socket which
> only indirectly pins a network namespace.
> 
> So that punches a whole in the active reference count tracking. So this
> will have to be handled as right now socket file descriptors that pin a
> network namespace that don't have an active reference anymore (no live
> processes, not explicit persistence via namespace fds) can't be used to
> issue a SIOCGSKNS ioctl() to open the associated network namespace.
> 
> So two options I see if the api is based on ids:
> 
> (1) We use the active reference count and somehow also make it work with
>     sockets.
> (2) The active reference count is not needed and we say that listns() is
>     an introspection system call anyway so we just always list
>     namespaces regardless of why they are still pinned: files,
>     mm_struct, network devices, everything is fair game.
> (3) Throw hands up in the air and just not do it.
>

I think the active reference counts are just nice to have, if I'm not missing
something we still have to figure out which pid is using the namespace we may
want to enter, so there's already a "time of check, time of use" issue. I think
if we want to have the active count we can do it just as an advisory thing, have
a flag that says "this ns is dying and you can't do anything with it", and then
for network namespaces we can just never set the flag and let the existing
SIOCKGSNS ioctl work as is.

The bigger question (and sorry I didn't think about this before now), is how are
we going to integrate this into the rest of the NS related syscalls? Having
progromatic introspection is excellent from a usabiility point of view, but we
also want to be able to have an easy way to get a PID from these namespaces, and
even eventually do things like setns() based on these IDs. Those are followup
series of course, but we should at least have a plan for them. Thanks,

Josef

Re: [PATCH RFC DRAFT 00/50] nstree: listns()

Posted by Christian Brauner 3 months, 2 weeks ago

On Tue, Oct 21, 2025 at 10:34:54AM -0400, Josef Bacik wrote:
> On Tue, Oct 21, 2025 at 01:43:06PM +0200, Christian Brauner wrote:
> > Hey,
> > 
> > As announced a while ago this is the next step building on the nstree
> > work from prior cycles. There's a bunch of fixes and semantic cleanups
> > in here and a ton of tests.
> > 
> > I need helper here!: Consider the following current design:
> > 
> > Currently listns() is relying on active namespace reference counts which
> > are introduced alongside this series.
> > 
> > The active reference count of a namespace consists of the live tasks
> > that make use of this namespace and any namespace file descriptors that
> > explicitly pin the namespace.
> > 
> > Once all tasks making use of this namespace have exited or reaped, all
> > namespace file descriptors for that namespace have been closed and all
> > bind-mounts for that namespace unmounted it ceases to appear in the
> > listns() output.
> > 
> > My reason for introducing the active reference count was that namespaces
> > might obviously still be pinned internally for various reasons. For
> > example the user namespace might still be pinned because there are still
> > open files that have stashed the openers credentials in file->f_cred, or
> > the last reference might be put with an rcu delay keeping that namespace
> > active on the namespace lists.
> > 
> > But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
> > Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
> > which uses lazy TLB destruction.
> > 
> > When this option is set a userspace task's struct mm_struct may be used
> > for kernel threads such as the idle task and will only be destroyed once
> > the cpu's runqueue switches back to another task. So the kernel thread
> > will take a reference on the struct mm_struct pinning it.
> > 
> > And for ptrace() based access checks struct mm_struct stashes the user
> > namespace of the task that struct mm_struct belonged to originally and
> > thus takes a reference to the users namespace and pins it.
> > 
> > So on an idle system such user namespaces can be persisted for pretty
> > arbitrary amounts of time via struct mm_struct.
> > 
> > Now, without the active reference count regulating visibility all
> > namespace that still are pinned in some way on the system will appear in
> > the listns() output and can be reopened using namespace file handles.
> > 
> > Of course that requires suitable privileges and it's not really a
> > concern per se because a task could've also persist the namespace
> > recorded in struct mm_struct explicitly and then the idle task would
> > still reuse that struct mm_struct and another task could still happily
> > setns() to it afaict and reuse it for something else.
> > 
> > The active reference count though has drawbacks itself. Namely that
> > socket files break the assumption that namespaces can only be opened if
> > there's either live processes pinning the namespace or there are file
> > descriptors open that pin the namespace itself as the socket SIOCGSKNS
> > ioctl() can be used to open a network namespace based on a socket which
> > only indirectly pins a network namespace.
> > 
> > So that punches a whole in the active reference count tracking. So this
> > will have to be handled as right now socket file descriptors that pin a
> > network namespace that don't have an active reference anymore (no live
> > processes, not explicit persistence via namespace fds) can't be used to
> > issue a SIOCGSKNS ioctl() to open the associated network namespace.
> > 
> > So two options I see if the api is based on ids:
> > 
> > (1) We use the active reference count and somehow also make it work with
> >     sockets.
> > (2) The active reference count is not needed and we say that listns() is
> >     an introspection system call anyway so we just always list
> >     namespaces regardless of why they are still pinned: files,
> >     mm_struct, network devices, everything is fair game.
> > (3) Throw hands up in the air and just not do it.
> >
> 
> I think the active reference counts are just nice to have, if I'm not missing
> something we still have to figure out which pid is using the namespace we may
> want to enter, so there's already a "time of check, time of use" issue. I think
> if we want to have the active count we can do it just as an advisory thing, have
> a flag that says "this ns is dying and you can't do anything with it", and then
> for network namespaces we can just never set the flag and let the existing
> SIOCKGSNS ioctl work as is.
> 
> The bigger question (and sorry I didn't think about this before now), is how are
> we going to integrate this into the rest of the NS related syscalls? Having
> progromatic introspection is excellent from a usabiility point of view, but we
> also want to be able to have an easy way to get a PID from these namespaces, and
> even eventually do things like setns() based on these IDs. Those are followup
> series of course, but we should at least have a plan for them. Thanks,

I don't think we even need to have separate system calls to operate
directly on the IDs that's why I added namespace file handles.

We have listns() to iterate through namespaces in various ways.
This will be followed by statns() which will indeed operate on these
IDs to retrieve namespace specific information.

I already have that one drafted as well (That can contain all kinds of
namespace specific information like number of mounts (mntns), or number
of sockets (netns), number of network devices (netns), number of process
(pidns) what have you. Although what to expose I'm leaving to the
individual namespaces to figure out. IOW, I'm not going to figure out
what information statns() whould expose for network namespaces. I'll
leave that to net/).

But to your other point: to perform traditional operations like setns()
or all of the ioctls associated with such namespaces, it's pretty easy:

  struct file_handle **net_handle;
  char net_buf[sizeof(*net_handle) + MAX_HANDLE_SZ];

  net_handle = (struct file_handle *)net_buf;
  net_handle->handle_bytes = sizeof(struct nsfs_file_handle);
  net_handle->handle_type = FILEID_NSFS;
  struct nsfs_file_handle *net_fh = (struct nsfs_file_handle *)net_handle->f_handle;
  net_fh->ns_id = netns_id;

Now obviously that should exist in a nice simple define and aligned and
not nastily open-coded like I did here but you get the point. The
namespace id is sufficient to open an fd to it which is the main api to
perform actual semantic operations on it.

  /*
   * As long as the caller has CAP_SYS_ADMIN in the owning user namespace
   * of the etwork namespace or is located in the network namespace they
   * can open a file descriptor to it just like with
   * /proc/<pid>/ns/<ns_type>
   */
  int netns_fd = open_by_handle_at(FD_NSFS_ROOT, net_handle, O_RDONLY);
  
  setns(netns_fd, CLONE_NEWNET)

Getting pids from pid namespaces or translating them between pid
namespaces is something I added a while ago as well, via ioctl_nsfs:

/* Translate pid from target pid namespace into the caller's pid namespace. */
#define NS_GET_PID_FROM_PIDNS	_IOR(NSIO, 0x6, int)
/* Return thread-group leader id of pid in the callers pid namespace. */
#define NS_GET_TGID_FROM_PIDNS	_IOR(NSIO, 0x7, int)
/* Translate pid from caller's pid namespace into a target pid namespace. */
#define NS_GET_PID_IN_PIDNS	_IOR(NSIO, 0x8, int)
/* Return thread-group leader id of pid in the target pid namespace. */
#define NS_GET_TGID_IN_PIDNS	_IOR(NSIO, 0x9, int)

That also works fd-based and so is covered by open_by_handle_at().

Re: [PATCH RFC DRAFT 00/50] nstree: listns()

Posted by Ferenc Fejes 3 months, 2 weeks ago

On Tue, 2025-10-21 at 13:43 +0200, Christian Brauner wrote:
> Hey,
> 
> As announced a while ago this is the next step building on the nstree
> work from prior cycles. There's a bunch of fixes and semantic cleanups
> in here and a ton of tests.
> 
> I need helper here!: Consider the following current design:
> 
> Currently listns() is relying on active namespace reference counts which
> are introduced alongside this series.
> 
> The active reference count of a namespace consists of the live tasks
> that make use of this namespace and any namespace file descriptors that
> explicitly pin the namespace.
> 
> Once all tasks making use of this namespace have exited or reaped, all
> namespace file descriptors for that namespace have been closed and all
> bind-mounts for that namespace unmounted it ceases to appear in the
> listns() output.
> 
> My reason for introducing the active reference count was that namespaces
> might obviously still be pinned internally for various reasons. For
> example the user namespace might still be pinned because there are still
> open files that have stashed the openers credentials in file->f_cred, or
> the last reference might be put with an rcu delay keeping that namespace
> active on the namespace lists.
> 
> But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
> Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
> which uses lazy TLB destruction.
> 
> When this option is set a userspace task's struct mm_struct may be used
> for kernel threads such as the idle task and will only be destroyed once
> the cpu's runqueue switches back to another task. So the kernel thread
> will take a reference on the struct mm_struct pinning it.
> 
> And for ptrace() based access checks struct mm_struct stashes the user
> namespace of the task that struct mm_struct belonged to originally and
> thus takes a reference to the users namespace and pins it.
> 
> So on an idle system such user namespaces can be persisted for pretty
> arbitrary amounts of time via struct mm_struct.
> 
> Now, without the active reference count regulating visibility all
> namespace that still are pinned in some way on the system will appear in
> the listns() output and can be reopened using namespace file handles.
> 
> Of course that requires suitable privileges and it's not really a
> concern per se because a task could've also persist the namespace
> recorded in struct mm_struct explicitly and then the idle task would
> still reuse that struct mm_struct and another task could still happily
> setns() to it afaict and reuse it for something else.
> 
> The active reference count though has drawbacks itself. Namely that
> socket files break the assumption that namespaces can only be opened if
> there's either live processes pinning the namespace or there are file
> descriptors open that pin the namespace itself as the socket SIOCGSKNS
> ioctl() can be used to open a network namespace based on a socket which
> only indirectly pins a network namespace.
> 
> So that punches a whole in the active reference count tracking. So this
> will have to be handled as right now socket file descriptors that pin a
> network namespace that don't have an active reference anymore (no live
> processes, not explicit persistence via namespace fds) can't be used to
> issue a SIOCGSKNS ioctl() to open the associated network namespace.
> 
> So two options I see if the api is based on ids:
> 
> (1) We use the active reference count and somehow also make it work with
>     sockets.
> (2) The active reference count is not needed and we say that listns() is
>     an introspection system call anyway so we just always list
>     namespaces regardless of why they are still pinned: files,
>     mm_struct, network devices, everything is fair game.
> (3) Throw hands up in the air and just not do it.
> 
> =====================================================================
> 
> Add a new listns() system call that allows userspace to iterate through
> namespaces in the system. This provides a programmatic interface to
> discover and inspect namespaces, enhancing existing namespace apis.
> 
> Currently, there is no direct way for userspace to enumerate namespaces
> in the system. Applications must resort to scanning /proc/<pid>/ns/
> across all processes, which is:
> 
> 1. Inefficient - requires iterating over all processes
> 2. Incomplete - misses inactive namespaces that aren't attached to any
>    running process but are kept alive by file descriptors, bind mounts,
>    or parent namespace references
> 3. Permission-heavy - requires access to /proc for many processes
> 4. No ordering or ownership.
> 5. No filtering per namespace type: Must always iterate and check all
>    namespaces.
> 
> The list goes on. The listns() system call solves these problems by
> providing direct kernel-level enumeration of namespaces. It is similar
> to listmount() but obviously tailored to namespaces.

I've been waiting for such an API for years; thanks for working on it. I mostly
deal with network namespaces, where points 2 and 3 are especially painful.

Recently, I've used this eBPF snippet to discover (at most 1024, because of the
verifier's halt checking) network namespaces, even if no process is attached.
But I can't do anything with it in userspace since it's not possible to pass the
inode number or netns cookie value to setns()...

extern const void net_namespace_list __ksym;
static void list_all_netns()
{
    struct list_head *nslist = 
	bpf_core_cast(&net_namespace_list, struct list_head);

    struct list_head *iter = nslist->next;

    bpf_repeat(1024) {
        const struct net *net = 
		bpf_core_cast(container_of(iter, struct net, list), struct
net);

        // bpf_printk("net: %p inode: %u cookie: %lu", 
	//	net, net->ns.inum, net->net_cookie);

        if (iter->next == nslist)
            break;
        iter = iter->next;
    }
}

> 
> /*
>  * @req: Pointer to struct ns_id_req specifying search parameters
>  * @ns_ids: User buffer to receive namespace IDs
>  * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
>  * @flags: Reserved for future use (must be 0)
>  */
> ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
>                size_t nr_ns_ids, unsigned int flags);
> 
> Returns:
> - On success: Number of namespace IDs written to ns_ids
> - On error: Negative error code
> 
> /*
>  * @size: Structure size
>  * @ns_id: Starting point for iteration; use 0 for first call, then
>  *         use the last returned ID for subsequent calls to paginate
>  * @ns_type: Bitmask of namespace types to include (from enum ns_type):
>  *           0: Return all namespace types
>  *           MNT_NS: Mount namespaces
>  *           NET_NS: Network namespaces
>  *           USER_NS: User namespaces
>  *           etc. Can be OR'd together
>  * @user_ns_id: Filter results to namespaces owned by this user namespace:
>  *              0: Return all namespaces (subject to permission checks)
>  *              LISTNS_CURRENT_USER: Namespaces owned by caller's user
> namespace
>  *              Other value: Namespaces owned by the specified user namespace
> ID
>  */
> struct ns_id_req {
>         __u32 size;         /* sizeof(struct ns_id_req) */
>         __u32 spare;        /* Reserved, must be 0 */
>         __u64 ns_id;        /* Last seen namespace ID (for pagination) */
>         __u32 ns_type;      /* Filter by namespace type(s) */
>         __u32 spare2;       /* Reserved, must be 0 */
>         __u64 user_ns_id;   /* Filter by owning user namespace */
> };
> 

After this merged, do you see any chance for backports? Does it rely on recent
bits which is hard/impossible to backport? I'm not aware of backported syscalls
but this would be really nice to see in older kernels.

Ferenc

Re: [PATCH RFC DRAFT 00/50] nstree: listns()

Posted by Christian Brauner 3 months, 2 weeks ago

> > Add a new listns() system call that allows userspace to iterate through
> > namespaces in the system. This provides a programmatic interface to
> > discover and inspect namespaces, enhancing existing namespace apis.
> > 
> > Currently, there is no direct way for userspace to enumerate namespaces
> > in the system. Applications must resort to scanning /proc/<pid>/ns/
> > across all processes, which is:
> > 
> > 1. Inefficient - requires iterating over all processes
> > 2. Incomplete - misses inactive namespaces that aren't attached to any
> >    running process but are kept alive by file descriptors, bind mounts,
> >    or parent namespace references
> > 3. Permission-heavy - requires access to /proc for many processes
> > 4. No ordering or ownership.
> > 5. No filtering per namespace type: Must always iterate and check all
> >    namespaces.
> > 
> > The list goes on. The listns() system call solves these problems by
> > providing direct kernel-level enumeration of namespaces. It is similar
> > to listmount() but obviously tailored to namespaces.
> 
> I've been waiting for such an API for years; thanks for working on it. I mostly
> deal with network namespaces, where points 2 and 3 are especially painful.
> 
> Recently, I've used this eBPF snippet to discover (at most 1024, because of the
> verifier's halt checking) network namespaces, even if no process is attached.
> But I can't do anything with it in userspace since it's not possible to pass the
> inode number or netns cookie value to setns()...

I've mentioned it in the cover letter and in my earlier reply to Josef:

On v6.18+ kernels it is possible to generate and open file handles to
namespaces. This is probably an api that people outside of fs/ proper
aren't all that familiar with.

In essence it allows you to refer to files - or more-general:
kernel-object that may be referenced via files - via opaque handles
instead of paths.

For regular filesystem that are multi-instance (IOW, you can have
multiple btrfs or ext4 filesystems mounted) such file handles cannot be
used without providing a file descriptor to another object in the
filesystem that is used to resolve the file handle...

However, for single-instance filesystems like pidfs and nsfs that's not
required which is why I added:

FD_PIDFS_ROOT
FD_NSFS_ROOT

which means that you can open both pidfds and namespace via
open_by_handle_at() purely based on the file handle. I call such file
handles "exhaustive file handles" because they fully describe the object
to be resolvable without any further information.

They are also not subject to the capable(CAP_DAC_READ_SEARCH) permission
check that regular file handles are and so can be used even by
unprivileged code as long as the caller is sufficiently privileged over
the relevant object (pid resolvable in caller's pid namespace of pidfds,
or caller located in namespace or privileged over the owning user
namespace of the relevant namespace for nsfs).

File handles for namespaces have the following uapi:

struct nsfs_file_handle {
	__u64 ns_id;
	__u32 ns_type;
	__u32 ns_inum;
};

#define NSFS_FILE_HANDLE_SIZE_VER0 16 /* sizeof first published struct */
#define NSFS_FILE_HANDLE_SIZE_LATEST sizeof(struct nsfs_file_handle) /* sizeof latest published struct */

and it is explicitly allowed to generate such file handles manually in
userspace. When the kernel generates a namespace file handle via
name_to_handle_at() till will return: ns_id, ns_type, and ns_inum but
userspace is allowed to provide the kernel with a laxer file handle
where only the ns_id is filled in but ns_type and ns_inum are zero - at
least after this patch series.

So for your case where you even know inode number, ns type, and ns id
you can fill in a struct nsfs_file_handle and either look at my reply to
Josef or in the (ugly) tests.

fd = open_by_handle_at(FD_NSFS_ROOT, file_handle, O_RDONLY);

and can open the namespace (provided it is still active).

> 
> extern const void net_namespace_list __ksym;
> static void list_all_netns()
> {
>     struct list_head *nslist = 
> 	bpf_core_cast(&net_namespace_list, struct list_head);
> 
>     struct list_head *iter = nslist->next;
> 
>     bpf_repeat(1024) {

This isn't needed anymore. I've implemented it in a bpf-friendly way so
it's possible to add kfuncs that would allow you to iterate through the
various namespace trees (locklessly).

If this is merged then I'll likely design that bpf part myself.

> After this merged, do you see any chance for backports? Does it rely on recent
> bits which is hard/impossible to backport? I'm not aware of backported syscalls
> but this would be really nice to see in older kernels.

Uhm, what downstream entities, managing kernels do is not my concern but
for upstream it's certainly not an option. There's a lot of preparatory
work that would have to be backported.

Re: [PATCH RFC DRAFT 00/50] nstree: listns()

Posted by Ferenc Fejes 3 months, 1 week ago

On Fri, 2025-10-24 at 16:50 +0200, Christian Brauner wrote:
> > > Add a new listns() system call that allows userspace to iterate through
> > > namespaces in the system. This provides a programmatic interface to
> > > discover and inspect namespaces, enhancing existing namespace apis.
> > > 
> > > Currently, there is no direct way for userspace to enumerate namespaces
> > > in the system. Applications must resort to scanning /proc/<pid>/ns/
> > > across all processes, which is:
> > > 
> > > 1. Inefficient - requires iterating over all processes
> > > 2. Incomplete - misses inactive namespaces that aren't attached to any
> > >    running process but are kept alive by file descriptors, bind mounts,
> > >    or parent namespace references
> > > 3. Permission-heavy - requires access to /proc for many processes
> > > 4. No ordering or ownership.
> > > 5. No filtering per namespace type: Must always iterate and check all
> > >    namespaces.
> > > 
> > > The list goes on. The listns() system call solves these problems by
> > > providing direct kernel-level enumeration of namespaces. It is similar
> > > to listmount() but obviously tailored to namespaces.
> > 
> > I've been waiting for such an API for years; thanks for working on it. I
> > mostly
> > deal with network namespaces, where points 2 and 3 are especially painful.
> > 
> > Recently, I've used this eBPF snippet to discover (at most 1024, because of
> > the
> > verifier's halt checking) network namespaces, even if no process is
> > attached.
> > But I can't do anything with it in userspace since it's not possible to pass
> > the
> > inode number or netns cookie value to setns()...
> 
> I've mentioned it in the cover letter and in my earlier reply to Josef:
> 
> On v6.18+ kernels it is possible to generate and open file handles to
> namespaces. This is probably an api that people outside of fs/ proper
> aren't all that familiar with.
> 
> In essence it allows you to refer to files - or more-general:
> kernel-object that may be referenced via files - via opaque handles
> instead of paths.
> 
> For regular filesystem that are multi-instance (IOW, you can have
> multiple btrfs or ext4 filesystems mounted) such file handles cannot be
> used without providing a file descriptor to another object in the
> filesystem that is used to resolve the file handle...
> 
> However, for single-instance filesystems like pidfs and nsfs that's not
> required which is why I added:
> 
> FD_PIDFS_ROOT
> FD_NSFS_ROOT
> 
> which means that you can open both pidfds and namespace via
> open_by_handle_at() purely based on the file handle. I call such file
> handles "exhaustive file handles" because they fully describe the object
> to be resolvable without any further information.
> 
> They are also not subject to the capable(CAP_DAC_READ_SEARCH) permission
> check that regular file handles are and so can be used even by
> unprivileged code as long as the caller is sufficiently privileged over
> the relevant object (pid resolvable in caller's pid namespace of pidfds,
> or caller located in namespace or privileged over the owning user
> namespace of the relevant namespace for nsfs).
> 
> File handles for namespaces have the following uapi:
> 
> struct nsfs_file_handle {
> 	__u64 ns_id;
> 	__u32 ns_type;
> 	__u32 ns_inum;
> };
> 
> #define NSFS_FILE_HANDLE_SIZE_VER0 16 /* sizeof first published struct */
> #define NSFS_FILE_HANDLE_SIZE_LATEST sizeof(struct nsfs_file_handle) /* sizeof
> latest published struct */
> 
> and it is explicitly allowed to generate such file handles manually in
> userspace. When the kernel generates a namespace file handle via
> name_to_handle_at() till will return: ns_id, ns_type, and ns_inum but
> userspace is allowed to provide the kernel with a laxer file handle
> where only the ns_id is filled in but ns_type and ns_inum are zero - at
> least after this patch series.
> 
> So for your case where you even know inode number, ns type, and ns id
> you can fill in a struct nsfs_file_handle and either look at my reply to
> Josef or in the (ugly) tests.
> 
> fd = open_by_handle_at(FD_NSFS_ROOT, file_handle, O_RDONLY);
> 
> and can open the namespace (provided it is still active).
> 
> > 
> > extern const void net_namespace_list __ksym;
> > static void list_all_netns()
> > {
> >     struct list_head *nslist = 
> > 	bpf_core_cast(&net_namespace_list, struct list_head);
> > 
> >     struct list_head *iter = nslist->next;
> > 
> >     bpf_repeat(1024) {
> 
> This isn't needed anymore. I've implemented it in a bpf-friendly way so
> it's possible to add kfuncs that would allow you to iterate through the
> various namespace trees (locklessly).
> 
> If this is merged then I'll likely design that bpf part myself.

Excellent, thanks for the detailed explanation, noted! Well I guess I have to
keep my eyes closer on recent ns changes, I was aware of pidfs but not the
helpers you just mentioned.

> 
> > After this merged, do you see any chance for backports? Does it rely on
> > recent
> > bits which is hard/impossible to backport? I'm not aware of backported
> > syscalls
> > but this would be really nice to see in older kernels.
> 
> Uhm, what downstream entities, managing kernels do is not my concern but
> for upstream it's certainly not an option. There's a lot of preparatory
> work that would have to be backported.

I was curious about the upstream option, but I see this isn't feasible. Anyway,
its great we will have this in the future, thanks for doing it!

Ferenc

Re: [PATCH RFC DRAFT 00/50] nstree: listns()

Posted by Jeff Layton 3 months, 2 weeks ago

On Tue, 2025-10-21 at 13:43 +0200, Christian Brauner wrote:
> Hey,
> 
> As announced a while ago this is the next step building on the nstree
> work from prior cycles. There's a bunch of fixes and semantic cleanups
> in here and a ton of tests.
> 
> I need helper here!: Consider the following current design:
> 
> Currently listns() is relying on active namespace reference counts which
> are introduced alongside this series.
> 
> The active reference count of a namespace consists of the live tasks
> that make use of this namespace and any namespace file descriptors that
> explicitly pin the namespace.
> 
> Once all tasks making use of this namespace have exited or reaped, all
> namespace file descriptors for that namespace have been closed and all
> bind-mounts for that namespace unmounted it ceases to appear in the
> listns() output.
> 
> My reason for introducing the active reference count was that namespaces
> might obviously still be pinned internally for various reasons. For
> example the user namespace might still be pinned because there are still
> open files that have stashed the openers credentials in file->f_cred, or
> the last reference might be put with an rcu delay keeping that namespace
> active on the namespace lists.
> 
> But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
> Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
> which uses lazy TLB destruction.
> 
> When this option is set a userspace task's struct mm_struct may be used
> for kernel threads such as the idle task and will only be destroyed once
> the cpu's runqueue switches back to another task. So the kernel thread
> will take a reference on the struct mm_struct pinning it.
> 
> And for ptrace() based access checks struct mm_struct stashes the user
> namespace of the task that struct mm_struct belonged to originally and
> thus takes a reference to the users namespace and pins it.
> 
> So on an idle system such user namespaces can be persisted for pretty
> arbitrary amounts of time via struct mm_struct.
> 
> Now, without the active reference count regulating visibility all
> namespace that still are pinned in some way on the system will appear in
> the listns() output and can be reopened using namespace file handles.
> 
> Of course that requires suitable privileges and it's not really a
> concern per se because a task could've also persist the namespace
> recorded in struct mm_struct explicitly and then the idle task would
> still reuse that struct mm_struct and another task could still happily
> setns() to it afaict and reuse it for something else.
> 
> The active reference count though has drawbacks itself. Namely that
> socket files break the assumption that namespaces can only be opened if
> there's either live processes pinning the namespace or there are file
> descriptors open that pin the namespace itself as the socket SIOCGSKNS
> ioctl() can be used to open a network namespace based on a socket which
> only indirectly pins a network namespace.
> 
> So that punches a whole in the active reference count tracking. So this
> will have to be handled as right now socket file descriptors that pin a
> network namespace that don't have an active reference anymore (no live
> processes, not explicit persistence via namespace fds) can't be used to
> issue a SIOCGSKNS ioctl() to open the associated network namespace.
> 

Is this capability something we need to preserve? It seems like the
fact that SIOCGSKNS works when there are no active references left
might have been an accident. Is there a legit use-case for allowing
that?

I don't see a problem with active+passive refcounts. They're more
complicated to deal with, but we've used them elsewhere so it's a
pattern we all know (even if we don't necessarily love them).

I'll also point out that net namespaces already have two refcounts for
this exact reason. Do you plan to replace the passive refcount in
struct net with the new passive refcount you're implementing here?

> So two options I see if the api is based on ids:
> 
> (1) We use the active reference count and somehow also make it work with
>     sockets.
> (2) The active reference count is not needed and we say that listns() is
>     an introspection system call anyway so we just always list
>     namespaces regardless of why they are still pinned: files,
>     mm_struct, network devices, everything is fair game.
> (3) Throw hands up in the air and just not do it.
> 

Is listns() the only reason we'd need a active/passive refcounts? It
seems like we might need them for other reasons (e.g. struct net).

In any case, given that this is a privileged syscall, I don't
necessarily see a problem with #2 here. Leaked namespaces can be a
problem and we don't have good visibility into them at the moment.

IMO, even if you keep the active+passive refcounts, it would be good to
be able to tell listns() to return all the namespaces, and not just the
ones that are still active. Maybe that can be the first flag for this
new syscall?

> =====================================================================
> 
> Add a new listns() system call that allows userspace to iterate through
> namespaces in the system. This provides a programmatic interface to
> discover and inspect namespaces, enhancing existing namespace apis.
> 
> Currently, there is no direct way for userspace to enumerate namespaces
> in the system. Applications must resort to scanning /proc/<pid>/ns/
> across all processes, which is:
> 
> 1. Inefficient - requires iterating over all processes
> 2. Incomplete - misses inactive namespaces that aren't attached to any
>    running process but are kept alive by file descriptors, bind mounts,
>    or parent namespace references
> 3. Permission-heavy - requires access to /proc for many processes
> 4. No ordering or ownership.
> 5. No filtering per namespace type: Must always iterate and check all
>    namespaces.
> 
> The list goes on. The listns() system call solves these problems by
> providing direct kernel-level enumeration of namespaces. It is similar
> to listmount() but obviously tailored to namespaces.
> 
> /*
>  * @req: Pointer to struct ns_id_req specifying search parameters
>  * @ns_ids: User buffer to receive namespace IDs
>  * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
>  * @flags: Reserved for future use (must be 0)
>  */
> ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
>                size_t nr_ns_ids, unsigned int flags);
> 
> Returns:
> - On success: Number of namespace IDs written to ns_ids
> - On error: Negative error code
> 
> /*
>  * @size: Structure size
>  * @ns_id: Starting point for iteration; use 0 for first call, then
>  *         use the last returned ID for subsequent calls to paginate
>  * @ns_type: Bitmask of namespace types to include (from enum ns_type):
>  *           0: Return all namespace types
>  *           MNT_NS: Mount namespaces
>  *           NET_NS: Network namespaces
>  *           USER_NS: User namespaces
>  *           etc. Can be OR'd together
>  * @user_ns_id: Filter results to namespaces owned by this user namespace:
>  *              0: Return all namespaces (subject to permission checks)
>  *              LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace
>  *              Other value: Namespaces owned by the specified user namespace ID
>  */
> struct ns_id_req {
>         __u32 size;         /* sizeof(struct ns_id_req) */
>         __u32 spare;        /* Reserved, must be 0 */
>         __u64 ns_id;        /* Last seen namespace ID (for pagination) */
>         __u32 ns_type;      /* Filter by namespace type(s) */
>         __u32 spare2;       /* Reserved, must be 0 */
>         __u64 user_ns_id;   /* Filter by owning user namespace */
> };
> 
> Example 1: List all namespaces
> 
> void list_all_namespaces(void)
> {
> 	struct ns_id_req req = {
> 		.size = sizeof(req),
> 		.ns_id = 0,      /* Start from beginning */
> 		.ns_type = 0,    /* All types */
> 		.user_ns_id = 0, /* All user namespaces */
> 	};
> 	uint64_t ids[100];
> 	ssize_t ret;
> 
> 	printf("All namespaces in the system:\n");
> 	do {
> 		ret = listns(&req, ids, 100, 0);
> 		if (ret < 0) {
> 			perror("listns");
> 			break;
> 		}
> 
> 		for (ssize_t i = 0; i < ret; i++)
> 			printf("  Namespace ID: %llu\n", (unsigned long long)ids[i]);
> 
> 		/* Continue from last seen ID */
> 		if (ret > 0)
> 			req.ns_id = ids[ret - 1];
> 	} while (ret == 100); /* Buffer was full, more may exist */
> }
> 
> Example 2 : List network namespaces only
> 
> void list_network_namespaces(void)
> {
> 	struct ns_id_req req = {
> 		.size = sizeof(req),
> 		.ns_id = 0,
> 		.ns_type = NET_NS, /* Only network namespaces */
> 		.user_ns_id = 0,
> 	};
> 	uint64_t ids[100];
> 	ssize_t ret;
> 
> 	ret = listns(&req, ids, 100, 0);
> 	if (ret < 0) {
> 		perror("listns");
> 		return;
> 	}
> 
> 	printf("Network namespaces: %zd found\n", ret);
> 	for (ssize_t i = 0; i < ret; i++)
> 		printf("  netns ID: %llu\n", (unsigned long long)ids[i]);
> }
> 
> Example 3 : List namespaces owned by current user namespace
> 
> void list_owned_namespaces(void)
> {
> 	struct ns_id_req req = {
> 		.size = sizeof(req),
> 		.ns_id = 0,
> 		.ns_type = 0,                      /* All types */
> 		.user_ns_id = LISTNS_CURRENT_USER, /* Current userns */
> 	};
> 	uint64_t ids[100];
> 	ssize_t ret;
> 
> 	ret = listns(&req, ids, 100, 0);
> 	if (ret < 0) {
> 		perror("listns");
> 		return;
> 	}
> 
> 	printf("Namespaces owned by my user namespace: %zd\n", ret);
> 	for (ssize_t i = 0; i < ret; i++)
> 		printf("  ns ID: %llu\n", (unsigned long long)ids[i]);
> }
> 
> Example 4 : List multiple namespace types
> 
> void list_network_and_mount_namespaces(void)
> {
> 	struct ns_id_req req = {
> 		.size = sizeof(req),
> 		.ns_id = 0,
> 		.ns_type = NET_NS | MNT_NS, /* Network and mount */
> 		.user_ns_id = 0,
> 	};
> 	uint64_t ids[100];
> 	ssize_t ret;
> 
> 	ret = listns(&req, ids, 100, 0);
> 	printf("Network and mount namespaces: %zd found\n", ret);
> }
> 
> Example 5 : Pagination through large namespace sets
> 
> void list_all_with_pagination(void)
> {
> 	struct ns_id_req req = {
> 		.size = sizeof(req),
> 		.ns_id = 0,
> 		.ns_type = 0,
> 		.user_ns_id = 0,
> 	};
> 	uint64_t ids[50];
> 	size_t total = 0;
> 	ssize_t ret;
> 
> 	printf("Enumerating all namespaces with pagination:\n");
> 
> 	while (1) {
> 		ret = listns(&req, ids, 50, 0);
> 		if (ret < 0) {
> 			perror("listns");
> 			break;
> 		}
> 		if (ret == 0)
> 			break; /* No more namespaces */
> 
> 		total += ret;
> 		printf("  Batch: %zd namespaces\n", ret);
> 
> 		/* Last ID in this batch becomes start of next batch */
> 		req.ns_id = ids[ret - 1];
> 
> 		if (ret < 50)
> 			break; /* Partial batch = end of results */
> 	}
> 
> 	printf("Total: %zu namespaces\n", total);
> }
> 
> listns() respects namespace isolation and capabilities:
> 
> (1) Global listing (user_ns_id = 0):
>     - Requires CAP_SYS_ADMIN in the namespace's owning user namespace
>     - OR the namespace must be in the caller's namespace context (e.g.,
>       a namespace the caller is currently using)
>     - User namespaces additionally allow listing if the caller has
>       CAP_SYS_ADMIN in that user namespace itself
> (2) Owner-filtered listing (user_ns_id != 0):
>     - Requires CAP_SYS_ADMIN in the specified owner user namespace
>     - OR the namespace must be in the caller's namespace context
>     - This allows unprivileged processes to enumerate namespaces they own
> (3) Visibility:
>     - Only "active" namespaces are listed
>     - A namespace is active if it has a non-zero __ns_ref_active count
>     - This includes namespaces used by running processes, held by open
>       file descriptors, or kept active by bind mounts
>     - Inactive namespaces (kept alive only by internal kernel
>       references) are not visible via listns()
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Christian Brauner (50):
>       libfs: allow to specify s_d_flags
>       nsfs: use inode_just_drop()
>       nsfs: raise DCACHE_DONTCACHE explicitly
>       pidfs: raise DCACHE_DONTCACHE explicitly
>       nsfs: raise SB_I_NODEV and SB_I_NOEXEC
>       nstree: simplify return
>       ns: initialize ns_list_node for initial namespaces
>       ns: add __ns_ref_read()
>       ns: add active reference count
>       ns: use anonymous struct to group list member
>       nstree: introduce a unified tree
>       nstree: allow lookup solely based on inode
>       nstree: assign fixed ids to the initial namespaces
>       ns: maintain list of owned namespaces
>       nstree: add listns()
>       arch: hookup listns() system call
>       nsfs: update tools header
>       selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper
>       selftests/namespaces: first active reference count tests
>       selftests/namespaces: second active reference count tests
>       selftests/namespaces: third active reference count tests
>       selftests/namespaces: fourth active reference count tests
>       selftests/namespaces: fifth active reference count tests
>       selftests/namespaces: sixth active reference count tests
>       selftests/namespaces: seventh active reference count tests
>       selftests/namespaces: eigth active reference count tests
>       selftests/namespaces: ninth active reference count tests
>       selftests/namespaces: tenth active reference count tests
>       selftests/namespaces: eleventh active reference count tests
>       selftests/namespaces: twelth active reference count tests
>       selftests/namespaces: thirteenth active reference count tests
>       selftests/namespaces: fourteenth active reference count tests
>       selftests/namespaces: fifteenth active reference count tests
>       selftests/namespaces: add listns() wrapper
>       selftests/namespaces: first listns() test
>       selftests/namespaces: second listns() test
>       selftests/namespaces: third listns() test
>       selftests/namespaces: fourth listns() test
>       selftests/namespaces: fifth listns() test
>       selftests/namespaces: sixth listns() test
>       selftests/namespaces: seventh listns() test
>       selftests/namespaces: ninth listns() test
>       selftests/namespaces: ninth listns() test
>       selftests/namespaces: first listns() permission test
>       selftests/namespaces: second listns() permission test
>       selftests/namespaces: third listns() permission test
>       selftests/namespaces: fourth listns() permission test
>       selftests/namespaces: fifth listns() permission test
>       selftests/namespaces: sixth listns() permission test
>       selftests/namespaces: seventh listns() permission test
> 
>  arch/alpha/kernel/syscalls/syscall.tbl             |    1 +
>  arch/arm/tools/syscall.tbl                         |    1 +
>  arch/arm64/tools/syscall_32.tbl                    |    1 +
>  arch/m68k/kernel/syscalls/syscall.tbl              |    1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl        |    1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl          |    1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl          |    1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl          |    1 +
>  arch/parisc/kernel/syscalls/syscall.tbl            |    1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl           |    1 +
>  arch/s390/kernel/syscalls/syscall.tbl              |    1 +
>  arch/sh/kernel/syscalls/syscall.tbl                |    1 +
>  arch/sparc/kernel/syscalls/syscall.tbl             |    1 +
>  arch/x86/entry/syscalls/syscall_32.tbl             |    1 +
>  arch/x86/entry/syscalls/syscall_64.tbl             |    1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl            |    1 +
>  fs/libfs.c                                         |    1 +
>  fs/namespace.c                                     |    8 +-
>  fs/nsfs.c                                          |   79 +-
>  fs/pidfs.c                                         |    1 +
>  include/linux/ns_common.h                          |  147 +-
>  include/linux/nsfs.h                               |    3 +
>  include/linux/nstree.h                             |   26 +-
>  include/linux/pseudo_fs.h                          |    1 +
>  include/linux/syscalls.h                           |    4 +
>  include/uapi/asm-generic/unistd.h                  |    4 +-
>  include/uapi/linux/nsfs.h                          |   58 +
>  init/version-timestamp.c                           |    5 +
>  ipc/msgutil.c                                      |    5 +
>  ipc/namespace.c                                    |    1 +
>  kernel/cgroup/cgroup.c                             |    5 +
>  kernel/cgroup/namespace.c                          |    1 +
>  kernel/cred.c                                      |   17 +
>  kernel/exit.c                                      |    1 +
>  kernel/nscommon.c                                  |   59 +-
>  kernel/nsproxy.c                                   |    7 +
>  kernel/nstree.c                                    |  527 ++++-
>  kernel/pid.c                                       |   15 +
>  kernel/pid_namespace.c                             |    1 +
>  kernel/time/namespace.c                            |    6 +
>  kernel/user.c                                      |    5 +
>  kernel/user_namespace.c                            |    1 +
>  kernel/utsname.c                                   |    1 +
>  net/core/net_namespace.c                           |    3 +-
>  scripts/syscall.tbl                                |    1 +
>  tools/include/uapi/linux/nsfs.h                    |   70 +
>  tools/testing/selftests/filesystems/utils.c        |    2 +-
>  tools/testing/selftests/namespaces/.gitignore      |    3 +
>  tools/testing/selftests/namespaces/Makefile        |    7 +-
>  .../selftests/namespaces/listns_permissions_test.c |  777 +++++++
>  tools/testing/selftests/namespaces/listns_test.c   |  656 ++++++
>  .../selftests/namespaces/ns_active_ref_test.c      | 2226 ++++++++++++++++++++
>  tools/testing/selftests/namespaces/wrappers.h      |   35 +
>  53 files changed, 4737 insertions(+), 48 deletions(-)
> ---
> base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
> change-id: 20251020-work-namespace-nstree-listns-9fd71518515c

-- 
Jeff Layton <jlayton@kernel.org>

Re: [PATCH RFC DRAFT 00/50] nstree: listns()

Posted by Christian Brauner 3 months, 2 weeks ago

> > So that punches a whole in the active reference count tracking. So this
> > will have to be handled as right now socket file descriptors that pin a
> > network namespace that don't have an active reference anymore (no live
> > processes, not explicit persistence via namespace fds) can't be used to
> > issue a SIOCGSKNS ioctl() to open the associated network namespace.
> > 
> 
> Is this capability something we need to preserve? It seems like the
> fact that SIOCGSKNS works when there are no active references left
> might have been an accident. Is there a legit use-case for allowing
> that?

I've solved that use-case now and have added a large testsuite to verify
that it works.

> 
> I don't see a problem with active+passive refcounts. They're more
> complicated to deal with, but we've used them elsewhere so it's a
> pattern we all know (even if we don't necessarily love them).

+1

> I'll also point out that net namespaces already have two refcounts for
> this exact reason. Do you plan to replace the passive refcount in
> struct net with the new passive refcount you're implementing here?

Yeah, that's an option. I think that in the future it should also be
possible to completely drop the net/ internal network namespace tracking
and rely on the nstree infrastructure only. But that's work for the
future.

> 
> > So two options I see if the api is based on ids:
> > 
> > (1) We use the active reference count and somehow also make it work with
> >     sockets.
> > (2) The active reference count is not needed and we say that listns() is
> >     an introspection system call anyway so we just always list
> >     namespaces regardless of why they are still pinned: files,
> >     mm_struct, network devices, everything is fair game.
> > (3) Throw hands up in the air and just not do it.
> > 
> 
> Is listns() the only reason we'd need a active/passive refcounts? It
> seems like we might need them for other reasons (e.g. struct net).

Yes.

> IMO, even if you keep the active+passive refcounts, it would be good to
> be able to tell listns() to return all the namespaces, and not just the
> ones that are still active. Maybe that can be the first flag for this
> new syscall?

Certainly possible but that would be pure introspection. But as I said
elsewhere, I have implemented the nstree infrastructure in a way that
it will allow bpf to walk the namespace trees and that would obviously
also include all namespaces that are not active anymore.

[syzbot ci] Re: nstree: listns()

Posted by syzbot ci 3 months, 2 weeks ago

syzbot ci has tested the following series

[v1] nstree: listns()
https://lore.kernel.org/all/20251021-work-namespace-nstree-listns-v1-0-ad44261a8a5b@kernel.org
* [PATCH RFC DRAFT 01/50] libfs: allow to specify s_d_flags
* [PATCH RFC DRAFT 02/50] nsfs: use inode_just_drop()
* [PATCH RFC DRAFT 03/50] nsfs: raise DCACHE_DONTCACHE explicitly
* [PATCH RFC DRAFT 04/50] pidfs: raise DCACHE_DONTCACHE explicitly
* [PATCH RFC DRAFT 05/50] nsfs: raise SB_I_NODEV and SB_I_NOEXEC
* [PATCH RFC DRAFT 06/50] nstree: simplify return
* [PATCH RFC DRAFT 07/50] ns: initialize ns_list_node for initial namespaces
* [PATCH RFC DRAFT 08/50] ns: add __ns_ref_read()
* [PATCH RFC DRAFT 09/50] ns: add active reference count
* [PATCH RFC DRAFT 10/50] ns: use anonymous struct to group list member
* [PATCH RFC DRAFT 11/50] nstree: introduce a unified tree
* [PATCH RFC DRAFT 12/50] nstree: allow lookup solely based on inode
* [PATCH RFC DRAFT 13/50] nstree: assign fixed ids to the initial namespaces
* [PATCH RFC DRAFT 14/50] ns: maintain list of owned namespaces
* [PATCH RFC DRAFT 15/50] nstree: add listns()
* [PATCH RFC DRAFT 16/50] arch: hookup listns() system call
* [PATCH RFC DRAFT 17/50] nsfs: update tools header
* [PATCH RFC DRAFT 18/50] selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper
* [PATCH RFC DRAFT 19/50] selftests/namespaces: first active reference count tests
* [PATCH RFC DRAFT 20/50] selftests/namespaces: second active reference count tests
* [PATCH RFC DRAFT 21/50] selftests/namespaces: third active reference count tests
* [PATCH RFC DRAFT 22/50] selftests/namespaces: fourth active reference count tests
* [PATCH RFC DRAFT 23/50] selftests/namespaces: fifth active reference count tests
* [PATCH RFC DRAFT 24/50] selftests/namespaces: sixth active reference count tests
* [PATCH RFC DRAFT 25/50] selftests/namespaces: seventh active reference count tests
* [PATCH RFC DRAFT 26/50] selftests/namespaces: eigth active reference count tests
* [PATCH RFC DRAFT 27/50] selftests/namespaces: ninth active reference count tests
* [PATCH RFC DRAFT 28/50] selftests/namespaces: tenth active reference count tests
* [PATCH RFC DRAFT 29/50] selftests/namespaces: eleventh active reference count tests
* [PATCH RFC DRAFT 30/50] selftests/namespaces: twelth active reference count tests
* [PATCH RFC DRAFT 31/50] selftests/namespaces: thirteenth active reference count tests
* [PATCH RFC DRAFT 32/50] selftests/namespaces: fourteenth active reference count tests
* [PATCH RFC DRAFT 33/50] selftests/namespaces: fifteenth active reference count tests
* [PATCH RFC DRAFT 34/50] selftests/namespaces: add listns() wrapper
* [PATCH RFC DRAFT 35/50] selftests/namespaces: first listns() test
* [PATCH RFC DRAFT 36/50] selftests/namespaces: second listns() test
* [PATCH RFC DRAFT 37/50] selftests/namespaces: third listns() test
* [PATCH RFC DRAFT 38/50] selftests/namespaces: fourth listns() test
* [PATCH RFC DRAFT 39/50] selftests/namespaces: fifth listns() test
* [PATCH RFC DRAFT 40/50] selftests/namespaces: sixth listns() test
* [PATCH RFC DRAFT 41/50] selftests/namespaces: seventh listns() test
* [PATCH RFC DRAFT 42/50] selftests/namespaces: ninth listns() test
* [PATCH RFC DRAFT 43/50] selftests/namespaces: ninth listns() test
* [PATCH RFC DRAFT 44/50] selftests/namespaces: first listns() permission test
* [PATCH RFC DRAFT 45/50] selftests/namespaces: second listns() permission test
* [PATCH RFC DRAFT 46/50] selftests/namespaces: third listns() permission test
* [PATCH RFC DRAFT 47/50] selftests/namespaces: fourth listns() permission test
* [PATCH RFC DRAFT 48/50] selftests/namespaces: fifth listns() permission test
* [PATCH RFC DRAFT 49/50] selftests/namespaces: sixth listns() permission test
* [PATCH RFC DRAFT 50/50] selftests/namespaces: seventh listns() permission test

and found the following issue:
WARNING in __ns_tree_add_raw

Full report is available here:
https://ci.syzbot.org/series/03ca38c3-876c-4231-aa06-ddb0bc8a30ad

***

WARNING in __ns_tree_add_raw

tree:      bpf
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf.git
base:      5fb750e8a9ae123b2034771b864b8a21dbef65cd
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/156cf21b-68f9-423c-807a-3dd094e6aed8/config

------------[ cut here ]------------
WARNING: CPU: 1 PID: 5816 at kernel/nstree.c:189 __ns_tree_add_raw+0xa92/0xb30
Modules linked in:
CPU: 1 UID: 0 PID: 5816 Comm: syz-executor Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__ns_tree_add_raw+0xa92/0xb30
Code: 32 00 90 0f 0b 90 42 80 3c 23 00 0f 85 1e fc ff ff e9 21 fc ff ff e8 ed 78 32 00 90 0f 0b 90 e9 66 fc ff ff e8 df 78 32 00 90 <0f> 0b 90 e9 53 ff ff ff 44 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c ef
RSP: 0018:ffffc90003f27c30 EFLAGS: 00010293
RAX: ffffffff818e0051 RBX: 1ffffffff16db871 RCX: ffff88810ffe0000
RDX: 0000000000000000 RSI: ffff8881bbf5e9a8 RDI: ffff88816d0e6e00
RBP: ffff88816d0e6e00 R08: ffff88816d0e6e3f R09: 0000000000000000
R10: ffff88816d0e6e30 R11: ffffffff81b988c0 R12: dffffc0000000000
R13: ffff88816d0e6e40 R14: ffffffff8b6dc388 R15: ffff8881bbf5e9a8
FS:  000055558630f500(0000) GS:ffff8882a9d04000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5c3f03529c CR3: 000000010b5bc000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 copy_cgroup_ns+0x373/0x5f0
 create_new_namespaces+0x358/0x720
 unshare_nsproxy_namespaces+0x11c/0x170
 ksys_unshare+0x4c8/0x8c0
 __x64_sys_unshare+0x38/0x50
 do_syscall_64+0xfa/0xfa0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5c3ef907c7
Code: 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 10 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc62c39c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00007ffc62c39c90 RCX: 00007f5c3ef907c7
RDX: 0000000000000000 RSI: 00007f5c3f03529c RDI: 0000000002000000
RBP: 00007ffc62c39d20 R08: 0000000000000000 R09: 00007f5c3fd1d6c0
R10: 0000000000044000 R11: 0000000000000246 R12: 00007ffc62c39d20
R13: 00007ffc62c39d28 R14: 0000000000000009 R15: 0000000000000000
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.