[PATCH RFC DRAFT 00/50] nstree: listns()

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC DRAFT 00/50] nstree: listns()
@ 2025-10-21 11:43 Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 01/50] libfs: allow to specify s_d_flags Christian Brauner
                   ` (53 more replies)
  0 siblings, 54 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Hey,

As announced a while ago this is the next step building on the nstree
work from prior cycles. There's a bunch of fixes and semantic cleanups
in here and a ton of tests.

I need helper here!: Consider the following current design:

Currently listns() is relying on active namespace reference counts which
are introduced alongside this series.

The active reference count of a namespace consists of the live tasks
that make use of this namespace and any namespace file descriptors that
explicitly pin the namespace.

Once all tasks making use of this namespace have exited or reaped, all
namespace file descriptors for that namespace have been closed and all
bind-mounts for that namespace unmounted it ceases to appear in the
listns() output.

My reason for introducing the active reference count was that namespaces
might obviously still be pinned internally for various reasons. For
example the user namespace might still be pinned because there are still
open files that have stashed the openers credentials in file->f_cred, or
the last reference might be put with an rcu delay keeping that namespace
active on the namespace lists.

But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
which uses lazy TLB destruction.

When this option is set a userspace task's struct mm_struct may be used
for kernel threads such as the idle task and will only be destroyed once
the cpu's runqueue switches back to another task. So the kernel thread
will take a reference on the struct mm_struct pinning it.

And for ptrace() based access checks struct mm_struct stashes the user
namespace of the task that struct mm_struct belonged to originally and
thus takes a reference to the users namespace and pins it.

So on an idle system such user namespaces can be persisted for pretty
arbitrary amounts of time via struct mm_struct.

Now, without the active reference count regulating visibility all
namespace that still are pinned in some way on the system will appear in
the listns() output and can be reopened using namespace file handles.

Of course that requires suitable privileges and it's not really a
concern per se because a task could've also persist the namespace
recorded in struct mm_struct explicitly and then the idle task would
still reuse that struct mm_struct and another task could still happily
setns() to it afaict and reuse it for something else.

The active reference count though has drawbacks itself. Namely that
socket files break the assumption that namespaces can only be opened if
there's either live processes pinning the namespace or there are file
descriptors open that pin the namespace itself as the socket SIOCGSKNS
ioctl() can be used to open a network namespace based on a socket which
only indirectly pins a network namespace.

So that punches a whole in the active reference count tracking. So this
will have to be handled as right now socket file descriptors that pin a
network namespace that don't have an active reference anymore (no live
processes, not explicit persistence via namespace fds) can't be used to
issue a SIOCGSKNS ioctl() to open the associated network namespace.

So two options I see if the api is based on ids:

(1) We use the active reference count and somehow also make it work with
    sockets.
(2) The active reference count is not needed and we say that listns() is
    an introspection system call anyway so we just always list
    namespaces regardless of why they are still pinned: files,
    mm_struct, network devices, everything is fair game.
(3) Throw hands up in the air and just not do it.

=====================================================================

Add a new listns() system call that allows userspace to iterate through
namespaces in the system. This provides a programmatic interface to
discover and inspect namespaces, enhancing existing namespace apis.

Currently, there is no direct way for userspace to enumerate namespaces
in the system. Applications must resort to scanning /proc/<pid>/ns/
across all processes, which is:

1. Inefficient - requires iterating over all processes
2. Incomplete - misses inactive namespaces that aren't attached to any
   running process but are kept alive by file descriptors, bind mounts,
   or parent namespace references
3. Permission-heavy - requires access to /proc for many processes
4. No ordering or ownership.
5. No filtering per namespace type: Must always iterate and check all
   namespaces.

The list goes on. The listns() system call solves these problems by
providing direct kernel-level enumeration of namespaces. It is similar
to listmount() but obviously tailored to namespaces.

/*
 * @req: Pointer to struct ns_id_req specifying search parameters
 * @ns_ids: User buffer to receive namespace IDs
 * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
 * @flags: Reserved for future use (must be 0)
 */
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
               size_t nr_ns_ids, unsigned int flags);

Returns:
- On success: Number of namespace IDs written to ns_ids
- On error: Negative error code

/*
 * @size: Structure size
 * @ns_id: Starting point for iteration; use 0 for first call, then
 *         use the last returned ID for subsequent calls to paginate
 * @ns_type: Bitmask of namespace types to include (from enum ns_type):
 *           0: Return all namespace types
 *           MNT_NS: Mount namespaces
 *           NET_NS: Network namespaces
 *           USER_NS: User namespaces
 *           etc. Can be OR'd together
 * @user_ns_id: Filter results to namespaces owned by this user namespace:
 *              0: Return all namespaces (subject to permission checks)
 *              LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace
 *              Other value: Namespaces owned by the specified user namespace ID
 */
struct ns_id_req {
        __u32 size;         /* sizeof(struct ns_id_req) */
        __u32 spare;        /* Reserved, must be 0 */
        __u64 ns_id;        /* Last seen namespace ID (for pagination) */
        __u32 ns_type;      /* Filter by namespace type(s) */
        __u32 spare2;       /* Reserved, must be 0 */
        __u64 user_ns_id;   /* Filter by owning user namespace */
};

Example 1: List all namespaces

void list_all_namespaces(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,      /* Start from beginning */
		.ns_type = 0,    /* All types */
		.user_ns_id = 0, /* All user namespaces */
	};
	uint64_t ids[100];
	ssize_t ret;

	printf("All namespaces in the system:\n");
	do {
		ret = listns(&req, ids, 100, 0);
		if (ret < 0) {
			perror("listns");
			break;
		}

		for (ssize_t i = 0; i < ret; i++)
			printf("  Namespace ID: %llu\n", (unsigned long long)ids[i]);

		/* Continue from last seen ID */
		if (ret > 0)
			req.ns_id = ids[ret - 1];
	} while (ret == 100); /* Buffer was full, more may exist */
}

Example 2 : List network namespaces only

void list_network_namespaces(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,
		.ns_type = NET_NS, /* Only network namespaces */
		.user_ns_id = 0,
	};
	uint64_t ids[100];
	ssize_t ret;

	ret = listns(&req, ids, 100, 0);
	if (ret < 0) {
		perror("listns");
		return;
	}

	printf("Network namespaces: %zd found\n", ret);
	for (ssize_t i = 0; i < ret; i++)
		printf("  netns ID: %llu\n", (unsigned long long)ids[i]);
}

Example 3 : List namespaces owned by current user namespace

void list_owned_namespaces(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,
		.ns_type = 0,                      /* All types */
		.user_ns_id = LISTNS_CURRENT_USER, /* Current userns */
	};
	uint64_t ids[100];
	ssize_t ret;

	ret = listns(&req, ids, 100, 0);
	if (ret < 0) {
		perror("listns");
		return;
	}

	printf("Namespaces owned by my user namespace: %zd\n", ret);
	for (ssize_t i = 0; i < ret; i++)
		printf("  ns ID: %llu\n", (unsigned long long)ids[i]);
}

Example 4 : List multiple namespace types

void list_network_and_mount_namespaces(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,
		.ns_type = NET_NS | MNT_NS, /* Network and mount */
		.user_ns_id = 0,
	};
	uint64_t ids[100];
	ssize_t ret;

	ret = listns(&req, ids, 100, 0);
	printf("Network and mount namespaces: %zd found\n", ret);
}

Example 5 : Pagination through large namespace sets

void list_all_with_pagination(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,
		.ns_type = 0,
		.user_ns_id = 0,
	};
	uint64_t ids[50];
	size_t total = 0;
	ssize_t ret;

	printf("Enumerating all namespaces with pagination:\n");

	while (1) {
		ret = listns(&req, ids, 50, 0);
		if (ret < 0) {
			perror("listns");
			break;
		}
		if (ret == 0)
			break; /* No more namespaces */

		total += ret;
		printf("  Batch: %zd namespaces\n", ret);

		/* Last ID in this batch becomes start of next batch */
		req.ns_id = ids[ret - 1];

		if (ret < 50)
			break; /* Partial batch = end of results */
	}

	printf("Total: %zu namespaces\n", total);
}

listns() respects namespace isolation and capabilities:

(1) Global listing (user_ns_id = 0):
    - Requires CAP_SYS_ADMIN in the namespace's owning user namespace
    - OR the namespace must be in the caller's namespace context (e.g.,
      a namespace the caller is currently using)
    - User namespaces additionally allow listing if the caller has
      CAP_SYS_ADMIN in that user namespace itself
(2) Owner-filtered listing (user_ns_id != 0):
    - Requires CAP_SYS_ADMIN in the specified owner user namespace
    - OR the namespace must be in the caller's namespace context
    - This allows unprivileged processes to enumerate namespaces they own
(3) Visibility:
    - Only "active" namespaces are listed
    - A namespace is active if it has a non-zero __ns_ref_active count
    - This includes namespaces used by running processes, held by open
      file descriptors, or kept active by bind mounts
    - Inactive namespaces (kept alive only by internal kernel
      references) are not visible via listns()

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (50):
      libfs: allow to specify s_d_flags
      nsfs: use inode_just_drop()
      nsfs: raise DCACHE_DONTCACHE explicitly
      pidfs: raise DCACHE_DONTCACHE explicitly
      nsfs: raise SB_I_NODEV and SB_I_NOEXEC
      nstree: simplify return
      ns: initialize ns_list_node for initial namespaces
      ns: add __ns_ref_read()
      ns: add active reference count
      ns: use anonymous struct to group list member
      nstree: introduce a unified tree
      nstree: allow lookup solely based on inode
      nstree: assign fixed ids to the initial namespaces
      ns: maintain list of owned namespaces
      nstree: add listns()
      arch: hookup listns() system call
      nsfs: update tools header
      selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper
      selftests/namespaces: first active reference count tests
      selftests/namespaces: second active reference count tests
      selftests/namespaces: third active reference count tests
      selftests/namespaces: fourth active reference count tests
      selftests/namespaces: fifth active reference count tests
      selftests/namespaces: sixth active reference count tests
      selftests/namespaces: seventh active reference count tests
      selftests/namespaces: eigth active reference count tests
      selftests/namespaces: ninth active reference count tests
      selftests/namespaces: tenth active reference count tests
      selftests/namespaces: eleventh active reference count tests
      selftests/namespaces: twelth active reference count tests
      selftests/namespaces: thirteenth active reference count tests
      selftests/namespaces: fourteenth active reference count tests
      selftests/namespaces: fifteenth active reference count tests
      selftests/namespaces: add listns() wrapper
      selftests/namespaces: first listns() test
      selftests/namespaces: second listns() test
      selftests/namespaces: third listns() test
      selftests/namespaces: fourth listns() test
      selftests/namespaces: fifth listns() test
      selftests/namespaces: sixth listns() test
      selftests/namespaces: seventh listns() test
      selftests/namespaces: ninth listns() test
      selftests/namespaces: ninth listns() test
      selftests/namespaces: first listns() permission test
      selftests/namespaces: second listns() permission test
      selftests/namespaces: third listns() permission test
      selftests/namespaces: fourth listns() permission test
      selftests/namespaces: fifth listns() permission test
      selftests/namespaces: sixth listns() permission test
      selftests/namespaces: seventh listns() permission test

 arch/alpha/kernel/syscalls/syscall.tbl             |    1 +
 arch/arm/tools/syscall.tbl                         |    1 +
 arch/arm64/tools/syscall_32.tbl                    |    1 +
 arch/m68k/kernel/syscalls/syscall.tbl              |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl        |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl          |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl          |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl          |    1 +
 arch/parisc/kernel/syscalls/syscall.tbl            |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl           |    1 +
 arch/s390/kernel/syscalls/syscall.tbl              |    1 +
 arch/sh/kernel/syscalls/syscall.tbl                |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl             |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl             |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl             |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl            |    1 +
 fs/libfs.c                                         |    1 +
 fs/namespace.c                                     |    8 +-
 fs/nsfs.c                                          |   79 +-
 fs/pidfs.c                                         |    1 +
 include/linux/ns_common.h                          |  147 +-
 include/linux/nsfs.h                               |    3 +
 include/linux/nstree.h                             |   26 +-
 include/linux/pseudo_fs.h                          |    1 +
 include/linux/syscalls.h                           |    4 +
 include/uapi/asm-generic/unistd.h                  |    4 +-
 include/uapi/linux/nsfs.h                          |   58 +
 init/version-timestamp.c                           |    5 +
 ipc/msgutil.c                                      |    5 +
 ipc/namespace.c                                    |    1 +
 kernel/cgroup/cgroup.c                             |    5 +
 kernel/cgroup/namespace.c                          |    1 +
 kernel/cred.c                                      |   17 +
 kernel/exit.c                                      |    1 +
 kernel/nscommon.c                                  |   59 +-
 kernel/nsproxy.c                                   |    7 +
 kernel/nstree.c                                    |  527 ++++-
 kernel/pid.c                                       |   15 +
 kernel/pid_namespace.c                             |    1 +
 kernel/time/namespace.c                            |    6 +
 kernel/user.c                                      |    5 +
 kernel/user_namespace.c                            |    1 +
 kernel/utsname.c                                   |    1 +
 net/core/net_namespace.c                           |    3 +-
 scripts/syscall.tbl                                |    1 +
 tools/include/uapi/linux/nsfs.h                    |   70 +
 tools/testing/selftests/filesystems/utils.c        |    2 +-
 tools/testing/selftests/namespaces/.gitignore      |    3 +
 tools/testing/selftests/namespaces/Makefile        |    7 +-
 .../selftests/namespaces/listns_permissions_test.c |  777 +++++++
 tools/testing/selftests/namespaces/listns_test.c   |  656 ++++++
 .../selftests/namespaces/ns_active_ref_test.c      | 2226 ++++++++++++++++++++
 tools/testing/selftests/namespaces/wrappers.h      |   35 +
 53 files changed, 4737 insertions(+), 48 deletions(-)
---
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
change-id: 20251020-work-namespace-nstree-listns-9fd71518515c


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 01/50] libfs: allow to specify s_d_flags
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 02/50] nsfs: use inode_just_drop() Christian Brauner
                   ` (52 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Make it possible for pseudo filesystems to specify default dentry flags.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/libfs.c                | 1 +
 include/linux/pseudo_fs.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/libfs.c b/fs/libfs.c
index ce8c496a6940..4bb4d8a313e7 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -680,6 +680,7 @@ static int pseudo_fs_fill_super(struct super_block *s, struct fs_context *fc)
 	s->s_export_op = ctx->eops;
 	s->s_xattr = ctx->xattr;
 	s->s_time_gran = 1;
+	s->s_d_flags |= ctx->s_d_flags;
 	root = new_inode(s);
 	if (!root)
 		return -ENOMEM;
diff --git a/include/linux/pseudo_fs.h b/include/linux/pseudo_fs.h
index 2503f7625d65..a651e60d9410 100644
--- a/include/linux/pseudo_fs.h
+++ b/include/linux/pseudo_fs.h
@@ -9,6 +9,7 @@ struct pseudo_fs_context {
 	const struct xattr_handler * const *xattr;
 	const struct dentry_operations *dops;
 	unsigned long magic;
+	unsigned int s_d_flags;
 };
 
 struct pseudo_fs_context *init_pseudo(struct fs_context *fc,

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 02/50] nsfs: use inode_just_drop()
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 01/50] libfs: allow to specify s_d_flags Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 03/50] nsfs: raise DCACHE_DONTCACHE explicitly Christian Brauner
                   ` (51 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Currently nsfs uses the default inode_generic_drop() fallback which
drops the inode when it's unlinked or when it's unhashed. Since nsfs
never hashes inodes that always amounts to dropping the inode.

But that's just annoying to have to reason through every time we look at
this code. Switch to inode_just_drop() which always drops the inode
explicitly. This also aligns the behavior with pidfs which does the
same.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/nsfs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 648dc59bef7f..4e77eba0c8fc 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -408,6 +408,7 @@ static const struct super_operations nsfs_ops = {
 	.statfs = simple_statfs,
 	.evict_inode = nsfs_evict,
 	.show_path = nsfs_show_path,
+	.drop_inode = inode_just_drop,
 };
 
 static int nsfs_init_inode(struct inode *inode, void *data)

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 03/50] nsfs: raise DCACHE_DONTCACHE explicitly
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 01/50] libfs: allow to specify s_d_flags Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 02/50] nsfs: use inode_just_drop() Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 04/50] pidfs: " Christian Brauner
                   ` (50 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

While nsfs dentries are never hashed and thus retain_dentry() will never
consider them for placing them on the LRU it isn't great to always have
to go and remember that. Raise DCACHE_DONTCACHE explicitly as a visual
marker that dentries aren't kept but freed immediately instead.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/nsfs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 4e77eba0c8fc..0e3fe8fda5bf 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -589,6 +589,7 @@ static int nsfs_init_fs_context(struct fs_context *fc)
 	struct pseudo_fs_context *ctx = init_pseudo(fc, NSFS_MAGIC);
 	if (!ctx)
 		return -ENOMEM;
+	ctx->s_d_flags |= DCACHE_DONTCACHE;
 	ctx->ops = &nsfs_ops;
 	ctx->eops = &nsfs_export_operations;
 	ctx->dops = &ns_dentry_operations;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 04/50] pidfs: raise DCACHE_DONTCACHE explicitly
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (2 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 03/50] nsfs: raise DCACHE_DONTCACHE explicitly Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 05/50] nsfs: raise SB_I_NODEV and SB_I_NOEXEC Christian Brauner
                   ` (49 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

While pidfs dentries are never hashed and thus retain_dentry() will never
consider them for placing them on the LRU it isn't great to always have
to go and remember that. Raise DCACHE_DONTCACHE explicitly as a visual
marker that dentries aren't kept but freed immediately instead.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 0ef5b47d796a..db236427fc2c 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -1022,6 +1022,7 @@ static int pidfs_init_fs_context(struct fs_context *fc)
 
 	fc->s_iflags |= SB_I_NOEXEC;
 	fc->s_iflags |= SB_I_NODEV;
+	ctx->s_d_flags |= DCACHE_DONTCACHE;
 	ctx->ops = &pidfs_sops;
 	ctx->eops = &pidfs_export_operations;
 	ctx->dops = &pidfs_dentry_operations;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 05/50] nsfs: raise SB_I_NODEV and SB_I_NOEXEC
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (3 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 04/50] pidfs: " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 06/50] nstree: simplify return Christian Brauner
                   ` (48 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

There's zero need for nsfs to allow device nodes or execution.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/nsfs.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 0e3fe8fda5bf..363be226e357 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -589,6 +589,8 @@ static int nsfs_init_fs_context(struct fs_context *fc)
 	struct pseudo_fs_context *ctx = init_pseudo(fc, NSFS_MAGIC);
 	if (!ctx)
 		return -ENOMEM;
+	fc->s_iflags |= SB_I_NOEXEC;
+	fc->s_iflags |= SB_I_NODEV;
 	ctx->s_d_flags |= DCACHE_DONTCACHE;
 	ctx->ops = &nsfs_ops;
 	ctx->eops = &nsfs_export_operations;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 06/50] nstree: simplify return
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (4 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 05/50] nsfs: raise SB_I_NODEV and SB_I_NOEXEC Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 07/50] ns: initialize ns_list_node for initial namespaces Christian Brauner
                   ` (47 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

node_to_ns() checks for NULL and the assert isn't really helpful and
will have to be dropped later anyway.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 kernel/nstree.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/kernel/nstree.c b/kernel/nstree.c
index b24a320a11a6..369fd1675c6a 100644
--- a/kernel/nstree.c
+++ b/kernel/nstree.c
@@ -194,11 +194,6 @@ struct ns_common *ns_tree_lookup_rcu(u64 ns_id, int ns_type)
 			break;
 	} while (read_seqretry(&ns_tree->ns_tree_lock, seq));
 
-	if (!node)
-		return NULL;
-
-	VFS_WARN_ON_ONCE(node_to_ns(node)->ns_type != ns_type);
-
 	return node_to_ns(node);
 }
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 07/50] ns: initialize ns_list_node for initial namespaces
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (5 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 06/50] nstree: simplify return Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 08/50] ns: add __ns_ref_read() Christian Brauner
                   ` (46 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Make sure that the list is always initialized for initial namespaces.

Fixes: 885fc8ac0a4d ("nstree: make iterator generic")
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c           | 1 +
 init/version-timestamp.c | 1 +
 ipc/msgutil.c            | 1 +
 kernel/cgroup/cgroup.c   | 1 +
 kernel/pid.c             | 1 +
 kernel/time/namespace.c  | 1 +
 kernel/user.c            | 1 +
 7 files changed, 7 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index d82910f33dc4..8ef8ba3dd316 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -5993,6 +5993,7 @@ struct mnt_namespace init_mnt_ns = {
 	.passive	= REFCOUNT_INIT(1),
 	.mounts		= RB_ROOT,
 	.poll		= __WAIT_QUEUE_HEAD_INITIALIZER(init_mnt_ns.poll),
+	.ns.ns_list_node = LIST_HEAD_INIT(init_mnt_ns.ns.ns_list_node),
 };
 
 static void __init init_mount_tree(void)
diff --git a/init/version-timestamp.c b/init/version-timestamp.c
index d071835121c2..61b2405d97f9 100644
--- a/init/version-timestamp.c
+++ b/init/version-timestamp.c
@@ -20,6 +20,7 @@ struct uts_namespace init_uts_ns = {
 	},
 	.user_ns = &init_user_ns,
 	.ns.inum = ns_init_inum(&init_uts_ns),
+	.ns.ns_list_node = LIST_HEAD_INIT(init_uts_ns.ns.ns_list_node),
 #ifdef CONFIG_UTS_NS
 	.ns.ops = &utsns_operations,
 #endif
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index 7a03f6d03de3..c9469fbce27c 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -30,6 +30,7 @@ struct ipc_namespace init_ipc_ns = {
 	.ns.__ns_ref = REFCOUNT_INIT(1),
 	.user_ns = &init_user_ns,
 	.ns.inum = ns_init_inum(&init_ipc_ns),
+	.ns.ns_list_node = LIST_HEAD_INIT(init_ipc_ns.ns.ns_list_node),
 #ifdef CONFIG_IPC_NS
 	.ns.ops = &ipcns_operations,
 #endif
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6ae5f48cf64e..a82918da8bae 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -256,6 +256,7 @@ struct cgroup_namespace init_cgroup_ns = {
 	.ns.inum	= ns_init_inum(&init_cgroup_ns),
 	.root_cset	= &init_css_set,
 	.ns.ns_type	= ns_common_type(&init_cgroup_ns),
+	.ns.ns_list_node = LIST_HEAD_INIT(init_cgroup_ns.ns.ns_list_node),
 };
 
 static struct file_system_type cgroup2_fs_type;
diff --git a/kernel/pid.c b/kernel/pid.c
index 4fffec767a63..cb7574ca00f7 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -78,6 +78,7 @@ struct pid_namespace init_pid_ns = {
 	.child_reaper = &init_task,
 	.user_ns = &init_user_ns,
 	.ns.inum = ns_init_inum(&init_pid_ns),
+	.ns.ns_list_node = LIST_HEAD_INIT(init_pid_ns.ns.ns_list_node),
 #ifdef CONFIG_PID_NS
 	.ns.ops = &pidns_operations,
 #endif
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 5b6997f4dc3d..ee05cad288da 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -484,6 +484,7 @@ struct time_namespace init_time_ns = {
 	.ns.inum	= ns_init_inum(&init_time_ns),
 	.ns.ops		= &timens_operations,
 	.frozen_offsets	= true,
+	.ns.ns_list_node = LIST_HEAD_INIT(init_time_ns.ns.ns_list_node),
 };
 
 void __init time_ns_init(void)
diff --git a/kernel/user.c b/kernel/user.c
index 0163665914c9..b9cf3b056a71 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -70,6 +70,7 @@ struct user_namespace init_user_ns = {
 	.owner = GLOBAL_ROOT_UID,
 	.group = GLOBAL_ROOT_GID,
 	.ns.inum = ns_init_inum(&init_user_ns),
+	.ns.ns_list_node = LIST_HEAD_INIT(init_user_ns.ns.ns_list_node),
 #ifdef CONFIG_USER_NS
 	.ns.ops = &userns_operations,
 #endif

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 08/50] ns: add __ns_ref_read()
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (6 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 07/50] ns: initialize ns_list_node for initial namespaces Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 09/50] ns: add active reference count Christian Brauner
                   ` (45 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Implement ns_ref_read() the same way as ns_ref_{get,put}().
No point in making that any more special or different from the other
helpers.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/ns_common.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index f5b68b8abb54..32114d5698dc 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -143,7 +143,12 @@ static __always_inline __must_check bool __ns_ref_get(struct ns_common *ns)
 	return refcount_inc_not_zero(&ns->__ns_ref);
 }
 
-#define ns_ref_read(__ns) refcount_read(&to_ns_common((__ns))->__ns_ref)
+static __always_inline __must_check int __ns_ref_read(const struct ns_common *ns)
+{
+	return refcount_read(&ns->__ns_ref);
+}
+
+#define ns_ref_read(__ns) __ns_ref_read(to_ns_common((__ns)))
 #define ns_ref_inc(__ns) refcount_inc(&to_ns_common((__ns))->__ns_ref)
 #define ns_ref_get(__ns) __ns_ref_get(to_ns_common((__ns)))
 #define ns_ref_put(__ns) __ns_ref_put(to_ns_common((__ns)))

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 09/50] ns: add active reference count
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (7 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 08/50] ns: add __ns_ref_read() Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 10/50] ns: use anonymous struct to group list member Christian Brauner
                   ` (44 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

The namespace tree is, among other things, currently used to support
file handles for namespaces. When a namespace is created it is placed on
the namespace trees and when it is destroyed it is removed from the
namespace trees.

While a namespace is on the namespace trees with a valid reference count
it is possible to reopen it through a namespace file handle. This is all
fine but has some issues that should be addressed.

On current kernels a namespace is visible to userspace in the
following cases:

(1) The namespace is in use by a task.
(2) The namespace is persisted through a VFS object (namespace file, bind-mount).
    Note that (2) only cares about direct persistence of the namespace
    itself not indirectly via e.g., file->f_cred file references or
    similar.
(3) The namespace is a hierarchical namespace type and is the parent of
    a single or multiple child namespaces.

Case (3) is interesting because it is possible that a parent namespace
might not fulfill any of (1) or (2), i.e., is invisible to userspace but
it may still be resurrected through the NS_GET_PARENT ioctl().

Currently namespace file handles allow much broader access to namespaces
than what is currently possible via (1)-(4). The reason is that
namespaces may remain pinned for completely internal reasons yet are
inaccessible to userspace.

For example, a user namespace my remain pinned by get_cred() calles to
stash the opener's credentials into file->f_cred. As it stands file
handles allow to resurrect such a users namespace even though this
should not be possible via (1)-(3). This is a fundamental uapi change
that we shouldn't do.

Consider the following insane case. Various architectures support the
CONFIG_MMU_LAZY_TLB_REFCOUNT option which uses lazy TLB destruction.
When this option is set a userspace task's struct mm_struct may be used
for kernel threads such as the idle task and will only be destroyed once
the cpu's runqueue switch back to another task. The kernel thread will
take a reference on the struct mm_struct. For ptrace() based access
checks struct mm_struct stashes the user namespace of the task that
struct mm_struct belonged to originally and thus takes a reference to
the users namespace and pins it.

So on a big idle system with loads and loads of CPUs user namespaces can
be persisted for arbitrary amount of time which also means that can be
resurrected using namespace file handles. That makes no sense
whatsoever.

We could ofc try and fix this but this is pointless surgery and there's
no need to change the refcounting rules for the actual __ns_ref count.
Instead we introduce a proper liveliness reference count __ns_ref_active
which tracks (1)-(3). This is easy to do as all of these things are
already managed centrally. Only (1)-(3) will count towards
__ns_ref_active count and only namespaces which are active may be opened
via namespace file handles.

The __ns_ref_active reference count does not regulate the lifetime of
the namespace itself. This is still done by __ns_ref. The
__ns_ref_active count can only be elevated if the __ns_ref count is
non-zero.

Furthermore, it also doesn't regulate the presence of a namespace on the
namespace trees. Any namespace remains on the namespace trees until it
is actually destroyed. This will allow the kernel to always reach any
namespace trivially and it will also enable stuff like bpf to walk the
namespace lists on the system for tracing or general introspection
purposes to e.g., debug problems where namespaces are pinned.

Different namespaces under /proc/<pid>/ns/<ns_type> may have different
lifetimes on current kernels. While most namespace are immediately
released when the last task using them _exits_, the user- and pid
namespace are persisted and thus both remain accessible via
/proc/<pid>/ns/<ns_type>. The user namespace lifetime is aliged with
struct cred and is only released through exit_creds(). However, it
becomes inaccessible to userspace once the last task using it is
_reaped_, i.e., when release_task() is called and all proc entries are
flushed. Similarly, the pid namespace is also visible until the last
task using it has been reaped. Both the user- and pid namespace are
marked as inactive once the task is reaped.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c            |   2 +
 fs/nsfs.c                 |  32 +++++++++++-
 include/linux/ns_common.h | 123 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/nsfs.h      |   3 ++
 init/version-timestamp.c  |   1 +
 ipc/msgutil.c             |   1 +
 ipc/namespace.c           |   1 +
 kernel/cgroup/cgroup.c    |   1 +
 kernel/cgroup/namespace.c |   1 +
 kernel/cred.c             |  17 +++++++
 kernel/exit.c             |   1 +
 kernel/nscommon.c         |  53 +++++++++++++++++++-
 kernel/nsproxy.c          |   7 +++
 kernel/pid.c              |  11 +++++
 kernel/pid_namespace.c    |   1 +
 kernel/time/namespace.c   |   2 +
 kernel/user.c             |   1 +
 kernel/user_namespace.c   |   1 +
 kernel/utsname.c          |   1 +
 net/core/net_namespace.c  |   1 +
 20 files changed, 259 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 8ef8ba3dd316..87116def5ee3 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4173,6 +4173,7 @@ struct mnt_namespace *copy_mnt_ns(u64 flags, struct mnt_namespace *ns,
 			p = next_mnt(skip_mnt_tree(p), old);
 	}
 	ns_tree_add_raw(new_ns);
+	ns_ref_active_get_owner(new_ns);
 	return new_ns;
 }
 
@@ -5989,6 +5990,7 @@ struct mnt_namespace init_mnt_ns = {
 	.ns.ops		= &mntns_operations,
 	.user_ns	= &init_user_ns,
 	.ns.__ns_ref	= REFCOUNT_INIT(1),
+	.ns.__ns_ref_active = ATOMIC_INIT(1),
 	.ns.ns_type	= ns_common_type(&init_mnt_ns),
 	.passive	= REFCOUNT_INIT(1),
 	.mounts		= RB_ROOT,
diff --git a/fs/nsfs.c b/fs/nsfs.c
index 363be226e357..a190e1e38442 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -58,6 +58,8 @@ const struct dentry_operations ns_dentry_operations = {
 static void nsfs_evict(struct inode *inode)
 {
 	struct ns_common *ns = inode->i_private;
+
+	__ns_ref_active_put(ns);
 	clear_inode(inode);
 	ns->ops->put(ns);
 }
@@ -419,6 +421,10 @@ static int nsfs_init_inode(struct inode *inode, void *data)
 	inode->i_mode |= S_IRUGO;
 	inode->i_fop = &ns_file_operations;
 	inode->i_ino = ns->inum;
+
+	if (!__ns_ref_active_get_not_zero(ns))
+		return -ENOENT;
+
 	return 0;
 }
 
@@ -493,7 +499,7 @@ static struct dentry *nsfs_fh_to_dentry(struct super_block *sb, struct fid *fh,
 		VFS_WARN_ON_ONCE(ns->ns_type != fid->ns_type);
 		VFS_WARN_ON_ONCE(ns->inum != fid->ns_inum);
 
-		if (!__ns_ref_get(ns))
+		if (!ns_get(ns))
 			return NULL;
 	}
 
@@ -614,3 +620,27 @@ void __init nsfs_init(void)
 	nsfs_root_path.mnt = nsfs_mnt;
 	nsfs_root_path.dentry = nsfs_mnt->mnt_root;
 }
+
+void nsproxy_ns_active_get(struct nsproxy *ns)
+{
+	ns_ref_active_get(ns->mnt_ns);
+	ns_ref_active_get(ns->uts_ns);
+	ns_ref_active_get(ns->ipc_ns);
+	ns_ref_active_get(ns->pid_ns_for_children);
+	ns_ref_active_get(ns->cgroup_ns);
+	ns_ref_active_get(ns->net_ns);
+	ns_ref_active_get(ns->time_ns);
+	ns_ref_active_get(ns->time_ns_for_children);
+}
+
+void nsproxy_ns_active_put(struct nsproxy *ns)
+{
+	ns_ref_active_put(ns->mnt_ns);
+	ns_ref_active_put(ns->uts_ns);
+	ns_ref_active_put(ns->ipc_ns);
+	ns_ref_active_put(ns->pid_ns_for_children);
+	ns_ref_active_put(ns->cgroup_ns);
+	ns_ref_active_put(ns->net_ns);
+	ns_ref_active_put(ns->time_ns);
+	ns_ref_active_put(ns->time_ns_for_children);
+}
diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index 32114d5698dc..5d19471235ab 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -4,7 +4,9 @@
 
 #include <linux/refcount.h>
 #include <linux/rbtree.h>
+#include <linux/vfsdebug.h>
 #include <uapi/linux/sched.h>
+#include <uapi/linux/nsfs.h>
 
 struct proc_ns_operations;
 
@@ -37,6 +39,73 @@ extern const struct proc_ns_operations cgroupns_operations;
 extern const struct proc_ns_operations timens_operations;
 extern const struct proc_ns_operations timens_for_children_operations;
 
+/*
+ * Namespace lifetimes are managed via a two-tier reference counting model:
+ *
+ * (1) __ns_ref (refcount_t): Main reference count tracking memory lifetime.
+ *     Controls when the namespace structure itself is freed. It also
+ *     pins the namespace on the namespace trees whereas (2) only
+ *     regulates their visibility to userspace.
+ *
+ * (2) __ns_ref_active (atomic_t): Reference count tracking active users.
+ *     Controls visibility of the namespace in the namespace trees.
+ *     Any live task that uses the namespace (via nsproxy or cred) holds
+ *     an active reference. Any open file descriptor or bind-mount of
+ *     the namespace holds an active reference. Once all tasks have
+ *     called exit_task_namespaces() and all file descriptors and
+ *     bind-mounts have been released the active reference count drops
+ *     to zero and the namespace becomes inactive. IOW, the namespace
+ *     cannot be listed or opened via file handles anymore.
+ *
+ *     Note that it is valid to transition from active to inactive and
+ *     back from inactive to active e.g., when walking a hierarchical
+ *     namespace tree upwards and reopening parent namespaces via
+ *     NS_GET_PARENT that only exist because they are a parent of an
+ *     actively used namespace.
+ *
+ * Relationship and lifecycle states:
+ *
+ * - Active (__ns_ref_active > 0):
+ *   Namespace is actively used by one or more tasks. The namespace can
+ *   be reopened via /proc/<pid>/ns/<ns_type> or discovered in the
+ *   namespace tree.
+ *
+ * - Inactive (__ns_ref_active == 0, __ns_ref > 0):
+ *   No tasks are actively using the namespace and it isn't pinned by
+ *   any bind-mounts or open file descriptors anymore. But the namespace
+ *   is still kept alive by internal references. For example, the user
+ *   namespace could be pinned by an open file through file->f_cred
+ *   references when one of the now defunct tasks had opened a file and
+ *   handed the file descriptor off to another process via a UNIX
+ *   socket. Such references keep the namespace structure alive through
+ *   __ns_ref but will not take an active reference.
+ *
+ * - Destroyed (__ns_ref == 0):
+ *   No references remain. The namespace is removed from the tree and freed.
+ *
+ * State transitions:
+ *
+ * Active -> Inactive:
+ *   When the last task using the namespace exits (via
+ *   exit_task_namespaces()), it drops its active references to all
+ *   namespaces apart from the pid namespace which remains accessible
+ *   until the task has been reaped and its pid number is released.
+ *
+ * Inactive -> Active:
+ *   When walking a hierarchical namespace tree upwards and reopening
+ *   parent namespaces via NS_GET_PARENT that only exist because they
+ *   are a parent of an actively used namespace it is possible to
+ *   necrobump an inactive namespace back to the active state.
+ *
+ * Inactive -> Destroyed:
+ *   When __ns_ref drops to zero (last file descriptor closed, last bind
+ *   mount removed, parent namespace released), the namespace is removed
+ *   from the tree and the memory is freed (after RCU grace period).
+ *
+ * Initial namespaces:
+ *   Boot-time namespaces (init_net, init_pid_ns, etc.) start with
+ *   __ns_ref_active = 1 and remain active forever.
+ */
 struct ns_common {
 	u32 ns_type;
 	struct dentry *stashed;
@@ -48,6 +117,7 @@ struct ns_common {
 			u64 ns_id;
 			struct rb_node ns_tree_node;
 			struct list_head ns_list_node;
+			atomic_t __ns_ref_active; /* do not use directly */
 		};
 		struct rcu_head ns_rcu;
 	};
@@ -56,6 +126,13 @@ struct ns_common {
 int __ns_common_init(struct ns_common *ns, u32 ns_type, const struct proc_ns_operations *ops, int inum);
 void __ns_common_free(struct ns_common *ns);
 
+static __always_inline bool is_initial_namespace(struct ns_common *ns)
+{
+	VFS_WARN_ON_ONCE(ns->inum == 0);
+	return unlikely(in_range(ns->inum, MNT_NS_INIT_INO,
+				 IPC_NS_INIT_INO - MNT_NS_INIT_INO + 1));
+}
+
 #define to_ns_common(__ns)                                    \
 	_Generic((__ns),                                      \
 		struct cgroup_namespace *:       &(__ns)->ns, \
@@ -133,6 +210,11 @@ void __ns_common_free(struct ns_common *ns);
 
 #define ns_common_free(__ns) __ns_common_free(to_ns_common((__ns)))
 
+static __always_inline __must_check int __ns_ref_active_read(const struct ns_common *ns)
+{
+	return atomic_read(&ns->__ns_ref_active);
+}
+
 static __always_inline __must_check bool __ns_ref_put(struct ns_common *ns)
 {
 	return refcount_dec_and_test(&ns->__ns_ref);
@@ -155,4 +237,45 @@ static __always_inline __must_check int __ns_ref_read(const struct ns_common *ns
 #define ns_ref_put_and_lock(__ns, __lock) \
 	refcount_dec_and_lock(&to_ns_common((__ns))->__ns_ref, (__lock))
 
+#define ns_ref_active_read(__ns) \
+	((__ns) ? __ns_ref_active_read(to_ns_common(__ns)) : 0)
+
+void __ns_ref_active_get_owner(struct ns_common *ns);
+
+static __always_inline void __ns_ref_active_get(struct ns_common *ns)
+{
+	WARN_ON_ONCE(atomic_add_negative(1, &ns->__ns_ref_active));
+	VFS_WARN_ON_ONCE(is_initial_namespace(ns) && __ns_ref_active_read(ns) <= 0);
+}
+#define ns_ref_active_get(__ns) \
+	do { if (__ns) __ns_ref_active_get(to_ns_common(__ns)); } while (0)
+
+static __always_inline bool __ns_ref_active_get_not_zero(struct ns_common *ns)
+{
+	return atomic_inc_not_zero(&ns->__ns_ref_active);
+}
+
+#define ns_ref_active_get_owner(__ns) \
+	do { if (__ns) __ns_ref_active_get_owner(to_ns_common(__ns)); } while (0)
+
+void __ns_ref_active_put_owner(struct ns_common *ns);
+
+static __always_inline void __ns_ref_active_put(struct ns_common *ns)
+{
+	if (atomic_dec_and_test(&ns->__ns_ref_active))
+		__ns_ref_active_put_owner(ns);
+}
+#define ns_ref_active_put(__ns) \
+	do { if (__ns) __ns_ref_active_put(to_ns_common(__ns)); } while (0)
+
+/*
+ * Grab a reference if the namespace is still active. This is
+ * intentionally racy.
+ */
+static __always_inline bool ns_get(struct ns_common *ns)
+{
+	VFS_WARN_ON_ONCE(__ns_ref_active_read(ns) && !__ns_ref_read(ns));
+	return __ns_ref_active_read(ns) && __ns_ref_get(ns);
+}
+
 #endif
diff --git a/include/linux/nsfs.h b/include/linux/nsfs.h
index e5a5fa83d36b..731b67fc2fec 100644
--- a/include/linux/nsfs.h
+++ b/include/linux/nsfs.h
@@ -37,4 +37,7 @@ void nsfs_init(void);
 
 #define current_in_namespace(__ns) (__current_namespace_from_type(__ns) == __ns)
 
+void nsproxy_ns_active_get(struct nsproxy *ns);
+void nsproxy_ns_active_put(struct nsproxy *ns);
+
 #endif /* _LINUX_NSFS_H */
diff --git a/init/version-timestamp.c b/init/version-timestamp.c
index 61b2405d97f9..c38498f94646 100644
--- a/init/version-timestamp.c
+++ b/init/version-timestamp.c
@@ -10,6 +10,7 @@
 struct uts_namespace init_uts_ns = {
 	.ns.ns_type = ns_common_type(&init_uts_ns),
 	.ns.__ns_ref = REFCOUNT_INIT(2),
+	.ns.__ns_ref_active = ATOMIC_INIT(1),
 	.name = {
 		.sysname	= UTS_SYSNAME,
 		.nodename	= UTS_NODENAME,
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index c9469fbce27c..d7c66b430470 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -28,6 +28,7 @@ DEFINE_SPINLOCK(mq_lock);
  */
 struct ipc_namespace init_ipc_ns = {
 	.ns.__ns_ref = REFCOUNT_INIT(1),
+	.ns.__ns_ref_active = ATOMIC_INIT(1),
 	.user_ns = &init_user_ns,
 	.ns.inum = ns_init_inum(&init_ipc_ns),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_ipc_ns.ns.ns_list_node),
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 59b12fcb40bd..23390d4f9b1f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -87,6 +87,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 	sem_init_ns(ns);
 	shm_init_ns(ns);
 	ns_tree_add(ns);
+	ns_ref_active_get_owner(ns);
 
 	return ns;
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a82918da8bae..a18ec090ad7e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -251,6 +251,7 @@ bool cgroup_enable_per_threadgroup_rwsem __read_mostly;
 /* cgroup namespace for init task */
 struct cgroup_namespace init_cgroup_ns = {
 	.ns.__ns_ref	= REFCOUNT_INIT(2),
+	.ns.__ns_ref_active = ATOMIC_INIT(1),
 	.user_ns	= &init_user_ns,
 	.ns.ops		= &cgroupns_operations,
 	.ns.inum	= ns_init_inum(&init_cgroup_ns),
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index fdbe57578e68..08be24baad98 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -31,6 +31,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	if (ret)
 		return ERR_PTR(ret);
 	ns_tree_add(new_ns);
+	ns_ref_active_get_owner(new_ns);
 	return no_free_ptr(new_ns);
 }
 
diff --git a/kernel/cred.c b/kernel/cred.c
index dbf6b687dc5c..25de5b76bbe4 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -343,6 +343,13 @@ int copy_creds(struct task_struct *p, u64 clone_flags)
 
 	p->cred = p->real_cred = get_cred(new);
 	inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
+
+	/*
+	 * Increment active ref for user_ns. Each task gets its own active
+	 * reference, even if CLONE_THREAD shares the cred structure.
+	 */
+	ns_ref_active_get(new->user_ns);
+
 	return 0;
 
 error_put:
@@ -435,6 +442,16 @@ int commit_creds(struct cred *new)
 	 */
 	if (new->user != old->user || new->user_ns != old->user_ns)
 		inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
+
+	/*
+	 * Swap active refs if changing user_ns. This task is switching from
+	 * actively using old->user_ns to actively using new->user_ns.
+	 */
+	if (new->user_ns != old->user_ns) {
+		ns_ref_active_get(new->user_ns);
+		ns_ref_active_put(old->user_ns);
+	}
+
 	rcu_assign_pointer(task->real_cred, new);
 	rcu_assign_pointer(task->cred, new);
 	if (new->user != old->user || new->user_ns != old->user_ns)
diff --git a/kernel/exit.c b/kernel/exit.c
index 9f74e8f1c431..599375553c1f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -291,6 +291,7 @@ void release_task(struct task_struct *p)
 	write_unlock_irq(&tasklist_lock);
 	/* @thread_pid can't go away until free_pids() below */
 	proc_flush_pid(thread_pid);
+	ns_ref_active_put(p->real_cred->user_ns);
 	add_device_randomness(&p->se.sum_exec_runtime,
 			      sizeof(p->se.sum_exec_runtime));
 	free_pids(post.pids);
diff --git a/kernel/nscommon.c b/kernel/nscommon.c
index c1fb2bad6d72..a324a12868fc 100644
--- a/kernel/nscommon.c
+++ b/kernel/nscommon.c
@@ -2,6 +2,7 @@
 
 #include <linux/ns_common.h>
 #include <linux/proc_ns.h>
+#include <linux/user_namespace.h>
 #include <linux/vfsdebug.h>
 
 #ifdef CONFIG_DEBUG_VFS
@@ -52,6 +53,8 @@ static void ns_debug(struct ns_common *ns, const struct proc_ns_operations *ops)
 
 int __ns_common_init(struct ns_common *ns, u32 ns_type, const struct proc_ns_operations *ops, int inum)
 {
+	int ret;
+
 	refcount_set(&ns->__ns_ref, 1);
 	ns->stashed = NULL;
 	ns->ops = ops;
@@ -68,10 +71,58 @@ int __ns_common_init(struct ns_common *ns, u32 ns_type, const struct proc_ns_ope
 		ns->inum = inum;
 		return 0;
 	}
-	return proc_alloc_inum(&ns->inum);
+	ret = proc_alloc_inum(&ns->inum);
+	if (ret)
+		return ret;
+	/*
+	 * Tree ref starts at 0. It's incremented when namespace enters
+	 * active use (installed in nsproxy) and decremented when all
+	 * active uses are gone. Initial namespaces are always active.
+	 */
+	if (is_initial_namespace(ns))
+		atomic_set(&ns->__ns_ref_active, 1);
+	else
+		atomic_set(&ns->__ns_ref_active, 0);
+	return 0;
 }
 
 void __ns_common_free(struct ns_common *ns)
 {
 	proc_free_inum(ns->inum);
 }
+
+void __ns_ref_active_get_owner(struct ns_common *ns)
+{
+	struct user_namespace *owner;
+
+	if (unlikely(!ns->ops))
+		return;
+	VFS_WARN_ON_ONCE(!ns->ops->owner);
+	owner = ns->ops->owner(ns);
+	VFS_WARN_ON_ONCE(!owner && ns != to_ns_common(&init_user_ns));
+	if (!owner)
+		return;
+	/* Skip init_user_ns as it's always active */
+	if (owner == &init_user_ns)
+		return;
+	WARN_ON_ONCE(atomic_add_negative(1, &to_ns_common(owner)->__ns_ref_active));
+}
+
+void __ns_ref_active_put_owner(struct ns_common *ns)
+{
+	struct user_namespace *owner;
+
+	do {
+		if (unlikely(!ns->ops))
+			return;
+		VFS_WARN_ON_ONCE(!ns->ops->owner);
+		owner = ns->ops->owner(ns);
+		VFS_WARN_ON_ONCE(!owner && ns != to_ns_common(&init_user_ns));
+		if (!owner)
+			return;
+		/* Skip init_user_ns as it's always active */
+		if (owner == &init_user_ns)
+			return;
+		ns = to_ns_common(owner);
+	} while (atomic_dec_and_test(&to_ns_common(owner)->__ns_ref_active));
+}
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 19aa64ab08c8..3324d827f6bc 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -26,6 +26,7 @@
 #include <linux/syscalls.h>
 #include <linux/cgroup.h>
 #include <linux/perf_event.h>
+#include <linux/nstree.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -179,12 +180,15 @@ int copy_namespaces(u64 flags, struct task_struct *tsk)
 	if ((flags & CLONE_VM) == 0)
 		timens_on_fork(new_ns, tsk);
 
+	nsproxy_ns_active_get(new_ns);
 	tsk->nsproxy = new_ns;
 	return 0;
 }
 
 void free_nsproxy(struct nsproxy *ns)
 {
+	nsproxy_ns_active_put(ns);
+
 	put_mnt_ns(ns->mnt_ns);
 	put_uts_ns(ns->uts_ns);
 	put_ipc_ns(ns->ipc_ns);
@@ -232,6 +236,9 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
 
 	might_sleep();
 
+	if (new)
+		nsproxy_ns_active_get(new);
+
 	task_lock(p);
 	ns = p->nsproxy;
 	p->nsproxy = new;
diff --git a/kernel/pid.c b/kernel/pid.c
index cb7574ca00f7..4f7b5054e23d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -72,6 +72,7 @@ static int pid_max_max = PID_MAX_LIMIT;
  */
 struct pid_namespace init_pid_ns = {
 	.ns.__ns_ref = REFCOUNT_INIT(2),
+	.ns.__ns_ref_active = ATOMIC_INIT(1),
 	.idr = IDR_INIT(init_pid_ns.idr),
 	.pid_allocated = PIDNS_ADDING,
 	.level = 0,
@@ -118,9 +119,13 @@ static void delayed_put_pid(struct rcu_head *rhp)
 void free_pid(struct pid *pid)
 {
 	int i;
+	struct pid_namespace *active_ns;
 
 	lockdep_assert_not_held(&tasklist_lock);
 
+	active_ns = pid->numbers[pid->level].ns;
+	ns_ref_active_put(active_ns);
+
 	spin_lock(&pidmap_lock);
 	for (i = 0; i <= pid->level; i++) {
 		struct upid *upid = pid->numbers + i;
@@ -285,6 +290,12 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	spin_unlock(&pidmap_lock);
 	idr_preload_end();
 
+	/*
+	 * Increment active ref for the task's active PID namespace.
+	 * This marks the namespace as actively in use by a running task.
+	 */
+	ns_ref_active_get(ns);
+
 	return pid;
 
 out_unlock:
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 650be58d8d18..2e678338c6d1 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -124,6 +124,7 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 #endif
 
 	ns_tree_add(ns);
+	ns_ref_active_get_owner(ns);
 	return ns;
 
 out_free_inum:
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index ee05cad288da..2e7c110bd13f 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -106,6 +106,7 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 	ns->offsets = old_ns->offsets;
 	ns->frozen_offsets = false;
 	ns_tree_add(ns);
+	ns_ref_active_get_owner(ns);
 	return ns;
 
 fail_free_page:
@@ -480,6 +481,7 @@ const struct proc_ns_operations timens_for_children_operations = {
 struct time_namespace init_time_ns = {
 	.ns.ns_type	= ns_common_type(&init_time_ns),
 	.ns.__ns_ref	= REFCOUNT_INIT(3),
+	.ns.__ns_ref_active = ATOMIC_INIT(1),
 	.user_ns	= &init_user_ns,
 	.ns.inum	= ns_init_inum(&init_time_ns),
 	.ns.ops		= &timens_operations,
diff --git a/kernel/user.c b/kernel/user.c
index b9cf3b056a71..bf60532856db 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -67,6 +67,7 @@ struct user_namespace init_user_ns = {
 	},
 	.ns.ns_type = ns_common_type(&init_user_ns),
 	.ns.__ns_ref = REFCOUNT_INIT(3),
+	.ns.__ns_ref_active = ATOMIC_INIT(1),
 	.owner = GLOBAL_ROOT_UID,
 	.group = GLOBAL_ROOT_GID,
 	.ns.inum = ns_init_inum(&init_user_ns),
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 03cb63883d04..9da3ee9e2eb1 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -160,6 +160,7 @@ int create_user_ns(struct cred *new)
 
 	set_cred_user_ns(new, ns);
 	ns_tree_add(ns);
+	ns_ref_active_get_owner(ns);
 	return 0;
 fail_keyring:
 #ifdef CONFIG_PERSISTENT_KEYRINGS
diff --git a/kernel/utsname.c b/kernel/utsname.c
index ebbfc578a9d3..18f55a05ad5b 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -60,6 +60,7 @@ static struct uts_namespace *clone_uts_ns(struct user_namespace *user_ns,
 	ns->user_ns = get_user_ns(user_ns);
 	up_read(&uts_sem);
 	ns_tree_add(ns);
+	ns_ref_active_get_owner(ns);
 	return ns;
 
 fail_free:
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index b0e0f22d7b21..f30fb78f020c 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -450,6 +450,7 @@ static __net_init int setup_net(struct net *net)
 	list_add_tail_rcu(&net->list, &net_namespace_list);
 	up_write(&net_rwsem);
 	ns_tree_add_raw(net);
+	ns_ref_active_get_owner(net);
 out:
 	return error;
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 10/50] ns: use anonymous struct to group list member
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (8 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 09/50] ns: add active reference count Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 11/50] nstree: introduce a unified tree Christian Brauner
                   ` (43 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Make it easier to spot that they belong together conceptually.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/ns_common.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index 5d19471235ab..34e072986955 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -115,8 +115,10 @@ struct ns_common {
 	union {
 		struct {
 			u64 ns_id;
-			struct rb_node ns_tree_node;
-			struct list_head ns_list_node;
+			struct /* per type rbtree and list */ {
+				struct rb_node ns_tree_node;
+				struct list_head ns_list_node;
+			};
 			atomic_t __ns_ref_active; /* do not use directly */
 		};
 		struct rcu_head ns_rcu;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 11/50] nstree: introduce a unified tree
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (9 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 10/50] ns: use anonymous struct to group list member Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 12/50] nstree: allow lookup solely based on inode Christian Brauner
                   ` (42 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

This will allow userspace to lookup and stat a namespace simply by its
identifier without having to know what type of namespace it is.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/ns_common.h |  4 ++
 kernel/nscommon.c         |  1 +
 kernel/nstree.c           | 94 ++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 77 insertions(+), 22 deletions(-)

diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index 34e072986955..ad65005d3371 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -115,6 +115,10 @@ struct ns_common {
 	union {
 		struct {
 			u64 ns_id;
+			struct /* global namespace rbtree and list */ {
+				struct rb_node ns_unified_tree_node;
+				struct list_head ns_unified_list_node;
+			};
 			struct /* per type rbtree and list */ {
 				struct rb_node ns_tree_node;
 				struct list_head ns_list_node;
diff --git a/kernel/nscommon.c b/kernel/nscommon.c
index a324a12868fc..c97a7bbb7d76 100644
--- a/kernel/nscommon.c
+++ b/kernel/nscommon.c
@@ -61,6 +61,7 @@ int __ns_common_init(struct ns_common *ns, u32 ns_type, const struct proc_ns_ope
 	ns->ns_id = 0;
 	ns->ns_type = ns_type;
 	RB_CLEAR_NODE(&ns->ns_tree_node);
+	RB_CLEAR_NODE(&ns->ns_unified_tree_node);
 	INIT_LIST_HEAD(&ns->ns_list_node);
 
 #ifdef CONFIG_DEBUG_VFS
diff --git a/kernel/nstree.c b/kernel/nstree.c
index 369fd1675c6a..d21df06b6747 100644
--- a/kernel/nstree.c
+++ b/kernel/nstree.c
@@ -4,31 +4,30 @@
 #include <linux/proc_ns.h>
 #include <linux/vfsdebug.h>
 
+__cacheline_aligned_in_smp DEFINE_SEQLOCK(ns_tree_lock);
+static struct rb_root ns_unified_tree = RB_ROOT; /* protected by ns_tree_lock */
+
 /**
  * struct ns_tree - Namespace tree
  * @ns_tree: Rbtree of namespaces of a particular type
  * @ns_list: Sequentially walkable list of all namespaces of this type
- * @ns_tree_lock: Seqlock to protect the tree and list
  * @type: type of namespaces in this tree
  */
 struct ns_tree {
-       struct rb_root ns_tree;
-       struct list_head ns_list;
-       seqlock_t ns_tree_lock;
-       int type;
+	struct rb_root ns_tree;
+	struct list_head ns_list;
+	int type;
 };
 
 struct ns_tree mnt_ns_tree = {
 	.ns_tree = RB_ROOT,
 	.ns_list = LIST_HEAD_INIT(mnt_ns_tree.ns_list),
-	.ns_tree_lock = __SEQLOCK_UNLOCKED(mnt_ns_tree.ns_tree_lock),
 	.type = CLONE_NEWNS,
 };
 
 struct ns_tree net_ns_tree = {
 	.ns_tree = RB_ROOT,
 	.ns_list = LIST_HEAD_INIT(net_ns_tree.ns_list),
-	.ns_tree_lock = __SEQLOCK_UNLOCKED(net_ns_tree.ns_tree_lock),
 	.type = CLONE_NEWNET,
 };
 EXPORT_SYMBOL_GPL(net_ns_tree);
@@ -36,42 +35,36 @@ EXPORT_SYMBOL_GPL(net_ns_tree);
 struct ns_tree uts_ns_tree = {
 	.ns_tree = RB_ROOT,
 	.ns_list = LIST_HEAD_INIT(uts_ns_tree.ns_list),
-	.ns_tree_lock = __SEQLOCK_UNLOCKED(uts_ns_tree.ns_tree_lock),
 	.type = CLONE_NEWUTS,
 };
 
 struct ns_tree user_ns_tree = {
 	.ns_tree = RB_ROOT,
 	.ns_list = LIST_HEAD_INIT(user_ns_tree.ns_list),
-	.ns_tree_lock = __SEQLOCK_UNLOCKED(user_ns_tree.ns_tree_lock),
 	.type = CLONE_NEWUSER,
 };
 
 struct ns_tree ipc_ns_tree = {
 	.ns_tree = RB_ROOT,
 	.ns_list = LIST_HEAD_INIT(ipc_ns_tree.ns_list),
-	.ns_tree_lock = __SEQLOCK_UNLOCKED(ipc_ns_tree.ns_tree_lock),
 	.type = CLONE_NEWIPC,
 };
 
 struct ns_tree pid_ns_tree = {
 	.ns_tree = RB_ROOT,
 	.ns_list = LIST_HEAD_INIT(pid_ns_tree.ns_list),
-	.ns_tree_lock = __SEQLOCK_UNLOCKED(pid_ns_tree.ns_tree_lock),
 	.type = CLONE_NEWPID,
 };
 
 struct ns_tree cgroup_ns_tree = {
 	.ns_tree = RB_ROOT,
 	.ns_list = LIST_HEAD_INIT(cgroup_ns_tree.ns_list),
-	.ns_tree_lock = __SEQLOCK_UNLOCKED(cgroup_ns_tree.ns_tree_lock),
 	.type = CLONE_NEWCGROUP,
 };
 
 struct ns_tree time_ns_tree = {
 	.ns_tree = RB_ROOT,
 	.ns_list = LIST_HEAD_INIT(time_ns_tree.ns_list),
-	.ns_tree_lock = __SEQLOCK_UNLOCKED(time_ns_tree.ns_tree_lock),
 	.type = CLONE_NEWTIME,
 };
 
@@ -84,6 +77,13 @@ static inline struct ns_common *node_to_ns(const struct rb_node *node)
 	return rb_entry(node, struct ns_common, ns_tree_node);
 }
 
+static inline struct ns_common *node_to_ns_unified(const struct rb_node *node)
+{
+	if (!node)
+		return NULL;
+	return rb_entry(node, struct ns_common, ns_unified_tree_node);
+}
+
 static inline int ns_cmp(struct rb_node *a, const struct rb_node *b)
 {
 	struct ns_common *ns_a = node_to_ns(a);
@@ -98,13 +98,27 @@ static inline int ns_cmp(struct rb_node *a, const struct rb_node *b)
 	return 0;
 }
 
+static inline int ns_cmp_unified(struct rb_node *a, const struct rb_node *b)
+{
+	struct ns_common *ns_a = node_to_ns_unified(a);
+	struct ns_common *ns_b = node_to_ns_unified(b);
+	u64 ns_id_a = ns_a->ns_id;
+	u64 ns_id_b = ns_b->ns_id;
+
+	if (ns_id_a < ns_id_b)
+		return -1;
+	if (ns_id_a > ns_id_b)
+		return 1;
+	return 0;
+}
+
 void __ns_tree_add_raw(struct ns_common *ns, struct ns_tree *ns_tree)
 {
 	struct rb_node *node, *prev;
 
 	VFS_WARN_ON_ONCE(!ns->ns_id);
 
-	write_seqlock(&ns_tree->ns_tree_lock);
+	write_seqlock(&ns_tree_lock);
 
 	VFS_WARN_ON_ONCE(ns->ns_type != ns_tree->type);
 
@@ -119,7 +133,8 @@ void __ns_tree_add_raw(struct ns_common *ns, struct ns_tree *ns_tree)
 	else
 		list_add_rcu(&ns->ns_list_node, &node_to_ns(prev)->ns_list_node);
 
-	write_sequnlock(&ns_tree->ns_tree_lock);
+	rb_find_add_rcu(&ns->ns_unified_tree_node, &ns_unified_tree, ns_cmp_unified);
+	write_sequnlock(&ns_tree_lock);
 
 	VFS_WARN_ON_ONCE(node);
 }
@@ -130,11 +145,12 @@ void __ns_tree_remove(struct ns_common *ns, struct ns_tree *ns_tree)
 	VFS_WARN_ON_ONCE(list_empty(&ns->ns_list_node));
 	VFS_WARN_ON_ONCE(ns->ns_type != ns_tree->type);
 
-	write_seqlock(&ns_tree->ns_tree_lock);
+	write_seqlock(&ns_tree_lock);
 	rb_erase(&ns->ns_tree_node, &ns_tree->ns_tree);
+	rb_erase(&ns->ns_unified_tree_node, &ns_unified_tree);
 	list_bidir_del_rcu(&ns->ns_list_node);
 	RB_CLEAR_NODE(&ns->ns_tree_node);
-	write_sequnlock(&ns_tree->ns_tree_lock);
+	write_sequnlock(&ns_tree_lock);
 }
 EXPORT_SYMBOL_GPL(__ns_tree_remove);
 
@@ -150,6 +166,17 @@ static int ns_find(const void *key, const struct rb_node *node)
 	return 0;
 }
 
+static int ns_find_unified(const void *key, const struct rb_node *node)
+{
+	const u64 ns_id = *(u64 *)key;
+	const struct ns_common *ns = node_to_ns_unified(node);
+
+	if (ns_id < ns->ns_id)
+		return -1;
+	if (ns_id > ns->ns_id)
+		return 1;
+	return 0;
+}
 
 static struct ns_tree *ns_tree_from_type(int ns_type)
 {
@@ -175,28 +202,51 @@ static struct ns_tree *ns_tree_from_type(int ns_type)
 	return NULL;
 }
 
-struct ns_common *ns_tree_lookup_rcu(u64 ns_id, int ns_type)
+static struct ns_common *__ns_unified_tree_lookup_rcu(u64 ns_id)
 {
-	struct ns_tree *ns_tree;
 	struct rb_node *node;
 	unsigned int seq;
 
-	RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "suspicious ns_tree_lookup_rcu() usage");
+	do {
+		seq = read_seqbegin(&ns_tree_lock);
+		node = rb_find_rcu(&ns_id, &ns_unified_tree, ns_find_unified);
+		if (node)
+			break;
+	} while (read_seqretry(&ns_tree_lock, seq));
+
+	return node_to_ns_unified(node);
+}
+
+static struct ns_common *__ns_tree_lookup_rcu(u64 ns_id, int ns_type)
+{
+	struct ns_tree *ns_tree;
+	struct rb_node *node;
+	unsigned int seq;
 
 	ns_tree = ns_tree_from_type(ns_type);
 	if (!ns_tree)
 		return NULL;
 
 	do {
-		seq = read_seqbegin(&ns_tree->ns_tree_lock);
+		seq = read_seqbegin(&ns_tree_lock);
 		node = rb_find_rcu(&ns_id, &ns_tree->ns_tree, ns_find);
 		if (node)
 			break;
-	} while (read_seqretry(&ns_tree->ns_tree_lock, seq));
+	} while (read_seqretry(&ns_tree_lock, seq));
 
 	return node_to_ns(node);
 }
 
+struct ns_common *ns_tree_lookup_rcu(u64 ns_id, int ns_type)
+{
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "suspicious ns_tree_lookup_rcu() usage");
+
+	if (ns_type)
+		return __ns_tree_lookup_rcu(ns_id, ns_type);
+
+	return __ns_unified_tree_lookup_rcu(ns_id);
+}
+
 /**
  * ns_tree_adjoined_rcu - find the next/previous namespace in the same
  * tree

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 12/50] nstree: allow lookup solely based on inode
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (10 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 11/50] nstree: introduce a unified tree Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 13/50] nstree: assign fixed ids to the initial namespaces Christian Brauner
                   ` (41 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

The namespace file handle struct nsfs_file_handle is uapi and userspace
is expressly allowed to generate file handles without going through
name_to_handle_at().

Allow userspace to generate a file handle where both the inode number
and the namespace type are zero and just pass in the unique namespace
id. The kernel uses the unified namespace tree to find the namespace and
open the file handle.

When the kernel creates a file handle via name_to_handle_at() it will
always fill in the type and the inode number allowing userspace to
retrieve core information.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/nsfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index a190e1e38442..ba5863ee4150 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -496,8 +496,8 @@ static struct dentry *nsfs_fh_to_dentry(struct super_block *sb, struct fid *fh,
 			return NULL;
 
 		VFS_WARN_ON_ONCE(ns->ns_id != fid->ns_id);
-		VFS_WARN_ON_ONCE(ns->ns_type != fid->ns_type);
-		VFS_WARN_ON_ONCE(ns->inum != fid->ns_inum);
+		if (fid->ns_inum && (fid->ns_inum != ns->inum))
+			return NULL;
 
 		if (!ns_get(ns))
 			return NULL;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 13/50] nstree: assign fixed ids to the initial namespaces
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (11 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 12/50] nstree: allow lookup solely based on inode Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 14/50] ns: maintain list of owned namespaces Christian Brauner
                   ` (40 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

The initial set of namespace comes with fixed inode numbers making it
easy for userspace to identify them solely based on that information.
This has long preceeded anything here.

Similarly, let's assign fixed namespace ids for the initial namespaces.

Kill the cookie and use a sequentially increasing number. This has the
nice side-effect that the owning user namespace will always have a
namespace id that is smaller than any of it's descendant namespaces.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c            |  2 +-
 include/linux/nstree.h    | 26 ++++++++++++++++++++++----
 include/uapi/linux/nsfs.h | 14 ++++++++++++++
 kernel/nstree.c           | 13 ++++++++-----
 net/core/net_namespace.c  |  2 +-
 5 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 87116def5ee3..5d8a80e1e944 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4094,7 +4094,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 		return ERR_PTR(ret);
 	}
 	if (!anon)
-		ns_tree_gen_id(&new_ns->ns);
+		ns_tree_gen_id(new_ns);
 	refcount_set(&new_ns->passive, 1);
 	new_ns->mounts = RB_ROOT;
 	init_waitqueue_head(&new_ns->poll);
diff --git a/include/linux/nstree.h b/include/linux/nstree.h
index 8b8636690473..96ee71622517 100644
--- a/include/linux/nstree.h
+++ b/include/linux/nstree.h
@@ -8,6 +8,7 @@
 #include <linux/seqlock.h>
 #include <linux/rculist.h>
 #include <linux/cookie.h>
+#include <uapi/linux/nsfs.h>
 
 extern struct ns_tree cgroup_ns_tree;
 extern struct ns_tree ipc_ns_tree;
@@ -29,7 +30,22 @@ extern struct ns_tree uts_ns_tree;
 		struct user_namespace *:   &(user_ns_tree),	\
 		struct uts_namespace *:    &(uts_ns_tree))
 
-u64 ns_tree_gen_id(struct ns_common *ns);
+#define ns_init_id(__ns)				      \
+	_Generic((__ns),                                      \
+		struct cgroup_namespace *: CGROUP_NS_INIT_ID, \
+		struct ipc_namespace *:    IPC_NS_INIT_ID,    \
+		struct mnt_namespace *:    MNT_NS_INIT_ID,    \
+		struct net *:              NET_NS_INIT_ID,    \
+		struct pid_namespace *:    PID_NS_INIT_ID,    \
+		struct time_namespace *:   TIME_NS_INIT_ID,   \
+		struct user_namespace *:   USER_NS_INIT_ID,   \
+		struct uts_namespace *:    UTS_NS_INIT_ID)
+
+#define ns_tree_gen_id(__ns)                 \
+	__ns_tree_gen_id(to_ns_common(__ns), \
+			 (((__ns) == ns_init_ns(__ns)) ? ns_init_id(__ns) : 0))
+
+u64 __ns_tree_gen_id(struct ns_common *ns, u64 id);
 void __ns_tree_add_raw(struct ns_common *ns, struct ns_tree *ns_tree);
 void __ns_tree_remove(struct ns_common *ns, struct ns_tree *ns_tree);
 struct ns_common *ns_tree_lookup_rcu(u64 ns_id, int ns_type);
@@ -37,9 +53,9 @@ struct ns_common *__ns_tree_adjoined_rcu(struct ns_common *ns,
 					 struct ns_tree *ns_tree,
 					 bool previous);
 
-static inline void __ns_tree_add(struct ns_common *ns, struct ns_tree *ns_tree)
+static inline void __ns_tree_add(struct ns_common *ns, struct ns_tree *ns_tree, u64 id)
 {
-	ns_tree_gen_id(ns);
+	__ns_tree_gen_id(ns, id);
 	__ns_tree_add_raw(ns, ns_tree);
 }
 
@@ -59,7 +75,9 @@ static inline void __ns_tree_add(struct ns_common *ns, struct ns_tree *ns_tree)
  * This function assigns a new id to the namespace and adds it to the
  * appropriate namespace tree and list.
  */
-#define ns_tree_add(__ns) __ns_tree_add(to_ns_common(__ns), to_ns_tree(__ns))
+#define ns_tree_add(__ns)                                   \
+	__ns_tree_add(to_ns_common(__ns), to_ns_tree(__ns), \
+		      (((__ns) == ns_init_ns(__ns)) ? ns_init_id(__ns) : 0))
 
 /**
  * ns_tree_remove - Remove a namespace from a namespace tree
diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h
index e098759ec917..f8bc2aad74d6 100644
--- a/include/uapi/linux/nsfs.h
+++ b/include/uapi/linux/nsfs.h
@@ -67,4 +67,18 @@ struct nsfs_file_handle {
 #define NSFS_FILE_HANDLE_SIZE_VER0 16 /* sizeof first published struct */
 #define NSFS_FILE_HANDLE_SIZE_LATEST sizeof(struct nsfs_file_handle) /* sizeof latest published struct */
 
+enum init_ns_id {
+	IPC_NS_INIT_ID		= 1ULL,
+	UTS_NS_INIT_ID		= 2ULL,
+	USER_NS_INIT_ID		= 3ULL,
+	PID_NS_INIT_ID		= 4ULL,
+	CGROUP_NS_INIT_ID	= 5ULL,
+	TIME_NS_INIT_ID		= 6ULL,
+	NET_NS_INIT_ID		= 7ULL,
+	MNT_NS_INIT_ID		= 8ULL,
+#ifdef __KERNEL__
+	NS_LAST_INIT_ID		= MNT_NS_INIT_ID,
+#endif
+};
+
 #endif /* __LINUX_NSFS_H */
diff --git a/kernel/nstree.c b/kernel/nstree.c
index d21df06b6747..de5ceda44637 100644
--- a/kernel/nstree.c
+++ b/kernel/nstree.c
@@ -68,8 +68,6 @@ struct ns_tree time_ns_tree = {
 	.type = CLONE_NEWTIME,
 };
 
-DEFINE_COOKIE(namespace_cookie);
-
 static inline struct ns_common *node_to_ns(const struct rb_node *node)
 {
 	if (!node)
@@ -278,15 +276,20 @@ struct ns_common *__ns_tree_adjoined_rcu(struct ns_common *ns,
 /**
  * ns_tree_gen_id - generate a new namespace id
  * @ns: namespace to generate id for
+ * @id: if non-zero, this is the initial namespace and this is a fixed id
  *
  * Generates a new namespace id and assigns it to the namespace. All
  * namespaces types share the same id space and thus can be compared
  * directly. IOW, when two ids of two namespace are equal, they are
  * identical.
  */
-u64 ns_tree_gen_id(struct ns_common *ns)
+u64 __ns_tree_gen_id(struct ns_common *ns, u64 id)
 {
-	guard(preempt)();
-	ns->ns_id = gen_cookie_next(&namespace_cookie);
+	static atomic64_t namespace_cookie = ATOMIC64_INIT(NS_LAST_INIT_ID + 1);
+
+	if (id)
+		ns->ns_id = id;
+	else
+		ns->ns_id = atomic64_inc_return(&namespace_cookie);
 	return ns->ns_id;
 }
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index f30fb78f020c..a76b9b9709d6 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -439,7 +439,7 @@ static __net_init int setup_net(struct net *net)
 	LIST_HEAD(net_exit_list);
 	int error = 0;
 
-	net->net_cookie = ns_tree_gen_id(&net->ns);
+	net->net_cookie = ns_tree_gen_id(net);
 
 	list_for_each_entry(ops, &pernet_list, list) {
 		error = ops_init(ops, net);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 14/50] ns: maintain list of owned namespaces
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (12 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 13/50] nstree: assign fixed ids to the initial namespaces Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 15/50] nstree: add listns() Christian Brauner
                   ` (39 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

The namespace tree doesn't express the ownership concept of namespace
appropriately. Maintain a list of directly owned namespaces per user
namespace. This will allow userspace and the kernel to use the listns()
system call to walk the namespace tree by owning user namespace.
Read-only list operations are completely lockless.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c            |  2 ++
 include/linux/ns_common.h |  4 ++++
 init/version-timestamp.c  |  2 ++
 ipc/msgutil.c             |  2 ++
 kernel/cgroup/cgroup.c    |  2 ++
 kernel/nscommon.c         |  2 ++
 kernel/nstree.c           | 19 +++++++++++++++++++
 kernel/pid.c              |  2 ++
 kernel/time/namespace.c   |  2 ++
 kernel/user.c             |  2 ++
 10 files changed, 39 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5d8a80e1e944..d460ca79f0e7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -5996,6 +5996,8 @@ struct mnt_namespace init_mnt_ns = {
 	.mounts		= RB_ROOT,
 	.poll		= __WAIT_QUEUE_HEAD_INITIALIZER(init_mnt_ns.poll),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_mnt_ns.ns.ns_list_node),
+	.ns.ns_owner_entry = LIST_HEAD_INIT(init_mnt_ns.ns.ns_owner_entry),
+	.ns.ns_owner = LIST_HEAD_INIT(init_mnt_ns.ns.ns_owner),
 };
 
 static void __init init_mount_tree(void)
diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index ad65005d3371..1afeaa93d319 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -123,6 +123,10 @@ struct ns_common {
 				struct rb_node ns_tree_node;
 				struct list_head ns_list_node;
 			};
+			struct /* namespace ownership list */ {
+				struct list_head ns_owner; /* list of namespaces owned by this namespace */
+				struct list_head ns_owner_entry; /* node in the owner namespace's ns_owned list */
+			};
 			atomic_t __ns_ref_active; /* do not use directly */
 		};
 		struct rcu_head ns_rcu;
diff --git a/init/version-timestamp.c b/init/version-timestamp.c
index c38498f94646..e5c278dabecf 100644
--- a/init/version-timestamp.c
+++ b/init/version-timestamp.c
@@ -22,6 +22,8 @@ struct uts_namespace init_uts_ns = {
 	.user_ns = &init_user_ns,
 	.ns.inum = ns_init_inum(&init_uts_ns),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_uts_ns.ns.ns_list_node),
+	.ns.ns_owner_entry = LIST_HEAD_INIT(init_uts_ns.ns.ns_owner_entry),
+	.ns.ns_owner = LIST_HEAD_INIT(init_uts_ns.ns.ns_owner),
 #ifdef CONFIG_UTS_NS
 	.ns.ops = &utsns_operations,
 #endif
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index d7c66b430470..ce1de73725c0 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -32,6 +32,8 @@ struct ipc_namespace init_ipc_ns = {
 	.user_ns = &init_user_ns,
 	.ns.inum = ns_init_inum(&init_ipc_ns),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_ipc_ns.ns.ns_list_node),
+	.ns.ns_owner_entry = LIST_HEAD_INIT(init_ipc_ns.ns.ns_owner_entry),
+	.ns.ns_owner = LIST_HEAD_INIT(init_ipc_ns.ns.ns_owner),
 #ifdef CONFIG_IPC_NS
 	.ns.ops = &ipcns_operations,
 #endif
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a18ec090ad7e..41a1fc562ed0 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -258,6 +258,8 @@ struct cgroup_namespace init_cgroup_ns = {
 	.root_cset	= &init_css_set,
 	.ns.ns_type	= ns_common_type(&init_cgroup_ns),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_cgroup_ns.ns.ns_list_node),
+	.ns.ns_owner_entry = LIST_HEAD_INIT(init_cgroup_ns.ns.ns_owner_entry),
+	.ns.ns_owner = LIST_HEAD_INIT(init_cgroup_ns.ns.ns_owner),
 };
 
 static struct file_system_type cgroup2_fs_type;
diff --git a/kernel/nscommon.c b/kernel/nscommon.c
index c97a7bbb7d76..0ef2939daf33 100644
--- a/kernel/nscommon.c
+++ b/kernel/nscommon.c
@@ -63,6 +63,8 @@ int __ns_common_init(struct ns_common *ns, u32 ns_type, const struct proc_ns_ope
 	RB_CLEAR_NODE(&ns->ns_tree_node);
 	RB_CLEAR_NODE(&ns->ns_unified_tree_node);
 	INIT_LIST_HEAD(&ns->ns_list_node);
+	INIT_LIST_HEAD(&ns->ns_owner);
+	INIT_LIST_HEAD(&ns->ns_owner_entry);
 
 #ifdef CONFIG_DEBUG_VFS
 	ns_debug(ns, ops);
diff --git a/kernel/nstree.c b/kernel/nstree.c
index de5ceda44637..829682bb04a1 100644
--- a/kernel/nstree.c
+++ b/kernel/nstree.c
@@ -3,6 +3,7 @@
 #include <linux/nstree.h>
 #include <linux/proc_ns.h>
 #include <linux/vfsdebug.h>
+#include <linux/user_namespace.h>
 
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(ns_tree_lock);
 static struct rb_root ns_unified_tree = RB_ROOT; /* protected by ns_tree_lock */
@@ -113,8 +114,10 @@ static inline int ns_cmp_unified(struct rb_node *a, const struct rb_node *b)
 void __ns_tree_add_raw(struct ns_common *ns, struct ns_tree *ns_tree)
 {
 	struct rb_node *node, *prev;
+	const struct proc_ns_operations *ops = ns->ops;
 
 	VFS_WARN_ON_ONCE(!ns->ns_id);
+	VFS_WARN_ON_ONCE(ns->ns_type != ns_tree->type);
 
 	write_seqlock(&ns_tree_lock);
 
@@ -132,6 +135,21 @@ void __ns_tree_add_raw(struct ns_common *ns, struct ns_tree *ns_tree)
 		list_add_rcu(&ns->ns_list_node, &node_to_ns(prev)->ns_list_node);
 
 	rb_find_add_rcu(&ns->ns_unified_tree_node, &ns_unified_tree, ns_cmp_unified);
+
+	if (ops) {
+		struct user_namespace *user_ns;
+
+		VFS_WARN_ON_ONCE(!ops->owner);
+		user_ns = ops->owner(ns);
+		if (user_ns) {
+			struct ns_common *owner = &user_ns->ns;
+			VFS_WARN_ON_ONCE(owner->ns_type != CLONE_NEWUSER);
+			list_add_tail_rcu(&ns->ns_owner_entry, &owner->ns_owner);
+		} else {
+			/* Only the initial user namespace doesn't have an owner. */
+			VFS_WARN_ON_ONCE(ns != to_ns_common(&init_user_ns));
+		}
+	}
 	write_sequnlock(&ns_tree_lock);
 
 	VFS_WARN_ON_ONCE(node);
@@ -148,6 +166,7 @@ void __ns_tree_remove(struct ns_common *ns, struct ns_tree *ns_tree)
 	rb_erase(&ns->ns_unified_tree_node, &ns_unified_tree);
 	list_bidir_del_rcu(&ns->ns_list_node);
 	RB_CLEAR_NODE(&ns->ns_tree_node);
+	list_bidir_del_rcu(&ns->ns_owner_entry);
 	write_sequnlock(&ns_tree_lock);
 }
 EXPORT_SYMBOL_GPL(__ns_tree_remove);
diff --git a/kernel/pid.c b/kernel/pid.c
index 4f7b5054e23d..f82dab348540 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -80,6 +80,8 @@ struct pid_namespace init_pid_ns = {
 	.user_ns = &init_user_ns,
 	.ns.inum = ns_init_inum(&init_pid_ns),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_pid_ns.ns.ns_list_node),
+	.ns.ns_owner_entry = LIST_HEAD_INIT(init_pid_ns.ns.ns_owner_entry),
+	.ns.ns_owner = LIST_HEAD_INIT(init_pid_ns.ns.ns_owner),
 #ifdef CONFIG_PID_NS
 	.ns.ops = &pidns_operations,
 #endif
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 2e7c110bd13f..15cb74267c75 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -485,6 +485,8 @@ struct time_namespace init_time_ns = {
 	.user_ns	= &init_user_ns,
 	.ns.inum	= ns_init_inum(&init_time_ns),
 	.ns.ops		= &timens_operations,
+	.ns.ns_owner_entry = LIST_HEAD_INIT(init_time_ns.ns.ns_owner_entry),
+	.ns.ns_owner = LIST_HEAD_INIT(init_time_ns.ns.ns_owner),
 	.frozen_offsets	= true,
 	.ns.ns_list_node = LIST_HEAD_INIT(init_time_ns.ns.ns_list_node),
 };
diff --git a/kernel/user.c b/kernel/user.c
index bf60532856db..e392768ccd44 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -72,6 +72,8 @@ struct user_namespace init_user_ns = {
 	.group = GLOBAL_ROOT_GID,
 	.ns.inum = ns_init_inum(&init_user_ns),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_user_ns.ns.ns_list_node),
+	.ns.ns_owner_entry = LIST_HEAD_INIT(init_user_ns.ns.ns_owner_entry),
+	.ns.ns_owner = LIST_HEAD_INIT(init_user_ns.ns.ns_owner),
 #ifdef CONFIG_USER_NS
 	.ns.ops = &userns_operations,
 #endif

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 15/50] nstree: add listns()
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (13 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 14/50] ns: maintain list of owned namespaces Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 16/50] arch: hookup listns() system call Christian Brauner
                   ` (38 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Add a new listns() system call that allows userspace to iterate through
namespaces in the system. This provides a programmatic interface to
discover and inspect namespaces, enhancing existing namespace apis.

Currently, there is no direct way for userspace to enumerate namespaces
in the system. Applications must resort to scanning /proc/<pid>/ns/
across all processes, which is:

1. Inefficient - requires iterating over all processes
2. Incomplete - misses inactive namespaces that aren't attached to any
   running process but are kept alive by file descriptors, bind mounts,
   or parent namespace references
3. Permission-heavy - requires access to /proc for many processes
4. No ordering or ownership.
5. No filtering per namespace type: Must always iterate and check all
   namespaces.

The list goes on. The listns() system call solves these problems by
providing direct kernel-level enumeration of namespaces. It is similar
to listmount() but obviously tailored to namespaces.

/*
 * @req: Pointer to struct ns_id_req specifying search parameters
 * @ns_ids: User buffer to receive namespace IDs
 * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
 * @flags: Reserved for future use (must be 0)
 */
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
               size_t nr_ns_ids, unsigned int flags);

Returns:
- On success: Number of namespace IDs written to ns_ids
- On error: Negative error code

/*
 * @size: Structure size
 * @ns_id: Starting point for iteration; use 0 for first call, then
 *         use the last returned ID for subsequent calls to paginate
 * @ns_type: Bitmask of namespace types to include (from enum ns_type):
 *           0: Return all namespace types
 *           MNT_NS: Mount namespaces
 *           NET_NS: Network namespaces
 *           USER_NS: User namespaces
 *           etc. Can be OR'd together
 * @user_ns_id: Filter results to namespaces owned by this user namespace:
 *              0: Return all namespaces (subject to permission checks)
 *              LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace
 *              Other value: Namespaces owned by the specified user namespace ID
 */
struct ns_id_req {
        __u32 size;         /* sizeof(struct ns_id_req) */
        __u32 spare;        /* Reserved, must be 0 */
        __u64 ns_id;        /* Last seen namespace ID (for pagination) */
        __u32 ns_type;      /* Filter by namespace type(s) */
        __u32 spare2;       /* Reserved, must be 0 */
        __u64 user_ns_id;   /* Filter by owning user namespace */
};

Example 1: List all namespaces

void list_all_namespaces(void)
{
    struct ns_id_req req = {
        .size = sizeof(req),
        .ns_id = 0,          /* Start from beginning */
        .ns_type = 0,        /* All types */
        .user_ns_id = 0,     /* All user namespaces */
    };
    uint64_t ids[100];
    ssize_t ret;

    printf("All namespaces in the system:\n");
    do {
        ret = listns(&req, ids, 100, 0);
        if (ret < 0) {
            perror("listns");
            break;
        }

        for (ssize_t i = 0; i < ret; i++)
            printf("  Namespace ID: %llu\n", (unsigned long long)ids[i]);

        /* Continue from last seen ID */
        if (ret > 0)
            req.ns_id = ids[ret - 1];
    } while (ret == 100);  /* Buffer was full, more may exist */
}

Example 2: List network namespaces only

void list_network_namespaces(void)
{
    struct ns_id_req req = {
        .size = sizeof(req),
        .ns_id = 0,
        .ns_type = NET_NS,   /* Only network namespaces */
        .user_ns_id = 0,
    };
    uint64_t ids[100];
    ssize_t ret;

    ret = listns(&req, ids, 100, 0);
    if (ret < 0) {
        perror("listns");
        return;
    }

    printf("Network namespaces: %zd found\n", ret);
    for (ssize_t i = 0; i < ret; i++)
        printf("  netns ID: %llu\n", (unsigned long long)ids[i]);
}

Example 3: List namespaces owned by current user namespace

void list_owned_namespaces(void)
{
    struct ns_id_req req = {
        .size = sizeof(req),
        .ns_id = 0,
        .ns_type = 0,                      /* All types */
        .user_ns_id = LISTNS_CURRENT_USER, /* Current userns */
    };
    uint64_t ids[100];
    ssize_t ret;

    ret = listns(&req, ids, 100, 0);
    if (ret < 0) {
        perror("listns");
        return;
    }

    printf("Namespaces owned by my user namespace: %zd\n", ret);
    for (ssize_t i = 0; i < ret; i++)
        printf("  ns ID: %llu\n", (unsigned long long)ids[i]);
}

Example 4: List multiple namespace types

void list_network_and_mount_namespaces(void)
{
    struct ns_id_req req = {
        .size = sizeof(req),
        .ns_id = 0,
        .ns_type = NET_NS | MNT_NS,  /* Network and mount */
        .user_ns_id = 0,
    };
    uint64_t ids[100];
    ssize_t ret;

    ret = listns(&req, ids, 100, 0);
    printf("Network and mount namespaces: %zd found\n", ret);
}

Example 5: Pagination through large namespace sets

void list_all_with_pagination(void)
{
    struct ns_id_req req = {
        .size = sizeof(req),
        .ns_id = 0,
        .ns_type = 0,
        .user_ns_id = 0,
    };
    uint64_t ids[50];
    size_t total = 0;
    ssize_t ret;

    printf("Enumerating all namespaces with pagination:\n");

    while (1) {
        ret = listns(&req, ids, 50, 0);
        if (ret < 0) {
            perror("listns");
            break;
        }
        if (ret == 0)
            break;  /* No more namespaces */

        total += ret;
        printf("  Batch: %zd namespaces\n", ret);

        /* Last ID in this batch becomes start of next batch */
        req.ns_id = ids[ret - 1];

        if (ret < 50)
            break;  /* Partial batch = end of results */
    }

    printf("Total: %zu namespaces\n", total);
}

Permission Model

listns() respects namespace isolation and capabilities:

(1) Global listing (user_ns_id = 0):
    - Requires CAP_SYS_ADMIN in the namespace's owning user namespace
    - OR the namespace must be in the caller's namespace context (e.g.,
      a namespace the caller is currently using)
    - User namespaces additionally allow listing if the caller has
      CAP_SYS_ADMIN in that user namespace itself
(2) Owner-filtered listing (user_ns_id != 0):
    - Requires CAP_SYS_ADMIN in the specified owner user namespace
    - OR the namespace must be in the caller's namespace context
    - This allows unprivileged processes to enumerate namespaces they own
(3) Visibility:
    - Only "active" namespaces are listed
    - A namespace is active if it has a non-zero __ns_ref_active count
    - This includes namespaces used by running processes, held by open
      file descriptors, or kept active by bind mounts
    - Inactive namespaces (kept alive only by internal kernel
      references) are not visible via listns()

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c            |   1 +
 fs/nsfs.c                 |  39 +++++
 include/linux/ns_common.h |   5 +-
 include/linux/syscalls.h  |   4 +
 include/uapi/linux/nsfs.h |  44 +++++
 init/version-timestamp.c  |   1 +
 ipc/msgutil.c             |   1 +
 kernel/cgroup/cgroup.c    |   1 +
 kernel/nscommon.c         |   3 +
 kernel/nstree.c           | 404 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/pid.c              |   1 +
 kernel/time/namespace.c   |   1 +
 kernel/user.c             |   1 +
 13 files changed, 501 insertions(+), 5 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index d460ca79f0e7..980296b0ec86 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -5996,6 +5996,7 @@ struct mnt_namespace init_mnt_ns = {
 	.mounts		= RB_ROOT,
 	.poll		= __WAIT_QUEUE_HEAD_INITIALIZER(init_mnt_ns.poll),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_mnt_ns.ns.ns_list_node),
+	.ns.ns_unified_list_node = LIST_HEAD_INIT(init_mnt_ns.ns.ns_unified_list_node),
 	.ns.ns_owner_entry = LIST_HEAD_INIT(init_mnt_ns.ns.ns_owner_entry),
 	.ns.ns_owner = LIST_HEAD_INIT(init_mnt_ns.ns.ns_owner),
 };
diff --git a/fs/nsfs.c b/fs/nsfs.c
index ba5863ee4150..629139a25d65 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -465,6 +465,45 @@ static int nsfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
 	return FILEID_NSFS;
 }
 
+bool is_current_namespace(struct ns_common *ns)
+{
+	switch (ns->ns_type) {
+#ifdef CONFIG_CGROUPS
+	case CLONE_NEWCGROUP:
+		return current_in_namespace(to_cg_ns(ns));
+#endif
+#ifdef CONFIG_IPC_NS
+	case CLONE_NEWIPC:
+		return current_in_namespace(to_ipc_ns(ns));
+#endif
+	case CLONE_NEWNS:
+		return current_in_namespace(to_mnt_ns(ns));
+#ifdef CONFIG_NET_NS
+	case CLONE_NEWNET:
+		return current_in_namespace(to_net_ns(ns));
+#endif
+#ifdef CONFIG_PID_NS
+	case CLONE_NEWPID:
+		return current_in_namespace(to_pid_ns(ns));
+#endif
+#ifdef CONFIG_TIME_NS
+	case CLONE_NEWTIME:
+		return current_in_namespace(to_time_ns(ns));
+#endif
+#ifdef CONFIG_USER_NS
+	case CLONE_NEWUSER:
+		return current_in_namespace(to_user_ns(ns));
+#endif
+#ifdef CONFIG_UTS_NS
+	case CLONE_NEWUTS:
+		return current_in_namespace(to_uts_ns(ns));
+#endif
+	default:
+		VFS_WARN_ON_ONCE(true);
+		return false;
+	}
+}
+
 static struct dentry *nsfs_fh_to_dentry(struct super_block *sb, struct fid *fh,
 					int fh_len, int fh_type)
 {
diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index 1afeaa93d319..151d9e594500 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -123,8 +123,10 @@ struct ns_common {
 				struct rb_node ns_tree_node;
 				struct list_head ns_list_node;
 			};
-			struct /* namespace ownership list */ {
+			struct /* namespace ownership rbtree and list */ {
+				struct rb_root ns_owner_tree; /* rbtree of namespaces owned by this namespace */
 				struct list_head ns_owner; /* list of namespaces owned by this namespace */
+				struct rb_node ns_owner_tree_node; /* node in the owner namespace's rbtree */
 				struct list_head ns_owner_entry; /* node in the owner namespace's ns_owned list */
 			};
 			atomic_t __ns_ref_active; /* do not use directly */
@@ -133,6 +135,7 @@ struct ns_common {
 	};
 };
 
+bool is_current_namespace(struct ns_common *ns);
 int __ns_common_init(struct ns_common *ns, u32 ns_type, const struct proc_ns_operations *ops, int inum);
 void __ns_common_free(struct ns_common *ns);
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 66c06fcdfe19..cf84d98964b2 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -77,6 +77,7 @@ struct cachestat_range;
 struct cachestat;
 struct statmount;
 struct mnt_id_req;
+struct ns_id_req;
 struct xattr_args;
 struct file_attr;
 
@@ -437,6 +438,9 @@ asmlinkage long sys_statmount(const struct mnt_id_req __user *req,
 asmlinkage long sys_listmount(const struct mnt_id_req __user *req,
 			      u64 __user *mnt_ids, size_t nr_mnt_ids,
 			      unsigned int flags);
+asmlinkage long sys_listns(const struct ns_id_req __user *req,
+			   u64 __user *ns_ids, size_t nr_ns_ids,
+			   unsigned int flags);
 asmlinkage long sys_truncate(const char __user *path, long length);
 asmlinkage long sys_ftruncate(unsigned int fd, off_t length);
 #if BITS_PER_LONG == 32
diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h
index f8bc2aad74d6..a25e38d1c874 100644
--- a/include/uapi/linux/nsfs.h
+++ b/include/uapi/linux/nsfs.h
@@ -81,4 +81,48 @@ enum init_ns_id {
 #endif
 };
 
+enum ns_type {
+	TIME_NS    = (1ULL << 7),  /* CLONE_NEWTIME */
+	MNT_NS     = (1ULL << 17), /* CLONE_NEWNS */
+	CGROUP_NS  = (1ULL << 25), /* CLONE_NEWCGROUP */
+	UTS_NS     = (1ULL << 26), /* CLONE_NEWUTS */
+	IPC_NS     = (1ULL << 27), /* CLONE_NEWIPC */
+	USER_NS    = (1ULL << 28), /* CLONE_NEWUSER */
+	PID_NS     = (1ULL << 29), /* CLONE_NEWPID */
+	NET_NS     = (1ULL << 30), /* CLONE_NEWNET */
+};
+
+/**
+ * struct ns_id_req - namespace ID request structure
+ * @size: size of this structure
+ * @spare: reserved for future use
+ * @filter: filter mask
+ * @ns_id: last namespace id
+ * @user_ns_id: owning user namespace ID
+ *
+ * Structure for passing namespace ID and miscellaneous parameters to
+ * statns(2) and listns(2).
+ *
+ * For statns(2) @param represents the request mask.
+ * For listns(2) @param represents the last listed mount id (or zero).
+ */
+struct ns_id_req {
+	__u32 size;
+	__u32 spare;
+	__u64 ns_id;
+	struct /* listns */ {
+		__u32 ns_type;
+		__u32 spare2;
+		__u64 user_ns_id;
+	};
+};
+
+/*
+ * Special @user_ns_id value that can be passed to listns()
+ */
+#define LISTNS_CURRENT_USER 0xffffffffffffffff /* Caller's userns */
+
+/* List of all ns_id_req versions. */
+#define NS_ID_REQ_SIZE_VER0 32 /* sizeof first published struct */
+
 #endif /* __LINUX_NSFS_H */
diff --git a/init/version-timestamp.c b/init/version-timestamp.c
index e5c278dabecf..cd6f435d5fde 100644
--- a/init/version-timestamp.c
+++ b/init/version-timestamp.c
@@ -22,6 +22,7 @@ struct uts_namespace init_uts_ns = {
 	.user_ns = &init_user_ns,
 	.ns.inum = ns_init_inum(&init_uts_ns),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_uts_ns.ns.ns_list_node),
+	.ns.ns_unified_list_node = LIST_HEAD_INIT(init_uts_ns.ns.ns_unified_list_node),
 	.ns.ns_owner_entry = LIST_HEAD_INIT(init_uts_ns.ns.ns_owner_entry),
 	.ns.ns_owner = LIST_HEAD_INIT(init_uts_ns.ns.ns_owner),
 #ifdef CONFIG_UTS_NS
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index ce1de73725c0..3708f325228d 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -32,6 +32,7 @@ struct ipc_namespace init_ipc_ns = {
 	.user_ns = &init_user_ns,
 	.ns.inum = ns_init_inum(&init_ipc_ns),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_ipc_ns.ns.ns_list_node),
+	.ns.ns_unified_list_node = LIST_HEAD_INIT(init_ipc_ns.ns.ns_unified_list_node),
 	.ns.ns_owner_entry = LIST_HEAD_INIT(init_ipc_ns.ns.ns_owner_entry),
 	.ns.ns_owner = LIST_HEAD_INIT(init_ipc_ns.ns.ns_owner),
 #ifdef CONFIG_IPC_NS
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 41a1fc562ed0..9ab0e7d7b54d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -258,6 +258,7 @@ struct cgroup_namespace init_cgroup_ns = {
 	.root_cset	= &init_css_set,
 	.ns.ns_type	= ns_common_type(&init_cgroup_ns),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_cgroup_ns.ns.ns_list_node),
+	.ns.ns_unified_list_node = LIST_HEAD_INIT(init_cgroup_ns.ns.ns_unified_list_node),
 	.ns.ns_owner_entry = LIST_HEAD_INIT(init_cgroup_ns.ns.ns_owner_entry),
 	.ns.ns_owner = LIST_HEAD_INIT(init_cgroup_ns.ns.ns_owner),
 };
diff --git a/kernel/nscommon.c b/kernel/nscommon.c
index 0ef2939daf33..bf9a30fa82d6 100644
--- a/kernel/nscommon.c
+++ b/kernel/nscommon.c
@@ -62,7 +62,10 @@ int __ns_common_init(struct ns_common *ns, u32 ns_type, const struct proc_ns_ope
 	ns->ns_type = ns_type;
 	RB_CLEAR_NODE(&ns->ns_tree_node);
 	RB_CLEAR_NODE(&ns->ns_unified_tree_node);
+	RB_CLEAR_NODE(&ns->ns_owner_tree_node);
 	INIT_LIST_HEAD(&ns->ns_list_node);
+	INIT_LIST_HEAD(&ns->ns_unified_list_node);
+	ns->ns_owner_tree = RB_ROOT;
 	INIT_LIST_HEAD(&ns->ns_owner);
 	INIT_LIST_HEAD(&ns->ns_owner_entry);
 
diff --git a/kernel/nstree.c b/kernel/nstree.c
index 829682bb04a1..62d7571aced4 100644
--- a/kernel/nstree.c
+++ b/kernel/nstree.c
@@ -2,11 +2,14 @@
 
 #include <linux/nstree.h>
 #include <linux/proc_ns.h>
+#include <linux/rculist.h>
+#include <linux/syscalls.h>
 #include <linux/vfsdebug.h>
 #include <linux/user_namespace.h>
 
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(ns_tree_lock);
 static struct rb_root ns_unified_tree = RB_ROOT; /* protected by ns_tree_lock */
+static LIST_HEAD(ns_unified_list); /* protected by ns_tree_lock */
 
 /**
  * struct ns_tree - Namespace tree
@@ -83,6 +86,13 @@ static inline struct ns_common *node_to_ns_unified(const struct rb_node *node)
 	return rb_entry(node, struct ns_common, ns_unified_tree_node);
 }
 
+static inline struct ns_common *node_to_ns_owner(const struct rb_node *node)
+{
+	if (!node)
+		return NULL;
+	return rb_entry(node, struct ns_common, ns_owner_tree_node);
+}
+
 static inline int ns_cmp(struct rb_node *a, const struct rb_node *b)
 {
 	struct ns_common *ns_a = node_to_ns(a);
@@ -111,6 +121,20 @@ static inline int ns_cmp_unified(struct rb_node *a, const struct rb_node *b)
 	return 0;
 }
 
+static inline int ns_cmp_owner(struct rb_node *a, const struct rb_node *b)
+{
+	struct ns_common *ns_a = node_to_ns_owner(a);
+	struct ns_common *ns_b = node_to_ns_owner(b);
+	u64 ns_id_a = ns_a->ns_id;
+	u64 ns_id_b = ns_b->ns_id;
+
+	if (ns_id_a < ns_id_b)
+		return -1;
+	if (ns_id_a > ns_id_b)
+		return 1;
+	return 0;
+}
+
 void __ns_tree_add_raw(struct ns_common *ns, struct ns_tree *ns_tree)
 {
 	struct rb_node *node, *prev;
@@ -134,7 +158,13 @@ void __ns_tree_add_raw(struct ns_common *ns, struct ns_tree *ns_tree)
 	else
 		list_add_rcu(&ns->ns_list_node, &node_to_ns(prev)->ns_list_node);
 
+	/* Add to unified tree and list */
 	rb_find_add_rcu(&ns->ns_unified_tree_node, &ns_unified_tree, ns_cmp_unified);
+	prev = rb_prev(&ns->ns_unified_tree_node);
+	if (!prev)
+		list_add_rcu(&ns->ns_unified_list_node, &ns_unified_list);
+	else
+		list_add_rcu(&ns->ns_unified_list_node, &node_to_ns_unified(prev)->ns_unified_list_node);
 
 	if (ops) {
 		struct user_namespace *user_ns;
@@ -144,7 +174,16 @@ void __ns_tree_add_raw(struct ns_common *ns, struct ns_tree *ns_tree)
 		if (user_ns) {
 			struct ns_common *owner = &user_ns->ns;
 			VFS_WARN_ON_ONCE(owner->ns_type != CLONE_NEWUSER);
-			list_add_tail_rcu(&ns->ns_owner_entry, &owner->ns_owner);
+
+			/* Insert into owner's rbtree */
+			rb_find_add_rcu(&ns->ns_owner_tree_node, &owner->ns_owner_tree, ns_cmp_owner);
+
+			/* Insert into owner's list in sorted order */
+			prev = rb_prev(&ns->ns_owner_tree_node);
+			if (!prev)
+				list_add_rcu(&ns->ns_owner_entry, &owner->ns_owner);
+			else
+				list_add_rcu(&ns->ns_owner_entry, &node_to_ns_owner(prev)->ns_owner_entry);
 		} else {
 			/* Only the initial user namespace doesn't have an owner. */
 			VFS_WARN_ON_ONCE(ns != to_ns_common(&init_user_ns));
@@ -157,16 +196,36 @@ void __ns_tree_add_raw(struct ns_common *ns, struct ns_tree *ns_tree)
 
 void __ns_tree_remove(struct ns_common *ns, struct ns_tree *ns_tree)
 {
+	const struct proc_ns_operations *ops = ns->ops;
+	struct user_namespace *user_ns;
+
 	VFS_WARN_ON_ONCE(RB_EMPTY_NODE(&ns->ns_tree_node));
 	VFS_WARN_ON_ONCE(list_empty(&ns->ns_list_node));
 	VFS_WARN_ON_ONCE(ns->ns_type != ns_tree->type);
 
 	write_seqlock(&ns_tree_lock);
 	rb_erase(&ns->ns_tree_node, &ns_tree->ns_tree);
-	rb_erase(&ns->ns_unified_tree_node, &ns_unified_tree);
-	list_bidir_del_rcu(&ns->ns_list_node);
 	RB_CLEAR_NODE(&ns->ns_tree_node);
-	list_bidir_del_rcu(&ns->ns_owner_entry);
+
+	list_bidir_del_rcu(&ns->ns_list_node);
+
+	rb_erase(&ns->ns_unified_tree_node, &ns_unified_tree);
+	RB_CLEAR_NODE(&ns->ns_unified_tree_node);
+
+	list_bidir_del_rcu(&ns->ns_unified_list_node);
+
+	/* Remove from owner's rbtree if this namespace has an owner */
+	if (ops) {
+		user_ns = ops->owner(ns);
+		if (user_ns) {
+			struct ns_common *owner = &user_ns->ns;
+			rb_erase(&ns->ns_owner_tree_node, &owner->ns_owner_tree);
+			RB_CLEAR_NODE(&ns->ns_owner_tree_node);
+		}
+
+		list_bidir_del_rcu(&ns->ns_owner_entry);
+	}
+
 	write_sequnlock(&ns_tree_lock);
 }
 EXPORT_SYMBOL_GPL(__ns_tree_remove);
@@ -312,3 +371,340 @@ u64 __ns_tree_gen_id(struct ns_common *ns, u64 id)
 		ns->ns_id = atomic64_inc_return(&namespace_cookie);
 	return ns->ns_id;
 }
+
+struct klistns {
+	u64 *kns_ids;
+	u32 nr_ns_ids;
+	u64 last_ns_id;
+	u64 user_ns_id;
+	u32 ns_type;
+	struct user_namespace *user_ns;
+	struct ns_common *first_ns;
+};
+
+static void __free_klistns_free(const struct klistns *kls)
+{
+	if (kls->user_ns_id != LISTNS_CURRENT_USER)
+		put_user_ns(kls->user_ns);
+	if (kls->first_ns)
+		kls->first_ns->ops->put(kls->first_ns);
+	kvfree(kls->kns_ids);
+}
+
+#define NS_ALL (PID_NS | USER_NS | MNT_NS | UTS_NS | IPC_NS | NET_NS | CGROUP_NS | TIME_NS)
+
+static int copy_ns_id_req(const struct ns_id_req __user *req,
+			  struct ns_id_req *kreq)
+{
+	int ret;
+	size_t usize;
+
+	BUILD_BUG_ON(sizeof(struct ns_id_req) != NS_ID_REQ_SIZE_VER0);
+
+	ret = get_user(usize, &req->size);
+	if (ret)
+		return -EFAULT;
+	if (unlikely(usize > PAGE_SIZE))
+		return -E2BIG;
+	if (unlikely(usize < NS_ID_REQ_SIZE_VER0))
+		return -EINVAL;
+	memset(kreq, 0, sizeof(*kreq));
+	ret = copy_struct_from_user(kreq, sizeof(*kreq), req, usize);
+	if (ret)
+		return ret;
+	if (kreq->spare != 0)
+		return -EINVAL;
+	if (kreq->ns_type & ~NS_ALL)
+		return -EOPNOTSUPP;
+	return 0;
+}
+
+static inline int prepare_klistns(struct klistns *kls, struct ns_id_req *kreq,
+				  size_t nr_ns_ids)
+{
+	kls->last_ns_id = kreq->ns_id;
+	kls->user_ns_id = kreq->user_ns_id;
+	kls->nr_ns_ids = nr_ns_ids;
+	kls->ns_type = kreq->ns_type;
+
+	kls->kns_ids = kvmalloc_array(nr_ns_ids, sizeof(*kls->kns_ids),
+				      GFP_KERNEL_ACCOUNT);
+	if (!kls->kns_ids)
+		return -ENOMEM;
+
+	return 0;
+}
+
+/*
+ * Lookup a namespace owned by owner with id >= ns_id.
+ * Returns the namespace with the smallest id that is >= ns_id.
+ */
+static struct ns_common *lookup_ns_owner_at(u64 ns_id, struct ns_common *owner)
+{
+	struct ns_common *ret = NULL;
+	struct rb_node *node;
+
+	VFS_WARN_ON_ONCE(owner->ns_type != CLONE_NEWUSER);
+
+	read_seqlock_excl(&ns_tree_lock);
+	node = owner->ns_owner_tree.rb_node;
+
+	while (node) {
+		struct ns_common *ns = node_to_ns_owner(node);
+
+		if (ns_id <= ns->ns_id) {
+			ret = ns;
+			if (ns_id == ns->ns_id)
+				break;
+			node = node->rb_left;
+		} else {
+			node = node->rb_right;
+		}
+	}
+
+	if (ret && !ns_get(ret))
+		ret = NULL;
+	read_sequnlock_excl(&ns_tree_lock);
+	return ret;
+}
+
+static struct ns_common *lookup_ns_id(u64 mnt_ns_id, int ns_type)
+{
+	struct ns_common *ns;
+
+	guard(rcu)();
+	ns = ns_tree_lookup_rcu(mnt_ns_id, ns_type);
+	if (!ns)
+		return NULL;
+
+	if (!ns_get(ns))
+		return NULL;
+
+	return ns;
+}
+
+static ssize_t do_listns_userns(struct klistns *kls)
+{
+	u64 *ns_ids = kls->kns_ids;
+	size_t nr_ns_ids = kls->nr_ns_ids;
+	struct ns_common *ns, *first_ns = NULL;
+	const struct list_head *head;
+	bool userns_capable;
+	ssize_t ret;
+
+	VFS_WARN_ON_ONCE(!kls->user_ns_id);
+
+	if (kls->user_ns_id == LISTNS_CURRENT_USER)
+		ns = to_ns_common(current_user_ns());
+	else if (kls->user_ns_id)
+		ns = lookup_ns_id(kls->user_ns_id, CLONE_NEWUSER);
+	if (!ns)
+		return -EINVAL;
+	kls->user_ns = to_user_ns(ns);
+
+	/*
+	 * Use the rbtree to find the first namespace we care about and
+	 * then use it's list entry to iterate from there.
+	 */
+	if (kls->last_ns_id) {
+		kls->first_ns = lookup_ns_owner_at(kls->last_ns_id + 1, ns);
+		if (!kls->first_ns)
+			return -ENOENT;
+		first_ns = kls->first_ns;
+	}
+
+	ret = 0;
+	head = &to_ns_common(kls->user_ns)->ns_owner;
+	userns_capable = ns_capable_noaudit(kls->user_ns, CAP_SYS_ADMIN);
+	guard(rcu)();
+	if (!first_ns)
+		first_ns = list_entry_rcu(head->next, typeof(*ns), ns_owner_entry);
+	for (ns = first_ns; &ns->ns_owner_entry != head && nr_ns_ids;
+	     ns = list_entry_rcu(ns->ns_owner_entry.next, typeof(*ns), ns_owner_entry)) {
+		if (kls->ns_type && !(kls->ns_type & ns->ns_type))
+			continue;
+		if (!__ns_ref_active_read(ns))
+			continue;
+		if (!userns_capable && !is_current_namespace(ns) &&
+		    ((ns->ns_type != CLONE_NEWUSER) ||
+		     !ns_capable_noaudit(to_user_ns(ns), CAP_SYS_ADMIN)))
+			continue;
+		*ns_ids = ns->ns_id;
+		ns_ids++;
+		nr_ns_ids--;
+		ret++;
+	}
+
+	return ret;
+}
+
+/*
+ * Lookup a namespace with id >= ns_id in either the unified tree or a type-specific tree.
+ * Returns the namespace with the smallest id that is >= ns_id.
+ */
+static struct ns_common *lookup_ns_id_at(u64 ns_id, int ns_type)
+{
+	struct ns_common *ret = NULL;
+	struct ns_tree *ns_tree = NULL;
+	struct rb_node *node;
+
+	if (ns_type) {
+		ns_tree = ns_tree_from_type(ns_type);
+		if (!ns_tree)
+			return NULL;
+	}
+
+	read_seqlock_excl(&ns_tree_lock);
+	if (ns_tree)
+		node = ns_tree->ns_tree.rb_node;
+	else
+		node = ns_unified_tree.rb_node;
+
+	while (node) {
+		struct ns_common *ns;
+
+		if (ns_type)
+			ns = node_to_ns(node);
+		else
+			ns = node_to_ns_unified(node);
+
+		if (ns_id <= ns->ns_id) {
+			if (ns_type)
+				ret = node_to_ns(node);
+			else
+				ret = node_to_ns_unified(node);
+			if (ns_id == ns->ns_id)
+				break;
+			node = node->rb_left;
+		} else {
+			node = node->rb_right;
+		}
+	}
+
+	if (ret && !ns_get(ret))
+		ret = NULL;
+	read_sequnlock_excl(&ns_tree_lock);
+	return ret;
+}
+
+static inline struct ns_common *first_ns_common(const struct list_head *head,
+						struct ns_tree *ns_tree)
+{
+	if (ns_tree)
+		return list_entry_rcu(head->next, struct ns_common, ns_list_node);
+	return list_entry_rcu(head->next, struct ns_common, ns_unified_list_node);
+}
+
+static inline struct ns_common *next_ns_common(struct ns_common *ns,
+					       struct ns_tree *ns_tree)
+{
+	if (ns_tree)
+		return list_entry_rcu(ns->ns_list_node.next, struct ns_common, ns_list_node);
+	return list_entry_rcu(ns->ns_unified_list_node.next, struct ns_common, ns_unified_list_node);
+}
+
+static inline bool ns_common_is_head(struct ns_common *ns,
+				     const struct list_head *head,
+				     struct ns_tree *ns_tree)
+{
+	if (ns_tree)
+		return &ns->ns_list_node == head;
+	return &ns->ns_unified_list_node == head;
+}
+
+static ssize_t do_listns(struct klistns *kls)
+{
+	u64 *ns_ids = kls->kns_ids;
+	size_t nr_ns_ids = kls->nr_ns_ids;
+	struct ns_common *ns, *first_ns = NULL;
+	struct ns_tree *ns_tree = NULL;
+	const struct list_head *head;
+	struct user_namespace *user_ns;
+	ssize_t ret;
+
+	if (hweight32(kls->ns_type) == 1) {
+		ns_tree = ns_tree_from_type(kls->ns_type);
+		if (!ns_tree)
+			return -EINVAL;
+	}
+
+	if (kls->last_ns_id) {
+		kls->first_ns = lookup_ns_id_at(kls->last_ns_id + 1, 0);
+		if (!kls->first_ns)
+			return -ENOENT;
+		first_ns = kls->first_ns;
+	}
+
+	ret = 0;
+	if (ns_tree)
+		head = &ns_tree->ns_list;
+	else
+		head = &ns_unified_list;
+
+	guard(rcu)();
+	if (!first_ns)
+		first_ns = first_ns_common(head, ns_tree);
+
+	for (ns = first_ns; !ns_common_is_head(ns, head, ns_tree) && nr_ns_ids;
+	     ns = next_ns_common(ns, ns_tree)) {
+		if (kls->ns_type && !(kls->ns_type & ns->ns_type))
+			continue;
+		if (!__ns_ref_active_read(ns))
+			continue;
+		/* Check permissions */
+		if (!ns->ops)
+			user_ns = NULL;
+		else
+			user_ns = ns->ops->owner(ns);
+		if (!user_ns)
+			user_ns = &init_user_ns;
+		if (!ns_capable_noaudit(user_ns, CAP_SYS_ADMIN) &&
+		    !is_current_namespace(ns) &&
+		    ((ns->ns_type != CLONE_NEWUSER) ||
+		     !ns_capable_noaudit(to_user_ns(ns), CAP_SYS_ADMIN)))
+			continue;
+		*ns_ids++ = ns->ns_id;
+		nr_ns_ids--;
+		ret++;
+	}
+
+	return ret;
+}
+
+SYSCALL_DEFINE4(listns, const struct ns_id_req __user *, req,
+		u64 __user *, ns_ids, size_t, nr_ns_ids, unsigned int, flags)
+{
+	struct klistns klns __free(klistns_free) = {};
+	const size_t maxcount = 1000000;
+	struct ns_id_req kreq;
+	ssize_t ret;
+
+	if (flags)
+		return -EINVAL;
+
+	if (unlikely(nr_ns_ids > maxcount))
+		return -EOVERFLOW;
+
+	if (!access_ok(ns_ids, nr_ns_ids * sizeof(*ns_ids)))
+		return -EFAULT;
+
+	ret = copy_ns_id_req(req, &kreq);
+	if (ret)
+		return ret;
+
+	ret = prepare_klistns(&klns, &kreq, nr_ns_ids);
+	if (ret)
+		return ret;
+
+	if (kreq.user_ns_id)
+		ret = do_listns_userns(&klns);
+	else
+		ret = do_listns(&klns);
+	if (ret <= 0)
+		return ret;
+
+	if (copy_to_user(ns_ids, klns.kns_ids, ret * sizeof(*ns_ids)))
+		return -EFAULT;
+
+	return ret;
+}
diff --git a/kernel/pid.c b/kernel/pid.c
index f82dab348540..a1bfb2924476 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -80,6 +80,7 @@ struct pid_namespace init_pid_ns = {
 	.user_ns = &init_user_ns,
 	.ns.inum = ns_init_inum(&init_pid_ns),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_pid_ns.ns.ns_list_node),
+	.ns.ns_unified_list_node = LIST_HEAD_INIT(init_pid_ns.ns.ns_unified_list_node),
 	.ns.ns_owner_entry = LIST_HEAD_INIT(init_pid_ns.ns.ns_owner_entry),
 	.ns.ns_owner = LIST_HEAD_INIT(init_pid_ns.ns.ns_owner),
 #ifdef CONFIG_PID_NS
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 15cb74267c75..acbeec049263 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -489,6 +489,7 @@ struct time_namespace init_time_ns = {
 	.ns.ns_owner = LIST_HEAD_INIT(init_time_ns.ns.ns_owner),
 	.frozen_offsets	= true,
 	.ns.ns_list_node = LIST_HEAD_INIT(init_time_ns.ns.ns_list_node),
+	.ns.ns_unified_list_node = LIST_HEAD_INIT(init_time_ns.ns.ns_unified_list_node),
 };
 
 void __init time_ns_init(void)
diff --git a/kernel/user.c b/kernel/user.c
index e392768ccd44..68fe16617d38 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -72,6 +72,7 @@ struct user_namespace init_user_ns = {
 	.group = GLOBAL_ROOT_GID,
 	.ns.inum = ns_init_inum(&init_user_ns),
 	.ns.ns_list_node = LIST_HEAD_INIT(init_user_ns.ns.ns_list_node),
+	.ns.ns_unified_list_node = LIST_HEAD_INIT(init_user_ns.ns.ns_unified_list_node),
 	.ns.ns_owner_entry = LIST_HEAD_INIT(init_user_ns.ns.ns_owner_entry),
 	.ns.ns_owner = LIST_HEAD_INIT(init_user_ns.ns.ns_owner),
 #ifdef CONFIG_USER_NS

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 16/50] arch: hookup listns() system call
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (14 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 15/50] nstree: add listns() Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 17/50] nsfs: update tools header Christian Brauner
                   ` (37 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Add the listns() system call to all architectures.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 arch/alpha/kernel/syscalls/syscall.tbl      | 1 +
 arch/arm/tools/syscall.tbl                  | 1 +
 arch/arm64/tools/syscall_32.tbl             | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl       | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl     | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    | 1 +
 arch/s390/kernel/syscalls/syscall.tbl       | 1 +
 arch/sh/kernel/syscalls/syscall.tbl         | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl      | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     | 1 +
 include/uapi/asm-generic/unistd.h           | 4 +++-
 scripts/syscall.tbl                         | 1 +
 18 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 16dca28ebf17..3fed97478058 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -509,3 +509,4 @@
 577	common	open_tree_attr			sys_open_tree_attr
 578	common	file_getattr			sys_file_getattr
 579	common	file_setattr			sys_file_setattr
+580	common	listns				sys_listns
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index b07e699aaa3c..fd09afae72a2 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -484,3 +484,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	listns				sys_listns
diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
index 8d9088bc577d..8cdfe5d4dac9 100644
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -481,3 +481,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	listns				sys_listns
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index f41d38dfbf13..871a5d67bf41 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -469,3 +469,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	listns				sys_listns
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 580af574fe73..022fc85d94b3 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	listns				sys_listns
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index d824ffe9a014..8cedc83c3266 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -408,3 +408,4 @@
 467	n32	open_tree_attr			sys_open_tree_attr
 468	n32	file_getattr			sys_file_getattr
 469	n32	file_setattr			sys_file_setattr
+470	n32	listns				sys_listns
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 7a7049c2c307..9b92bddf06b5 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -384,3 +384,4 @@
 467	n64	open_tree_attr			sys_open_tree_attr
 468	n64	file_getattr			sys_file_getattr
 469	n64	file_setattr			sys_file_setattr
+470	n64	listns				sys_listns
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index d330274f0601..f810b8a55716 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -457,3 +457,4 @@
 467	o32	open_tree_attr			sys_open_tree_attr
 468	o32	file_getattr			sys_file_getattr
 469	o32	file_setattr			sys_file_setattr
+470	o32	listns				sys_listns
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 88a788a7b18d..39bdacaa530b 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -468,3 +468,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	listns				sys_listns
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index b453e80dfc00..ec4458cdb97b 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -560,3 +560,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	listns				sys_listns
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 8a6744d658db..5863787ab036 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -472,3 +472,4 @@
 467  common	open_tree_attr		sys_open_tree_attr		sys_open_tree_attr
 468  common	file_getattr		sys_file_getattr		sys_file_getattr
 469  common	file_setattr		sys_file_setattr		sys_file_setattr
+470  common	listns			sys_listns			sys_listns
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 5e9c9eff5539..969c11325ade 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	listns				sys_listns
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index ebb7d06d1044..39aa26b6a50b 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	listns				sys_listns
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4877e16da69a..b3b2658cb8d1 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -475,3 +475,4 @@
 467	i386	open_tree_attr		sys_open_tree_attr
 468	i386	file_getattr		sys_file_getattr
 469	i386	file_setattr		sys_file_setattr
+470	common	listns			sys_listns
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ced2a1deecd7..8a4ac4841be6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -394,6 +394,7 @@
 467	common	open_tree_attr		sys_open_tree_attr
 468	common	file_getattr		sys_file_getattr
 469	common	file_setattr		sys_file_setattr
+470	common	listns			sys_listns
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 374e4cb788d8..438a3b170402 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	listns				sys_listns
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 04e0077fb4c9..942370b3f5d2 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -857,9 +857,11 @@ __SYSCALL(__NR_open_tree_attr, sys_open_tree_attr)
 __SYSCALL(__NR_file_getattr, sys_file_getattr)
 #define __NR_file_setattr 469
 __SYSCALL(__NR_file_setattr, sys_file_setattr)
+#define __NR_listns 470
+__SYSCALL(__NR_listns, sys_listns)
 
 #undef __NR_syscalls
-#define __NR_syscalls 470
+#define __NR_syscalls 471
 
 /*
  * 32 bit systems traditionally used different
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index d1ae5e92c615..e74868be513c 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -410,3 +410,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	listns				sys_listns

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 17/50] nsfs: update tools header
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (15 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 16/50] arch: hookup listns() system call Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 18/50] selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper Christian Brauner
                   ` (36 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Ensure all the new uapi bits are visible for the selftests.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/include/uapi/linux/nsfs.h | 70 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/tools/include/uapi/linux/nsfs.h b/tools/include/uapi/linux/nsfs.h
index 33c9b578b3b2..a25e38d1c874 100644
--- a/tools/include/uapi/linux/nsfs.h
+++ b/tools/include/uapi/linux/nsfs.h
@@ -53,6 +53,76 @@ enum init_ns_ino {
 	TIME_NS_INIT_INO	= 0xEFFFFFFAU,
 	NET_NS_INIT_INO		= 0xEFFFFFF9U,
 	MNT_NS_INIT_INO		= 0xEFFFFFF8U,
+#ifdef __KERNEL__
+	MNT_NS_ANON_INO		= 0xEFFFFFF7U,
+#endif
 };
 
+struct nsfs_file_handle {
+	__u64 ns_id;
+	__u32 ns_type;
+	__u32 ns_inum;
+};
+
+#define NSFS_FILE_HANDLE_SIZE_VER0 16 /* sizeof first published struct */
+#define NSFS_FILE_HANDLE_SIZE_LATEST sizeof(struct nsfs_file_handle) /* sizeof latest published struct */
+
+enum init_ns_id {
+	IPC_NS_INIT_ID		= 1ULL,
+	UTS_NS_INIT_ID		= 2ULL,
+	USER_NS_INIT_ID		= 3ULL,
+	PID_NS_INIT_ID		= 4ULL,
+	CGROUP_NS_INIT_ID	= 5ULL,
+	TIME_NS_INIT_ID		= 6ULL,
+	NET_NS_INIT_ID		= 7ULL,
+	MNT_NS_INIT_ID		= 8ULL,
+#ifdef __KERNEL__
+	NS_LAST_INIT_ID		= MNT_NS_INIT_ID,
+#endif
+};
+
+enum ns_type {
+	TIME_NS    = (1ULL << 7),  /* CLONE_NEWTIME */
+	MNT_NS     = (1ULL << 17), /* CLONE_NEWNS */
+	CGROUP_NS  = (1ULL << 25), /* CLONE_NEWCGROUP */
+	UTS_NS     = (1ULL << 26), /* CLONE_NEWUTS */
+	IPC_NS     = (1ULL << 27), /* CLONE_NEWIPC */
+	USER_NS    = (1ULL << 28), /* CLONE_NEWUSER */
+	PID_NS     = (1ULL << 29), /* CLONE_NEWPID */
+	NET_NS     = (1ULL << 30), /* CLONE_NEWNET */
+};
+
+/**
+ * struct ns_id_req - namespace ID request structure
+ * @size: size of this structure
+ * @spare: reserved for future use
+ * @filter: filter mask
+ * @ns_id: last namespace id
+ * @user_ns_id: owning user namespace ID
+ *
+ * Structure for passing namespace ID and miscellaneous parameters to
+ * statns(2) and listns(2).
+ *
+ * For statns(2) @param represents the request mask.
+ * For listns(2) @param represents the last listed mount id (or zero).
+ */
+struct ns_id_req {
+	__u32 size;
+	__u32 spare;
+	__u64 ns_id;
+	struct /* listns */ {
+		__u32 ns_type;
+		__u32 spare2;
+		__u64 user_ns_id;
+	};
+};
+
+/*
+ * Special @user_ns_id value that can be passed to listns()
+ */
+#define LISTNS_CURRENT_USER 0xffffffffffffffff /* Caller's userns */
+
+/* List of all ns_id_req versions. */
+#define NS_ID_REQ_SIZE_VER0 32 /* sizeof first published struct */
+
 #endif /* __LINUX_NSFS_H */

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 18/50] selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (16 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 17/50] nsfs: update tools header Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 19/50] selftests/namespaces: first active reference count tests Christian Brauner
                   ` (35 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

This is effectively unused and doesn't really server any purpose after
having reviewed all of the tests that rely on it.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/filesystems/utils.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/filesystems/utils.c b/tools/testing/selftests/filesystems/utils.c
index c43a69dffd83..a0c64f415a7f 100644
--- a/tools/testing/selftests/filesystems/utils.c
+++ b/tools/testing/selftests/filesystems/utils.c
@@ -487,7 +487,7 @@ int setup_userns(void)
 	uid_t uid = getuid();
 	gid_t gid = getgid();
 
-	ret = unshare(CLONE_NEWNS|CLONE_NEWUSER|CLONE_NEWPID);
+	ret = unshare(CLONE_NEWNS|CLONE_NEWUSER);
 	if (ret) {
 		ksft_exit_fail_msg("unsharing mountns and userns: %s\n",
 				   strerror(errno));

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 19/50] selftests/namespaces: first active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (17 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 18/50] selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 20/50] selftests/namespaces: second " Christian Brauner
                   ` (34 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that initial namespaces can be reopened via file handle. Initial
namespaces should always have a ref count of one from boot.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/namespaces/.gitignore      |  1 +
 tools/testing/selftests/namespaces/Makefile        |  5 +-
 .../selftests/namespaces/ns_active_ref_test.c      | 74 ++++++++++++++++++++++
 3 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/namespaces/.gitignore b/tools/testing/selftests/namespaces/.gitignore
index ccfb40837a73..100cc5bfef04 100644
--- a/tools/testing/selftests/namespaces/.gitignore
+++ b/tools/testing/selftests/namespaces/.gitignore
@@ -1,3 +1,4 @@
 nsid_test
 file_handle_test
 init_ino_test
+ns_active_ref_test
diff --git a/tools/testing/selftests/namespaces/Makefile b/tools/testing/selftests/namespaces/Makefile
index 5fe4b3dc07d3..5cea938cdde8 100644
--- a/tools/testing/selftests/namespaces/Makefile
+++ b/tools/testing/selftests/namespaces/Makefile
@@ -1,7 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-only
 CFLAGS += -Wall -O0 -g $(KHDR_INCLUDES) $(TOOLS_INCLUDES)
+LDLIBS += -lcap
 
-TEST_GEN_PROGS := nsid_test file_handle_test init_ino_test
+TEST_GEN_PROGS := nsid_test file_handle_test init_ino_test ns_active_ref_test
 
 include ../lib.mk
 
+$(OUTPUT)/ns_active_ref_test: ../filesystems/utils.c
+
diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
new file mode 100644
index 000000000000..21514a537b26
--- /dev/null
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <linux/nsfs.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "../kselftest_harness.h"
+#include "../filesystems/utils.h"
+
+#ifndef FD_NSFS_ROOT
+#define FD_NSFS_ROOT -10003 /* Root of the nsfs filesystem */
+#endif
+
+/*
+ * Test that initial namespaces can be reopened via file handle.
+ * Initial namespaces should have active ref count of 1 from boot.
+ */
+TEST(init_ns_always_active)
+{
+	struct file_handle *handle;
+	int mount_id;
+	int ret;
+	int fd1, fd2;
+	struct stat st1, st2;
+
+	handle = malloc(sizeof(*handle) + MAX_HANDLE_SZ);
+	ASSERT_NE(handle, NULL);
+
+	/* Open initial network namespace */
+	fd1 = open("/proc/1/ns/net", O_RDONLY);
+	ASSERT_GE(fd1, 0);
+
+	/* Get file handle for initial namespace */
+	handle->handle_bytes = MAX_HANDLE_SZ;
+	ret = name_to_handle_at(fd1, "", handle, &mount_id, AT_EMPTY_PATH);
+	if (ret < 0 && errno == EOPNOTSUPP) {
+		SKIP(free(handle); close(fd1);
+		     return, "nsfs doesn't support file handles");
+	}
+	ASSERT_EQ(ret, 0);
+
+	/* Close the namespace fd */
+	close(fd1);
+
+	/* Try to reopen via file handle - should succeed since init ns is always active */
+	fd2 = open_by_handle_at(FD_NSFS_ROOT, handle, O_RDONLY);
+	if (fd2 < 0 && (errno == EINVAL || errno == EOPNOTSUPP)) {
+		SKIP(free(handle);
+		     return, "open_by_handle_at with FD_NSFS_ROOT not supported");
+	}
+	ASSERT_GE(fd2, 0);
+
+	/* Verify we opened the same namespace */
+	fd1 = open("/proc/1/ns/net", O_RDONLY);
+	ASSERT_GE(fd1, 0);
+	ASSERT_EQ(fstat(fd1, &st1), 0);
+	ASSERT_EQ(fstat(fd2, &st2), 0);
+	ASSERT_EQ(st1.st_ino, st2.st_ino);
+
+	close(fd1);
+	close(fd2);
+	free(handle);
+}
+
+TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 20/50] selftests/namespaces: second active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (18 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 19/50] selftests/namespaces: first active reference count tests Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 21/50] selftests/namespaces: third " Christian Brauner
                   ` (33 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test namespace lifecycle: create a namespace in a child process, get a
file handle while it's active, then try to reopen after the process
exits (namespace becomes inactive).

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 89 ++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index 21514a537b26..f628b4a4a927 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -71,4 +71,93 @@ TEST(init_ns_always_active)
 	free(handle);
 }
 
+/*
+ * Test namespace lifecycle: create a namespace in a child process,
+ * get a file handle while it's active, then try to reopen after
+ * the process exits (namespace becomes inactive).
+ */
+TEST(ns_inactive_after_exit)
+{
+	struct file_handle *handle;
+	int mount_id;
+	int ret;
+	int fd;
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	char buf[sizeof(*handle) + MAX_HANDLE_SZ];
+
+	/* Create pipe for passing file handle from child */
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		/* Child process */
+		close(pipefd[0]);
+
+		/* Create new network namespace */
+		ret = unshare(CLONE_NEWNET);
+		if (ret < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Open our new namespace */
+		fd = open("/proc/self/ns/net", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Get file handle for the namespace */
+		handle = (struct file_handle *)buf;
+		handle->handle_bytes = MAX_HANDLE_SZ;
+		ret = name_to_handle_at(fd, "", handle, &mount_id, AT_EMPTY_PATH);
+		close(fd);
+
+		if (ret < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Send handle to parent */
+		write(pipefd[1], buf, sizeof(*handle) + handle->handle_bytes);
+		close(pipefd[1]);
+
+		/* Exit - namespace should become inactive */
+		exit(0);
+	}
+
+	/* Parent process */
+	close(pipefd[1]);
+
+	/* Read file handle from child */
+	ret = read(pipefd[0], buf, sizeof(buf));
+	close(pipefd[0]);
+
+	/* Wait for child to exit */
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	if (WEXITSTATUS(status) != 0) {
+		SKIP(return, "Child failed to create namespace or get handle");
+	}
+
+	ASSERT_GT(ret, 0);
+	handle = (struct file_handle *)buf;
+
+	/* Try to reopen namespace - should fail with ENOENT since it's inactive */
+	fd = open_by_handle_at(FD_NSFS_ROOT, handle, O_RDONLY);
+	if (fd >= 0) {
+		/* Namespace is still active - this could happen if cleanup is slow */
+		TH_LOG("Warning: Namespace still active after process exit");
+		close(fd);
+	} else {
+		/* Should fail with ENOENT (namespace inactive) or ESTALE */
+		ASSERT_TRUE(errno == ENOENT || errno == ESTALE);
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 21/50] selftests/namespaces: third active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (19 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 20/50] selftests/namespaces: second " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 22/50] selftests/namespaces: fourth " Christian Brauner
                   ` (32 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that a namespace remains active while a process is using it,
even after the creating process exits.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 128 +++++++++++++++++++++
 1 file changed, 128 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index f628b4a4a927..63233f22517a 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -160,4 +160,132 @@ TEST(ns_inactive_after_exit)
 	}
 }
 
+/*
+ * Test that a namespace remains active while a process is using it,
+ * even after the creating process exits.
+ */
+TEST(ns_active_with_multiple_processes)
+{
+	struct file_handle *handle;
+	int mount_id;
+	int ret;
+	int fd;
+	int pipefd[2];
+	int syncpipe[2];
+	pid_t pid1, pid2;
+	int status;
+	char buf[sizeof(*handle) + MAX_HANDLE_SZ];
+	char sync_byte;
+
+	/* Create pipes for communication */
+	ASSERT_EQ(pipe(pipefd), 0);
+	ASSERT_EQ(pipe(syncpipe), 0);
+
+	pid1 = fork();
+	ASSERT_GE(pid1, 0);
+
+	if (pid1 == 0) {
+		/* First child - creates namespace */
+		close(pipefd[0]);
+		close(syncpipe[1]);
+
+		/* Create new network namespace */
+		ret = unshare(CLONE_NEWNET);
+		if (ret < 0) {
+			close(pipefd[1]);
+			close(syncpipe[0]);
+			exit(1);
+		}
+
+		/* Open and get handle */
+		fd = open("/proc/self/ns/net", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			close(syncpipe[0]);
+			exit(1);
+		}
+
+		handle = (struct file_handle *)buf;
+		handle->handle_bytes = MAX_HANDLE_SZ;
+		ret = name_to_handle_at(fd, "", handle, &mount_id, AT_EMPTY_PATH);
+		close(fd);
+
+		if (ret < 0) {
+			close(pipefd[1]);
+			close(syncpipe[0]);
+			exit(1);
+		}
+
+		/* Send handle to parent */
+		write(pipefd[1], buf, sizeof(*handle) + handle->handle_bytes);
+		close(pipefd[1]);
+
+		/* Wait for signal before exiting */
+		read(syncpipe[0], &sync_byte, 1);
+		close(syncpipe[0]);
+		exit(0);
+	}
+
+	/* Parent reads handle */
+	close(pipefd[1]);
+	ret = read(pipefd[0], buf, sizeof(buf));
+	close(pipefd[0]);
+	ASSERT_GT(ret, 0);
+
+	handle = (struct file_handle *)buf;
+
+	/* Create second child that will keep namespace active */
+	pid2 = fork();
+	ASSERT_GE(pid2, 0);
+
+	if (pid2 == 0) {
+		/* Second child - reopens the namespace */
+		close(syncpipe[0]);
+		close(syncpipe[1]);
+
+		/* Open the namespace via handle */
+		fd = open_by_handle_at(FD_NSFS_ROOT, handle, O_RDONLY);
+		if (fd < 0) {
+			exit(1);
+		}
+
+		/* Join the namespace */
+		ret = setns(fd, CLONE_NEWNET);
+		close(fd);
+		if (ret < 0) {
+			exit(1);
+		}
+
+		/* Sleep to keep namespace active */
+		sleep(1);
+		exit(0);
+	}
+
+	/* Let second child enter the namespace */
+	usleep(100000); /* 100ms */
+
+	/* Signal first child to exit */
+	close(syncpipe[0]);
+	sync_byte = 'X';
+	write(syncpipe[1], &sync_byte, 1);
+	close(syncpipe[1]);
+
+	/* Wait for first child */
+	waitpid(pid1, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	/* Namespace should still be active because second child is using it */
+	fd = open_by_handle_at(FD_NSFS_ROOT, handle, O_RDONLY);
+	if (fd < 0) {
+		/* If this fails, second child might not have entered namespace yet */
+		TH_LOG("Warning: Could not reopen namespace (errno=%d)", errno);
+	} else {
+		close(fd);
+	}
+
+	/* Wait for second child */
+	waitpid(pid2, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 22/50] selftests/namespaces: fourth active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (20 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 21/50] selftests/namespaces: third " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 23/50] selftests/namespaces: fifth " Christian Brauner
                   ` (31 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test user namespace active ref tracking via credential lifecycle.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 97 ++++++++++++++++++++++
 1 file changed, 97 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index 63233f22517a..b9836693f5ec 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -288,4 +288,101 @@ TEST(ns_active_with_multiple_processes)
 	ASSERT_TRUE(WIFEXITED(status));
 }
 
+/*
+ * Test user namespace active ref tracking via credential lifecycle
+ */
+TEST(userns_active_ref_lifecycle)
+{
+	struct file_handle *handle;
+	int mount_id;
+	int ret;
+	int fd;
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	char buf[sizeof(*handle) + MAX_HANDLE_SZ];
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		/* Child process */
+		close(pipefd[0]);
+
+		/* Create new user namespace */
+		ret = unshare(CLONE_NEWUSER);
+		if (ret < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Set up uid/gid mappings */
+		int uid_map_fd = open("/proc/self/uid_map", O_WRONLY);
+		int gid_map_fd = open("/proc/self/gid_map", O_WRONLY);
+		int setgroups_fd = open("/proc/self/setgroups", O_WRONLY);
+
+		if (uid_map_fd >= 0 && gid_map_fd >= 0 && setgroups_fd >= 0) {
+			write(setgroups_fd, "deny", 4);
+			close(setgroups_fd);
+
+			char mapping[64];
+			snprintf(mapping, sizeof(mapping), "0 %d 1", getuid());
+			write(uid_map_fd, mapping, strlen(mapping));
+			close(uid_map_fd);
+
+			snprintf(mapping, sizeof(mapping), "0 %d 1", getgid());
+			write(gid_map_fd, mapping, strlen(mapping));
+			close(gid_map_fd);
+		}
+
+		/* Get file handle */
+		fd = open("/proc/self/ns/user", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		handle = (struct file_handle *)buf;
+		handle->handle_bytes = MAX_HANDLE_SZ;
+		ret = name_to_handle_at(fd, "", handle, &mount_id, AT_EMPTY_PATH);
+		close(fd);
+
+		if (ret < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Send handle to parent */
+		write(pipefd[1], buf, sizeof(*handle) + handle->handle_bytes);
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	/* Parent */
+	close(pipefd[1]);
+	ret = read(pipefd[0], buf, sizeof(buf));
+	close(pipefd[0]);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	if (WEXITSTATUS(status) != 0) {
+		SKIP(return, "Child failed to create user namespace");
+	}
+
+	ASSERT_GT(ret, 0);
+	handle = (struct file_handle *)buf;
+
+	/* Namespace should be inactive after all tasks exit */
+	fd = open_by_handle_at(FD_NSFS_ROOT, handle, O_RDONLY);
+	if (fd >= 0) {
+		TH_LOG("Warning: User namespace still active after process exit");
+		close(fd);
+	} else {
+		ASSERT_TRUE(errno == ENOENT || errno == ESTALE);
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 23/50] selftests/namespaces: fifth active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (21 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 22/50] selftests/namespaces: fourth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 24/50] selftests/namespaces: sixth " Christian Brauner
                   ` (30 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test PID namespace active ref tracking

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 88 ++++++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index b9836693f5ec..66665bd39e9b 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -385,4 +385,92 @@ TEST(userns_active_ref_lifecycle)
 	}
 }
 
+/*
+ * Test PID namespace active ref tracking
+ */
+TEST(pidns_active_ref_lifecycle)
+{
+	struct file_handle *handle;
+	int mount_id;
+	int ret;
+	int fd;
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	char buf[sizeof(*handle) + MAX_HANDLE_SZ];
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		/* Child process */
+		close(pipefd[0]);
+
+		/* Create new PID namespace */
+		ret = unshare(CLONE_NEWPID);
+		if (ret < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Fork to actually enter the PID namespace */
+		pid_t child = fork();
+		if (child < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		if (child == 0) {
+			/* Grandchild - in new PID namespace */
+			fd = open("/proc/self/ns/pid", O_RDONLY);
+			if (fd < 0) {
+				exit(1);
+			}
+
+			handle = (struct file_handle *)buf;
+			handle->handle_bytes = MAX_HANDLE_SZ;
+			ret = name_to_handle_at(fd, "", handle, &mount_id, AT_EMPTY_PATH);
+			close(fd);
+
+			if (ret < 0) {
+				exit(1);
+			}
+
+			/* Send handle to grandparent */
+			write(pipefd[1], buf, sizeof(*handle) + handle->handle_bytes);
+			close(pipefd[1]);
+			exit(0);
+		}
+
+		/* Wait for grandchild */
+		waitpid(child, NULL, 0);
+		exit(0);
+	}
+
+	/* Parent */
+	close(pipefd[1]);
+	ret = read(pipefd[0], buf, sizeof(buf));
+	close(pipefd[0]);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	if (ret <= 0) {
+		SKIP(return, "Child failed to create PID namespace or get handle");
+	}
+
+	handle = (struct file_handle *)buf;
+
+	/* Namespace should be inactive after all processes exit */
+	fd = open_by_handle_at(FD_NSFS_ROOT, handle, O_RDONLY);
+	if (fd >= 0) {
+		TH_LOG("Warning: PID namespace still active after process exit");
+		close(fd);
+	} else {
+		ASSERT_TRUE(errno == ENOENT || errno == ESTALE);
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 24/50] selftests/namespaces: sixth active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (22 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 23/50] selftests/namespaces: fifth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 25/50] selftests/namespaces: seventh " Christian Brauner
                   ` (29 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that an open file descriptor keeps a namespace active.
Even after the creating process exits, the namespace should remain
active as long as an fd is held open.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 168 +++++++++++++++++++++
 1 file changed, 168 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index 66665bd39e9b..dbb1eb8a04b2 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -473,4 +473,172 @@ TEST(pidns_active_ref_lifecycle)
 	}
 }
 
+/*
+ * Test that an open file descriptor keeps a namespace active.
+ * Even after the creating process exits, the namespace should remain
+ * active as long as an fd is held open.
+ */
+TEST(ns_fd_keeps_active)
+{
+	struct file_handle *handle;
+	int mount_id;
+	int ret;
+	int nsfd;
+	int pipe_child_ready[2];
+	int pipe_parent_ready[2];
+	pid_t pid;
+	int status;
+	char buf[sizeof(*handle) + MAX_HANDLE_SZ];
+	char sync_byte;
+	char proc_path[64];
+
+	ASSERT_EQ(pipe(pipe_child_ready), 0);
+	ASSERT_EQ(pipe(pipe_parent_ready), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		/* Child process */
+		close(pipe_child_ready[0]);
+		close(pipe_parent_ready[1]);
+
+		TH_LOG("Child: creating new network namespace");
+
+		/* Create new network namespace */
+		ret = unshare(CLONE_NEWNET);
+		if (ret < 0) {
+			TH_LOG("Child: unshare(CLONE_NEWNET) failed: %s", strerror(errno));
+			close(pipe_child_ready[1]);
+			close(pipe_parent_ready[0]);
+			exit(1);
+		}
+
+		TH_LOG("Child: network namespace created successfully");
+
+		/* Get file handle for the namespace */
+		nsfd = open("/proc/self/ns/net", O_RDONLY);
+		if (nsfd < 0) {
+			TH_LOG("Child: failed to open /proc/self/ns/net: %s", strerror(errno));
+			close(pipe_child_ready[1]);
+			close(pipe_parent_ready[0]);
+			exit(1);
+		}
+
+		TH_LOG("Child: opened namespace fd %d", nsfd);
+
+		handle = (struct file_handle *)buf;
+		handle->handle_bytes = MAX_HANDLE_SZ;
+		ret = name_to_handle_at(nsfd, "", handle, &mount_id, AT_EMPTY_PATH);
+		close(nsfd);
+
+		if (ret < 0) {
+			TH_LOG("Child: name_to_handle_at failed: %s", strerror(errno));
+			close(pipe_child_ready[1]);
+			close(pipe_parent_ready[0]);
+			exit(1);
+		}
+
+		TH_LOG("Child: got file handle (bytes=%u)", handle->handle_bytes);
+
+		/* Send file handle to parent */
+		ret = write(pipe_child_ready[1], buf, sizeof(*handle) + handle->handle_bytes);
+		TH_LOG("Child: sent %d bytes of file handle to parent", ret);
+		close(pipe_child_ready[1]);
+
+		/* Wait for parent to open the fd */
+		TH_LOG("Child: waiting for parent to open fd");
+		ret = read(pipe_parent_ready[0], &sync_byte, 1);
+		close(pipe_parent_ready[0]);
+
+		TH_LOG("Child: parent signaled (read %d bytes), exiting now", ret);
+		/* Exit - namespace should stay active because parent holds fd */
+		exit(0);
+	}
+
+	/* Parent process */
+	close(pipe_child_ready[1]);
+	close(pipe_parent_ready[0]);
+
+	TH_LOG("Parent: reading file handle from child");
+
+	/* Read file handle from child */
+	ret = read(pipe_child_ready[0], buf, sizeof(buf));
+	close(pipe_child_ready[0]);
+	ASSERT_GT(ret, 0);
+	handle = (struct file_handle *)buf;
+
+	TH_LOG("Parent: received %d bytes, handle size=%u", ret, handle->handle_bytes);
+
+	/* Open the child's namespace while it's still alive */
+	snprintf(proc_path, sizeof(proc_path), "/proc/%d/ns/net", pid);
+	TH_LOG("Parent: opening child's namespace at %s", proc_path);
+	nsfd = open(proc_path, O_RDONLY);
+	if (nsfd < 0) {
+		TH_LOG("Parent: failed to open %s: %s", proc_path, strerror(errno));
+		close(pipe_parent_ready[1]);
+		kill(pid, SIGKILL);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to open child's namespace");
+	}
+
+	TH_LOG("Parent: opened child's namespace, got fd %d", nsfd);
+
+	/* Signal child that we have the fd */
+	sync_byte = 'G';
+	write(pipe_parent_ready[1], &sync_byte, 1);
+	close(pipe_parent_ready[1]);
+	TH_LOG("Parent: signaled child that we have the fd");
+
+	/* Wait for child to exit */
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	if (WEXITSTATUS(status) != 0) {
+		close(nsfd);
+		SKIP(return, "Child failed to create namespace");
+	}
+
+	TH_LOG("Child exited, parent holds fd %d to namespace", nsfd);
+
+	/*
+	 * Namespace should still be ACTIVE because we hold an fd.
+	 * We should be able to reopen it via file handle.
+	 */
+	TH_LOG("Attempting to reopen namespace via file handle (should succeed - fd held)");
+	int fd2 = open_by_handle_at(FD_NSFS_ROOT, handle, O_RDONLY);
+	if (fd2 < 0) {
+		close(nsfd);
+		TH_LOG("ERROR: Failed to reopen active namespace (fd held): %s (errno=%d)",
+		       strerror(errno), errno);
+		ASSERT_TRUE(false);
+	}
+
+	TH_LOG("Successfully reopened namespace via file handle, got fd %d", fd2);
+
+	/* Verify it's the same namespace */
+	struct stat st1, st2;
+	ASSERT_EQ(fstat(nsfd, &st1), 0);
+	ASSERT_EQ(fstat(fd2, &st2), 0);
+	TH_LOG("Namespace inodes: nsfd=%lu, fd2=%lu", st1.st_ino, st2.st_ino);
+	ASSERT_EQ(st1.st_ino, st2.st_ino);
+	close(fd2);
+
+	/* Now close the fd - namespace should become inactive */
+	TH_LOG("Closing fd %d - namespace should become inactive", nsfd);
+	close(nsfd);
+
+	/* Now reopening should fail - namespace is inactive */
+	TH_LOG("Attempting to reopen namespace via file handle (should fail - inactive)");
+	fd2 = open_by_handle_at(FD_NSFS_ROOT, handle, O_RDONLY);
+	if (fd2 >= 0) {
+		close(fd2);
+		TH_LOG("ERROR: Namespace still active after closing all fds (fd=%d)", fd2);
+	} else {
+		/* Should fail with ENOENT (inactive) or ESTALE (gone) */
+		TH_LOG("Reopen failed as expected: %s (errno=%d)", strerror(errno), errno);
+		ASSERT_TRUE(errno == ENOENT || errno == ESTALE);
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 25/50] selftests/namespaces: seventh active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (23 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 24/50] selftests/namespaces: sixth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 26/50] selftests/namespaces: eigth " Christian Brauner
                   ` (28 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test hierarchical active reference propagation.
When a child namespace is active, its owning user namespace should also
be active automatically due to hierarchical active reference propagation.
This ensures parents are always reachable when children are active.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 242 +++++++++++++++++++++
 1 file changed, 242 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index dbb1eb8a04b2..6377f5d72ed9 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -20,6 +20,10 @@
 #define FD_NSFS_ROOT -10003 /* Root of the nsfs filesystem */
 #endif
 
+#ifndef FILEID_NSFS
+#define FILEID_NSFS 0xf1
+#endif
+
 /*
  * Test that initial namespaces can be reopened via file handle.
  * Initial namespaces should have active ref count of 1 from boot.
@@ -641,4 +645,242 @@ TEST(ns_fd_keeps_active)
 	}
 }
 
+/*
+ * Test hierarchical active reference propagation.
+ * When a child namespace is active, its owning user namespace should also
+ * be active automatically due to hierarchical active reference propagation.
+ * This ensures parents are always reachable when children are active.
+ */
+TEST(ns_parent_always_reachable)
+{
+	struct file_handle *parent_handle, *child_handle;
+	int ret;
+	int child_nsfd;
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	__u64 parent_id, child_id;
+	char parent_buf[sizeof(*parent_handle) + MAX_HANDLE_SZ];
+	char child_buf[sizeof(*child_handle) + MAX_HANDLE_SZ];
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		/* Child process */
+		close(pipefd[0]);
+
+		TH_LOG("Child: creating parent user namespace and setting up mappings");
+
+		/* Create parent user namespace with mappings */
+		ret = setup_userns();
+		if (ret < 0) {
+			TH_LOG("Child: setup_userns() for parent failed: %s", strerror(errno));
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		TH_LOG("Child: parent user namespace created, now uid=%d gid=%d", getuid(), getgid());
+
+		/* Get namespace ID for parent user namespace */
+		int parent_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (parent_fd < 0) {
+			TH_LOG("Child: failed to open parent /proc/self/ns/user: %s", strerror(errno));
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		TH_LOG("Child: opened parent userns fd %d", parent_fd);
+
+		if (ioctl(parent_fd, NS_GET_ID, &parent_id) < 0) {
+			TH_LOG("Child: NS_GET_ID for parent failed: %s", strerror(errno));
+			close(parent_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(parent_fd);
+
+		TH_LOG("Child: got parent namespace ID %llu", (unsigned long long)parent_id);
+
+		/* Create child user namespace within parent */
+		TH_LOG("Child: creating nested child user namespace");
+		ret = setup_userns();
+		if (ret < 0) {
+			TH_LOG("Child: setup_userns() for child failed: %s", strerror(errno));
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		TH_LOG("Child: nested child user namespace created, uid=%d gid=%d", getuid(), getgid());
+
+		/* Get namespace ID for child user namespace */
+		int child_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (child_fd < 0) {
+			TH_LOG("Child: failed to open child /proc/self/ns/user: %s", strerror(errno));
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		TH_LOG("Child: opened child userns fd %d", child_fd);
+
+		if (ioctl(child_fd, NS_GET_ID, &child_id) < 0) {
+			TH_LOG("Child: NS_GET_ID for child failed: %s", strerror(errno));
+			close(child_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(child_fd);
+
+		TH_LOG("Child: got child namespace ID %llu", (unsigned long long)child_id);
+
+		/* Send both namespace IDs to parent */
+		TH_LOG("Child: sending both namespace IDs to parent");
+		write(pipefd[1], &parent_id, sizeof(parent_id));
+		write(pipefd[1], &child_id, sizeof(child_id));
+		close(pipefd[1]);
+
+		TH_LOG("Child: exiting - parent userns should become inactive");
+		/* Exit - parent user namespace should become inactive */
+		exit(0);
+	}
+
+	/* Parent process */
+	close(pipefd[1]);
+
+	TH_LOG("Parent: reading both namespace IDs from child");
+
+	/* Read both namespace IDs - fixed size, no parsing needed */
+	ret = read(pipefd[0], &parent_id, sizeof(parent_id));
+	if (ret != sizeof(parent_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read parent namespace ID from child");
+	}
+
+	ret = read(pipefd[0], &child_id, sizeof(child_id));
+	close(pipefd[0]);
+	if (ret != sizeof(child_id)) {
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read child namespace ID from child");
+	}
+
+	TH_LOG("Parent: received parent_id=%llu, child_id=%llu",
+	       (unsigned long long)parent_id, (unsigned long long)child_id);
+
+	/* Construct file handles from namespace IDs */
+	parent_handle = (struct file_handle *)parent_buf;
+	parent_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	parent_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *parent_fh = (struct nsfs_file_handle *)parent_handle->f_handle;
+	parent_fh->ns_id = parent_id;
+	parent_fh->ns_type = 0;
+	parent_fh->ns_inum = 0;
+
+	child_handle = (struct file_handle *)child_buf;
+	child_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	child_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *child_fh = (struct nsfs_file_handle *)child_handle->f_handle;
+	child_fh->ns_id = child_id;
+	child_fh->ns_type = 0;
+	child_fh->ns_inum = 0;
+
+	TH_LOG("Parent: opening child namespace BEFORE child exits");
+
+	/* Open child namespace while child is still alive to keep it active */
+	child_nsfd = open_by_handle_at(FD_NSFS_ROOT, child_handle, O_RDONLY);
+	if (child_nsfd < 0) {
+		TH_LOG("Failed to open child namespace: %s (errno=%d)", strerror(errno), errno);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to open child namespace");
+	}
+
+	TH_LOG("Opened child namespace fd %d", child_nsfd);
+
+	/* Now wait for child to exit */
+	TH_LOG("Parent: waiting for child to exit");
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	if (WEXITSTATUS(status) != 0) {
+		close(child_nsfd);
+		SKIP(return, "Child failed to create namespaces");
+	}
+
+	TH_LOG("Child process exited, parent holds fd to child namespace");
+
+	/*
+	 * With hierarchical active reference propagation:
+	 * Since the child namespace is active (parent process holds fd),
+	 * the parent user namespace should ALSO be active automatically.
+	 * This is because when we took an active reference on the child,
+	 * it propagated up to the owning user namespace.
+	 */
+	TH_LOG("Attempting to reopen parent namespace (should SUCCEED - hierarchical propagation)");
+	int parent_fd = open_by_handle_at(FD_NSFS_ROOT, parent_handle, O_RDONLY);
+	if (parent_fd < 0) {
+		close(child_nsfd);
+		TH_LOG("ERROR: Parent namespace inactive despite active child: %s (errno=%d)",
+		       strerror(errno), errno);
+		TH_LOG("This indicates hierarchical active reference propagation is not working!");
+		ASSERT_TRUE(false);
+	}
+
+	TH_LOG("SUCCESS: Parent namespace is active (fd=%d) due to active child", parent_fd);
+
+	/* Verify we can also get parent via NS_GET_USERNS */
+	TH_LOG("Verifying NS_GET_USERNS also works");
+	int parent_fd2 = ioctl(child_nsfd, NS_GET_USERNS);
+	if (parent_fd2 < 0) {
+		close(parent_fd);
+		close(child_nsfd);
+		TH_LOG("NS_GET_USERNS failed: %s (errno=%d)", strerror(errno), errno);
+		SKIP(return, "NS_GET_USERNS not supported or failed");
+	}
+
+	TH_LOG("NS_GET_USERNS succeeded, got parent fd %d", parent_fd2);
+
+	/* Verify both methods give us the same namespace */
+	struct stat st1, st2;
+	ASSERT_EQ(fstat(parent_fd, &st1), 0);
+	ASSERT_EQ(fstat(parent_fd2, &st2), 0);
+	TH_LOG("Parent namespace inodes: parent_fd=%lu, parent_fd2=%lu", st1.st_ino, st2.st_ino);
+	ASSERT_EQ(st1.st_ino, st2.st_ino);
+
+	/*
+	 * Close child fd - parent should remain active because we still
+	 * hold direct references to it (parent_fd and parent_fd2).
+	 */
+	TH_LOG("Closing child fd - parent should remain active (direct refs held)");
+	close(child_nsfd);
+
+	/* Parent should still be openable */
+	TH_LOG("Verifying parent still active via file handle");
+	int parent_fd3 = open_by_handle_at(FD_NSFS_ROOT, parent_handle, O_RDONLY);
+	if (parent_fd3 < 0) {
+		close(parent_fd);
+		close(parent_fd2);
+		TH_LOG("ERROR: Parent became inactive despite holding fds: %s (errno=%d)",
+		       strerror(errno), errno);
+		ASSERT_TRUE(false);
+	}
+	close(parent_fd3);
+
+	TH_LOG("Closing all fds to parent namespace");
+	close(parent_fd);
+	close(parent_fd2);
+
+	/* Both should now be inactive */
+	TH_LOG("Attempting to reopen parent (should fail - inactive, no refs)");
+	parent_fd = open_by_handle_at(FD_NSFS_ROOT, parent_handle, O_RDONLY);
+	if (parent_fd >= 0) {
+		close(parent_fd);
+		TH_LOG("ERROR: Parent namespace still active after closing all fds (fd=%d)", parent_fd);
+	} else {
+		TH_LOG("Parent inactive as expected: %s (errno=%d)", strerror(errno), errno);
+		ASSERT_TRUE(errno == ENOENT || errno == ESTALE);
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 26/50] selftests/namespaces: eigth active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (24 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 25/50] selftests/namespaces: seventh " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 27/50] selftests/namespaces: ninth " Christian Brauner
                   ` (27 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that bind mounts keep namespaces in the tree even when inactive

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 126 +++++++++++++++++++++
 1 file changed, 126 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index 6377f5d72ed9..66c9908d4977 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -883,4 +883,130 @@ TEST(ns_parent_always_reachable)
 	}
 }
 
+/*
+ * Test that bind mounts keep namespaces in the tree even when inactive
+ */
+TEST(ns_bind_mount_keeps_in_tree)
+{
+	struct file_handle *handle;
+	int mount_id;
+	int ret;
+	int fd;
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	char buf[sizeof(*handle) + MAX_HANDLE_SZ];
+	char tmpfile[] = "/tmp/ns-test-XXXXXX";
+	int tmpfd;
+
+	/* Create temporary file for bind mount */
+	tmpfd = mkstemp(tmpfile);
+	if (tmpfd < 0) {
+		SKIP(return, "Cannot create temporary file");
+	}
+	close(tmpfd);
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		/* Child process */
+		close(pipefd[0]);
+
+		/* Unshare mount namespace and make mounts private to avoid propagation */
+		ret = unshare(CLONE_NEWNS);
+		if (ret < 0) {
+			close(pipefd[1]);
+			unlink(tmpfile);
+			exit(1);
+		}
+		ret = mount(NULL, "/", NULL, MS_PRIVATE | MS_REC, NULL);
+		if (ret < 0) {
+			close(pipefd[1]);
+			unlink(tmpfile);
+			exit(1);
+		}
+
+		/* Create new network namespace */
+		ret = unshare(CLONE_NEWNET);
+		if (ret < 0) {
+			close(pipefd[1]);
+			unlink(tmpfile);
+			exit(1);
+		}
+
+		/* Bind mount the namespace */
+		ret = mount("/proc/self/ns/net", tmpfile, NULL, MS_BIND, NULL);
+		if (ret < 0) {
+			close(pipefd[1]);
+			unlink(tmpfile);
+			exit(1);
+		}
+
+		/* Get file handle */
+		fd = open("/proc/self/ns/net", O_RDONLY);
+		if (fd < 0) {
+			umount(tmpfile);
+			close(pipefd[1]);
+			unlink(tmpfile);
+			exit(1);
+		}
+
+		handle = (struct file_handle *)buf;
+		handle->handle_bytes = MAX_HANDLE_SZ;
+		ret = name_to_handle_at(fd, "", handle, &mount_id, AT_EMPTY_PATH);
+		close(fd);
+
+		if (ret < 0) {
+			umount(tmpfile);
+			close(pipefd[1]);
+			unlink(tmpfile);
+			exit(1);
+		}
+
+		/* Send handle to parent */
+		write(pipefd[1], buf, sizeof(*handle) + handle->handle_bytes);
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	/* Parent */
+	close(pipefd[1]);
+	ret = read(pipefd[0], buf, sizeof(buf));
+	close(pipefd[0]);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	if (WEXITSTATUS(status) != 0) {
+		unlink(tmpfile);
+		SKIP(return, "Child failed to create namespace or bind mount");
+	}
+
+	ASSERT_GT(ret, 0);
+	handle = (struct file_handle *)buf;
+
+	/*
+	 * Namespace should be inactive but still in tree due to bind mount.
+	 * Reopening should fail with ENOENT (inactive) not ESTALE (not in tree).
+	 */
+	fd = open_by_handle_at(FD_NSFS_ROOT, handle, O_RDONLY);
+	if (fd >= 0) {
+		/* Unexpected - namespace shouldn't be active */
+		close(fd);
+		TH_LOG("Warning: Namespace still active");
+	} else {
+		/* Should be ENOENT (inactive) since bind mount keeps it in tree */
+		if (errno != ENOENT && errno != ESTALE) {
+			TH_LOG("Unexpected error: %d", errno);
+		}
+	}
+
+	/* Cleanup */
+	umount(tmpfile);
+	unlink(tmpfile);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 27/50] selftests/namespaces: ninth active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (25 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 26/50] selftests/namespaces: eigth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 28/50] selftests/namespaces: tenth " Christian Brauner
                   ` (26 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test multi-level hierarchy (3+ levels deep).
Grandparent → Parent → Child
When child is active, both parent AND grandparent should be active.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 167 +++++++++++++++++++++
 1 file changed, 167 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index 66c9908d4977..87b435b64b45 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -1009,4 +1009,171 @@ TEST(ns_bind_mount_keeps_in_tree)
 	unlink(tmpfile);
 }
 
+/*
+ * Test multi-level hierarchy (3+ levels deep).
+ * Grandparent → Parent → Child
+ * When child is active, both parent AND grandparent should be active.
+ */
+TEST(ns_multilevel_hierarchy)
+{
+	struct file_handle *gp_handle, *p_handle, *c_handle;
+	int ret, pipefd[2];
+	pid_t pid;
+	int status;
+	__u64 gp_id, p_id, c_id;
+	char gp_buf[sizeof(*gp_handle) + MAX_HANDLE_SZ];
+	char p_buf[sizeof(*p_handle) + MAX_HANDLE_SZ];
+	char c_buf[sizeof(*c_handle) + MAX_HANDLE_SZ];
+
+	ASSERT_EQ(pipe(pipefd), 0);
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		close(pipefd[0]);
+
+		/* Create grandparent user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int gp_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (gp_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(gp_fd, NS_GET_ID, &gp_id) < 0) {
+			close(gp_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(gp_fd);
+
+		/* Create parent user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int p_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (p_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(p_fd, NS_GET_ID, &p_id) < 0) {
+			close(p_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(p_fd);
+
+		/* Create child user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int c_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (c_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(c_fd, NS_GET_ID, &c_id) < 0) {
+			close(c_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(c_fd);
+
+		/* Send all three namespace IDs */
+		write(pipefd[1], &gp_id, sizeof(gp_id));
+		write(pipefd[1], &p_id, sizeof(p_id));
+		write(pipefd[1], &c_id, sizeof(c_id));
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	close(pipefd[1]);
+
+	/* Read all three namespace IDs - fixed size, no parsing needed */
+	ret = read(pipefd[0], &gp_id, sizeof(gp_id));
+	if (ret != sizeof(gp_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read grandparent namespace ID from child");
+	}
+
+	ret = read(pipefd[0], &p_id, sizeof(p_id));
+	if (ret != sizeof(p_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read parent namespace ID from child");
+	}
+
+	ret = read(pipefd[0], &c_id, sizeof(c_id));
+	close(pipefd[0]);
+	if (ret != sizeof(c_id)) {
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read child namespace ID from child");
+	}
+
+	/* Construct file handles from namespace IDs */
+	gp_handle = (struct file_handle *)gp_buf;
+	gp_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	gp_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *gp_fh = (struct nsfs_file_handle *)gp_handle->f_handle;
+	gp_fh->ns_id = gp_id;
+	gp_fh->ns_type = 0;
+	gp_fh->ns_inum = 0;
+
+	p_handle = (struct file_handle *)p_buf;
+	p_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	p_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *p_fh = (struct nsfs_file_handle *)p_handle->f_handle;
+	p_fh->ns_id = p_id;
+	p_fh->ns_type = 0;
+	p_fh->ns_inum = 0;
+
+	c_handle = (struct file_handle *)c_buf;
+	c_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	c_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *c_fh = (struct nsfs_file_handle *)c_handle->f_handle;
+	c_fh->ns_id = c_id;
+	c_fh->ns_type = 0;
+	c_fh->ns_inum = 0;
+
+	/* Open child before process exits */
+	int c_fd = open_by_handle_at(FD_NSFS_ROOT, c_handle, O_RDONLY);
+	if (c_fd < 0) {
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to open child namespace");
+	}
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+	if (WEXITSTATUS(status) != 0) {
+		close(c_fd);
+		SKIP(return, "Child failed");
+	}
+
+	/*
+	 * With 3-level hierarchy and child active:
+	 * - Child is active (we hold fd)
+	 * - Parent should be active (propagated from child)
+	 * - Grandparent should be active (propagated from parent)
+	 */
+	TH_LOG("Testing parent active when child is active");
+	int p_fd = open_by_handle_at(FD_NSFS_ROOT, p_handle, O_RDONLY);
+	ASSERT_GE(p_fd, 0);
+
+	TH_LOG("Testing grandparent active when child is active");
+	int gp_fd = open_by_handle_at(FD_NSFS_ROOT, gp_handle, O_RDONLY);
+	ASSERT_GE(gp_fd, 0);
+
+	close(c_fd);
+	close(p_fd);
+	close(gp_fd);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 28/50] selftests/namespaces: tenth active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (26 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 27/50] selftests/namespaces: ninth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 29/50] selftests/namespaces: eleventh " Christian Brauner
                   ` (25 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test multiple children sharing same parent.
Parent should stay active as long as ANY child is active.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 177 +++++++++++++++++++++
 1 file changed, 177 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index 87b435b64b45..15d001df981c 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -1176,4 +1176,181 @@ TEST(ns_multilevel_hierarchy)
 	close(gp_fd);
 }
 
+/*
+ * Test multiple children sharing same parent.
+ * Parent should stay active as long as ANY child is active.
+ */
+TEST(ns_multiple_children_same_parent)
+{
+	struct file_handle *p_handle, *c1_handle, *c2_handle;
+	int ret, pipefd[2];
+	pid_t pid;
+	int status;
+	__u64 p_id, c1_id, c2_id;
+	char p_buf[sizeof(*p_handle) + MAX_HANDLE_SZ];
+	char c1_buf[sizeof(*c1_handle) + MAX_HANDLE_SZ];
+	char c2_buf[sizeof(*c2_handle) + MAX_HANDLE_SZ];
+
+	ASSERT_EQ(pipe(pipefd), 0);
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		close(pipefd[0]);
+
+		/* Create parent user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int p_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (p_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(p_fd, NS_GET_ID, &p_id) < 0) {
+			close(p_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(p_fd);
+
+		/* Create first child user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int c1_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (c1_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(c1_fd, NS_GET_ID, &c1_id) < 0) {
+			close(c1_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(c1_fd);
+
+		/* Return to parent user namespace and create second child */
+		/* We can't actually do this easily, so let's create a sibling namespace
+		 * by creating a network namespace instead */
+		if (unshare(CLONE_NEWNET) < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int c2_fd = open("/proc/self/ns/net", O_RDONLY);
+		if (c2_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(c2_fd, NS_GET_ID, &c2_id) < 0) {
+			close(c2_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(c2_fd);
+
+		/* Send all namespace IDs */
+		write(pipefd[1], &p_id, sizeof(p_id));
+		write(pipefd[1], &c1_id, sizeof(c1_id));
+		write(pipefd[1], &c2_id, sizeof(c2_id));
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	close(pipefd[1]);
+
+	/* Read all three namespace IDs - fixed size, no parsing needed */
+	ret = read(pipefd[0], &p_id, sizeof(p_id));
+	if (ret != sizeof(p_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read parent namespace ID");
+	}
+
+	ret = read(pipefd[0], &c1_id, sizeof(c1_id));
+	if (ret != sizeof(c1_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read first child namespace ID");
+	}
+
+	ret = read(pipefd[0], &c2_id, sizeof(c2_id));
+	close(pipefd[0]);
+	if (ret != sizeof(c2_id)) {
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read second child namespace ID");
+	}
+
+	/* Construct file handles from namespace IDs */
+	p_handle = (struct file_handle *)p_buf;
+	p_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	p_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *p_fh = (struct nsfs_file_handle *)p_handle->f_handle;
+	p_fh->ns_id = p_id;
+	p_fh->ns_type = 0;
+	p_fh->ns_inum = 0;
+
+	c1_handle = (struct file_handle *)c1_buf;
+	c1_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	c1_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *c1_fh = (struct nsfs_file_handle *)c1_handle->f_handle;
+	c1_fh->ns_id = c1_id;
+	c1_fh->ns_type = 0;
+	c1_fh->ns_inum = 0;
+
+	c2_handle = (struct file_handle *)c2_buf;
+	c2_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	c2_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *c2_fh = (struct nsfs_file_handle *)c2_handle->f_handle;
+	c2_fh->ns_id = c2_id;
+	c2_fh->ns_type = 0;
+	c2_fh->ns_inum = 0;
+
+	/* Open both children before process exits */
+	int c1_fd = open_by_handle_at(FD_NSFS_ROOT, c1_handle, O_RDONLY);
+	int c2_fd = open_by_handle_at(FD_NSFS_ROOT, c2_handle, O_RDONLY);
+
+	if (c1_fd < 0 || c2_fd < 0) {
+		if (c1_fd >= 0) close(c1_fd);
+		if (c2_fd >= 0) close(c2_fd);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to open child namespaces");
+	}
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+	if (WEXITSTATUS(status) != 0) {
+		close(c1_fd);
+		close(c2_fd);
+		SKIP(return, "Child failed");
+	}
+
+	/* Parent should be active (both children active) */
+	TH_LOG("Both children active - parent should be active");
+	int p_fd = open_by_handle_at(FD_NSFS_ROOT, p_handle, O_RDONLY);
+	ASSERT_GE(p_fd, 0);
+	close(p_fd);
+
+	/* Close first child - parent should STILL be active */
+	TH_LOG("Closing first child - parent should still be active");
+	close(c1_fd);
+	p_fd = open_by_handle_at(FD_NSFS_ROOT, p_handle, O_RDONLY);
+	ASSERT_GE(p_fd, 0);
+	close(p_fd);
+
+	/* Close second child - NOW parent should become inactive */
+	TH_LOG("Closing second child - parent should become inactive");
+	close(c2_fd);
+	p_fd = open_by_handle_at(FD_NSFS_ROOT, p_handle, O_RDONLY);
+	if (p_fd >= 0) {
+		close(p_fd);
+		TH_LOG("Warning: Parent still active after all children inactive");
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 29/50] selftests/namespaces: eleventh active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (27 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 28/50] selftests/namespaces: tenth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 30/50] selftests/namespaces: twelth " Christian Brauner
                   ` (24 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that different namespace types with same owner all contribute
active references to the owning user namespace.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 178 +++++++++++++++++++++
 1 file changed, 178 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index 15d001df981c..3c2f99b25067 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -1353,4 +1353,182 @@ TEST(ns_multiple_children_same_parent)
 	}
 }
 
+/*
+ * Test that different namespace types with same owner all contribute
+ * active references to the owning user namespace.
+ */
+TEST(ns_different_types_same_owner)
+{
+	struct file_handle *u_handle, *n_handle, *ut_handle;
+	int ret, pipefd[2];
+	pid_t pid;
+	int status;
+	__u64 u_id, n_id, ut_id;
+	char u_buf[sizeof(*u_handle) + MAX_HANDLE_SZ];
+	char n_buf[sizeof(*n_handle) + MAX_HANDLE_SZ];
+	char ut_buf[sizeof(*ut_handle) + MAX_HANDLE_SZ];
+
+	ASSERT_EQ(pipe(pipefd), 0);
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		close(pipefd[0]);
+
+		/* Create user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int u_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (u_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(u_fd, NS_GET_ID, &u_id) < 0) {
+			close(u_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(u_fd);
+
+		/* Create network namespace (owned by user namespace) */
+		if (unshare(CLONE_NEWNET) < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int n_fd = open("/proc/self/ns/net", O_RDONLY);
+		if (n_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(n_fd, NS_GET_ID, &n_id) < 0) {
+			close(n_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(n_fd);
+
+		/* Create UTS namespace (also owned by user namespace) */
+		if (unshare(CLONE_NEWUTS) < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int ut_fd = open("/proc/self/ns/uts", O_RDONLY);
+		if (ut_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(ut_fd, NS_GET_ID, &ut_id) < 0) {
+			close(ut_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(ut_fd);
+
+		/* Send all namespace IDs */
+		write(pipefd[1], &u_id, sizeof(u_id));
+		write(pipefd[1], &n_id, sizeof(n_id));
+		write(pipefd[1], &ut_id, sizeof(ut_id));
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	close(pipefd[1]);
+
+	/* Read all three namespace IDs - fixed size, no parsing needed */
+	ret = read(pipefd[0], &u_id, sizeof(u_id));
+	if (ret != sizeof(u_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read user namespace ID");
+	}
+
+	ret = read(pipefd[0], &n_id, sizeof(n_id));
+	if (ret != sizeof(n_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read network namespace ID");
+	}
+
+	ret = read(pipefd[0], &ut_id, sizeof(ut_id));
+	close(pipefd[0]);
+	if (ret != sizeof(ut_id)) {
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read UTS namespace ID");
+	}
+
+	/* Construct file handles from namespace IDs */
+	u_handle = (struct file_handle *)u_buf;
+	u_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	u_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *u_fh = (struct nsfs_file_handle *)u_handle->f_handle;
+	u_fh->ns_id = u_id;
+	u_fh->ns_type = 0;
+	u_fh->ns_inum = 0;
+
+	n_handle = (struct file_handle *)n_buf;
+	n_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	n_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *n_fh = (struct nsfs_file_handle *)n_handle->f_handle;
+	n_fh->ns_id = n_id;
+	n_fh->ns_type = 0;
+	n_fh->ns_inum = 0;
+
+	ut_handle = (struct file_handle *)ut_buf;
+	ut_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	ut_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *ut_fh = (struct nsfs_file_handle *)ut_handle->f_handle;
+	ut_fh->ns_id = ut_id;
+	ut_fh->ns_type = 0;
+	ut_fh->ns_inum = 0;
+
+	/* Open both non-user namespaces before process exits */
+	int n_fd = open_by_handle_at(FD_NSFS_ROOT, n_handle, O_RDONLY);
+	int ut_fd = open_by_handle_at(FD_NSFS_ROOT, ut_handle, O_RDONLY);
+
+	if (n_fd < 0 || ut_fd < 0) {
+		if (n_fd >= 0) close(n_fd);
+		if (ut_fd >= 0) close(ut_fd);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to open namespaces");
+	}
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+	if (WEXITSTATUS(status) != 0) {
+		close(n_fd);
+		close(ut_fd);
+		SKIP(return, "Child failed");
+	}
+
+	/*
+	 * Both network and UTS namespaces are active.
+	 * User namespace should be active (gets 2 active refs).
+	 */
+	TH_LOG("Both net and uts active - user namespace should be active");
+	int u_fd = open_by_handle_at(FD_NSFS_ROOT, u_handle, O_RDONLY);
+	ASSERT_GE(u_fd, 0);
+	close(u_fd);
+
+	/* Close network namespace - user namespace should STILL be active */
+	TH_LOG("Closing network ns - user ns should still be active (uts still active)");
+	close(n_fd);
+	u_fd = open_by_handle_at(FD_NSFS_ROOT, u_handle, O_RDONLY);
+	ASSERT_GE(u_fd, 0);
+	close(u_fd);
+
+	/* Close UTS namespace - user namespace should become inactive */
+	TH_LOG("Closing uts ns - user ns should become inactive");
+	close(ut_fd);
+	u_fd = open_by_handle_at(FD_NSFS_ROOT, u_handle, O_RDONLY);
+	if (u_fd >= 0) {
+		close(u_fd);
+		TH_LOG("Warning: User namespace still active after all owned namespaces inactive");
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 30/50] selftests/namespaces: twelth active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (28 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 29/50] selftests/namespaces: eleventh " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 31/50] selftests/namespaces: thirteenth " Christian Brauner
                   ` (23 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test hierarchical propagation with deep namespace hierarchy.
Create: init_user_ns -> user_A -> user_B -> net_ns
When net_ns is active, both user_A and user_B should be active.
This verifies the conditional recursion in __ns_ref_active_put() works.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 182 +++++++++++++++++++++
 1 file changed, 182 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index 3c2f99b25067..63cc88fe5cc1 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -1531,4 +1531,186 @@ TEST(ns_different_types_same_owner)
 	}
 }
 
+/*
+ * Test hierarchical propagation with deep namespace hierarchy.
+ * Create: init_user_ns -> user_A -> user_B -> net_ns
+ * When net_ns is active, both user_A and user_B should be active.
+ * This verifies the conditional recursion in __ns_ref_active_put() works.
+ */
+TEST(ns_deep_hierarchy_propagation)
+{
+	struct file_handle *ua_handle, *ub_handle, *net_handle;
+	int ret, pipefd[2];
+	pid_t pid;
+	int status;
+	__u64 ua_id, ub_id, net_id;
+	char ua_buf[sizeof(*ua_handle) + MAX_HANDLE_SZ];
+	char ub_buf[sizeof(*ub_handle) + MAX_HANDLE_SZ];
+	char net_buf[sizeof(*net_handle) + MAX_HANDLE_SZ];
+
+	ASSERT_EQ(pipe(pipefd), 0);
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		close(pipefd[0]);
+
+		/* Create user_A -> user_B -> net hierarchy */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int ua_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (ua_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(ua_fd, NS_GET_ID, &ua_id) < 0) {
+			close(ua_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(ua_fd);
+
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int ub_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (ub_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(ub_fd, NS_GET_ID, &ub_id) < 0) {
+			close(ub_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(ub_fd);
+
+		if (unshare(CLONE_NEWNET) < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int net_fd = open("/proc/self/ns/net", O_RDONLY);
+		if (net_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(net_fd, NS_GET_ID, &net_id) < 0) {
+			close(net_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(net_fd);
+
+		/* Send all three namespace IDs */
+		write(pipefd[1], &ua_id, sizeof(ua_id));
+		write(pipefd[1], &ub_id, sizeof(ub_id));
+		write(pipefd[1], &net_id, sizeof(net_id));
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	close(pipefd[1]);
+
+	/* Read all three namespace IDs - fixed size, no parsing needed */
+	ret = read(pipefd[0], &ua_id, sizeof(ua_id));
+	if (ret != sizeof(ua_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read user_A namespace ID");
+	}
+
+	ret = read(pipefd[0], &ub_id, sizeof(ub_id));
+	if (ret != sizeof(ub_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read user_B namespace ID");
+	}
+
+	ret = read(pipefd[0], &net_id, sizeof(net_id));
+	close(pipefd[0]);
+	if (ret != sizeof(net_id)) {
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read network namespace ID");
+	}
+
+	/* Construct file handles from namespace IDs */
+	ua_handle = (struct file_handle *)ua_buf;
+	ua_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	ua_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *ua_fh = (struct nsfs_file_handle *)ua_handle->f_handle;
+	ua_fh->ns_id = ua_id;
+	ua_fh->ns_type = 0;
+	ua_fh->ns_inum = 0;
+
+	ub_handle = (struct file_handle *)ub_buf;
+	ub_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	ub_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *ub_fh = (struct nsfs_file_handle *)ub_handle->f_handle;
+	ub_fh->ns_id = ub_id;
+	ub_fh->ns_type = 0;
+	ub_fh->ns_inum = 0;
+
+	net_handle = (struct file_handle *)net_buf;
+	net_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	net_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *net_fh = (struct nsfs_file_handle *)net_handle->f_handle;
+	net_fh->ns_id = net_id;
+	net_fh->ns_type = 0;
+	net_fh->ns_inum = 0;
+
+	/* Open net_ns before child exits to keep it active */
+	int net_fd = open_by_handle_at(FD_NSFS_ROOT, net_handle, O_RDONLY);
+	if (net_fd < 0) {
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to open network namespace");
+	}
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+	if (WEXITSTATUS(status) != 0) {
+		close(net_fd);
+		SKIP(return, "Child failed");
+	}
+
+	/* With net_ns active, both user_A and user_B should be active */
+	TH_LOG("Testing user_B active (net_ns active causes propagation)");
+	int ub_fd = open_by_handle_at(FD_NSFS_ROOT, ub_handle, O_RDONLY);
+	ASSERT_GE(ub_fd, 0);
+
+	TH_LOG("Testing user_A active (propagated through user_B)");
+	int ua_fd = open_by_handle_at(FD_NSFS_ROOT, ua_handle, O_RDONLY);
+	ASSERT_GE(ua_fd, 0);
+
+	/* Close net_ns - user_B should stay active (we hold direct ref) */
+	TH_LOG("Closing net_ns, user_B should remain active (direct ref held)");
+	close(net_fd);
+	int ub_fd2 = open_by_handle_at(FD_NSFS_ROOT, ub_handle, O_RDONLY);
+	ASSERT_GE(ub_fd2, 0);
+	close(ub_fd2);
+
+	/* Close user_B - user_A should stay active (we hold direct ref) */
+	TH_LOG("Closing user_B, user_A should remain active (direct ref held)");
+	close(ub_fd);
+	int ua_fd2 = open_by_handle_at(FD_NSFS_ROOT, ua_handle, O_RDONLY);
+	ASSERT_GE(ua_fd2, 0);
+	close(ua_fd2);
+
+	/* Close user_A - everything should become inactive */
+	TH_LOG("Closing user_A, all should become inactive");
+	close(ua_fd);
+
+	/* All should now be inactive */
+	ua_fd = open_by_handle_at(FD_NSFS_ROOT, ua_handle, O_RDONLY);
+	if (ua_fd >= 0) {
+		close(ua_fd);
+		TH_LOG("Warning: user_A still active");
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 31/50] selftests/namespaces: thirteenth active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (29 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 30/50] selftests/namespaces: twelth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 32/50] selftests/namespaces: fourteenth " Christian Brauner
                   ` (22 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that parent stays active as long as ANY child is active.
Create parent user namespace with two child net namespaces.
Parent should remain active until BOTH children are inactive.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 201 +++++++++++++++++++++
 1 file changed, 201 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index 63cc88fe5cc1..4c077223b05c 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -1713,4 +1713,205 @@ TEST(ns_deep_hierarchy_propagation)
 	}
 }
 
+/*
+ * Test that parent stays active as long as ANY child is active.
+ * Create parent user namespace with two child net namespaces.
+ * Parent should remain active until BOTH children are inactive.
+ */
+TEST(ns_parent_multiple_children_refcount)
+{
+	struct file_handle *parent_handle, *net1_handle, *net2_handle;
+	int ret, pipefd[2], syncpipe[2];
+	pid_t pid;
+	int status;
+	__u64 p_id, n1_id, n2_id;
+	char p_buf[sizeof(*parent_handle) + MAX_HANDLE_SZ];
+	char n1_buf[sizeof(*net1_handle) + MAX_HANDLE_SZ];
+	char n2_buf[sizeof(*net2_handle) + MAX_HANDLE_SZ];
+	char sync_byte;
+
+	ASSERT_EQ(pipe(pipefd), 0);
+	ASSERT_EQ(pipe(syncpipe), 0);
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		close(pipefd[0]);
+		close(syncpipe[1]);
+
+		/* Create parent user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int p_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (p_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(p_fd, NS_GET_ID, &p_id) < 0) {
+			close(p_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(p_fd);
+
+		/* Create first network namespace */
+		if (unshare(CLONE_NEWNET) < 0) {
+			close(pipefd[1]);
+			close(syncpipe[0]);
+			exit(1);
+		}
+
+		int n1_fd = open("/proc/self/ns/net", O_RDONLY);
+		if (n1_fd < 0) {
+			close(pipefd[1]);
+			close(syncpipe[0]);
+			exit(1);
+		}
+		if (ioctl(n1_fd, NS_GET_ID, &n1_id) < 0) {
+			close(n1_fd);
+			close(pipefd[1]);
+			close(syncpipe[0]);
+			exit(1);
+		}
+		/* Keep n1_fd open so first namespace stays active */
+
+		/* Create second network namespace */
+		if (unshare(CLONE_NEWNET) < 0) {
+			close(n1_fd);
+			close(pipefd[1]);
+			close(syncpipe[0]);
+			exit(1);
+		}
+
+		int n2_fd = open("/proc/self/ns/net", O_RDONLY);
+		if (n2_fd < 0) {
+			close(n1_fd);
+			close(pipefd[1]);
+			close(syncpipe[0]);
+			exit(1);
+		}
+		if (ioctl(n2_fd, NS_GET_ID, &n2_id) < 0) {
+			close(n1_fd);
+			close(n2_fd);
+			close(pipefd[1]);
+			close(syncpipe[0]);
+			exit(1);
+		}
+		/* Keep both n1_fd and n2_fd open */
+
+		/* Send all namespace IDs */
+		write(pipefd[1], &p_id, sizeof(p_id));
+		write(pipefd[1], &n1_id, sizeof(n1_id));
+		write(pipefd[1], &n2_id, sizeof(n2_id));
+		close(pipefd[1]);
+
+		/* Wait for parent to signal before exiting */
+		read(syncpipe[0], &sync_byte, 1);
+		close(syncpipe[0]);
+		exit(0);
+	}
+
+	close(pipefd[1]);
+	close(syncpipe[0]);
+
+	/* Read all three namespace IDs - fixed size, no parsing needed */
+	ret = read(pipefd[0], &p_id, sizeof(p_id));
+	if (ret != sizeof(p_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read parent namespace ID");
+	}
+
+	ret = read(pipefd[0], &n1_id, sizeof(n1_id));
+	if (ret != sizeof(n1_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read first network namespace ID");
+	}
+
+	ret = read(pipefd[0], &n2_id, sizeof(n2_id));
+	close(pipefd[0]);
+	if (ret != sizeof(n2_id)) {
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read second network namespace ID");
+	}
+
+	/* Construct file handles from namespace IDs */
+	parent_handle = (struct file_handle *)p_buf;
+	parent_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	parent_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *p_fh = (struct nsfs_file_handle *)parent_handle->f_handle;
+	p_fh->ns_id = p_id;
+	p_fh->ns_type = 0;
+	p_fh->ns_inum = 0;
+
+	net1_handle = (struct file_handle *)n1_buf;
+	net1_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	net1_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *n1_fh = (struct nsfs_file_handle *)net1_handle->f_handle;
+	n1_fh->ns_id = n1_id;
+	n1_fh->ns_type = 0;
+	n1_fh->ns_inum = 0;
+
+	net2_handle = (struct file_handle *)n2_buf;
+	net2_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	net2_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *n2_fh = (struct nsfs_file_handle *)net2_handle->f_handle;
+	n2_fh->ns_id = n2_id;
+	n2_fh->ns_type = 0;
+	n2_fh->ns_inum = 0;
+
+	/* Open both net namespaces while child is still alive */
+	int n1_fd = open_by_handle_at(FD_NSFS_ROOT, net1_handle, O_RDONLY);
+	int n2_fd = open_by_handle_at(FD_NSFS_ROOT, net2_handle, O_RDONLY);
+	if (n1_fd < 0 || n2_fd < 0) {
+		if (n1_fd >= 0) close(n1_fd);
+		if (n2_fd >= 0) close(n2_fd);
+		sync_byte = 'G';
+		write(syncpipe[1], &sync_byte, 1);
+		close(syncpipe[1]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to open net namespaces");
+	}
+
+	/* Signal child that we have opened the namespaces */
+	sync_byte = 'G';
+	write(syncpipe[1], &sync_byte, 1);
+	close(syncpipe[1]);
+
+	/* Wait for child to exit */
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+	if (WEXITSTATUS(status) != 0) {
+		close(n1_fd);
+		close(n2_fd);
+		SKIP(return, "Child failed");
+	}
+
+	/* Parent should be active (has 2 active children) */
+	TH_LOG("Both net namespaces active - parent should be active");
+	int p_fd = open_by_handle_at(FD_NSFS_ROOT, parent_handle, O_RDONLY);
+	ASSERT_GE(p_fd, 0);
+	close(p_fd);
+
+	/* Close first net namespace - parent should STILL be active */
+	TH_LOG("Closing first net ns - parent should still be active");
+	close(n1_fd);
+	p_fd = open_by_handle_at(FD_NSFS_ROOT, parent_handle, O_RDONLY);
+	ASSERT_GE(p_fd, 0);
+	close(p_fd);
+
+	/* Close second net namespace - parent should become inactive */
+	TH_LOG("Closing second net ns - parent should become inactive");
+	close(n2_fd);
+	p_fd = open_by_handle_at(FD_NSFS_ROOT, parent_handle, O_RDONLY);
+	if (p_fd >= 0) {
+		close(p_fd);
+		TH_LOG("Warning: Parent still active after all children inactive");
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 32/50] selftests/namespaces: fourteenth active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (30 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 31/50] selftests/namespaces: thirteenth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 33/50] selftests/namespaces: fifteenth " Christian Brauner
                   ` (21 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that user namespace as a child also propagates correctly.
Create user_A -> user_B, verify when user_B is active that user_A
is also active. This is different from non-user namespace children.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 138 +++++++++++++++++++++
 1 file changed, 138 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index 4c077223b05c..1eb4dc07e924 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -1914,4 +1914,142 @@ TEST(ns_parent_multiple_children_refcount)
 	}
 }
 
+/*
+ * Test that user namespace as a child also propagates correctly.
+ * Create user_A -> user_B, verify when user_B is active that user_A
+ * is also active. This is different from non-user namespace children.
+ */
+TEST(ns_userns_child_propagation)
+{
+	struct file_handle *ua_handle, *ub_handle;
+	int ret, pipefd[2];
+	pid_t pid;
+	int status;
+	__u64 ua_id, ub_id;
+	char ua_buf[sizeof(*ua_handle) + MAX_HANDLE_SZ];
+	char ub_buf[sizeof(*ub_handle) + MAX_HANDLE_SZ];
+
+	ASSERT_EQ(pipe(pipefd), 0);
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		close(pipefd[0]);
+
+		/* Create user_A */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int ua_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (ua_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(ua_fd, NS_GET_ID, &ua_id) < 0) {
+			close(ua_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(ua_fd);
+
+		/* Create user_B (child of user_A) */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int ub_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (ub_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(ub_fd, NS_GET_ID, &ub_id) < 0) {
+			close(ub_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(ub_fd);
+
+		/* Send both namespace IDs */
+		write(pipefd[1], &ua_id, sizeof(ua_id));
+		write(pipefd[1], &ub_id, sizeof(ub_id));
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	close(pipefd[1]);
+
+	/* Read both namespace IDs - fixed size, no parsing needed */
+	ret = read(pipefd[0], &ua_id, sizeof(ua_id));
+	if (ret != sizeof(ua_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read user_A namespace ID");
+	}
+
+	ret = read(pipefd[0], &ub_id, sizeof(ub_id));
+	close(pipefd[0]);
+	if (ret != sizeof(ub_id)) {
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read user_B namespace ID");
+	}
+
+	/* Construct file handles from namespace IDs */
+	ua_handle = (struct file_handle *)ua_buf;
+	ua_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	ua_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *ua_fh = (struct nsfs_file_handle *)ua_handle->f_handle;
+	ua_fh->ns_id = ua_id;
+	ua_fh->ns_type = 0;
+	ua_fh->ns_inum = 0;
+
+	ub_handle = (struct file_handle *)ub_buf;
+	ub_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	ub_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *ub_fh = (struct nsfs_file_handle *)ub_handle->f_handle;
+	ub_fh->ns_id = ub_id;
+	ub_fh->ns_type = 0;
+	ub_fh->ns_inum = 0;
+
+	/* Open user_B before child exits */
+	int ub_fd = open_by_handle_at(FD_NSFS_ROOT, ub_handle, O_RDONLY);
+	if (ub_fd < 0) {
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to open user_B");
+	}
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+	if (WEXITSTATUS(status) != 0) {
+		close(ub_fd);
+		SKIP(return, "Child failed");
+	}
+
+	/* With user_B active, user_A should also be active */
+	TH_LOG("Testing user_A active when child user_B is active");
+	int ua_fd = open_by_handle_at(FD_NSFS_ROOT, ua_handle, O_RDONLY);
+	ASSERT_GE(ua_fd, 0);
+
+	/* Close user_B */
+	TH_LOG("Closing user_B");
+	close(ub_fd);
+
+	/* user_A should remain active (we hold direct ref) */
+	int ua_fd2 = open_by_handle_at(FD_NSFS_ROOT, ua_handle, O_RDONLY);
+	ASSERT_GE(ua_fd2, 0);
+	close(ua_fd2);
+
+	/* Close user_A - should become inactive */
+	TH_LOG("Closing user_A - should become inactive");
+	close(ua_fd);
+
+	ua_fd = open_by_handle_at(FD_NSFS_ROOT, ua_handle, O_RDONLY);
+	if (ua_fd >= 0) {
+		close(ua_fd);
+		TH_LOG("Warning: user_A still active");
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 33/50] selftests/namespaces: fifteenth active reference count tests
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (31 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 32/50] selftests/namespaces: fourteenth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 34/50] selftests/namespaces: add listns() wrapper Christian Brauner
                   ` (20 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test different namespace types (net, uts, ipc) all contributing
active references to the same owning user namespace.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/ns_active_ref_test.c      | 171 +++++++++++++++++++++
 1 file changed, 171 insertions(+)

diff --git a/tools/testing/selftests/namespaces/ns_active_ref_test.c b/tools/testing/selftests/namespaces/ns_active_ref_test.c
index 1eb4dc07e924..8b1553be6881 100644
--- a/tools/testing/selftests/namespaces/ns_active_ref_test.c
+++ b/tools/testing/selftests/namespaces/ns_active_ref_test.c
@@ -2052,4 +2052,175 @@ TEST(ns_userns_child_propagation)
 	}
 }
 
+/*
+ * Test different namespace types (net, uts, ipc) all contributing
+ * active references to the same owning user namespace.
+ */
+TEST(ns_mixed_types_same_owner)
+{
+	struct file_handle *user_handle, *net_handle, *uts_handle;
+	int ret, pipefd[2];
+	pid_t pid;
+	int status;
+	__u64 u_id, n_id, ut_id;
+	char u_buf[sizeof(*user_handle) + MAX_HANDLE_SZ];
+	char n_buf[sizeof(*net_handle) + MAX_HANDLE_SZ];
+	char ut_buf[sizeof(*uts_handle) + MAX_HANDLE_SZ];
+
+	ASSERT_EQ(pipe(pipefd), 0);
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		close(pipefd[0]);
+
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int u_fd = open("/proc/self/ns/user", O_RDONLY);
+		if (u_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(u_fd, NS_GET_ID, &u_id) < 0) {
+			close(u_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(u_fd);
+
+		if (unshare(CLONE_NEWNET) < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int n_fd = open("/proc/self/ns/net", O_RDONLY);
+		if (n_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(n_fd, NS_GET_ID, &n_id) < 0) {
+			close(n_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(n_fd);
+
+		if (unshare(CLONE_NEWUTS) < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		int ut_fd = open("/proc/self/ns/uts", O_RDONLY);
+		if (ut_fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+		if (ioctl(ut_fd, NS_GET_ID, &ut_id) < 0) {
+			close(ut_fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(ut_fd);
+
+		/* Send all namespace IDs */
+		write(pipefd[1], &u_id, sizeof(u_id));
+		write(pipefd[1], &n_id, sizeof(n_id));
+		write(pipefd[1], &ut_id, sizeof(ut_id));
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	close(pipefd[1]);
+
+	/* Read all three namespace IDs - fixed size, no parsing needed */
+	ret = read(pipefd[0], &u_id, sizeof(u_id));
+	if (ret != sizeof(u_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read user namespace ID");
+	}
+
+	ret = read(pipefd[0], &n_id, sizeof(n_id));
+	if (ret != sizeof(n_id)) {
+		close(pipefd[0]);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read network namespace ID");
+	}
+
+	ret = read(pipefd[0], &ut_id, sizeof(ut_id));
+	close(pipefd[0]);
+	if (ret != sizeof(ut_id)) {
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to read UTS namespace ID");
+	}
+
+	/* Construct file handles from namespace IDs */
+	user_handle = (struct file_handle *)u_buf;
+	user_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	user_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *u_fh = (struct nsfs_file_handle *)user_handle->f_handle;
+	u_fh->ns_id = u_id;
+	u_fh->ns_type = 0;
+	u_fh->ns_inum = 0;
+
+	net_handle = (struct file_handle *)n_buf;
+	net_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	net_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *n_fh = (struct nsfs_file_handle *)net_handle->f_handle;
+	n_fh->ns_id = n_id;
+	n_fh->ns_type = 0;
+	n_fh->ns_inum = 0;
+
+	uts_handle = (struct file_handle *)ut_buf;
+	uts_handle->handle_bytes = sizeof(struct nsfs_file_handle);
+	uts_handle->handle_type = FILEID_NSFS;
+	struct nsfs_file_handle *ut_fh = (struct nsfs_file_handle *)uts_handle->f_handle;
+	ut_fh->ns_id = ut_id;
+	ut_fh->ns_type = 0;
+	ut_fh->ns_inum = 0;
+
+	/* Open both non-user namespaces */
+	int n_fd = open_by_handle_at(FD_NSFS_ROOT, net_handle, O_RDONLY);
+	int ut_fd = open_by_handle_at(FD_NSFS_ROOT, uts_handle, O_RDONLY);
+	if (n_fd < 0 || ut_fd < 0) {
+		if (n_fd >= 0) close(n_fd);
+		if (ut_fd >= 0) close(ut_fd);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to open namespaces");
+	}
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+	if (WEXITSTATUS(status) != 0) {
+		close(n_fd);
+		close(ut_fd);
+		SKIP(return, "Child failed");
+	}
+
+	/* User namespace should be active (2 active children) */
+	TH_LOG("Both net and uts active - user ns should be active");
+	int u_fd = open_by_handle_at(FD_NSFS_ROOT, user_handle, O_RDONLY);
+	ASSERT_GE(u_fd, 0);
+	close(u_fd);
+
+	/* Close net - user ns should STILL be active (uts still active) */
+	TH_LOG("Closing net - user ns should still be active");
+	close(n_fd);
+	u_fd = open_by_handle_at(FD_NSFS_ROOT, user_handle, O_RDONLY);
+	ASSERT_GE(u_fd, 0);
+	close(u_fd);
+
+	/* Close uts - user ns should become inactive */
+	TH_LOG("Closing uts - user ns should become inactive");
+	close(ut_fd);
+	u_fd = open_by_handle_at(FD_NSFS_ROOT, user_handle, O_RDONLY);
+	if (u_fd >= 0) {
+		close(u_fd);
+		TH_LOG("Warning: User ns still active");
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 34/50] selftests/namespaces: add listns() wrapper
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (32 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 33/50] selftests/namespaces: fifteenth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 35/50] selftests/namespaces: first listns() test Christian Brauner
                   ` (19 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Add a wrapper for the listns() system call.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/namespaces/wrappers.h | 35 +++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/tools/testing/selftests/namespaces/wrappers.h b/tools/testing/selftests/namespaces/wrappers.h
new file mode 100644
index 000000000000..9741a64a5b1d
--- /dev/null
+++ b/tools/testing/selftests/namespaces/wrappers.h
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/nsfs.h>
+#include <linux/types.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#ifndef __SELFTESTS_NAMESPACES_WRAPPERS_H__
+#define __SELFTESTS_NAMESPACES_WRAPPERS_H__
+
+#ifndef __NR_listns
+	#if defined __alpha__
+		#define __NR_listns 580
+	#elif defined _MIPS_SIM
+		#if _MIPS_SIM == _MIPS_SIM_ABI32	/* o32 */
+			#define __NR_listns 4470
+		#endif
+		#if _MIPS_SIM == _MIPS_SIM_NABI32	/* n32 */
+			#define __NR_listns 6470
+		#endif
+		#if _MIPS_SIM == _MIPS_SIM_ABI64	/* n64 */
+			#define __NR_listns 5470
+		#endif
+	#else
+		#define __NR_listns 470
+	#endif
+#endif
+
+static inline int sys_listns(const struct ns_id_req *req, __u64 *ns_ids,
+			     size_t nr_ns_ids, unsigned int flags)
+{
+	return syscall(__NR_listns, req, ns_ids, nr_ns_ids, flags);
+}
+
+#endif /* __SELFTESTS_NAMESPACES_WRAPPERS_H__ */

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 35/50] selftests/namespaces: first listns() test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (33 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 34/50] selftests/namespaces: add listns() wrapper Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 36/50] selftests/namespaces: second " Christian Brauner
                   ` (18 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test basic listns() functionality with the unified namespace tree.
List all active namespaces globally.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/namespaces/.gitignore    |  1 +
 tools/testing/selftests/namespaces/Makefile      |  3 +-
 tools/testing/selftests/namespaces/listns_test.c | 57 ++++++++++++++++++++++++
 3 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/namespaces/.gitignore b/tools/testing/selftests/namespaces/.gitignore
index 100cc5bfef04..5065f07e92c9 100644
--- a/tools/testing/selftests/namespaces/.gitignore
+++ b/tools/testing/selftests/namespaces/.gitignore
@@ -2,3 +2,4 @@ nsid_test
 file_handle_test
 init_ino_test
 ns_active_ref_test
+listns_test
diff --git a/tools/testing/selftests/namespaces/Makefile b/tools/testing/selftests/namespaces/Makefile
index 5cea938cdde8..de708f4df159 100644
--- a/tools/testing/selftests/namespaces/Makefile
+++ b/tools/testing/selftests/namespaces/Makefile
@@ -2,9 +2,10 @@
 CFLAGS += -Wall -O0 -g $(KHDR_INCLUDES) $(TOOLS_INCLUDES)
 LDLIBS += -lcap
 
-TEST_GEN_PROGS := nsid_test file_handle_test init_ino_test ns_active_ref_test
+TEST_GEN_PROGS := nsid_test file_handle_test init_ino_test ns_active_ref_test listns_test
 
 include ../lib.mk
 
 $(OUTPUT)/ns_active_ref_test: ../filesystems/utils.c
+$(OUTPUT)/listns_test: ../filesystems/utils.c
 
diff --git a/tools/testing/selftests/namespaces/listns_test.c b/tools/testing/selftests/namespaces/listns_test.c
new file mode 100644
index 000000000000..cb42827d3dfe
--- /dev/null
+++ b/tools/testing/selftests/namespaces/listns_test.c
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <linux/nsfs.h>
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "../kselftest_harness.h"
+#include "../filesystems/utils.h"
+#include "wrappers.h"
+
+/*
+ * Test basic listns() functionality with the unified namespace tree.
+ * List all active namespaces globally.
+ */
+TEST(listns_basic_unified)
+{
+	struct ns_id_req req = {
+		.size = sizeof(req),
+		.spare = 0,
+		.ns_id = 0,
+		.ns_type = 0,  /* All types */
+		.spare2 = 0,
+		.user_ns_id = 0,  /* Global listing */
+	};
+	__u64 ns_ids[100];
+	ssize_t ret;
+
+	ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+	if (ret < 0) {
+		if (errno == ENOSYS)
+			SKIP(return, "listns() not supported");
+		TH_LOG("listns failed: %s (errno=%d)", strerror(errno), errno);
+		ASSERT_TRUE(false);
+	}
+
+	/* Should find at least the initial namespaces */
+	ASSERT_GT(ret, 0);
+	TH_LOG("Found %zd active namespaces", ret);
+
+	/* Verify all returned IDs are non-zero */
+	for (ssize_t i = 0; i < ret; i++) {
+		ASSERT_NE(ns_ids[i], 0);
+		TH_LOG("  [%zd] ns_id: %llu", i, (unsigned long long)ns_ids[i]);
+	}
+}
+
+TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 36/50] selftests/namespaces: second listns() test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (34 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 35/50] selftests/namespaces: first listns() test Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 37/50] selftests/namespaces: third " Christian Brauner
                   ` (17 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

test listns() with type filtering.
List only network namespaces.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/namespaces/listns_test.c | 61 ++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_test.c b/tools/testing/selftests/namespaces/listns_test.c
index cb42827d3dfe..64249502ac49 100644
--- a/tools/testing/selftests/namespaces/listns_test.c
+++ b/tools/testing/selftests/namespaces/listns_test.c
@@ -54,4 +54,65 @@ TEST(listns_basic_unified)
 	}
 }
 
+/*
+ * Test listns() with type filtering.
+ * List only network namespaces.
+ */
+TEST(listns_filter_by_type)
+{
+	struct ns_id_req req = {
+		.size = sizeof(req),
+		.spare = 0,
+		.ns_id = 0,
+		.ns_type = CLONE_NEWNET,  /* Only network namespaces */
+		.spare2 = 0,
+		.user_ns_id = 0,
+	};
+	__u64 ns_ids[100];
+	ssize_t ret;
+
+	ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+	if (ret < 0) {
+		if (errno == ENOSYS)
+			SKIP(return, "listns() not supported");
+		TH_LOG("listns failed: %s (errno=%d)", strerror(errno), errno);
+		ASSERT_TRUE(false);
+	}
+	ASSERT_GE(ret, 0);
+
+	/* Should find at least init_net */
+	ASSERT_GT(ret, 0);
+	TH_LOG("Found %zd active network namespaces", ret);
+
+	/* Verify we can open each namespace and it's actually a network namespace */
+	for (ssize_t i = 0; i < ret && i < 5; i++) {
+		struct nsfs_file_handle nsfh = {
+			.ns_id = ns_ids[i],
+			.ns_type = CLONE_NEWNET,
+			.ns_inum = 0,
+		};
+		struct file_handle *fh;
+		int fd;
+
+		fh = (struct file_handle *)malloc(sizeof(*fh) + sizeof(nsfh));
+		ASSERT_NE(fh, NULL);
+		fh->handle_bytes = sizeof(nsfh);
+		fh->handle_type = 0;
+		memcpy(fh->f_handle, &nsfh, sizeof(nsfh));
+
+		fd = open_by_handle_at(-10003, fh, O_RDONLY);
+		free(fh);
+
+		if (fd >= 0) {
+			int ns_type;
+			/* Verify it's a network namespace via ioctl */
+			ns_type = ioctl(fd, NS_GET_NSTYPE);
+			if (ns_type >= 0) {
+				ASSERT_EQ(ns_type, CLONE_NEWNET);
+			}
+			close(fd);
+		}
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 37/50] selftests/namespaces: third listns() test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (35 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 36/50] selftests/namespaces: second " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 38/50] selftests/namespaces: fourth " Christian Brauner
                   ` (16 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test listns() pagination.
List namespaces in batches.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/namespaces/listns_test.c | 53 ++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_test.c b/tools/testing/selftests/namespaces/listns_test.c
index 64249502ac49..7dff63a00263 100644
--- a/tools/testing/selftests/namespaces/listns_test.c
+++ b/tools/testing/selftests/namespaces/listns_test.c
@@ -115,4 +115,57 @@ TEST(listns_filter_by_type)
 	}
 }
 
+/*
+ * Test listns() pagination.
+ * List namespaces in batches.
+ */
+TEST(listns_pagination)
+{
+	struct ns_id_req req = {
+		.size = sizeof(req),
+		.spare = 0,
+		.ns_id = 0,
+		.ns_type = 0,
+		.spare2 = 0,
+		.user_ns_id = 0,
+	};
+	__u64 batch1[2], batch2[2];
+	ssize_t ret1, ret2;
+
+	/* Get first batch */
+	ret1 = sys_listns(&req, batch1, ARRAY_SIZE(batch1), 0);
+	if (ret1 < 0) {
+		if (errno == ENOSYS)
+			SKIP(return, "listns() not supported");
+		TH_LOG("listns failed: %s (errno=%d)", strerror(errno), errno);
+		ASSERT_TRUE(false);
+	}
+	ASSERT_GE(ret1, 0);
+
+	if (ret1 == 0)
+		SKIP(return, "No namespaces found");
+
+	TH_LOG("First batch: %zd namespaces", ret1);
+
+	/* Get second batch using last ID from first batch */
+	if (ret1 == ARRAY_SIZE(batch1)) {
+		req.ns_id = batch1[ret1 - 1];
+		ret2 = sys_listns(&req, batch2, ARRAY_SIZE(batch2), 0);
+		ASSERT_GE(ret2, 0);
+
+		TH_LOG("Second batch: %zd namespaces (after ns_id=%llu)",
+		       ret2, (unsigned long long)req.ns_id);
+
+		/* If we got more results, verify IDs are monotonically increasing */
+		if (ret2 > 0) {
+			ASSERT_GT(batch2[0], batch1[ret1 - 1]);
+			TH_LOG("Pagination working: %llu > %llu",
+			       (unsigned long long)batch2[0],
+			       (unsigned long long)batch1[ret1 - 1]);
+		}
+	} else {
+		TH_LOG("All namespaces fit in first batch");
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 38/50] selftests/namespaces: fourth listns() test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (36 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 37/50] selftests/namespaces: third " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 39/50] selftests/namespaces: fifth " Christian Brauner
                   ` (15 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test listns() with LISTNS_CURRENT_USER.
List namespaces owned by current user namespace.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/namespaces/listns_test.c | 33 ++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_test.c b/tools/testing/selftests/namespaces/listns_test.c
index 7dff63a00263..457298cb4c64 100644
--- a/tools/testing/selftests/namespaces/listns_test.c
+++ b/tools/testing/selftests/namespaces/listns_test.c
@@ -168,4 +168,37 @@ TEST(listns_pagination)
 	}
 }
 
+/*
+ * Test listns() with LISTNS_CURRENT_USER.
+ * List namespaces owned by current user namespace.
+ */
+TEST(listns_current_user)
+{
+	struct ns_id_req req = {
+		.size = sizeof(req),
+		.spare = 0,
+		.ns_id = 0,
+		.ns_type = 0,
+		.spare2 = 0,
+		.user_ns_id = LISTNS_CURRENT_USER,
+	};
+	__u64 ns_ids[100];
+	ssize_t ret;
+
+	ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+	if (ret < 0) {
+		if (errno == ENOSYS)
+			SKIP(return, "listns() not supported");
+		TH_LOG("listns failed: %s (errno=%d)", strerror(errno), errno);
+		ASSERT_TRUE(false);
+	}
+	ASSERT_GE(ret, 0);
+
+	/* Should find at least the initial namespaces if we're in init_user_ns */
+	TH_LOG("Found %zd namespaces owned by current user namespace", ret);
+
+	for (ssize_t i = 0; i < ret; i++)
+		TH_LOG("  [%zd] ns_id: %llu", i, (unsigned long long)ns_ids[i]);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 39/50] selftests/namespaces: fifth listns() test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (37 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 38/50] selftests/namespaces: fourth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 40/50] selftests/namespaces: sixth " Christian Brauner
                   ` (14 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that listns() only returns active namespaces.
Create a namespace, let it become inactive, verify it's not listed.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/namespaces/listns_test.c | 126 +++++++++++++++++++++++
 1 file changed, 126 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_test.c b/tools/testing/selftests/namespaces/listns_test.c
index 457298cb4c64..e854794abe56 100644
--- a/tools/testing/selftests/namespaces/listns_test.c
+++ b/tools/testing/selftests/namespaces/listns_test.c
@@ -201,4 +201,130 @@ TEST(listns_current_user)
 		TH_LOG("  [%zd] ns_id: %llu", i, (unsigned long long)ns_ids[i]);
 }
 
+/*
+ * Test that listns() only returns active namespaces.
+ * Create a namespace, let it become inactive, verify it's not listed.
+ */
+TEST(listns_only_active)
+{
+	struct ns_id_req req = {
+		.size = sizeof(req),
+		.spare = 0,
+		.ns_id = 0,
+		.ns_type = CLONE_NEWNET,
+		.spare2 = 0,
+		.user_ns_id = 0,
+	};
+	__u64 ns_ids_before[100], ns_ids_after[100];
+	ssize_t ret_before, ret_after;
+	int pipefd[2];
+	pid_t pid;
+	__u64 new_ns_id = 0;
+	int status;
+
+	/* Get initial list */
+	ret_before = sys_listns(&req, ns_ids_before, ARRAY_SIZE(ns_ids_before), 0);
+	if (ret_before < 0) {
+		if (errno == ENOSYS)
+			SKIP(return, "listns() not supported");
+		TH_LOG("listns failed: %s (errno=%d)", strerror(errno), errno);
+		ASSERT_TRUE(false);
+	}
+	ASSERT_GE(ret_before, 0);
+
+	TH_LOG("Before: %zd active network namespaces", ret_before);
+
+	/* Create a new namespace in a child process and get its ID */
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		__u64 ns_id;
+
+		close(pipefd[0]);
+
+		/* Create new network namespace */
+		if (unshare(CLONE_NEWNET) < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Get its ID */
+		fd = open("/proc/self/ns/net", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		if (ioctl(fd, NS_GET_ID, &ns_id) < 0) {
+			close(fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(fd);
+
+		/* Send ID to parent */
+		write(pipefd[1], &ns_id, sizeof(ns_id));
+		close(pipefd[1]);
+
+		/* Keep namespace active briefly */
+		usleep(100000);
+		exit(0);
+	}
+
+	/* Parent reads the new namespace ID */
+	{
+		int bytes;
+
+		close(pipefd[1]);
+		bytes = read(pipefd[0], &new_ns_id, sizeof(new_ns_id));
+		close(pipefd[0]);
+
+		if (bytes == sizeof(new_ns_id)) {
+			__u64 ns_ids_during[100];
+			int ret_during;
+
+			TH_LOG("Child created namespace with ID %llu", (unsigned long long)new_ns_id);
+
+			/* List namespaces while child is still alive - should see new one */
+			ret_during = sys_listns(&req, ns_ids_during, ARRAY_SIZE(ns_ids_during), 0);
+			ASSERT_GE(ret_during, 0);
+			TH_LOG("During: %d active network namespaces", ret_during);
+
+			/* Should have more namespaces than before */
+			ASSERT_GE(ret_during, ret_before);
+		}
+	}
+
+	/* Wait for child to exit */
+	waitpid(pid, &status, 0);
+
+	/* Give time for namespace to become inactive */
+	usleep(100000);
+
+	/* List namespaces after child exits - should not see new one */
+	ret_after = sys_listns(&req, ns_ids_after, ARRAY_SIZE(ns_ids_after), 0);
+	ASSERT_GE(ret_after, 0);
+	TH_LOG("After: %zd active network namespaces", ret_after);
+
+	/* Verify the new namespace ID is not in the after list */
+	if (new_ns_id != 0) {
+		bool found = false;
+
+		for (ssize_t i = 0; i < ret_after; i++) {
+			if (ns_ids_after[i] == new_ns_id) {
+				found = true;
+				break;
+			}
+		}
+		if (found) {
+			TH_LOG("Warning: Namespace %llu still active after process exit",
+			       (unsigned long long)new_ns_id);
+		}
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 40/50] selftests/namespaces: sixth listns() test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (38 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 39/50] selftests/namespaces: fifth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 41/50] selftests/namespaces: seventh " Christian Brauner
                   ` (13 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test listns() with specific user namespace ID.
Create a user namespace and list namespaces it owns.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/namespaces/listns_test.c | 109 +++++++++++++++++++++++
 1 file changed, 109 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_test.c b/tools/testing/selftests/namespaces/listns_test.c
index e854794abe56..e1e90ef933cf 100644
--- a/tools/testing/selftests/namespaces/listns_test.c
+++ b/tools/testing/selftests/namespaces/listns_test.c
@@ -327,4 +327,113 @@ TEST(listns_only_active)
 	}
 }
 
+/*
+ * Test listns() with specific user namespace ID.
+ * Create a user namespace and list namespaces it owns.
+ */
+TEST(listns_specific_userns)
+{
+	struct ns_id_req req = {
+		.size = sizeof(req),
+		.spare = 0,
+		.ns_id = 0,
+		.ns_type = 0,
+		.spare2 = 0,
+		.user_ns_id = 0,  /* Will be filled with created userns ID */
+	};
+	__u64 ns_ids[100];
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	__u64 user_ns_id = 0;
+	int bytes;
+	ssize_t ret;
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		__u64 ns_id;
+		char buf;
+
+		close(pipefd[0]);
+
+		/* Create new user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Get user namespace ID */
+		fd = open("/proc/self/ns/user", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		if (ioctl(fd, NS_GET_ID, &ns_id) < 0) {
+			close(fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(fd);
+
+		/* Send ID to parent */
+		write(pipefd[1], &ns_id, sizeof(ns_id));
+
+		/* Create some namespaces owned by this user namespace */
+		unshare(CLONE_NEWNET);
+		unshare(CLONE_NEWUTS);
+
+		/* Wait for parent signal */
+		read(pipefd[1], &buf, 1);
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	/* Parent */
+	close(pipefd[1]);
+	bytes = read(pipefd[0], &user_ns_id, sizeof(user_ns_id));
+
+	if (bytes != sizeof(user_ns_id)) {
+		close(pipefd[0]);
+		kill(pid, SIGKILL);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to get user namespace ID from child");
+	}
+
+	TH_LOG("Child created user namespace with ID %llu", (unsigned long long)user_ns_id);
+
+	/* List namespaces owned by this user namespace */
+	req.user_ns_id = user_ns_id;
+	ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+
+	if (ret < 0) {
+		TH_LOG("listns failed: %s (errno=%d)", strerror(errno), errno);
+		close(pipefd[0]);
+		kill(pid, SIGKILL);
+		waitpid(pid, NULL, 0);
+		if (errno == ENOSYS) {
+			SKIP(return, "listns() not supported");
+		}
+		ASSERT_GE(ret, 0);
+	}
+
+	TH_LOG("Found %zd namespaces owned by user namespace %llu", ret,
+	       (unsigned long long)user_ns_id);
+
+	/* Should find at least the network and UTS namespaces we created */
+	if (ret > 0) {
+		for (ssize_t i = 0; i < ret && i < 10; i++)
+			TH_LOG("  [%zd] ns_id: %llu", i, (unsigned long long)ns_ids[i]);
+	}
+
+	/* Signal child to exit */
+	close(pipefd[0]);
+	waitpid(pid, &status, 0);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 41/50] selftests/namespaces: seventh listns() test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (39 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 40/50] selftests/namespaces: sixth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 42/50] selftests/namespaces: ninth " Christian Brauner
                   ` (12 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test listns() with multiple namespace types filter.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/namespaces/listns_test.c | 31 ++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_test.c b/tools/testing/selftests/namespaces/listns_test.c
index e1e90ef933cf..6afffddf9764 100644
--- a/tools/testing/selftests/namespaces/listns_test.c
+++ b/tools/testing/selftests/namespaces/listns_test.c
@@ -436,4 +436,35 @@ TEST(listns_specific_userns)
 	waitpid(pid, &status, 0);
 }
 
+/*
+ * Test listns() with multiple namespace types filter.
+ */
+TEST(listns_multiple_types)
+{
+	struct ns_id_req req = {
+		.size = sizeof(req),
+		.spare = 0,
+		.ns_id = 0,
+		.ns_type = CLONE_NEWNET | CLONE_NEWUTS,  /* Network and UTS */
+		.spare2 = 0,
+		.user_ns_id = 0,
+	};
+	__u64 ns_ids[100];
+	ssize_t ret;
+
+	ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+	if (ret < 0) {
+		if (errno == ENOSYS)
+			SKIP(return, "listns() not supported");
+		TH_LOG("listns failed: %s (errno=%d)", strerror(errno), errno);
+		ASSERT_TRUE(false);
+	}
+	ASSERT_GE(ret, 0);
+
+	TH_LOG("Found %zd active network/UTS namespaces", ret);
+
+	for (ssize_t i = 0; i < ret; i++)
+		TH_LOG("  [%zd] ns_id: %llu", i, (unsigned long long)ns_ids[i]);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 42/50] selftests/namespaces: ninth listns() test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (40 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 41/50] selftests/namespaces: seventh " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 43/50] " Christian Brauner
                   ` (11 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that hierarchical active reference propagation keeps parent
user namespaces visible in listns().

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/namespaces/listns_test.c | 135 +++++++++++++++++++++++
 1 file changed, 135 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_test.c b/tools/testing/selftests/namespaces/listns_test.c
index 6afffddf9764..ddf4509d5cd6 100644
--- a/tools/testing/selftests/namespaces/listns_test.c
+++ b/tools/testing/selftests/namespaces/listns_test.c
@@ -467,4 +467,139 @@ TEST(listns_multiple_types)
 		TH_LOG("  [%zd] ns_id: %llu", i, (unsigned long long)ns_ids[i]);
 }
 
+/*
+ * Test that hierarchical active reference propagation keeps parent
+ * user namespaces visible in listns().
+ */
+TEST(listns_hierarchical_visibility)
+{
+	struct ns_id_req req = {
+		.size = sizeof(req),
+		.spare = 0,
+		.ns_id = 0,
+		.ns_type = CLONE_NEWUSER,
+		.spare2 = 0,
+		.user_ns_id = 0,
+	};
+	__u64 parent_ns_id = 0, child_ns_id = 0;
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	int bytes;
+	__u64 ns_ids[100];
+	ssize_t ret;
+	bool found_parent, found_child;
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		char buf;
+
+		close(pipefd[0]);
+
+		/* Create parent user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		fd = open("/proc/self/ns/user", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		if (ioctl(fd, NS_GET_ID, &parent_ns_id) < 0) {
+			close(fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(fd);
+
+		/* Create child user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		fd = open("/proc/self/ns/user", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		if (ioctl(fd, NS_GET_ID, &child_ns_id) < 0) {
+			close(fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(fd);
+
+		/* Send both IDs to parent */
+		write(pipefd[1], &parent_ns_id, sizeof(parent_ns_id));
+		write(pipefd[1], &child_ns_id, sizeof(child_ns_id));
+
+		/* Wait for parent signal */
+		read(pipefd[1], &buf, 1);
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	/* Parent */
+	close(pipefd[1]);
+
+	/* Read both namespace IDs */
+	bytes = read(pipefd[0], &parent_ns_id, sizeof(parent_ns_id));
+	bytes += read(pipefd[0], &child_ns_id, sizeof(child_ns_id));
+
+	if (bytes != (int)(2 * sizeof(__u64))) {
+		close(pipefd[0]);
+		kill(pid, SIGKILL);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "Failed to get namespace IDs from child");
+	}
+
+	TH_LOG("Parent user namespace ID: %llu", (unsigned long long)parent_ns_id);
+	TH_LOG("Child user namespace ID: %llu", (unsigned long long)child_ns_id);
+
+	/* List all user namespaces */
+	ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+
+	if (ret < 0 && errno == ENOSYS) {
+		close(pipefd[0]);
+		kill(pid, SIGKILL);
+		waitpid(pid, NULL, 0);
+		SKIP(return, "listns() not supported");
+	}
+
+	ASSERT_GE(ret, 0);
+	TH_LOG("Found %zd active user namespaces", ret);
+
+	/* Both parent and child should be visible (active due to child process) */
+	found_parent = false;
+	found_child = false;
+	for (ssize_t i = 0; i < ret; i++) {
+		if (ns_ids[i] == parent_ns_id)
+			found_parent = true;
+		if (ns_ids[i] == child_ns_id)
+			found_child = true;
+	}
+
+	TH_LOG("Parent namespace %s, child namespace %s",
+	       found_parent ? "found" : "NOT FOUND",
+	       found_child ? "found" : "NOT FOUND");
+
+	ASSERT_TRUE(found_child);
+	/* With hierarchical propagation, parent should also be active */
+	ASSERT_TRUE(found_parent);
+
+	/* Signal child to exit */
+	close(pipefd[0]);
+	waitpid(pid, &status, 0);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 43/50] selftests/namespaces: ninth listns() test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (41 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 42/50] selftests/namespaces: ninth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 44/50] selftests/namespaces: first listns() permission test Christian Brauner
                   ` (10 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test error cases for listns().

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/namespaces/listns_test.c | 51 ++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_test.c b/tools/testing/selftests/namespaces/listns_test.c
index ddf4509d5cd6..eb44f50ab77a 100644
--- a/tools/testing/selftests/namespaces/listns_test.c
+++ b/tools/testing/selftests/namespaces/listns_test.c
@@ -602,4 +602,55 @@ TEST(listns_hierarchical_visibility)
 	waitpid(pid, &status, 0);
 }
 
+/*
+ * Test error cases for listns().
+ */
+TEST(listns_error_cases)
+{
+	struct ns_id_req req = {
+		.size = sizeof(req),
+		.spare = 0,
+		.ns_id = 0,
+		.ns_type = 0,
+		.spare2 = 0,
+		.user_ns_id = 0,
+	};
+	__u64 ns_ids[10];
+	int ret;
+
+	/* Test with invalid flags */
+	ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0xFFFF);
+	if (ret >= 0 || errno == ENOSYS) {
+		if (errno != ENOSYS) {
+			TH_LOG("Warning: Expected EINVAL for invalid flags, got success");
+		}
+	} else {
+		ASSERT_EQ(errno, EINVAL);
+	}
+
+	/* Test with NULL ns_ids array */
+	ret = sys_listns(&req, NULL, 10, 0);
+	if (ret >= 0) {
+		TH_LOG("Warning: Expected EFAULT for NULL array, got success");
+	}
+
+	/* Test with invalid spare field */
+	req.spare = 1;
+	ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+	if (ret >= 0 || errno == ENOSYS) {
+		if (errno != ENOSYS) {
+			TH_LOG("Warning: Expected EINVAL for non-zero spare, got success");
+		}
+	}
+	req.spare = 0;
+
+	/* Test with huge nr_ns_ids */
+	ret = sys_listns(&req, ns_ids, 2000000, 0);
+	if (ret >= 0 || errno == ENOSYS) {
+		if (errno != ENOSYS) {
+			TH_LOG("Warning: Expected EOVERFLOW for huge count, got success");
+		}
+	}
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 44/50] selftests/namespaces: first listns() permission test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (42 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 43/50] " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 45/50] selftests/namespaces: second " Christian Brauner
                   ` (9 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that unprivileged users can only see namespaces they're currently
in. Create a namespace, drop privileges, verify we can only see our own
namespaces.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/namespaces/.gitignore      |   1 +
 tools/testing/selftests/namespaces/Makefile        |   3 +-
 .../selftests/namespaces/listns_permissions_test.c | 134 +++++++++++++++++++++
 3 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/namespaces/.gitignore b/tools/testing/selftests/namespaces/.gitignore
index 5065f07e92c9..17f9c675a60b 100644
--- a/tools/testing/selftests/namespaces/.gitignore
+++ b/tools/testing/selftests/namespaces/.gitignore
@@ -3,3 +3,4 @@ file_handle_test
 init_ino_test
 ns_active_ref_test
 listns_test
+listns_permissions_test
diff --git a/tools/testing/selftests/namespaces/Makefile b/tools/testing/selftests/namespaces/Makefile
index de708f4df159..2dd22bc68b89 100644
--- a/tools/testing/selftests/namespaces/Makefile
+++ b/tools/testing/selftests/namespaces/Makefile
@@ -2,10 +2,11 @@
 CFLAGS += -Wall -O0 -g $(KHDR_INCLUDES) $(TOOLS_INCLUDES)
 LDLIBS += -lcap
 
-TEST_GEN_PROGS := nsid_test file_handle_test init_ino_test ns_active_ref_test listns_test
+TEST_GEN_PROGS := nsid_test file_handle_test init_ino_test ns_active_ref_test listns_test listns_permissions_test
 
 include ../lib.mk
 
 $(OUTPUT)/ns_active_ref_test: ../filesystems/utils.c
 $(OUTPUT)/listns_test: ../filesystems/utils.c
+$(OUTPUT)/listns_permissions_test: ../filesystems/utils.c
 
diff --git a/tools/testing/selftests/namespaces/listns_permissions_test.c b/tools/testing/selftests/namespaces/listns_permissions_test.c
new file mode 100644
index 000000000000..87ec71560d99
--- /dev/null
+++ b/tools/testing/selftests/namespaces/listns_permissions_test.c
@@ -0,0 +1,134 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <linux/nsfs.h>
+#include <sys/capability.h>
+#include <sys/ioctl.h>
+#include <sys/prctl.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "../kselftest_harness.h"
+#include "../filesystems/utils.h"
+#include "wrappers.h"
+
+/*
+ * Test that unprivileged users can only see namespaces they're currently in.
+ * Create a namespace, drop privileges, verify we can only see our own namespaces.
+ */
+TEST(listns_unprivileged_current_only)
+{
+	struct ns_id_req req = {
+		.size = sizeof(req),
+		.spare = 0,
+		.ns_id = 0,
+		.ns_type = CLONE_NEWNET,
+		.spare2 = 0,
+		.user_ns_id = 0,
+	};
+	__u64 ns_ids[100];
+	ssize_t ret;
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	bool found_ours;
+	int unexpected_count;
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		__u64 our_netns_id;
+		bool found_ours;
+		int unexpected_count;
+
+		close(pipefd[0]);
+
+		/* Create user namespace to be unprivileged */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Create a network namespace */
+		if (unshare(CLONE_NEWNET) < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Get our network namespace ID */
+		fd = open("/proc/self/ns/net", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		if (ioctl(fd, NS_GET_ID, &our_netns_id) < 0) {
+			close(fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(fd);
+
+		/* Now we're unprivileged - list all network namespaces */
+		ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+		if (ret < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* We should only see our own network namespace */
+		found_ours = false;
+		unexpected_count = 0;
+
+		for (ssize_t i = 0; i < ret; i++) {
+			if (ns_ids[i] == our_netns_id) {
+				found_ours = true;
+			} else {
+				/* This is either init_net (which we can see) or unexpected */
+				unexpected_count++;
+			}
+		}
+
+		/* Send results to parent */
+		write(pipefd[1], &found_ours, sizeof(found_ours));
+		write(pipefd[1], &unexpected_count, sizeof(unexpected_count));
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	/* Parent */
+	close(pipefd[1]);
+
+	found_ours = false;
+	unexpected_count = 0;
+	read(pipefd[0], &found_ours, sizeof(found_ours));
+	read(pipefd[0], &unexpected_count, sizeof(unexpected_count));
+	close(pipefd[0]);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	if (WEXITSTATUS(status) != 0) {
+		SKIP(return, "Child failed to setup namespaces or run listns");
+	}
+
+	/* Child should have seen its own namespace */
+	ASSERT_TRUE(found_ours);
+
+	TH_LOG("Unprivileged child saw its own namespace, plus %d others (likely init_net)",
+			unexpected_count);
+}
+
+TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 45/50] selftests/namespaces: second listns() permission test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (43 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 44/50] selftests/namespaces: first listns() permission test Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 46/50] selftests/namespaces: third " Christian Brauner
                   ` (8 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that users with CAP_SYS_ADMIN in a user namespace can see
all namespaces owned by that user namespace.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/listns_permissions_test.c | 103 +++++++++++++++++++++
 1 file changed, 103 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_permissions_test.c b/tools/testing/selftests/namespaces/listns_permissions_test.c
index 87ec71560d99..803c42fc76ec 100644
--- a/tools/testing/selftests/namespaces/listns_permissions_test.c
+++ b/tools/testing/selftests/namespaces/listns_permissions_test.c
@@ -131,4 +131,107 @@ TEST(listns_unprivileged_current_only)
 			unexpected_count);
 }
 
+/*
+ * Test that users with CAP_SYS_ADMIN in a user namespace can see
+ * all namespaces owned by that user namespace.
+ */
+TEST(listns_cap_sys_admin_in_userns)
+{
+	struct ns_id_req req = {
+		.size = sizeof(req),
+		.spare = 0,
+		.ns_id = 0,
+		.ns_type = 0,  /* All types */
+		.spare2 = 0,
+		.user_ns_id = 0,  /* Will be set to our created user namespace */
+	};
+	__u64 ns_ids[100];
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	bool success;
+	ssize_t count;
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		__u64 userns_id;
+		ssize_t ret;
+		int min_expected;
+		bool success;
+
+		close(pipefd[0]);
+
+		/* Create user namespace - we'll have CAP_SYS_ADMIN in it */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Get the user namespace ID */
+		fd = open("/proc/self/ns/user", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		if (ioctl(fd, NS_GET_ID, &userns_id) < 0) {
+			close(fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(fd);
+
+		/* Create several namespaces owned by this user namespace */
+		unshare(CLONE_NEWNET);
+		unshare(CLONE_NEWUTS);
+		unshare(CLONE_NEWIPC);
+
+		/* List namespaces owned by our user namespace */
+		req.user_ns_id = userns_id;
+		ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+		if (ret < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/*
+		 * We have CAP_SYS_ADMIN in this user namespace,
+		 * so we should see all namespaces owned by it.
+		 * That includes: net, uts, ipc, and the user namespace itself.
+		 */
+		min_expected = 4;
+		success = (ret >= min_expected);
+
+		write(pipefd[1], &success, sizeof(success));
+		write(pipefd[1], &ret, sizeof(ret));
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	/* Parent */
+	close(pipefd[1]);
+
+	success = false;
+	count = 0;
+	read(pipefd[0], &success, sizeof(success));
+	read(pipefd[0], &count, sizeof(count));
+	close(pipefd[0]);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	if (WEXITSTATUS(status) != 0) {
+		SKIP(return, "Child failed to setup namespaces");
+	}
+
+	ASSERT_TRUE(success);
+	TH_LOG("User with CAP_SYS_ADMIN saw %zd namespaces owned by their user namespace",
+			count);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 46/50] selftests/namespaces: third listns() permission test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (44 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 45/50] selftests/namespaces: second " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 47/50] selftests/namespaces: fourth " Christian Brauner
                   ` (7 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that users cannot see namespaces from unrelated user namespaces.
Create two sibling user namespaces, verify they can't see each other's
owned namespaces.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/listns_permissions_test.c | 138 +++++++++++++++++++++
 1 file changed, 138 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_permissions_test.c b/tools/testing/selftests/namespaces/listns_permissions_test.c
index 803c42fc76ec..4e47b4c82c56 100644
--- a/tools/testing/selftests/namespaces/listns_permissions_test.c
+++ b/tools/testing/selftests/namespaces/listns_permissions_test.c
@@ -234,4 +234,142 @@ TEST(listns_cap_sys_admin_in_userns)
 			count);
 }
 
+/*
+ * Test that users cannot see namespaces from unrelated user namespaces.
+ * Create two sibling user namespaces, verify they can't see each other's
+ * owned namespaces.
+ */
+TEST(listns_cannot_see_sibling_userns_namespaces)
+{
+	int pipefd[2];
+	pid_t pid1, pid2;
+	int status;
+	__u64 netns_a_id;
+	int pipefd2[2];
+	bool found_sibling_netns;
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	/* Fork first child - creates user namespace A */
+	pid1 = fork();
+	ASSERT_GE(pid1, 0);
+
+	if (pid1 == 0) {
+		int fd;
+		__u64 netns_a_id;
+		char buf;
+
+		close(pipefd[0]);
+
+		/* Create user namespace A */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Create network namespace owned by user namespace A */
+		if (unshare(CLONE_NEWNET) < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Get network namespace ID */
+		fd = open("/proc/self/ns/net", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		if (ioctl(fd, NS_GET_ID, &netns_a_id) < 0) {
+			close(fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(fd);
+
+		/* Send namespace ID to parent */
+		write(pipefd[1], &netns_a_id, sizeof(netns_a_id));
+
+		/* Keep alive for sibling to check */
+		read(pipefd[1], &buf, 1);
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	/* Parent reads namespace A ID */
+	close(pipefd[1]);
+	netns_a_id = 0;
+	read(pipefd[0], &netns_a_id, sizeof(netns_a_id));
+
+	TH_LOG("User namespace A created network namespace with ID %llu",
+	       (unsigned long long)netns_a_id);
+
+	/* Fork second child - creates user namespace B */
+	ASSERT_EQ(pipe(pipefd2), 0);
+
+	pid2 = fork();
+	ASSERT_GE(pid2, 0);
+
+	if (pid2 == 0) {
+		struct ns_id_req req = {
+			.size = sizeof(req),
+			.spare = 0,
+			.ns_id = 0,
+			.ns_type = CLONE_NEWNET,
+			.spare2 = 0,
+			.user_ns_id = 0,
+		};
+		__u64 ns_ids[100];
+		ssize_t ret;
+		bool found_sibling_netns;
+
+		close(pipefd[0]);
+		close(pipefd2[0]);
+
+		/* Create user namespace B (sibling to A) */
+		if (setup_userns() < 0) {
+			close(pipefd2[1]);
+			exit(1);
+		}
+
+		/* Try to list all network namespaces */
+		ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+
+		found_sibling_netns = false;
+		if (ret > 0) {
+			for (ssize_t i = 0; i < ret; i++) {
+				if (ns_ids[i] == netns_a_id) {
+					found_sibling_netns = true;
+					break;
+				}
+			}
+		}
+
+		/* We should NOT see the sibling's network namespace */
+		write(pipefd2[1], &found_sibling_netns, sizeof(found_sibling_netns));
+		close(pipefd2[1]);
+		exit(0);
+	}
+
+	/* Parent reads result from second child */
+	close(pipefd2[1]);
+	found_sibling_netns = false;
+	read(pipefd2[0], &found_sibling_netns, sizeof(found_sibling_netns));
+	close(pipefd2[0]);
+
+	/* Signal first child to exit */
+	close(pipefd[0]);
+
+	/* Wait for both children */
+	waitpid(pid2, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	waitpid(pid1, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	/* Second child should NOT have seen first child's namespace */
+	ASSERT_FALSE(found_sibling_netns);
+	TH_LOG("User namespace B correctly could not see sibling namespace A's network namespace");
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 47/50] selftests/namespaces: fourth listns() permission test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (45 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 46/50] selftests/namespaces: third " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 48/50] selftests/namespaces: fifth " Christian Brauner
                   ` (6 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test permission checking with LISTNS_CURRENT_USER.
Verify that listing with LISTNS_CURRENT_USER respects permissions.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/listns_permissions_test.c | 79 ++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_permissions_test.c b/tools/testing/selftests/namespaces/listns_permissions_test.c
index 4e47b4c82c56..ff42109779ca 100644
--- a/tools/testing/selftests/namespaces/listns_permissions_test.c
+++ b/tools/testing/selftests/namespaces/listns_permissions_test.c
@@ -372,4 +372,83 @@ TEST(listns_cannot_see_sibling_userns_namespaces)
 	TH_LOG("User namespace B correctly could not see sibling namespace A's network namespace");
 }
 
+/*
+ * Test permission checking with LISTNS_CURRENT_USER.
+ * Verify that listing with LISTNS_CURRENT_USER respects permissions.
+ */
+TEST(listns_current_user_permissions)
+{
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	bool success;
+	ssize_t count;
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		struct ns_id_req req = {
+			.size = sizeof(req),
+			.spare = 0,
+			.ns_id = 0,
+			.ns_type = 0,
+			.spare2 = 0,
+			.user_ns_id = LISTNS_CURRENT_USER,
+		};
+		__u64 ns_ids[100];
+		ssize_t ret;
+		bool success;
+
+		close(pipefd[0]);
+
+		/* Create user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Create some namespaces owned by this user namespace */
+		if (unshare(CLONE_NEWNET) < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		if (unshare(CLONE_NEWUTS) < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* List with LISTNS_CURRENT_USER - should see our owned namespaces */
+		ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+
+		success = (ret >= 3);  /* At least user, net, uts */
+		write(pipefd[1], &success, sizeof(success));
+		write(pipefd[1], &ret, sizeof(ret));
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	/* Parent */
+	close(pipefd[1]);
+
+	success = false;
+	count = 0;
+	read(pipefd[0], &success, sizeof(success));
+	read(pipefd[0], &count, sizeof(count));
+	close(pipefd[0]);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	if (WEXITSTATUS(status) != 0) {
+		SKIP(return, "Child failed to setup namespaces");
+	}
+
+	ASSERT_TRUE(success);
+	TH_LOG("LISTNS_CURRENT_USER returned %zd namespaces", count);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 48/50] selftests/namespaces: fifth listns() permission test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (46 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 47/50] selftests/namespaces: fourth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 49/50] selftests/namespaces: sixth " Christian Brauner
                   ` (5 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that CAP_SYS_ADMIN in parent user namespace allows seeing
child user namespace's owned namespaces.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/listns_permissions_test.c | 122 +++++++++++++++++++++
 1 file changed, 122 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_permissions_test.c b/tools/testing/selftests/namespaces/listns_permissions_test.c
index ff42109779ca..07c0c2be0aa5 100644
--- a/tools/testing/selftests/namespaces/listns_permissions_test.c
+++ b/tools/testing/selftests/namespaces/listns_permissions_test.c
@@ -451,4 +451,126 @@ TEST(listns_current_user_permissions)
 	TH_LOG("LISTNS_CURRENT_USER returned %zd namespaces", count);
 }
 
+/*
+ * Test that CAP_SYS_ADMIN in parent user namespace allows seeing
+ * child user namespace's owned namespaces.
+ */
+TEST(listns_parent_userns_cap_sys_admin)
+{
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	bool found_child_userns;
+	ssize_t count;
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		__u64 parent_userns_id;
+		__u64 child_userns_id;
+		struct ns_id_req req;
+		__u64 ns_ids[100];
+		ssize_t ret;
+		bool found_child_userns;
+
+		close(pipefd[0]);
+
+		/* Create parent user namespace - we have CAP_SYS_ADMIN in it */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Get parent user namespace ID */
+		fd = open("/proc/self/ns/user", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		if (ioctl(fd, NS_GET_ID, &parent_userns_id) < 0) {
+			close(fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(fd);
+
+		/* Create child user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Get child user namespace ID */
+		fd = open("/proc/self/ns/user", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		if (ioctl(fd, NS_GET_ID, &child_userns_id) < 0) {
+			close(fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(fd);
+
+		/* Create namespaces owned by child user namespace */
+		if (unshare(CLONE_NEWNET) < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* List namespaces owned by parent user namespace */
+		req.size = sizeof(req);
+		req.spare = 0;
+		req.ns_id = 0;
+		req.ns_type = 0;
+		req.spare2 = 0;
+		req.user_ns_id = parent_userns_id;
+
+		ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+
+		/* Should see child user namespace in the list */
+		found_child_userns = false;
+		if (ret > 0) {
+			for (ssize_t i = 0; i < ret; i++) {
+				if (ns_ids[i] == child_userns_id) {
+					found_child_userns = true;
+					break;
+				}
+			}
+		}
+
+		write(pipefd[1], &found_child_userns, sizeof(found_child_userns));
+		write(pipefd[1], &ret, sizeof(ret));
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	/* Parent */
+	close(pipefd[1]);
+
+	found_child_userns = false;
+	count = 0;
+	read(pipefd[0], &found_child_userns, sizeof(found_child_userns));
+	read(pipefd[0], &count, sizeof(count));
+	close(pipefd[0]);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	if (WEXITSTATUS(status) != 0) {
+		SKIP(return, "Child failed to setup namespaces");
+	}
+
+	ASSERT_TRUE(found_child_userns);
+	TH_LOG("Process with CAP_SYS_ADMIN in parent user namespace saw child user namespace (total: %zd)",
+			count);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 49/50] selftests/namespaces: sixth listns() permission test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (47 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 48/50] selftests/namespaces: fifth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 11:43 ` [PATCH RFC DRAFT 50/50] selftests/namespaces: seventh " Christian Brauner
                   ` (4 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that we can see user namespaces we have CAP_SYS_ADMIN inside of.
This is different from seeing namespaces owned by a user namespace.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/listns_permissions_test.c | 90 ++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_permissions_test.c b/tools/testing/selftests/namespaces/listns_permissions_test.c
index 07c0c2be0aa5..709250ce1542 100644
--- a/tools/testing/selftests/namespaces/listns_permissions_test.c
+++ b/tools/testing/selftests/namespaces/listns_permissions_test.c
@@ -573,4 +573,94 @@ TEST(listns_parent_userns_cap_sys_admin)
 			count);
 }
 
+/*
+ * Test that we can see user namespaces we have CAP_SYS_ADMIN inside of.
+ * This is different from seeing namespaces owned by a user namespace.
+ */
+TEST(listns_cap_sys_admin_inside_userns)
+{
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	bool found_ours;
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		__u64 our_userns_id;
+		struct ns_id_req req;
+		__u64 ns_ids[100];
+		ssize_t ret;
+		bool found_ours;
+
+		close(pipefd[0]);
+
+		/* Create user namespace - we have CAP_SYS_ADMIN inside it */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Get our user namespace ID */
+		fd = open("/proc/self/ns/user", O_RDONLY);
+		if (fd < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		if (ioctl(fd, NS_GET_ID, &our_userns_id) < 0) {
+			close(fd);
+			close(pipefd[1]);
+			exit(1);
+		}
+		close(fd);
+
+		/* List all user namespaces globally */
+		req.size = sizeof(req);
+		req.spare = 0;
+		req.ns_id = 0;
+		req.ns_type = CLONE_NEWUSER;
+		req.spare2 = 0;
+		req.user_ns_id = 0;
+
+		ret = sys_listns(&req, ns_ids, ARRAY_SIZE(ns_ids), 0);
+
+		/* We should be able to see our own user namespace */
+		found_ours = false;
+		if (ret > 0) {
+			for (ssize_t i = 0; i < ret; i++) {
+				if (ns_ids[i] == our_userns_id) {
+					found_ours = true;
+					break;
+				}
+			}
+		}
+
+		write(pipefd[1], &found_ours, sizeof(found_ours));
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	/* Parent */
+	close(pipefd[1]);
+
+	found_ours = false;
+	read(pipefd[0], &found_ours, sizeof(found_ours));
+	close(pipefd[0]);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	if (WEXITSTATUS(status) != 0) {
+		SKIP(return, "Child failed to setup namespace");
+	}
+
+	ASSERT_TRUE(found_ours);
+	TH_LOG("Process can see user namespace it has CAP_SYS_ADMIN inside of");
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH RFC DRAFT 50/50] selftests/namespaces: seventh listns() permission test
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (48 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 49/50] selftests/namespaces: sixth " Christian Brauner
@ 2025-10-21 11:43 ` Christian Brauner
  2025-10-21 14:34 ` [PATCH RFC DRAFT 00/50] nstree: listns() Josef Bacik
                   ` (3 subsequent siblings)
  53 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-21 11:43 UTC (permalink / raw)
  To: linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann, Christian Brauner

Test that dropping CAP_SYS_ADMIN restricts what we can see.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/namespaces/listns_permissions_test.c | 111 +++++++++++++++++++++
 1 file changed, 111 insertions(+)

diff --git a/tools/testing/selftests/namespaces/listns_permissions_test.c b/tools/testing/selftests/namespaces/listns_permissions_test.c
index 709250ce1542..9d1767e8b804 100644
--- a/tools/testing/selftests/namespaces/listns_permissions_test.c
+++ b/tools/testing/selftests/namespaces/listns_permissions_test.c
@@ -663,4 +663,115 @@ TEST(listns_cap_sys_admin_inside_userns)
 	TH_LOG("Process can see user namespace it has CAP_SYS_ADMIN inside of");
 }
 
+/*
+ * Test that dropping CAP_SYS_ADMIN restricts what we can see.
+ */
+TEST(listns_drop_cap_sys_admin)
+{
+	cap_t caps;
+	cap_value_t cap_list[1] = { CAP_SYS_ADMIN };
+
+	/* This test needs to start with CAP_SYS_ADMIN */
+	caps = cap_get_proc();
+	if (!caps) {
+		SKIP(return, "Cannot get capabilities");
+	}
+
+	cap_flag_value_t cap_val;
+	if (cap_get_flag(caps, CAP_SYS_ADMIN, CAP_EFFECTIVE, &cap_val) < 0) {
+		cap_free(caps);
+		SKIP(return, "Cannot check CAP_SYS_ADMIN");
+	}
+
+	if (cap_val != CAP_SET) {
+		cap_free(caps);
+		SKIP(return, "Test needs CAP_SYS_ADMIN to start");
+	}
+	cap_free(caps);
+
+	int pipefd[2];
+	pid_t pid;
+	int status;
+	bool correct;
+	ssize_t count_before, count_after;
+
+	ASSERT_EQ(pipe(pipefd), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		struct ns_id_req req = {
+			.size = sizeof(req),
+			.spare = 0,
+			.ns_id = 0,
+			.ns_type = CLONE_NEWNET,
+			.spare2 = 0,
+			.user_ns_id = LISTNS_CURRENT_USER,
+		};
+		__u64 ns_ids_before[100];
+		ssize_t count_before;
+		__u64 ns_ids_after[100];
+		ssize_t count_after;
+		bool correct;
+
+		close(pipefd[0]);
+
+		/* Create user namespace */
+		if (setup_userns() < 0) {
+			close(pipefd[1]);
+			exit(1);
+		}
+
+		/* Count namespaces with CAP_SYS_ADMIN */
+		count_before = sys_listns(&req, ns_ids_before, ARRAY_SIZE(ns_ids_before), 0);
+
+		/* Drop CAP_SYS_ADMIN */
+		caps = cap_get_proc();
+		if (caps) {
+			cap_set_flag(caps, CAP_EFFECTIVE, 1, cap_list, CAP_CLEAR);
+			cap_set_flag(caps, CAP_PERMITTED, 1, cap_list, CAP_CLEAR);
+			cap_set_proc(caps);
+			cap_free(caps);
+		}
+
+		/* Ensure we can't regain the capability */
+		prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+
+		/* Count namespaces without CAP_SYS_ADMIN */
+		count_after = sys_listns(&req, ns_ids_after, ARRAY_SIZE(ns_ids_after), 0);
+
+		/* Without CAP_SYS_ADMIN, we should see same or fewer namespaces */
+		correct = (count_after <= count_before);
+
+		write(pipefd[1], &correct, sizeof(correct));
+		write(pipefd[1], &count_before, sizeof(count_before));
+		write(pipefd[1], &count_after, sizeof(count_after));
+		close(pipefd[1]);
+		exit(0);
+	}
+
+	/* Parent */
+	close(pipefd[1]);
+
+	correct = false;
+	count_before = 0;
+	count_after = 0;
+	read(pipefd[0], &correct, sizeof(correct));
+	read(pipefd[0], &count_before, sizeof(count_before));
+	read(pipefd[0], &count_after, sizeof(count_after));
+	close(pipefd[0]);
+
+	waitpid(pid, &status, 0);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	if (WEXITSTATUS(status) != 0) {
+		SKIP(return, "Child failed to setup namespace");
+	}
+
+	ASSERT_TRUE(correct);
+	TH_LOG("With CAP_SYS_ADMIN: %zd namespaces, without: %zd namespaces",
+			count_before, count_after);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH RFC DRAFT 00/50] nstree: listns()
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (49 preceding siblings ...)
  2025-10-21 11:43 ` [PATCH RFC DRAFT 50/50] selftests/namespaces: seventh " Christian Brauner
@ 2025-10-21 14:34 ` Josef Bacik
  2025-10-22  8:34   ` Christian Brauner
  2025-10-21 14:41 ` [syzbot ci] " syzbot ci
                   ` (2 subsequent siblings)
  53 siblings, 1 reply; 59+ messages in thread
From: Josef Bacik @ 2025-10-21 14:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Jann Horn, Mike Yuan,
	Zbigniew Jędrzejewski-Szmek, Lennart Poettering,
	Daan De Meyer, Aleksa Sarai, Amir Goldstein, Tejun Heo,
	Johannes Weiner, Thomas Gleixner, Alexander Viro, Jan Kara,
	linux-kernel, cgroups, bpf, Eric Dumazet, Jakub Kicinski, netdev,
	Arnd Bergmann

On Tue, Oct 21, 2025 at 01:43:06PM +0200, Christian Brauner wrote:
> Hey,
> 
> As announced a while ago this is the next step building on the nstree
> work from prior cycles. There's a bunch of fixes and semantic cleanups
> in here and a ton of tests.
> 
> I need helper here!: Consider the following current design:
> 
> Currently listns() is relying on active namespace reference counts which
> are introduced alongside this series.
> 
> The active reference count of a namespace consists of the live tasks
> that make use of this namespace and any namespace file descriptors that
> explicitly pin the namespace.
> 
> Once all tasks making use of this namespace have exited or reaped, all
> namespace file descriptors for that namespace have been closed and all
> bind-mounts for that namespace unmounted it ceases to appear in the
> listns() output.
> 
> My reason for introducing the active reference count was that namespaces
> might obviously still be pinned internally for various reasons. For
> example the user namespace might still be pinned because there are still
> open files that have stashed the openers credentials in file->f_cred, or
> the last reference might be put with an rcu delay keeping that namespace
> active on the namespace lists.
> 
> But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
> Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
> which uses lazy TLB destruction.
> 
> When this option is set a userspace task's struct mm_struct may be used
> for kernel threads such as the idle task and will only be destroyed once
> the cpu's runqueue switches back to another task. So the kernel thread
> will take a reference on the struct mm_struct pinning it.
> 
> And for ptrace() based access checks struct mm_struct stashes the user
> namespace of the task that struct mm_struct belonged to originally and
> thus takes a reference to the users namespace and pins it.
> 
> So on an idle system such user namespaces can be persisted for pretty
> arbitrary amounts of time via struct mm_struct.
> 
> Now, without the active reference count regulating visibility all
> namespace that still are pinned in some way on the system will appear in
> the listns() output and can be reopened using namespace file handles.
> 
> Of course that requires suitable privileges and it's not really a
> concern per se because a task could've also persist the namespace
> recorded in struct mm_struct explicitly and then the idle task would
> still reuse that struct mm_struct and another task could still happily
> setns() to it afaict and reuse it for something else.
> 
> The active reference count though has drawbacks itself. Namely that
> socket files break the assumption that namespaces can only be opened if
> there's either live processes pinning the namespace or there are file
> descriptors open that pin the namespace itself as the socket SIOCGSKNS
> ioctl() can be used to open a network namespace based on a socket which
> only indirectly pins a network namespace.
> 
> So that punches a whole in the active reference count tracking. So this
> will have to be handled as right now socket file descriptors that pin a
> network namespace that don't have an active reference anymore (no live
> processes, not explicit persistence via namespace fds) can't be used to
> issue a SIOCGSKNS ioctl() to open the associated network namespace.
> 
> So two options I see if the api is based on ids:
> 
> (1) We use the active reference count and somehow also make it work with
>     sockets.
> (2) The active reference count is not needed and we say that listns() is
>     an introspection system call anyway so we just always list
>     namespaces regardless of why they are still pinned: files,
>     mm_struct, network devices, everything is fair game.
> (3) Throw hands up in the air and just not do it.
>

I think the active reference counts are just nice to have, if I'm not missing
something we still have to figure out which pid is using the namespace we may
want to enter, so there's already a "time of check, time of use" issue. I think
if we want to have the active count we can do it just as an advisory thing, have
a flag that says "this ns is dying and you can't do anything with it", and then
for network namespaces we can just never set the flag and let the existing
SIOCKGSNS ioctl work as is.

The bigger question (and sorry I didn't think about this before now), is how are
we going to integrate this into the rest of the NS related syscalls? Having
progromatic introspection is excellent from a usabiility point of view, but we
also want to be able to have an easy way to get a PID from these namespaces, and
even eventually do things like setns() based on these IDs. Those are followup
series of course, but we should at least have a plan for them. Thanks,

Josef

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH RFC DRAFT 00/50] nstree: listns()
  2025-10-21 14:34 ` [PATCH RFC DRAFT 00/50] nstree: listns() Josef Bacik
@ 2025-10-22  8:34   ` Christian Brauner
  0 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-22  8:34 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-fsdevel, Jeff Layton, Jann Horn, Mike Yuan,
	Zbigniew Jędrzejewski-Szmek, Lennart Poettering,
	Daan De Meyer, Aleksa Sarai, Amir Goldstein, Tejun Heo,
	Johannes Weiner, Thomas Gleixner, Alexander Viro, Jan Kara,
	linux-kernel, cgroups, bpf, Eric Dumazet, Jakub Kicinski, netdev,
	Arnd Bergmann

On Tue, Oct 21, 2025 at 10:34:54AM -0400, Josef Bacik wrote:
> On Tue, Oct 21, 2025 at 01:43:06PM +0200, Christian Brauner wrote:
> > Hey,
> > 
> > As announced a while ago this is the next step building on the nstree
> > work from prior cycles. There's a bunch of fixes and semantic cleanups
> > in here and a ton of tests.
> > 
> > I need helper here!: Consider the following current design:
> > 
> > Currently listns() is relying on active namespace reference counts which
> > are introduced alongside this series.
> > 
> > The active reference count of a namespace consists of the live tasks
> > that make use of this namespace and any namespace file descriptors that
> > explicitly pin the namespace.
> > 
> > Once all tasks making use of this namespace have exited or reaped, all
> > namespace file descriptors for that namespace have been closed and all
> > bind-mounts for that namespace unmounted it ceases to appear in the
> > listns() output.
> > 
> > My reason for introducing the active reference count was that namespaces
> > might obviously still be pinned internally for various reasons. For
> > example the user namespace might still be pinned because there are still
> > open files that have stashed the openers credentials in file->f_cred, or
> > the last reference might be put with an rcu delay keeping that namespace
> > active on the namespace lists.
> > 
> > But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
> > Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
> > which uses lazy TLB destruction.
> > 
> > When this option is set a userspace task's struct mm_struct may be used
> > for kernel threads such as the idle task and will only be destroyed once
> > the cpu's runqueue switches back to another task. So the kernel thread
> > will take a reference on the struct mm_struct pinning it.
> > 
> > And for ptrace() based access checks struct mm_struct stashes the user
> > namespace of the task that struct mm_struct belonged to originally and
> > thus takes a reference to the users namespace and pins it.
> > 
> > So on an idle system such user namespaces can be persisted for pretty
> > arbitrary amounts of time via struct mm_struct.
> > 
> > Now, without the active reference count regulating visibility all
> > namespace that still are pinned in some way on the system will appear in
> > the listns() output and can be reopened using namespace file handles.
> > 
> > Of course that requires suitable privileges and it's not really a
> > concern per se because a task could've also persist the namespace
> > recorded in struct mm_struct explicitly and then the idle task would
> > still reuse that struct mm_struct and another task could still happily
> > setns() to it afaict and reuse it for something else.
> > 
> > The active reference count though has drawbacks itself. Namely that
> > socket files break the assumption that namespaces can only be opened if
> > there's either live processes pinning the namespace or there are file
> > descriptors open that pin the namespace itself as the socket SIOCGSKNS
> > ioctl() can be used to open a network namespace based on a socket which
> > only indirectly pins a network namespace.
> > 
> > So that punches a whole in the active reference count tracking. So this
> > will have to be handled as right now socket file descriptors that pin a
> > network namespace that don't have an active reference anymore (no live
> > processes, not explicit persistence via namespace fds) can't be used to
> > issue a SIOCGSKNS ioctl() to open the associated network namespace.
> > 
> > So two options I see if the api is based on ids:
> > 
> > (1) We use the active reference count and somehow also make it work with
> >     sockets.
> > (2) The active reference count is not needed and we say that listns() is
> >     an introspection system call anyway so we just always list
> >     namespaces regardless of why they are still pinned: files,
> >     mm_struct, network devices, everything is fair game.
> > (3) Throw hands up in the air and just not do it.
> >
> 
> I think the active reference counts are just nice to have, if I'm not missing
> something we still have to figure out which pid is using the namespace we may
> want to enter, so there's already a "time of check, time of use" issue. I think
> if we want to have the active count we can do it just as an advisory thing, have
> a flag that says "this ns is dying and you can't do anything with it", and then
> for network namespaces we can just never set the flag and let the existing
> SIOCKGSNS ioctl work as is.
> 
> The bigger question (and sorry I didn't think about this before now), is how are
> we going to integrate this into the rest of the NS related syscalls? Having
> progromatic introspection is excellent from a usabiility point of view, but we
> also want to be able to have an easy way to get a PID from these namespaces, and
> even eventually do things like setns() based on these IDs. Those are followup
> series of course, but we should at least have a plan for them. Thanks,

I don't think we even need to have separate system calls to operate
directly on the IDs that's why I added namespace file handles.

We have listns() to iterate through namespaces in various ways.
This will be followed by statns() which will indeed operate on these
IDs to retrieve namespace specific information.

I already have that one drafted as well (That can contain all kinds of
namespace specific information like number of mounts (mntns), or number
of sockets (netns), number of network devices (netns), number of process
(pidns) what have you. Although what to expose I'm leaving to the
individual namespaces to figure out. IOW, I'm not going to figure out
what information statns() whould expose for network namespaces. I'll
leave that to net/).

But to your other point: to perform traditional operations like setns()
or all of the ioctls associated with such namespaces, it's pretty easy:

  struct file_handle **net_handle;
  char net_buf[sizeof(*net_handle) + MAX_HANDLE_SZ];

  net_handle = (struct file_handle *)net_buf;
  net_handle->handle_bytes = sizeof(struct nsfs_file_handle);
  net_handle->handle_type = FILEID_NSFS;
  struct nsfs_file_handle *net_fh = (struct nsfs_file_handle *)net_handle->f_handle;
  net_fh->ns_id = netns_id;

Now obviously that should exist in a nice simple define and aligned and
not nastily open-coded like I did here but you get the point. The
namespace id is sufficient to open an fd to it which is the main api to
perform actual semantic operations on it.

  /*
   * As long as the caller has CAP_SYS_ADMIN in the owning user namespace
   * of the etwork namespace or is located in the network namespace they
   * can open a file descriptor to it just like with
   * /proc/<pid>/ns/<ns_type>
   */
  int netns_fd = open_by_handle_at(FD_NSFS_ROOT, net_handle, O_RDONLY);
  
  setns(netns_fd, CLONE_NEWNET)

Getting pids from pid namespaces or translating them between pid
namespaces is something I added a while ago as well, via ioctl_nsfs:

/* Translate pid from target pid namespace into the caller's pid namespace. */
#define NS_GET_PID_FROM_PIDNS	_IOR(NSIO, 0x6, int)
/* Return thread-group leader id of pid in the callers pid namespace. */
#define NS_GET_TGID_FROM_PIDNS	_IOR(NSIO, 0x7, int)
/* Translate pid from caller's pid namespace into a target pid namespace. */
#define NS_GET_PID_IN_PIDNS	_IOR(NSIO, 0x8, int)
/* Return thread-group leader id of pid in the target pid namespace. */
#define NS_GET_TGID_IN_PIDNS	_IOR(NSIO, 0x9, int)

That also works fd-based and so is covered by open_by_handle_at().

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [syzbot ci] Re: nstree: listns()
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (50 preceding siblings ...)
  2025-10-21 14:34 ` [PATCH RFC DRAFT 00/50] nstree: listns() Josef Bacik
@ 2025-10-21 14:41 ` syzbot ci
  2025-10-22 11:00 ` [PATCH RFC DRAFT 00/50] " Ferenc Fejes
  2025-10-22 11:28 ` Jeff Layton
  53 siblings, 0 replies; 59+ messages in thread
From: syzbot ci @ 2025-10-21 14:41 UTC (permalink / raw)
  To: amir73il, arnd, bpf, brauner, cgroups, cyphar, daan.j.demeyer,
	edumazet, hannes, jack, jannh, jlayton, josef, kuba,
	linux-fsdevel, linux-kernel, me, mzxreary, netdev, tglx, tj, viro,
	zbyszek
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] nstree: listns()
https://lore.kernel.org/all/20251021-work-namespace-nstree-listns-v1-0-ad44261a8a5b@kernel.org
* [PATCH RFC DRAFT 01/50] libfs: allow to specify s_d_flags
* [PATCH RFC DRAFT 02/50] nsfs: use inode_just_drop()
* [PATCH RFC DRAFT 03/50] nsfs: raise DCACHE_DONTCACHE explicitly
* [PATCH RFC DRAFT 04/50] pidfs: raise DCACHE_DONTCACHE explicitly
* [PATCH RFC DRAFT 05/50] nsfs: raise SB_I_NODEV and SB_I_NOEXEC
* [PATCH RFC DRAFT 06/50] nstree: simplify return
* [PATCH RFC DRAFT 07/50] ns: initialize ns_list_node for initial namespaces
* [PATCH RFC DRAFT 08/50] ns: add __ns_ref_read()
* [PATCH RFC DRAFT 09/50] ns: add active reference count
* [PATCH RFC DRAFT 10/50] ns: use anonymous struct to group list member
* [PATCH RFC DRAFT 11/50] nstree: introduce a unified tree
* [PATCH RFC DRAFT 12/50] nstree: allow lookup solely based on inode
* [PATCH RFC DRAFT 13/50] nstree: assign fixed ids to the initial namespaces
* [PATCH RFC DRAFT 14/50] ns: maintain list of owned namespaces
* [PATCH RFC DRAFT 15/50] nstree: add listns()
* [PATCH RFC DRAFT 16/50] arch: hookup listns() system call
* [PATCH RFC DRAFT 17/50] nsfs: update tools header
* [PATCH RFC DRAFT 18/50] selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper
* [PATCH RFC DRAFT 19/50] selftests/namespaces: first active reference count tests
* [PATCH RFC DRAFT 20/50] selftests/namespaces: second active reference count tests
* [PATCH RFC DRAFT 21/50] selftests/namespaces: third active reference count tests
* [PATCH RFC DRAFT 22/50] selftests/namespaces: fourth active reference count tests
* [PATCH RFC DRAFT 23/50] selftests/namespaces: fifth active reference count tests
* [PATCH RFC DRAFT 24/50] selftests/namespaces: sixth active reference count tests
* [PATCH RFC DRAFT 25/50] selftests/namespaces: seventh active reference count tests
* [PATCH RFC DRAFT 26/50] selftests/namespaces: eigth active reference count tests
* [PATCH RFC DRAFT 27/50] selftests/namespaces: ninth active reference count tests
* [PATCH RFC DRAFT 28/50] selftests/namespaces: tenth active reference count tests
* [PATCH RFC DRAFT 29/50] selftests/namespaces: eleventh active reference count tests
* [PATCH RFC DRAFT 30/50] selftests/namespaces: twelth active reference count tests
* [PATCH RFC DRAFT 31/50] selftests/namespaces: thirteenth active reference count tests
* [PATCH RFC DRAFT 32/50] selftests/namespaces: fourteenth active reference count tests
* [PATCH RFC DRAFT 33/50] selftests/namespaces: fifteenth active reference count tests
* [PATCH RFC DRAFT 34/50] selftests/namespaces: add listns() wrapper
* [PATCH RFC DRAFT 35/50] selftests/namespaces: first listns() test
* [PATCH RFC DRAFT 36/50] selftests/namespaces: second listns() test
* [PATCH RFC DRAFT 37/50] selftests/namespaces: third listns() test
* [PATCH RFC DRAFT 38/50] selftests/namespaces: fourth listns() test
* [PATCH RFC DRAFT 39/50] selftests/namespaces: fifth listns() test
* [PATCH RFC DRAFT 40/50] selftests/namespaces: sixth listns() test
* [PATCH RFC DRAFT 41/50] selftests/namespaces: seventh listns() test
* [PATCH RFC DRAFT 42/50] selftests/namespaces: ninth listns() test
* [PATCH RFC DRAFT 43/50] selftests/namespaces: ninth listns() test
* [PATCH RFC DRAFT 44/50] selftests/namespaces: first listns() permission test
* [PATCH RFC DRAFT 45/50] selftests/namespaces: second listns() permission test
* [PATCH RFC DRAFT 46/50] selftests/namespaces: third listns() permission test
* [PATCH RFC DRAFT 47/50] selftests/namespaces: fourth listns() permission test
* [PATCH RFC DRAFT 48/50] selftests/namespaces: fifth listns() permission test
* [PATCH RFC DRAFT 49/50] selftests/namespaces: sixth listns() permission test
* [PATCH RFC DRAFT 50/50] selftests/namespaces: seventh listns() permission test

and found the following issue:
WARNING in __ns_tree_add_raw

Full report is available here:
https://ci.syzbot.org/series/03ca38c3-876c-4231-aa06-ddb0bc8a30ad

***

WARNING in __ns_tree_add_raw

tree:      bpf
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf.git
base:      5fb750e8a9ae123b2034771b864b8a21dbef65cd
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/156cf21b-68f9-423c-807a-3dd094e6aed8/config

------------[ cut here ]------------
WARNING: CPU: 1 PID: 5816 at kernel/nstree.c:189 __ns_tree_add_raw+0xa92/0xb30
Modules linked in:
CPU: 1 UID: 0 PID: 5816 Comm: syz-executor Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__ns_tree_add_raw+0xa92/0xb30
Code: 32 00 90 0f 0b 90 42 80 3c 23 00 0f 85 1e fc ff ff e9 21 fc ff ff e8 ed 78 32 00 90 0f 0b 90 e9 66 fc ff ff e8 df 78 32 00 90 <0f> 0b 90 e9 53 ff ff ff 44 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c ef
RSP: 0018:ffffc90003f27c30 EFLAGS: 00010293
RAX: ffffffff818e0051 RBX: 1ffffffff16db871 RCX: ffff88810ffe0000
RDX: 0000000000000000 RSI: ffff8881bbf5e9a8 RDI: ffff88816d0e6e00
RBP: ffff88816d0e6e00 R08: ffff88816d0e6e3f R09: 0000000000000000
R10: ffff88816d0e6e30 R11: ffffffff81b988c0 R12: dffffc0000000000
R13: ffff88816d0e6e40 R14: ffffffff8b6dc388 R15: ffff8881bbf5e9a8
FS:  000055558630f500(0000) GS:ffff8882a9d04000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5c3f03529c CR3: 000000010b5bc000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 copy_cgroup_ns+0x373/0x5f0
 create_new_namespaces+0x358/0x720
 unshare_nsproxy_namespaces+0x11c/0x170
 ksys_unshare+0x4c8/0x8c0
 __x64_sys_unshare+0x38/0x50
 do_syscall_64+0xfa/0xfa0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5c3ef907c7
Code: 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 10 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc62c39c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00007ffc62c39c90 RCX: 00007f5c3ef907c7
RDX: 0000000000000000 RSI: 00007f5c3f03529c RDI: 0000000002000000
RBP: 00007ffc62c39d20 R08: 0000000000000000 R09: 00007f5c3fd1d6c0
R10: 0000000000044000 R11: 0000000000000246 R12: 00007ffc62c39d20
R13: 00007ffc62c39d28 R14: 0000000000000009 R15: 0000000000000000
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH RFC DRAFT 00/50] nstree: listns()
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (51 preceding siblings ...)
  2025-10-21 14:41 ` [syzbot ci] " syzbot ci
@ 2025-10-22 11:00 ` Ferenc Fejes
  2025-10-24 14:50   ` Christian Brauner
  2025-10-22 11:28 ` Jeff Layton
  53 siblings, 1 reply; 59+ messages in thread
From: Ferenc Fejes @ 2025-10-22 11:00 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel, Josef Bacik, Jeff Layton
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann

On Tue, 2025-10-21 at 13:43 +0200, Christian Brauner wrote:
> Hey,
> 
> As announced a while ago this is the next step building on the nstree
> work from prior cycles. There's a bunch of fixes and semantic cleanups
> in here and a ton of tests.
> 
> I need helper here!: Consider the following current design:
> 
> Currently listns() is relying on active namespace reference counts which
> are introduced alongside this series.
> 
> The active reference count of a namespace consists of the live tasks
> that make use of this namespace and any namespace file descriptors that
> explicitly pin the namespace.
> 
> Once all tasks making use of this namespace have exited or reaped, all
> namespace file descriptors for that namespace have been closed and all
> bind-mounts for that namespace unmounted it ceases to appear in the
> listns() output.
> 
> My reason for introducing the active reference count was that namespaces
> might obviously still be pinned internally for various reasons. For
> example the user namespace might still be pinned because there are still
> open files that have stashed the openers credentials in file->f_cred, or
> the last reference might be put with an rcu delay keeping that namespace
> active on the namespace lists.
> 
> But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
> Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
> which uses lazy TLB destruction.
> 
> When this option is set a userspace task's struct mm_struct may be used
> for kernel threads such as the idle task and will only be destroyed once
> the cpu's runqueue switches back to another task. So the kernel thread
> will take a reference on the struct mm_struct pinning it.
> 
> And for ptrace() based access checks struct mm_struct stashes the user
> namespace of the task that struct mm_struct belonged to originally and
> thus takes a reference to the users namespace and pins it.
> 
> So on an idle system such user namespaces can be persisted for pretty
> arbitrary amounts of time via struct mm_struct.
> 
> Now, without the active reference count regulating visibility all
> namespace that still are pinned in some way on the system will appear in
> the listns() output and can be reopened using namespace file handles.
> 
> Of course that requires suitable privileges and it's not really a
> concern per se because a task could've also persist the namespace
> recorded in struct mm_struct explicitly and then the idle task would
> still reuse that struct mm_struct and another task could still happily
> setns() to it afaict and reuse it for something else.
> 
> The active reference count though has drawbacks itself. Namely that
> socket files break the assumption that namespaces can only be opened if
> there's either live processes pinning the namespace or there are file
> descriptors open that pin the namespace itself as the socket SIOCGSKNS
> ioctl() can be used to open a network namespace based on a socket which
> only indirectly pins a network namespace.
> 
> So that punches a whole in the active reference count tracking. So this
> will have to be handled as right now socket file descriptors that pin a
> network namespace that don't have an active reference anymore (no live
> processes, not explicit persistence via namespace fds) can't be used to
> issue a SIOCGSKNS ioctl() to open the associated network namespace.
> 
> So two options I see if the api is based on ids:
> 
> (1) We use the active reference count and somehow also make it work with
>     sockets.
> (2) The active reference count is not needed and we say that listns() is
>     an introspection system call anyway so we just always list
>     namespaces regardless of why they are still pinned: files,
>     mm_struct, network devices, everything is fair game.
> (3) Throw hands up in the air and just not do it.
> 
> =====================================================================
> 
> Add a new listns() system call that allows userspace to iterate through
> namespaces in the system. This provides a programmatic interface to
> discover and inspect namespaces, enhancing existing namespace apis.
> 
> Currently, there is no direct way for userspace to enumerate namespaces
> in the system. Applications must resort to scanning /proc/<pid>/ns/
> across all processes, which is:
> 
> 1. Inefficient - requires iterating over all processes
> 2. Incomplete - misses inactive namespaces that aren't attached to any
>    running process but are kept alive by file descriptors, bind mounts,
>    or parent namespace references
> 3. Permission-heavy - requires access to /proc for many processes
> 4. No ordering or ownership.
> 5. No filtering per namespace type: Must always iterate and check all
>    namespaces.
> 
> The list goes on. The listns() system call solves these problems by
> providing direct kernel-level enumeration of namespaces. It is similar
> to listmount() but obviously tailored to namespaces.

I've been waiting for such an API for years; thanks for working on it. I mostly
deal with network namespaces, where points 2 and 3 are especially painful.

Recently, I've used this eBPF snippet to discover (at most 1024, because of the
verifier's halt checking) network namespaces, even if no process is attached.
But I can't do anything with it in userspace since it's not possible to pass the
inode number or netns cookie value to setns()...

extern const void net_namespace_list __ksym;
static void list_all_netns()
{
    struct list_head *nslist = 
	bpf_core_cast(&net_namespace_list, struct list_head);

    struct list_head *iter = nslist->next;

    bpf_repeat(1024) {
        const struct net *net = 
		bpf_core_cast(container_of(iter, struct net, list), struct
net);

        // bpf_printk("net: %p inode: %u cookie: %lu", 
	//	net, net->ns.inum, net->net_cookie);

        if (iter->next == nslist)
            break;
        iter = iter->next;
    }
}

> 
> /*
>  * @req: Pointer to struct ns_id_req specifying search parameters
>  * @ns_ids: User buffer to receive namespace IDs
>  * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
>  * @flags: Reserved for future use (must be 0)
>  */
> ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
>                size_t nr_ns_ids, unsigned int flags);
> 
> Returns:
> - On success: Number of namespace IDs written to ns_ids
> - On error: Negative error code
> 
> /*
>  * @size: Structure size
>  * @ns_id: Starting point for iteration; use 0 for first call, then
>  *         use the last returned ID for subsequent calls to paginate
>  * @ns_type: Bitmask of namespace types to include (from enum ns_type):
>  *           0: Return all namespace types
>  *           MNT_NS: Mount namespaces
>  *           NET_NS: Network namespaces
>  *           USER_NS: User namespaces
>  *           etc. Can be OR'd together
>  * @user_ns_id: Filter results to namespaces owned by this user namespace:
>  *              0: Return all namespaces (subject to permission checks)
>  *              LISTNS_CURRENT_USER: Namespaces owned by caller's user
> namespace
>  *              Other value: Namespaces owned by the specified user namespace
> ID
>  */
> struct ns_id_req {
>         __u32 size;         /* sizeof(struct ns_id_req) */
>         __u32 spare;        /* Reserved, must be 0 */
>         __u64 ns_id;        /* Last seen namespace ID (for pagination) */
>         __u32 ns_type;      /* Filter by namespace type(s) */
>         __u32 spare2;       /* Reserved, must be 0 */
>         __u64 user_ns_id;   /* Filter by owning user namespace */
> };
> 

After this merged, do you see any chance for backports? Does it rely on recent
bits which is hard/impossible to backport? I'm not aware of backported syscalls
but this would be really nice to see in older kernels.

Ferenc

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH RFC DRAFT 00/50] nstree: listns()
  2025-10-22 11:00 ` [PATCH RFC DRAFT 00/50] " Ferenc Fejes
@ 2025-10-24 14:50   ` Christian Brauner
  2025-10-27 10:49     ` Ferenc Fejes
  0 siblings, 1 reply; 59+ messages in thread
From: Christian Brauner @ 2025-10-24 14:50 UTC (permalink / raw)
  To: Ferenc Fejes
  Cc: linux-fsdevel, Josef Bacik, Jeff Layton, Jann Horn, Mike Yuan,
	Zbigniew Jędrzejewski-Szmek, Lennart Poettering,
	Daan De Meyer, Aleksa Sarai, Amir Goldstein, Tejun Heo,
	Johannes Weiner, Thomas Gleixner, Alexander Viro, Jan Kara,
	linux-kernel, cgroups, bpf, Eric Dumazet, Jakub Kicinski, netdev,
	Arnd Bergmann

> > Add a new listns() system call that allows userspace to iterate through
> > namespaces in the system. This provides a programmatic interface to
> > discover and inspect namespaces, enhancing existing namespace apis.
> > 
> > Currently, there is no direct way for userspace to enumerate namespaces
> > in the system. Applications must resort to scanning /proc/<pid>/ns/
> > across all processes, which is:
> > 
> > 1. Inefficient - requires iterating over all processes
> > 2. Incomplete - misses inactive namespaces that aren't attached to any
> >    running process but are kept alive by file descriptors, bind mounts,
> >    or parent namespace references
> > 3. Permission-heavy - requires access to /proc for many processes
> > 4. No ordering or ownership.
> > 5. No filtering per namespace type: Must always iterate and check all
> >    namespaces.
> > 
> > The list goes on. The listns() system call solves these problems by
> > providing direct kernel-level enumeration of namespaces. It is similar
> > to listmount() but obviously tailored to namespaces.
> 
> I've been waiting for such an API for years; thanks for working on it. I mostly
> deal with network namespaces, where points 2 and 3 are especially painful.
> 
> Recently, I've used this eBPF snippet to discover (at most 1024, because of the
> verifier's halt checking) network namespaces, even if no process is attached.
> But I can't do anything with it in userspace since it's not possible to pass the
> inode number or netns cookie value to setns()...

I've mentioned it in the cover letter and in my earlier reply to Josef:

On v6.18+ kernels it is possible to generate and open file handles to
namespaces. This is probably an api that people outside of fs/ proper
aren't all that familiar with.

In essence it allows you to refer to files - or more-general:
kernel-object that may be referenced via files - via opaque handles
instead of paths.

For regular filesystem that are multi-instance (IOW, you can have
multiple btrfs or ext4 filesystems mounted) such file handles cannot be
used without providing a file descriptor to another object in the
filesystem that is used to resolve the file handle...

However, for single-instance filesystems like pidfs and nsfs that's not
required which is why I added:

FD_PIDFS_ROOT
FD_NSFS_ROOT

which means that you can open both pidfds and namespace via
open_by_handle_at() purely based on the file handle. I call such file
handles "exhaustive file handles" because they fully describe the object
to be resolvable without any further information.

They are also not subject to the capable(CAP_DAC_READ_SEARCH) permission
check that regular file handles are and so can be used even by
unprivileged code as long as the caller is sufficiently privileged over
the relevant object (pid resolvable in caller's pid namespace of pidfds,
or caller located in namespace or privileged over the owning user
namespace of the relevant namespace for nsfs).

File handles for namespaces have the following uapi:

struct nsfs_file_handle {
	__u64 ns_id;
	__u32 ns_type;
	__u32 ns_inum;
};

#define NSFS_FILE_HANDLE_SIZE_VER0 16 /* sizeof first published struct */
#define NSFS_FILE_HANDLE_SIZE_LATEST sizeof(struct nsfs_file_handle) /* sizeof latest published struct */

and it is explicitly allowed to generate such file handles manually in
userspace. When the kernel generates a namespace file handle via
name_to_handle_at() till will return: ns_id, ns_type, and ns_inum but
userspace is allowed to provide the kernel with a laxer file handle
where only the ns_id is filled in but ns_type and ns_inum are zero - at
least after this patch series.

So for your case where you even know inode number, ns type, and ns id
you can fill in a struct nsfs_file_handle and either look at my reply to
Josef or in the (ugly) tests.

fd = open_by_handle_at(FD_NSFS_ROOT, file_handle, O_RDONLY);

and can open the namespace (provided it is still active).

> 
> extern const void net_namespace_list __ksym;
> static void list_all_netns()
> {
>     struct list_head *nslist = 
> 	bpf_core_cast(&net_namespace_list, struct list_head);
> 
>     struct list_head *iter = nslist->next;
> 
>     bpf_repeat(1024) {

This isn't needed anymore. I've implemented it in a bpf-friendly way so
it's possible to add kfuncs that would allow you to iterate through the
various namespace trees (locklessly).

If this is merged then I'll likely design that bpf part myself.

> After this merged, do you see any chance for backports? Does it rely on recent
> bits which is hard/impossible to backport? I'm not aware of backported syscalls
> but this would be really nice to see in older kernels.

Uhm, what downstream entities, managing kernels do is not my concern but
for upstream it's certainly not an option. There's a lot of preparatory
work that would have to be backported.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH RFC DRAFT 00/50] nstree: listns()
  2025-10-24 14:50   ` Christian Brauner
@ 2025-10-27 10:49     ` Ferenc Fejes
  0 siblings, 0 replies; 59+ messages in thread
From: Ferenc Fejes @ 2025-10-27 10:49 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Josef Bacik, Jeff Layton, Jann Horn, Mike Yuan,
	Zbigniew Jędrzejewski-Szmek, Lennart Poettering,
	Daan De Meyer, Aleksa Sarai, Amir Goldstein, Tejun Heo,
	Johannes Weiner, Thomas Gleixner, Alexander Viro, Jan Kara,
	linux-kernel, cgroups, bpf, Eric Dumazet, Jakub Kicinski, netdev,
	Arnd Bergmann

On Fri, 2025-10-24 at 16:50 +0200, Christian Brauner wrote:
> > > Add a new listns() system call that allows userspace to iterate through
> > > namespaces in the system. This provides a programmatic interface to
> > > discover and inspect namespaces, enhancing existing namespace apis.
> > > 
> > > Currently, there is no direct way for userspace to enumerate namespaces
> > > in the system. Applications must resort to scanning /proc/<pid>/ns/
> > > across all processes, which is:
> > > 
> > > 1. Inefficient - requires iterating over all processes
> > > 2. Incomplete - misses inactive namespaces that aren't attached to any
> > >    running process but are kept alive by file descriptors, bind mounts,
> > >    or parent namespace references
> > > 3. Permission-heavy - requires access to /proc for many processes
> > > 4. No ordering or ownership.
> > > 5. No filtering per namespace type: Must always iterate and check all
> > >    namespaces.
> > > 
> > > The list goes on. The listns() system call solves these problems by
> > > providing direct kernel-level enumeration of namespaces. It is similar
> > > to listmount() but obviously tailored to namespaces.
> > 
> > I've been waiting for such an API for years; thanks for working on it. I
> > mostly
> > deal with network namespaces, where points 2 and 3 are especially painful.
> > 
> > Recently, I've used this eBPF snippet to discover (at most 1024, because of
> > the
> > verifier's halt checking) network namespaces, even if no process is
> > attached.
> > But I can't do anything with it in userspace since it's not possible to pass
> > the
> > inode number or netns cookie value to setns()...
> 
> I've mentioned it in the cover letter and in my earlier reply to Josef:
> 
> On v6.18+ kernels it is possible to generate and open file handles to
> namespaces. This is probably an api that people outside of fs/ proper
> aren't all that familiar with.
> 
> In essence it allows you to refer to files - or more-general:
> kernel-object that may be referenced via files - via opaque handles
> instead of paths.
> 
> For regular filesystem that are multi-instance (IOW, you can have
> multiple btrfs or ext4 filesystems mounted) such file handles cannot be
> used without providing a file descriptor to another object in the
> filesystem that is used to resolve the file handle...
> 
> However, for single-instance filesystems like pidfs and nsfs that's not
> required which is why I added:
> 
> FD_PIDFS_ROOT
> FD_NSFS_ROOT
> 
> which means that you can open both pidfds and namespace via
> open_by_handle_at() purely based on the file handle. I call such file
> handles "exhaustive file handles" because they fully describe the object
> to be resolvable without any further information.
> 
> They are also not subject to the capable(CAP_DAC_READ_SEARCH) permission
> check that regular file handles are and so can be used even by
> unprivileged code as long as the caller is sufficiently privileged over
> the relevant object (pid resolvable in caller's pid namespace of pidfds,
> or caller located in namespace or privileged over the owning user
> namespace of the relevant namespace for nsfs).
> 
> File handles for namespaces have the following uapi:
> 
> struct nsfs_file_handle {
> 	__u64 ns_id;
> 	__u32 ns_type;
> 	__u32 ns_inum;
> };
> 
> #define NSFS_FILE_HANDLE_SIZE_VER0 16 /* sizeof first published struct */
> #define NSFS_FILE_HANDLE_SIZE_LATEST sizeof(struct nsfs_file_handle) /* sizeof
> latest published struct */
> 
> and it is explicitly allowed to generate such file handles manually in
> userspace. When the kernel generates a namespace file handle via
> name_to_handle_at() till will return: ns_id, ns_type, and ns_inum but
> userspace is allowed to provide the kernel with a laxer file handle
> where only the ns_id is filled in but ns_type and ns_inum are zero - at
> least after this patch series.
> 
> So for your case where you even know inode number, ns type, and ns id
> you can fill in a struct nsfs_file_handle and either look at my reply to
> Josef or in the (ugly) tests.
> 
> fd = open_by_handle_at(FD_NSFS_ROOT, file_handle, O_RDONLY);
> 
> and can open the namespace (provided it is still active).
> 
> > 
> > extern const void net_namespace_list __ksym;
> > static void list_all_netns()
> > {
> >     struct list_head *nslist = 
> > 	bpf_core_cast(&net_namespace_list, struct list_head);
> > 
> >     struct list_head *iter = nslist->next;
> > 
> >     bpf_repeat(1024) {
> 
> This isn't needed anymore. I've implemented it in a bpf-friendly way so
> it's possible to add kfuncs that would allow you to iterate through the
> various namespace trees (locklessly).
> 
> If this is merged then I'll likely design that bpf part myself.

Excellent, thanks for the detailed explanation, noted! Well I guess I have to
keep my eyes closer on recent ns changes, I was aware of pidfs but not the
helpers you just mentioned.

> 
> > After this merged, do you see any chance for backports? Does it rely on
> > recent
> > bits which is hard/impossible to backport? I'm not aware of backported
> > syscalls
> > but this would be really nice to see in older kernels.
> 
> Uhm, what downstream entities, managing kernels do is not my concern but
> for upstream it's certainly not an option. There's a lot of preparatory
> work that would have to be backported.

I was curious about the upstream option, but I see this isn't feasible. Anyway,
its great we will have this in the future, thanks for doing it!

Ferenc

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH RFC DRAFT 00/50] nstree: listns()
  2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
                   ` (52 preceding siblings ...)
  2025-10-22 11:00 ` [PATCH RFC DRAFT 00/50] " Ferenc Fejes
@ 2025-10-22 11:28 ` Jeff Layton
  2025-10-24 14:54   ` Christian Brauner
  53 siblings, 1 reply; 59+ messages in thread
From: Jeff Layton @ 2025-10-22 11:28 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel, Josef Bacik
  Cc: Jann Horn, Mike Yuan, Zbigniew Jędrzejewski-Szmek,
	Lennart Poettering, Daan De Meyer, Aleksa Sarai, Amir Goldstein,
	Tejun Heo, Johannes Weiner, Thomas Gleixner, Alexander Viro,
	Jan Kara, linux-kernel, cgroups, bpf, Eric Dumazet,
	Jakub Kicinski, netdev, Arnd Bergmann

On Tue, 2025-10-21 at 13:43 +0200, Christian Brauner wrote:
> Hey,
> 
> As announced a while ago this is the next step building on the nstree
> work from prior cycles. There's a bunch of fixes and semantic cleanups
> in here and a ton of tests.
> 
> I need helper here!: Consider the following current design:
> 
> Currently listns() is relying on active namespace reference counts which
> are introduced alongside this series.
> 
> The active reference count of a namespace consists of the live tasks
> that make use of this namespace and any namespace file descriptors that
> explicitly pin the namespace.
> 
> Once all tasks making use of this namespace have exited or reaped, all
> namespace file descriptors for that namespace have been closed and all
> bind-mounts for that namespace unmounted it ceases to appear in the
> listns() output.
> 
> My reason for introducing the active reference count was that namespaces
> might obviously still be pinned internally for various reasons. For
> example the user namespace might still be pinned because there are still
> open files that have stashed the openers credentials in file->f_cred, or
> the last reference might be put with an rcu delay keeping that namespace
> active on the namespace lists.
> 
> But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
> Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
> which uses lazy TLB destruction.
> 
> When this option is set a userspace task's struct mm_struct may be used
> for kernel threads such as the idle task and will only be destroyed once
> the cpu's runqueue switches back to another task. So the kernel thread
> will take a reference on the struct mm_struct pinning it.
> 
> And for ptrace() based access checks struct mm_struct stashes the user
> namespace of the task that struct mm_struct belonged to originally and
> thus takes a reference to the users namespace and pins it.
> 
> So on an idle system such user namespaces can be persisted for pretty
> arbitrary amounts of time via struct mm_struct.
> 
> Now, without the active reference count regulating visibility all
> namespace that still are pinned in some way on the system will appear in
> the listns() output and can be reopened using namespace file handles.
> 
> Of course that requires suitable privileges and it's not really a
> concern per se because a task could've also persist the namespace
> recorded in struct mm_struct explicitly and then the idle task would
> still reuse that struct mm_struct and another task could still happily
> setns() to it afaict and reuse it for something else.
> 
> The active reference count though has drawbacks itself. Namely that
> socket files break the assumption that namespaces can only be opened if
> there's either live processes pinning the namespace or there are file
> descriptors open that pin the namespace itself as the socket SIOCGSKNS
> ioctl() can be used to open a network namespace based on a socket which
> only indirectly pins a network namespace.
> 
> So that punches a whole in the active reference count tracking. So this
> will have to be handled as right now socket file descriptors that pin a
> network namespace that don't have an active reference anymore (no live
> processes, not explicit persistence via namespace fds) can't be used to
> issue a SIOCGSKNS ioctl() to open the associated network namespace.
> 

Is this capability something we need to preserve? It seems like the
fact that SIOCGSKNS works when there are no active references left
might have been an accident. Is there a legit use-case for allowing
that?

I don't see a problem with active+passive refcounts. They're more
complicated to deal with, but we've used them elsewhere so it's a
pattern we all know (even if we don't necessarily love them).

I'll also point out that net namespaces already have two refcounts for
this exact reason. Do you plan to replace the passive refcount in
struct net with the new passive refcount you're implementing here?

> So two options I see if the api is based on ids:
> 
> (1) We use the active reference count and somehow also make it work with
>     sockets.
> (2) The active reference count is not needed and we say that listns() is
>     an introspection system call anyway so we just always list
>     namespaces regardless of why they are still pinned: files,
>     mm_struct, network devices, everything is fair game.
> (3) Throw hands up in the air and just not do it.
> 

Is listns() the only reason we'd need a active/passive refcounts? It
seems like we might need them for other reasons (e.g. struct net).

In any case, given that this is a privileged syscall, I don't
necessarily see a problem with #2 here. Leaked namespaces can be a
problem and we don't have good visibility into them at the moment.

IMO, even if you keep the active+passive refcounts, it would be good to
be able to tell listns() to return all the namespaces, and not just the
ones that are still active. Maybe that can be the first flag for this
new syscall?

> =====================================================================
> 
> Add a new listns() system call that allows userspace to iterate through
> namespaces in the system. This provides a programmatic interface to
> discover and inspect namespaces, enhancing existing namespace apis.
> 
> Currently, there is no direct way for userspace to enumerate namespaces
> in the system. Applications must resort to scanning /proc/<pid>/ns/
> across all processes, which is:
> 
> 1. Inefficient - requires iterating over all processes
> 2. Incomplete - misses inactive namespaces that aren't attached to any
>    running process but are kept alive by file descriptors, bind mounts,
>    or parent namespace references
> 3. Permission-heavy - requires access to /proc for many processes
> 4. No ordering or ownership.
> 5. No filtering per namespace type: Must always iterate and check all
>    namespaces.
> 
> The list goes on. The listns() system call solves these problems by
> providing direct kernel-level enumeration of namespaces. It is similar
> to listmount() but obviously tailored to namespaces.
> 
> /*
>  * @req: Pointer to struct ns_id_req specifying search parameters
>  * @ns_ids: User buffer to receive namespace IDs
>  * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
>  * @flags: Reserved for future use (must be 0)
>  */
> ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
>                size_t nr_ns_ids, unsigned int flags);
> 
> Returns:
> - On success: Number of namespace IDs written to ns_ids
> - On error: Negative error code
> 
> /*
>  * @size: Structure size
>  * @ns_id: Starting point for iteration; use 0 for first call, then
>  *         use the last returned ID for subsequent calls to paginate
>  * @ns_type: Bitmask of namespace types to include (from enum ns_type):
>  *           0: Return all namespace types
>  *           MNT_NS: Mount namespaces
>  *           NET_NS: Network namespaces
>  *           USER_NS: User namespaces
>  *           etc. Can be OR'd together
>  * @user_ns_id: Filter results to namespaces owned by this user namespace:
>  *              0: Return all namespaces (subject to permission checks)
>  *              LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace
>  *              Other value: Namespaces owned by the specified user namespace ID
>  */
> struct ns_id_req {
>         __u32 size;         /* sizeof(struct ns_id_req) */
>         __u32 spare;        /* Reserved, must be 0 */
>         __u64 ns_id;        /* Last seen namespace ID (for pagination) */
>         __u32 ns_type;      /* Filter by namespace type(s) */
>         __u32 spare2;       /* Reserved, must be 0 */
>         __u64 user_ns_id;   /* Filter by owning user namespace */
> };
> 
> Example 1: List all namespaces
> 
> void list_all_namespaces(void)
> {
> 	struct ns_id_req req = {
> 		.size = sizeof(req),
> 		.ns_id = 0,      /* Start from beginning */
> 		.ns_type = 0,    /* All types */
> 		.user_ns_id = 0, /* All user namespaces */
> 	};
> 	uint64_t ids[100];
> 	ssize_t ret;
> 
> 	printf("All namespaces in the system:\n");
> 	do {
> 		ret = listns(&req, ids, 100, 0);
> 		if (ret < 0) {
> 			perror("listns");
> 			break;
> 		}
> 
> 		for (ssize_t i = 0; i < ret; i++)
> 			printf("  Namespace ID: %llu\n", (unsigned long long)ids[i]);
> 
> 		/* Continue from last seen ID */
> 		if (ret > 0)
> 			req.ns_id = ids[ret - 1];
> 	} while (ret == 100); /* Buffer was full, more may exist */
> }
> 
> Example 2 : List network namespaces only
> 
> void list_network_namespaces(void)
> {
> 	struct ns_id_req req = {
> 		.size = sizeof(req),
> 		.ns_id = 0,
> 		.ns_type = NET_NS, /* Only network namespaces */
> 		.user_ns_id = 0,
> 	};
> 	uint64_t ids[100];
> 	ssize_t ret;
> 
> 	ret = listns(&req, ids, 100, 0);
> 	if (ret < 0) {
> 		perror("listns");
> 		return;
> 	}
> 
> 	printf("Network namespaces: %zd found\n", ret);
> 	for (ssize_t i = 0; i < ret; i++)
> 		printf("  netns ID: %llu\n", (unsigned long long)ids[i]);
> }
> 
> Example 3 : List namespaces owned by current user namespace
> 
> void list_owned_namespaces(void)
> {
> 	struct ns_id_req req = {
> 		.size = sizeof(req),
> 		.ns_id = 0,
> 		.ns_type = 0,                      /* All types */
> 		.user_ns_id = LISTNS_CURRENT_USER, /* Current userns */
> 	};
> 	uint64_t ids[100];
> 	ssize_t ret;
> 
> 	ret = listns(&req, ids, 100, 0);
> 	if (ret < 0) {
> 		perror("listns");
> 		return;
> 	}
> 
> 	printf("Namespaces owned by my user namespace: %zd\n", ret);
> 	for (ssize_t i = 0; i < ret; i++)
> 		printf("  ns ID: %llu\n", (unsigned long long)ids[i]);
> }
> 
> Example 4 : List multiple namespace types
> 
> void list_network_and_mount_namespaces(void)
> {
> 	struct ns_id_req req = {
> 		.size = sizeof(req),
> 		.ns_id = 0,
> 		.ns_type = NET_NS | MNT_NS, /* Network and mount */
> 		.user_ns_id = 0,
> 	};
> 	uint64_t ids[100];
> 	ssize_t ret;
> 
> 	ret = listns(&req, ids, 100, 0);
> 	printf("Network and mount namespaces: %zd found\n", ret);
> }
> 
> Example 5 : Pagination through large namespace sets
> 
> void list_all_with_pagination(void)
> {
> 	struct ns_id_req req = {
> 		.size = sizeof(req),
> 		.ns_id = 0,
> 		.ns_type = 0,
> 		.user_ns_id = 0,
> 	};
> 	uint64_t ids[50];
> 	size_t total = 0;
> 	ssize_t ret;
> 
> 	printf("Enumerating all namespaces with pagination:\n");
> 
> 	while (1) {
> 		ret = listns(&req, ids, 50, 0);
> 		if (ret < 0) {
> 			perror("listns");
> 			break;
> 		}
> 		if (ret == 0)
> 			break; /* No more namespaces */
> 
> 		total += ret;
> 		printf("  Batch: %zd namespaces\n", ret);
> 
> 		/* Last ID in this batch becomes start of next batch */
> 		req.ns_id = ids[ret - 1];
> 
> 		if (ret < 50)
> 			break; /* Partial batch = end of results */
> 	}
> 
> 	printf("Total: %zu namespaces\n", total);
> }
> 
> listns() respects namespace isolation and capabilities:
> 
> (1) Global listing (user_ns_id = 0):
>     - Requires CAP_SYS_ADMIN in the namespace's owning user namespace
>     - OR the namespace must be in the caller's namespace context (e.g.,
>       a namespace the caller is currently using)
>     - User namespaces additionally allow listing if the caller has
>       CAP_SYS_ADMIN in that user namespace itself
> (2) Owner-filtered listing (user_ns_id != 0):
>     - Requires CAP_SYS_ADMIN in the specified owner user namespace
>     - OR the namespace must be in the caller's namespace context
>     - This allows unprivileged processes to enumerate namespaces they own
> (3) Visibility:
>     - Only "active" namespaces are listed
>     - A namespace is active if it has a non-zero __ns_ref_active count
>     - This includes namespaces used by running processes, held by open
>       file descriptors, or kept active by bind mounts
>     - Inactive namespaces (kept alive only by internal kernel
>       references) are not visible via listns()
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Christian Brauner (50):
>       libfs: allow to specify s_d_flags
>       nsfs: use inode_just_drop()
>       nsfs: raise DCACHE_DONTCACHE explicitly
>       pidfs: raise DCACHE_DONTCACHE explicitly
>       nsfs: raise SB_I_NODEV and SB_I_NOEXEC
>       nstree: simplify return
>       ns: initialize ns_list_node for initial namespaces
>       ns: add __ns_ref_read()
>       ns: add active reference count
>       ns: use anonymous struct to group list member
>       nstree: introduce a unified tree
>       nstree: allow lookup solely based on inode
>       nstree: assign fixed ids to the initial namespaces
>       ns: maintain list of owned namespaces
>       nstree: add listns()
>       arch: hookup listns() system call
>       nsfs: update tools header
>       selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper
>       selftests/namespaces: first active reference count tests
>       selftests/namespaces: second active reference count tests
>       selftests/namespaces: third active reference count tests
>       selftests/namespaces: fourth active reference count tests
>       selftests/namespaces: fifth active reference count tests
>       selftests/namespaces: sixth active reference count tests
>       selftests/namespaces: seventh active reference count tests
>       selftests/namespaces: eigth active reference count tests
>       selftests/namespaces: ninth active reference count tests
>       selftests/namespaces: tenth active reference count tests
>       selftests/namespaces: eleventh active reference count tests
>       selftests/namespaces: twelth active reference count tests
>       selftests/namespaces: thirteenth active reference count tests
>       selftests/namespaces: fourteenth active reference count tests
>       selftests/namespaces: fifteenth active reference count tests
>       selftests/namespaces: add listns() wrapper
>       selftests/namespaces: first listns() test
>       selftests/namespaces: second listns() test
>       selftests/namespaces: third listns() test
>       selftests/namespaces: fourth listns() test
>       selftests/namespaces: fifth listns() test
>       selftests/namespaces: sixth listns() test
>       selftests/namespaces: seventh listns() test
>       selftests/namespaces: ninth listns() test
>       selftests/namespaces: ninth listns() test
>       selftests/namespaces: first listns() permission test
>       selftests/namespaces: second listns() permission test
>       selftests/namespaces: third listns() permission test
>       selftests/namespaces: fourth listns() permission test
>       selftests/namespaces: fifth listns() permission test
>       selftests/namespaces: sixth listns() permission test
>       selftests/namespaces: seventh listns() permission test
> 
>  arch/alpha/kernel/syscalls/syscall.tbl             |    1 +
>  arch/arm/tools/syscall.tbl                         |    1 +
>  arch/arm64/tools/syscall_32.tbl                    |    1 +
>  arch/m68k/kernel/syscalls/syscall.tbl              |    1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl        |    1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl          |    1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl          |    1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl          |    1 +
>  arch/parisc/kernel/syscalls/syscall.tbl            |    1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl           |    1 +
>  arch/s390/kernel/syscalls/syscall.tbl              |    1 +
>  arch/sh/kernel/syscalls/syscall.tbl                |    1 +
>  arch/sparc/kernel/syscalls/syscall.tbl             |    1 +
>  arch/x86/entry/syscalls/syscall_32.tbl             |    1 +
>  arch/x86/entry/syscalls/syscall_64.tbl             |    1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl            |    1 +
>  fs/libfs.c                                         |    1 +
>  fs/namespace.c                                     |    8 +-
>  fs/nsfs.c                                          |   79 +-
>  fs/pidfs.c                                         |    1 +
>  include/linux/ns_common.h                          |  147 +-
>  include/linux/nsfs.h                               |    3 +
>  include/linux/nstree.h                             |   26 +-
>  include/linux/pseudo_fs.h                          |    1 +
>  include/linux/syscalls.h                           |    4 +
>  include/uapi/asm-generic/unistd.h                  |    4 +-
>  include/uapi/linux/nsfs.h                          |   58 +
>  init/version-timestamp.c                           |    5 +
>  ipc/msgutil.c                                      |    5 +
>  ipc/namespace.c                                    |    1 +
>  kernel/cgroup/cgroup.c                             |    5 +
>  kernel/cgroup/namespace.c                          |    1 +
>  kernel/cred.c                                      |   17 +
>  kernel/exit.c                                      |    1 +
>  kernel/nscommon.c                                  |   59 +-
>  kernel/nsproxy.c                                   |    7 +
>  kernel/nstree.c                                    |  527 ++++-
>  kernel/pid.c                                       |   15 +
>  kernel/pid_namespace.c                             |    1 +
>  kernel/time/namespace.c                            |    6 +
>  kernel/user.c                                      |    5 +
>  kernel/user_namespace.c                            |    1 +
>  kernel/utsname.c                                   |    1 +
>  net/core/net_namespace.c                           |    3 +-
>  scripts/syscall.tbl                                |    1 +
>  tools/include/uapi/linux/nsfs.h                    |   70 +
>  tools/testing/selftests/filesystems/utils.c        |    2 +-
>  tools/testing/selftests/namespaces/.gitignore      |    3 +
>  tools/testing/selftests/namespaces/Makefile        |    7 +-
>  .../selftests/namespaces/listns_permissions_test.c |  777 +++++++
>  tools/testing/selftests/namespaces/listns_test.c   |  656 ++++++
>  .../selftests/namespaces/ns_active_ref_test.c      | 2226 ++++++++++++++++++++
>  tools/testing/selftests/namespaces/wrappers.h      |   35 +
>  53 files changed, 4737 insertions(+), 48 deletions(-)
> ---
> base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
> change-id: 20251020-work-namespace-nstree-listns-9fd71518515c

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH RFC DRAFT 00/50] nstree: listns()
  2025-10-22 11:28 ` Jeff Layton
@ 2025-10-24 14:54   ` Christian Brauner
  0 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2025-10-24 14:54 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, Josef Bacik, Jann Horn, Mike Yuan,
	Zbigniew Jędrzejewski-Szmek, Lennart Poettering,
	Daan De Meyer, Aleksa Sarai, Amir Goldstein, Tejun Heo,
	Johannes Weiner, Thomas Gleixner, Alexander Viro, Jan Kara,
	linux-kernel, cgroups, bpf, Eric Dumazet, Jakub Kicinski, netdev,
	Arnd Bergmann

> > So that punches a whole in the active reference count tracking. So this
> > will have to be handled as right now socket file descriptors that pin a
> > network namespace that don't have an active reference anymore (no live
> > processes, not explicit persistence via namespace fds) can't be used to
> > issue a SIOCGSKNS ioctl() to open the associated network namespace.
> > 
> 
> Is this capability something we need to preserve? It seems like the
> fact that SIOCGSKNS works when there are no active references left
> might have been an accident. Is there a legit use-case for allowing
> that?

I've solved that use-case now and have added a large testsuite to verify
that it works.

> 
> I don't see a problem with active+passive refcounts. They're more
> complicated to deal with, but we've used them elsewhere so it's a
> pattern we all know (even if we don't necessarily love them).

+1

> I'll also point out that net namespaces already have two refcounts for
> this exact reason. Do you plan to replace the passive refcount in
> struct net with the new passive refcount you're implementing here?

Yeah, that's an option. I think that in the future it should also be
possible to completely drop the net/ internal network namespace tracking
and rely on the nstree infrastructure only. But that's work for the
future.

> 
> > So two options I see if the api is based on ids:
> > 
> > (1) We use the active reference count and somehow also make it work with
> >     sockets.
> > (2) The active reference count is not needed and we say that listns() is
> >     an introspection system call anyway so we just always list
> >     namespaces regardless of why they are still pinned: files,
> >     mm_struct, network devices, everything is fair game.
> > (3) Throw hands up in the air and just not do it.
> > 
> 
> Is listns() the only reason we'd need a active/passive refcounts? It
> seems like we might need them for other reasons (e.g. struct net).

Yes.

> IMO, even if you keep the active+passive refcounts, it would be good to
> be able to tell listns() to return all the namespaces, and not just the
> ones that are still active. Maybe that can be the first flag for this
> new syscall?

Certainly possible but that would be pure introspection. But as I said
elsewhere, I have implemented the nstree infrastructure in a way that
it will allow bpf to walk the namespace trees and that would obviously
also include all namespaces that are not active anymore. 


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2025-10-27 10:49 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 01/50] libfs: allow to specify s_d_flags Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 02/50] nsfs: use inode_just_drop() Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 03/50] nsfs: raise DCACHE_DONTCACHE explicitly Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 04/50] pidfs: " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 05/50] nsfs: raise SB_I_NODEV and SB_I_NOEXEC Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 06/50] nstree: simplify return Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 07/50] ns: initialize ns_list_node for initial namespaces Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 08/50] ns: add __ns_ref_read() Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 09/50] ns: add active reference count Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 10/50] ns: use anonymous struct to group list member Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 11/50] nstree: introduce a unified tree Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 12/50] nstree: allow lookup solely based on inode Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 13/50] nstree: assign fixed ids to the initial namespaces Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 14/50] ns: maintain list of owned namespaces Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 15/50] nstree: add listns() Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 16/50] arch: hookup listns() system call Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 17/50] nsfs: update tools header Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 18/50] selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 19/50] selftests/namespaces: first active reference count tests Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 20/50] selftests/namespaces: second " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 21/50] selftests/namespaces: third " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 22/50] selftests/namespaces: fourth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 23/50] selftests/namespaces: fifth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 24/50] selftests/namespaces: sixth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 25/50] selftests/namespaces: seventh " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 26/50] selftests/namespaces: eigth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 27/50] selftests/namespaces: ninth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 28/50] selftests/namespaces: tenth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 29/50] selftests/namespaces: eleventh " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 30/50] selftests/namespaces: twelth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 31/50] selftests/namespaces: thirteenth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 32/50] selftests/namespaces: fourteenth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 33/50] selftests/namespaces: fifteenth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 34/50] selftests/namespaces: add listns() wrapper Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 35/50] selftests/namespaces: first listns() test Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 36/50] selftests/namespaces: second " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 37/50] selftests/namespaces: third " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 38/50] selftests/namespaces: fourth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 39/50] selftests/namespaces: fifth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 40/50] selftests/namespaces: sixth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 41/50] selftests/namespaces: seventh " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 42/50] selftests/namespaces: ninth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 43/50] " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 44/50] selftests/namespaces: first listns() permission test Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 45/50] selftests/namespaces: second " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 46/50] selftests/namespaces: third " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 47/50] selftests/namespaces: fourth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 48/50] selftests/namespaces: fifth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 49/50] selftests/namespaces: sixth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 50/50] selftests/namespaces: seventh " Christian Brauner
2025-10-21 14:34 ` [PATCH RFC DRAFT 00/50] nstree: listns() Josef Bacik
2025-10-22  8:34   ` Christian Brauner
2025-10-21 14:41 ` [syzbot ci] " syzbot ci
2025-10-22 11:00 ` [PATCH RFC DRAFT 00/50] " Ferenc Fejes
2025-10-24 14:50   ` Christian Brauner
2025-10-27 10:49     ` Ferenc Fejes
2025-10-22 11:28 ` Jeff Layton
2025-10-24 14:54   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).