From: Ferenc Fejes <ferenc@fejes.dev>
To: Christian Brauner <brauner@kernel.org>,
linux-fsdevel@vger.kernel.org,
Josef Bacik <josef@toxicpanda.com>,
Jeff Layton <jlayton@kernel.org>
Cc: "Jann Horn" <jannh@google.com>, "Mike Yuan" <me@yhndnzj.com>,
"Zbigniew Jędrzejewski-Szmek" <zbyszek@in.waw.pl>,
"Lennart Poettering" <mzxreary@0pointer.de>,
"Daan De Meyer" <daan.j.demeyer@gmail.com>,
"Aleksa Sarai" <cyphar@cyphar.com>,
"Amir Goldstein" <amir73il@gmail.com>,
"Tejun Heo" <tj@kernel.org>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Thomas Gleixner" <tglx@linutronix.de>,
"Alexander Viro" <viro@zeniv.linux.org.uk>,
"Jan Kara" <jack@suse.cz>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
bpf@vger.kernel.org, "Eric Dumazet" <edumazet@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
netdev@vger.kernel.org, "Arnd Bergmann" <arnd@arndb.de>
Subject: Re: [PATCH RFC DRAFT 00/50] nstree: listns()
Date: Wed, 22 Oct 2025 13:00:01 +0200 [thread overview]
Message-ID: <f708a1119b2ad8cf2514b1df128a4ef7cf21c636.camel@fejes.dev> (raw)
In-Reply-To: <20251021-work-namespace-nstree-listns-v1-0-ad44261a8a5b@kernel.org>
On Tue, 2025-10-21 at 13:43 +0200, Christian Brauner wrote:
> Hey,
>
> As announced a while ago this is the next step building on the nstree
> work from prior cycles. There's a bunch of fixes and semantic cleanups
> in here and a ton of tests.
>
> I need helper here!: Consider the following current design:
>
> Currently listns() is relying on active namespace reference counts which
> are introduced alongside this series.
>
> The active reference count of a namespace consists of the live tasks
> that make use of this namespace and any namespace file descriptors that
> explicitly pin the namespace.
>
> Once all tasks making use of this namespace have exited or reaped, all
> namespace file descriptors for that namespace have been closed and all
> bind-mounts for that namespace unmounted it ceases to appear in the
> listns() output.
>
> My reason for introducing the active reference count was that namespaces
> might obviously still be pinned internally for various reasons. For
> example the user namespace might still be pinned because there are still
> open files that have stashed the openers credentials in file->f_cred, or
> the last reference might be put with an rcu delay keeping that namespace
> active on the namespace lists.
>
> But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
> Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
> which uses lazy TLB destruction.
>
> When this option is set a userspace task's struct mm_struct may be used
> for kernel threads such as the idle task and will only be destroyed once
> the cpu's runqueue switches back to another task. So the kernel thread
> will take a reference on the struct mm_struct pinning it.
>
> And for ptrace() based access checks struct mm_struct stashes the user
> namespace of the task that struct mm_struct belonged to originally and
> thus takes a reference to the users namespace and pins it.
>
> So on an idle system such user namespaces can be persisted for pretty
> arbitrary amounts of time via struct mm_struct.
>
> Now, without the active reference count regulating visibility all
> namespace that still are pinned in some way on the system will appear in
> the listns() output and can be reopened using namespace file handles.
>
> Of course that requires suitable privileges and it's not really a
> concern per se because a task could've also persist the namespace
> recorded in struct mm_struct explicitly and then the idle task would
> still reuse that struct mm_struct and another task could still happily
> setns() to it afaict and reuse it for something else.
>
> The active reference count though has drawbacks itself. Namely that
> socket files break the assumption that namespaces can only be opened if
> there's either live processes pinning the namespace or there are file
> descriptors open that pin the namespace itself as the socket SIOCGSKNS
> ioctl() can be used to open a network namespace based on a socket which
> only indirectly pins a network namespace.
>
> So that punches a whole in the active reference count tracking. So this
> will have to be handled as right now socket file descriptors that pin a
> network namespace that don't have an active reference anymore (no live
> processes, not explicit persistence via namespace fds) can't be used to
> issue a SIOCGSKNS ioctl() to open the associated network namespace.
>
> So two options I see if the api is based on ids:
>
> (1) We use the active reference count and somehow also make it work with
> sockets.
> (2) The active reference count is not needed and we say that listns() is
> an introspection system call anyway so we just always list
> namespaces regardless of why they are still pinned: files,
> mm_struct, network devices, everything is fair game.
> (3) Throw hands up in the air and just not do it.
>
> =====================================================================
>
> Add a new listns() system call that allows userspace to iterate through
> namespaces in the system. This provides a programmatic interface to
> discover and inspect namespaces, enhancing existing namespace apis.
>
> Currently, there is no direct way for userspace to enumerate namespaces
> in the system. Applications must resort to scanning /proc/<pid>/ns/
> across all processes, which is:
>
> 1. Inefficient - requires iterating over all processes
> 2. Incomplete - misses inactive namespaces that aren't attached to any
> running process but are kept alive by file descriptors, bind mounts,
> or parent namespace references
> 3. Permission-heavy - requires access to /proc for many processes
> 4. No ordering or ownership.
> 5. No filtering per namespace type: Must always iterate and check all
> namespaces.
>
> The list goes on. The listns() system call solves these problems by
> providing direct kernel-level enumeration of namespaces. It is similar
> to listmount() but obviously tailored to namespaces.
I've been waiting for such an API for years; thanks for working on it. I mostly
deal with network namespaces, where points 2 and 3 are especially painful.
Recently, I've used this eBPF snippet to discover (at most 1024, because of the
verifier's halt checking) network namespaces, even if no process is attached.
But I can't do anything with it in userspace since it's not possible to pass the
inode number or netns cookie value to setns()...
extern const void net_namespace_list __ksym;
static void list_all_netns()
{
struct list_head *nslist =
bpf_core_cast(&net_namespace_list, struct list_head);
struct list_head *iter = nslist->next;
bpf_repeat(1024) {
const struct net *net =
bpf_core_cast(container_of(iter, struct net, list), struct
net);
// bpf_printk("net: %p inode: %u cookie: %lu",
// net, net->ns.inum, net->net_cookie);
if (iter->next == nslist)
break;
iter = iter->next;
}
}
>
> /*
> * @req: Pointer to struct ns_id_req specifying search parameters
> * @ns_ids: User buffer to receive namespace IDs
> * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
> * @flags: Reserved for future use (must be 0)
> */
> ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
> size_t nr_ns_ids, unsigned int flags);
>
> Returns:
> - On success: Number of namespace IDs written to ns_ids
> - On error: Negative error code
>
> /*
> * @size: Structure size
> * @ns_id: Starting point for iteration; use 0 for first call, then
> * use the last returned ID for subsequent calls to paginate
> * @ns_type: Bitmask of namespace types to include (from enum ns_type):
> * 0: Return all namespace types
> * MNT_NS: Mount namespaces
> * NET_NS: Network namespaces
> * USER_NS: User namespaces
> * etc. Can be OR'd together
> * @user_ns_id: Filter results to namespaces owned by this user namespace:
> * 0: Return all namespaces (subject to permission checks)
> * LISTNS_CURRENT_USER: Namespaces owned by caller's user
> namespace
> * Other value: Namespaces owned by the specified user namespace
> ID
> */
> struct ns_id_req {
> __u32 size; /* sizeof(struct ns_id_req) */
> __u32 spare; /* Reserved, must be 0 */
> __u64 ns_id; /* Last seen namespace ID (for pagination) */
> __u32 ns_type; /* Filter by namespace type(s) */
> __u32 spare2; /* Reserved, must be 0 */
> __u64 user_ns_id; /* Filter by owning user namespace */
> };
>
After this merged, do you see any chance for backports? Does it rely on recent
bits which is hard/impossible to backport? I'm not aware of backported syscalls
but this would be really nice to see in older kernels.
Ferenc
next prev parent reply other threads:[~2025-10-22 11:00 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-21 11:43 [PATCH RFC DRAFT 00/50] nstree: listns() Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 01/50] libfs: allow to specify s_d_flags Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 02/50] nsfs: use inode_just_drop() Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 03/50] nsfs: raise DCACHE_DONTCACHE explicitly Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 04/50] pidfs: " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 05/50] nsfs: raise SB_I_NODEV and SB_I_NOEXEC Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 06/50] nstree: simplify return Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 07/50] ns: initialize ns_list_node for initial namespaces Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 08/50] ns: add __ns_ref_read() Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 09/50] ns: add active reference count Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 10/50] ns: use anonymous struct to group list member Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 11/50] nstree: introduce a unified tree Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 12/50] nstree: allow lookup solely based on inode Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 13/50] nstree: assign fixed ids to the initial namespaces Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 14/50] ns: maintain list of owned namespaces Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 15/50] nstree: add listns() Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 16/50] arch: hookup listns() system call Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 17/50] nsfs: update tools header Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 18/50] selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 19/50] selftests/namespaces: first active reference count tests Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 20/50] selftests/namespaces: second " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 21/50] selftests/namespaces: third " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 22/50] selftests/namespaces: fourth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 23/50] selftests/namespaces: fifth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 24/50] selftests/namespaces: sixth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 25/50] selftests/namespaces: seventh " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 26/50] selftests/namespaces: eigth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 27/50] selftests/namespaces: ninth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 28/50] selftests/namespaces: tenth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 29/50] selftests/namespaces: eleventh " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 30/50] selftests/namespaces: twelth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 31/50] selftests/namespaces: thirteenth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 32/50] selftests/namespaces: fourteenth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 33/50] selftests/namespaces: fifteenth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 34/50] selftests/namespaces: add listns() wrapper Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 35/50] selftests/namespaces: first listns() test Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 36/50] selftests/namespaces: second " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 37/50] selftests/namespaces: third " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 38/50] selftests/namespaces: fourth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 39/50] selftests/namespaces: fifth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 40/50] selftests/namespaces: sixth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 41/50] selftests/namespaces: seventh " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 42/50] selftests/namespaces: ninth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 43/50] " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 44/50] selftests/namespaces: first listns() permission test Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 45/50] selftests/namespaces: second " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 46/50] selftests/namespaces: third " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 47/50] selftests/namespaces: fourth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 48/50] selftests/namespaces: fifth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 49/50] selftests/namespaces: sixth " Christian Brauner
2025-10-21 11:43 ` [PATCH RFC DRAFT 50/50] selftests/namespaces: seventh " Christian Brauner
2025-10-21 14:34 ` [PATCH RFC DRAFT 00/50] nstree: listns() Josef Bacik
2025-10-22 8:34 ` Christian Brauner
2025-10-21 14:41 ` [syzbot ci] " syzbot ci
2025-10-22 11:00 ` Ferenc Fejes [this message]
2025-10-24 14:50 ` [PATCH RFC DRAFT 00/50] " Christian Brauner
2025-10-27 10:49 ` Ferenc Fejes
2025-10-22 11:28 ` Jeff Layton
2025-10-24 14:54 ` Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f708a1119b2ad8cf2514b1df128a4ef7cf21c636.camel@fejes.dev \
--to=ferenc@fejes.dev \
--cc=amir73il@gmail.com \
--cc=arnd@arndb.de \
--cc=bpf@vger.kernel.org \
--cc=brauner@kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=cyphar@cyphar.com \
--cc=daan.j.demeyer@gmail.com \
--cc=edumazet@google.com \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=jannh@google.com \
--cc=jlayton@kernel.org \
--cc=josef@toxicpanda.com \
--cc=kuba@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=me@yhndnzj.com \
--cc=mzxreary@0pointer.de \
--cc=netdev@vger.kernel.org \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=zbyszek@in.waw.pl \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).