From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 898FF23AE87; Fri, 24 Oct 2025 14:50:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761317452; cv=none; b=UXChoreHiCMRad9LGJ2ZULYFSWM7ztMPP9g0xqWykW9H7PijbeG1kIBYa8gn/N0MH9KIhQL+DGoVY0zxNY4db71bryrrahLqjwbwPc3LiBB2GU6iuJGPW+qVOuzYGn+YImhDXgDP3KBxNHEYd8/Jo45BUGP1cqSiigjhsKCCqcc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761317452; c=relaxed/simple; bh=0knJF7uPO6AofNUjLkQQvkI0kJEMlROapYN6LI2qQlU=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=FLpmye7PkKFl9ZYhym6NHyiplHhZXrOxSHTK5tDhbDXsN0aGGcFYYEhWC71BjUjGZY18PYMulRDm0IWglWk1s9vGFzunzDcJjaJHOA7Ewkk4OUIr9oxs8WN3HZgPfDrIK3UFmdMaQ5JcgUvdB4iuapcUg5Gmnp4W3WdnE9JK0eM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=X8seNLt5; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="X8seNLt5" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4CB14C4CEF1; Fri, 24 Oct 2025 14:50:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1761317452; bh=0knJF7uPO6AofNUjLkQQvkI0kJEMlROapYN6LI2qQlU=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=X8seNLt5RF176sZd6t2usJg6+ak1kThL3v1R1UTT43sFOzps+l0LW6+qqN1YXaOx8 7G/ugiiZ9KyWkayAGz2cf+0Dkc9NA8WKB6Y9o3lfP2XbiaH3eqsmDVqNugHrZjWZKf ighuDrTNgD1XhVcflIq6Z8ImsGEvPJr4tyXK12HZedFj8fLQrLGSjZhpYyTlfVhXkC Y000Gf13lfSV3wCRZLYNjoCxvQwr1+7WXw06Wwq4Zv/FErQ75Eh6JBMnu3c1HeaJhe IpVGcslSDxM0wR09pKb1kFEDPBjorrfiRibxjWUq5cwo72/cZfUrKT7iebRWaLpeO9 TLAuPc+M0yXpQ== Date: Fri, 24 Oct 2025 16:50:44 +0200 From: Christian Brauner To: Ferenc Fejes Cc: linux-fsdevel@vger.kernel.org, Josef Bacik , Jeff Layton , Jann Horn , Mike Yuan , Zbigniew =?utf-8?Q?J=C4=99drzejewski-Szmek?= , Lennart Poettering , Daan De Meyer , Aleksa Sarai , Amir Goldstein , Tejun Heo , Johannes Weiner , Thomas Gleixner , Alexander Viro , Jan Kara , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, bpf@vger.kernel.org, Eric Dumazet , Jakub Kicinski , netdev@vger.kernel.org, Arnd Bergmann Subject: Re: [PATCH RFC DRAFT 00/50] nstree: listns() Message-ID: <20251024-rostig-stier-0bcd991850f5@brauner> References: <20251021-work-namespace-nstree-listns-v1-0-ad44261a8a5b@kernel.org> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: > > Add a new listns() system call that allows userspace to iterate through > > namespaces in the system. This provides a programmatic interface to > > discover and inspect namespaces, enhancing existing namespace apis. > > > > Currently, there is no direct way for userspace to enumerate namespaces > > in the system. Applications must resort to scanning /proc//ns/ > > across all processes, which is: > > > > 1. Inefficient - requires iterating over all processes > > 2. Incomplete - misses inactive namespaces that aren't attached to any > >    running process but are kept alive by file descriptors, bind mounts, > >    or parent namespace references > > 3. Permission-heavy - requires access to /proc for many processes > > 4. No ordering or ownership. > > 5. No filtering per namespace type: Must always iterate and check all > >    namespaces. > > > > The list goes on. The listns() system call solves these problems by > > providing direct kernel-level enumeration of namespaces. It is similar > > to listmount() but obviously tailored to namespaces. > > I've been waiting for such an API for years; thanks for working on it. I mostly > deal with network namespaces, where points 2 and 3 are especially painful. > > Recently, I've used this eBPF snippet to discover (at most 1024, because of the > verifier's halt checking) network namespaces, even if no process is attached. > But I can't do anything with it in userspace since it's not possible to pass the > inode number or netns cookie value to setns()... I've mentioned it in the cover letter and in my earlier reply to Josef: On v6.18+ kernels it is possible to generate and open file handles to namespaces. This is probably an api that people outside of fs/ proper aren't all that familiar with. In essence it allows you to refer to files - or more-general: kernel-object that may be referenced via files - via opaque handles instead of paths. For regular filesystem that are multi-instance (IOW, you can have multiple btrfs or ext4 filesystems mounted) such file handles cannot be used without providing a file descriptor to another object in the filesystem that is used to resolve the file handle... However, for single-instance filesystems like pidfs and nsfs that's not required which is why I added: FD_PIDFS_ROOT FD_NSFS_ROOT which means that you can open both pidfds and namespace via open_by_handle_at() purely based on the file handle. I call such file handles "exhaustive file handles" because they fully describe the object to be resolvable without any further information. They are also not subject to the capable(CAP_DAC_READ_SEARCH) permission check that regular file handles are and so can be used even by unprivileged code as long as the caller is sufficiently privileged over the relevant object (pid resolvable in caller's pid namespace of pidfds, or caller located in namespace or privileged over the owning user namespace of the relevant namespace for nsfs). File handles for namespaces have the following uapi: struct nsfs_file_handle { __u64 ns_id; __u32 ns_type; __u32 ns_inum; }; #define NSFS_FILE_HANDLE_SIZE_VER0 16 /* sizeof first published struct */ #define NSFS_FILE_HANDLE_SIZE_LATEST sizeof(struct nsfs_file_handle) /* sizeof latest published struct */ and it is explicitly allowed to generate such file handles manually in userspace. When the kernel generates a namespace file handle via name_to_handle_at() till will return: ns_id, ns_type, and ns_inum but userspace is allowed to provide the kernel with a laxer file handle where only the ns_id is filled in but ns_type and ns_inum are zero - at least after this patch series. So for your case where you even know inode number, ns type, and ns id you can fill in a struct nsfs_file_handle and either look at my reply to Josef or in the (ugly) tests. fd = open_by_handle_at(FD_NSFS_ROOT, file_handle, O_RDONLY); and can open the namespace (provided it is still active). > > extern const void net_namespace_list __ksym; > static void list_all_netns() > { > struct list_head *nslist =  > bpf_core_cast(&net_namespace_list, struct list_head); > > struct list_head *iter = nslist->next; > > bpf_repeat(1024) { This isn't needed anymore. I've implemented it in a bpf-friendly way so it's possible to add kfuncs that would allow you to iterate through the various namespace trees (locklessly). If this is merged then I'll likely design that bpf part myself. > After this merged, do you see any chance for backports? Does it rely on recent > bits which is hard/impossible to backport? I'm not aware of backported syscalls > but this would be really nice to see in older kernels. Uhm, what downstream entities, managing kernels do is not my concern but for upstream it's certainly not an option. There's a lot of preparatory work that would have to be backported.