From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f49.google.com (mail-wm1-f49.google.com [209.85.128.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5308B2F5A22 for ; Mon, 27 Oct 2025 10:49:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.49 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761562167; cv=none; b=T3U0bHcKpvcLMYs37XnSyDv+du8Fzq4Y0kW8nWaDNBS6kUaPCojCt4ClLaKLGOHAbvZa8gIQWEIW/EfExVmH9vWTyOYDXmUXH1x7uKzTL+hdHx6UI0ZH1sVuMDCTeqwMh/7azBOgfP2ch2ju3A/ZUzFQ1S0ra5gnIDue1J0UF9U= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761562167; c=relaxed/simple; bh=2x4KR0G7UwcPw5kY8y2EBE8YorHPAz2+4JwfNEt525k=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=tq5OlqjtZc5YsnzK4Yhks4EZ2wQvx7C3k5Dn20wE08THXU4SY2GTcnvFI638DIouxHiSJEFQ2MEDHRScX0ulIQ+tvR57uG16xNjY2cJzPY5bcGz1gLCRn5u5amZNezrjSzJvUnI36cGWVsBlg2bGBxsaHvTgEG2paaUVColdW+c= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=fejes.dev; spf=pass smtp.mailfrom=gmail.com; arc=none smtp.client-ip=209.85.128.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=fejes.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-wm1-f49.google.com with SMTP id 5b1f17b1804b1-46fcf9f63b6so26903745e9.2 for ; Mon, 27 Oct 2025 03:49:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761562163; x=1762166963; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=0uT5T+CCda9NfmCDxQvE5M5FBpTxCUYhV8ShRWLIe98=; b=Iz7qx9KHmt5qFV7Jk04HlWHFk1CkO9ZdhW5VgCGYinvwPA0P+hf462vyaRPeF4NB/B N6e6x+W15sJP6dx4ts9jbXhHnFDUXBMjDlyp1H/Bwu8lTabRDg4DyhQBEMQkndY7doZV rJK+s75qwQ4u9yerm5OKjP7ortsoaemR+VgzaPSuM7saFqRTXibauoGXaaqccQQJZfxF xFCIyZy9IkUp276xVy8u/wFYDzmRfJq4yr+whA7RJmRtqWnBA4La74kVVg87kpVrB5n/ dWO66K8TDmFmRTu8WcXM2GtcdsotU34X03H6zMxzVmZA4RbKUAU7Uijh4JylohZ8fWMa 9Z9A== X-Forwarded-Encrypted: i=1; AJvYcCVuS8xAlKwwQjuJTywtWrKArjB6A1nGu5sOGPpF1rRncWYa4kcFlZIbbIZiypt274u2V5eYm18F@vger.kernel.org X-Gm-Message-State: AOJu0Ywm9owX2yR9J5A2TV662Mn71QIbfEHX4hFQHcR/JFpPYJrJn0bV jtzI2nKbjKRHkVYuoYEFhpiEMqvf7wl6SAJ3ejUyXpJiRLVOBb9srm/C X-Gm-Gg: ASbGncsw8I+bCcI5Cq0NC04yZ7Ua/TSqu7NTnXn9RSC/fXj1Hjwtunl+t7gFcZ07+rG d3Ioh2oAL4GIsUHV1VetUWpw1lzf4qgT7dZ9XGxV1a/nPLyD6TaAMGcbniTy3LGsau6tnHeiSpP m62266RaRw7wPAUH+U27CqY1MJWmcwoS2E0jBl36MwLv3l+G3FmRrIe01VmEoAqTST2EKIfljoi rwufmj6V3ADipD3YJB6HHnbYl0GGZgyRLhlTUgW0MzRsYAyXs+dbw94SrVtPZTtaTa5cwKhMGL2 3bPlGUnpIGv8tCmrNgraG9Zun359oyHRGgbchjoLFovPgYxAyHginb+NkRqAAGPixqbnhecVXXp TtxKQfKQHZcYKSTLcb/me1JPCQWgzZS4lROmTLyN6CKBMbLXEfXa9pjIJYhkH/HNX8+eOuPPim0 Zx8951y0Nv2JL5FFO7LkuDReedmfgIptB25KQPTtNEhuc63RTs/o8FNdREJ0Vu4EkPxrmihvlw X-Google-Smtp-Source: AGHT+IE6ocL5Th32b34BBOC0L9pFQlFvi/YH+L6cOh1wnI9WJsnVc28ef88WDuXdG9RsxeTD0U0vfQ== X-Received: by 2002:a05:600d:634f:b0:475:d952:342f with SMTP id 5b1f17b1804b1-475d9523a1amr53410685e9.39.1761562163354; Mon, 27 Oct 2025 03:49:23 -0700 (PDT) Received: from [10.148.83.128] ([195.228.69.10]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-429952b7b43sm13830879f8f.6.2025.10.27.03.49.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 Oct 2025 03:49:21 -0700 (PDT) Message-ID: Subject: Re: [PATCH RFC DRAFT 00/50] nstree: listns() From: Ferenc Fejes To: Christian Brauner Cc: linux-fsdevel@vger.kernel.org, Josef Bacik , Jeff Layton , Jann Horn , Mike Yuan , Zbigniew =?UTF-8?Q?J=C4=99drzejewski-Szmek?= , Lennart Poettering , Daan De Meyer , Aleksa Sarai , Amir Goldstein , Tejun Heo , Johannes Weiner , Thomas Gleixner , Alexander Viro , Jan Kara , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, bpf@vger.kernel.org, Eric Dumazet , Jakub Kicinski , netdev@vger.kernel.org, Arnd Bergmann Date: Mon, 27 Oct 2025 11:49:20 +0100 In-Reply-To: <20251024-rostig-stier-0bcd991850f5@brauner> References: <20251021-work-namespace-nstree-listns-v1-0-ad44261a8a5b@kernel.org> <20251024-rostig-stier-0bcd991850f5@brauner> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.56.2-5 Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Fri, 2025-10-24 at 16:50 +0200, Christian Brauner wrote: > > > Add a new listns() system call that allows userspace to iterate throu= gh > > > namespaces in the system. This provides a programmatic interface to > > > discover and inspect namespaces, enhancing existing namespace apis. > > >=20 > > > Currently, there is no direct way for userspace to enumerate namespac= es > > > in the system. Applications must resort to scanning /proc//ns/ > > > across all processes, which is: > > >=20 > > > 1. Inefficient - requires iterating over all processes > > > 2. Incomplete - misses inactive namespaces that aren't attached to an= y > > > =C2=A0=C2=A0 running process but are kept alive by file descriptors, = bind mounts, > > > =C2=A0=C2=A0 or parent namespace references > > > 3. Permission-heavy - requires access to /proc for many processes > > > 4. No ordering or ownership. > > > 5. No filtering per namespace type: Must always iterate and check all > > > =C2=A0=C2=A0 namespaces. > > >=20 > > > The list goes on. The listns() system call solves these problems by > > > providing direct kernel-level enumeration of namespaces. It is simila= r > > > to listmount() but obviously tailored to namespaces. > >=20 > > I've been waiting for such an API for years; thanks for working on it. = I > > mostly > > deal with network namespaces, where points 2 and 3 are especially painf= ul. > >=20 > > Recently, I've used this eBPF snippet to discover (at most 1024, becaus= e of > > the > > verifier's halt checking) network namespaces, even if no process is > > attached. > > But I can't do anything with it in userspace since it's not possible to= pass > > the > > inode number or netns cookie value to setns()... >=20 > I've mentioned it in the cover letter and in my earlier reply to Josef: >=20 > On v6.18+ kernels it is possible to generate and open file handles to > namespaces. This is probably an api that people outside of fs/ proper > aren't all that familiar with. >=20 > In essence it allows you to refer to files - or more-general: > kernel-object that may be referenced via files - via opaque handles > instead of paths. >=20 > For regular filesystem that are multi-instance (IOW, you can have > multiple btrfs or ext4 filesystems mounted) such file handles cannot be > used without providing a file descriptor to another object in the > filesystem that is used to resolve the file handle... >=20 > However, for single-instance filesystems like pidfs and nsfs that's not > required which is why I added: >=20 > FD_PIDFS_ROOT > FD_NSFS_ROOT >=20 > which means that you can open both pidfds and namespace via > open_by_handle_at() purely based on the file handle. I call such file > handles "exhaustive file handles" because they fully describe the object > to be resolvable without any further information. >=20 > They are also not subject to the capable(CAP_DAC_READ_SEARCH) permission > check that regular file handles are and so can be used even by > unprivileged code as long as the caller is sufficiently privileged over > the relevant object (pid resolvable in caller's pid namespace of pidfds, > or caller located in namespace or privileged over the owning user > namespace of the relevant namespace for nsfs). >=20 > File handles for namespaces have the following uapi: >=20 > struct nsfs_file_handle { > __u64 ns_id; > __u32 ns_type; > __u32 ns_inum; > }; >=20 > #define NSFS_FILE_HANDLE_SIZE_VER0 16 /* sizeof first published struct */ > #define NSFS_FILE_HANDLE_SIZE_LATEST sizeof(struct nsfs_file_handle) /* s= izeof > latest published struct */ >=20 > and it is explicitly allowed to generate such file handles manually in > userspace. When the kernel generates a namespace file handle via > name_to_handle_at() till will return: ns_id, ns_type, and ns_inum but > userspace is allowed to provide the kernel with a laxer file handle > where only the ns_id is filled in but ns_type and ns_inum are zero - at > least after this patch series. >=20 > So for your case where you even know inode number, ns type, and ns id > you can fill in a struct nsfs_file_handle and either look at my reply to > Josef or in the (ugly) tests. >=20 > fd =3D open_by_handle_at(FD_NSFS_ROOT, file_handle, O_RDONLY); >=20 > and can open the namespace (provided it is still active). >=20 > >=20 > > extern const void net_namespace_list __ksym; > > static void list_all_netns() > > { > > =C2=A0=C2=A0=C2=A0 struct list_head *nslist =3D=C2=A0 > > bpf_core_cast(&net_namespace_list, struct list_head); > >=20 > > =C2=A0=C2=A0=C2=A0 struct list_head *iter =3D nslist->next; > >=20 > > =C2=A0=C2=A0=C2=A0 bpf_repeat(1024) { >=20 > This isn't needed anymore. I've implemented it in a bpf-friendly way so > it's possible to add kfuncs that would allow you to iterate through the > various namespace trees (locklessly). >=20 > If this is merged then I'll likely design that bpf part myself. Excellent, thanks for the detailed explanation, noted! Well I guess I have = to keep my eyes closer on recent ns changes, I was aware of pidfs but not the helpers you just mentioned. >=20 > > After this merged, do you see any chance for backports? Does it rely on > > recent > > bits which is hard/impossible to backport? I'm not aware of backported > > syscalls > > but this would be really nice to see in older kernels. >=20 > Uhm, what downstream entities, managing kernels do is not my concern but > for upstream it's certainly not an option. There's a lot of preparatory > work that would have to be backported. I was curious about the upstream option, but I see this isn't feasible. Any= way, its great we will have this in the future, thanks for doing it! Ferenc