From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wr1-f43.google.com (mail-wr1-f43.google.com [209.85.221.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 37EDD328B51 for ; Wed, 22 Oct 2025 11:00:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.43 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761130809; cv=none; b=iOYKVocCk0+dtbIO4+Gqigygoj6CRQyJZ6yNP4QiQKBitflR2DUKY1pfe1g+VARi+f7B/i+FQEjy5XmjmaAgCRa1POBBNDtX4oBjYwjWSjDQugXvaTfSBgpbp4CwSjfgKUdOLaEuUEhm3ON5ZMNwB/x8ipoBXpKvqdJ4XD/UK+w= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761130809; c=relaxed/simple; bh=OKz+xmkKiE9OVC+4fDjQc95t5WKflma0N2qOQ8z5SZg=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=k072Q0mztIs1zjKYCe8c4jB+88KHw+rQ9O8zCfCT/OLmupvGCCp63PTgWy0uPwm6FzViUmj9AhUBBhCbGIYDiNH+yqCS1BmA1xYrv3l4dOhRFdtSBV8CQE1fc226ugzEqoIjfJqdkSiLcjZpFKv0AWID11+36rwGrn/1NUAc74w= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=fejes.dev; spf=pass smtp.mailfrom=gmail.com; arc=none smtp.client-ip=209.85.221.43 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=fejes.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-wr1-f43.google.com with SMTP id ffacd0b85a97d-4270a3464bcso3646349f8f.2 for ; Wed, 22 Oct 2025 04:00:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761130803; x=1761735603; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=WMhlNDo8ASWwcI8u3mjsBd+rOMePpOl7Ef330eW/SSU=; b=eJ6akwyNBFfiYr3Wa9StSVx4EHn5PSkmzAfv0CoANbNb2UPhOi6111GLE78EiGCdAP 0vb6WTtCPxc9lhEr9SURi8XCwLayxhsPJmZXBaYpsjVCdJxQRaVdtamE6xh2clkfxMBQ GcZtfnGJ6nnDEwlo89+MzwMLHwxnyWg04z0E1vK2v0q4Xb/EsxKcoe9kIbV/rzdJ95e0 CeaZYMfP6/yonAfLcI7MfiGoLEFI1A69ghxdQMGgowHBABLSAc05DrmVHANEIvEt9Ck9 IuraBaqhWGnjvhvwfchLm15RWlS3ZPltXyXKlYYHQSgGVlSnjPbx5L8CulBt3fDZaSAI uFcg== X-Forwarded-Encrypted: i=1; AJvYcCW+ri1L0wDEFLDPT49cjDXp/sXgV5xEBTEdd0Tc94MrNb2CcAO0JH2WD+ns76MXWQbm7TyU75iYDcTNYyYl@vger.kernel.org X-Gm-Message-State: AOJu0Yyxu4e9CnNJqAH5ArYHB2M5mkUl7+WISJWOp2iPQO5MGQxoTW1Z NaGlsed2PKdwoSNZpbV1z2cMQh6kZFKeiV2DyvHoG9k9MGHQgozdaF9j X-Gm-Gg: ASbGncujwkKGDKCVKjleMePj54JxmLWACWmWoqNjUL4LvefEb05ACfs3aLNyCyqMtPf R2eHugkMYwI4q2Ufebsf6h/ZZKSCDyY2XkT/N/75kUD2tTsDguoqto8L8KArj9lKzcBFKotfrut 6rEYeajXHuDMs62SQlh22B7js2KrgcoEmy/d7DVhsJYJrNSpSqCrEqLDrIx96Ghu25nUAIFsGyW RboQy2rl4preNM4IlJtoFq3E8+NK4c7UeN9b/ODu7awun7k2bC6KxNtHfBrNHY+zEeiuWZxgbRU sjJKpOriK/huBvpj6HuljiPwOeVGCkv1PavpMAVgTSuV9WdvUbujke6lg41R00sDl6cFaWEbDCP 8wwLuRcdAZunGlN8R7PQPzzbA/o8ZerEpPS2bz79VSuydyI5CCrVjdIgKyoJ1PhinlMK1FC7HT7 58rw1JgPKE+4rigtY0tife9s9RH5OLQr1gCkTGcmqiCNFry8hs+p5zjs2JzUzC3E7b X-Google-Smtp-Source: AGHT+IH5uo+eTYMpftiVH0dhTJ/g13dymxown9/4klXMfyReEafbuUyOsJCDQx0E42CrJS41YtQEjg== X-Received: by 2002:a05:6000:250a:b0:428:3bb5:5813 with SMTP id ffacd0b85a97d-4283bb55a33mr11110213f8f.59.1761130803127; Wed, 22 Oct 2025 04:00:03 -0700 (PDT) Received: from [10.148.83.128] (business-89-135-192-225.business.broadband.hu. [89.135.192.225]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-427f00b9f71sm24812193f8f.37.2025.10.22.04.00.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Oct 2025 04:00:02 -0700 (PDT) Message-ID: Subject: Re: [PATCH RFC DRAFT 00/50] nstree: listns() From: Ferenc Fejes To: Christian Brauner , linux-fsdevel@vger.kernel.org, Josef Bacik , Jeff Layton Cc: Jann Horn , Mike Yuan , Zbigniew =?UTF-8?Q?J=C4=99drzejewski-Szmek?= , Lennart Poettering , Daan De Meyer , Aleksa Sarai , Amir Goldstein , Tejun Heo , Johannes Weiner , Thomas Gleixner , Alexander Viro , Jan Kara , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, bpf@vger.kernel.org, Eric Dumazet , Jakub Kicinski , netdev@vger.kernel.org, Arnd Bergmann Date: Wed, 22 Oct 2025 13:00:01 +0200 In-Reply-To: <20251021-work-namespace-nstree-listns-v1-0-ad44261a8a5b@kernel.org> References: <20251021-work-namespace-nstree-listns-v1-0-ad44261a8a5b@kernel.org> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.56.2-5 Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Tue, 2025-10-21 at 13:43 +0200, Christian Brauner wrote: > Hey, >=20 > As announced a while ago this is the next step building on the nstree > work from prior cycles. There's a bunch of fixes and semantic cleanups > in here and a ton of tests. >=20 > I need helper here!: Consider the following current design: >=20 > Currently listns() is relying on active namespace reference counts which > are introduced alongside this series. >=20 > The active reference count of a namespace consists of the live tasks > that make use of this namespace and any namespace file descriptors that > explicitly pin the namespace. >=20 > Once all tasks making use of this namespace have exited or reaped, all > namespace file descriptors for that namespace have been closed and all > bind-mounts for that namespace unmounted it ceases to appear in the > listns() output. >=20 > My reason for introducing the active reference count was that namespaces > might obviously still be pinned internally for various reasons. For > example the user namespace might still be pinned because there are still > open files that have stashed the openers credentials in file->f_cred, or > the last reference might be put with an rcu delay keeping that namespace > active on the namespace lists. >=20 > But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=3Dy. > Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option > which uses lazy TLB destruction. >=20 > When this option is set a userspace task's struct mm_struct may be used > for kernel threads such as the idle task and will only be destroyed once > the cpu's runqueue switches back to another task. So the kernel thread > will take a reference on the struct mm_struct pinning it. >=20 > And for ptrace() based access checks struct mm_struct stashes the user > namespace of the task that struct mm_struct belonged to originally and > thus takes a reference to the users namespace and pins it. >=20 > So on an idle system such user namespaces can be persisted for pretty > arbitrary amounts of time via struct mm_struct. >=20 > Now, without the active reference count regulating visibility all > namespace that still are pinned in some way on the system will appear in > the listns() output and can be reopened using namespace file handles. >=20 > Of course that requires suitable privileges and it's not really a > concern per se because a task could've also persist the namespace > recorded in struct mm_struct explicitly and then the idle task would > still reuse that struct mm_struct and another task could still happily > setns() to it afaict and reuse it for something else. >=20 > The active reference count though has drawbacks itself. Namely that > socket files break the assumption that namespaces can only be opened if > there's either live processes pinning the namespace or there are file > descriptors open that pin the namespace itself as the socket SIOCGSKNS > ioctl() can be used to open a network namespace based on a socket which > only indirectly pins a network namespace. >=20 > So that punches a whole in the active reference count tracking. So this > will have to be handled as right now socket file descriptors that pin a > network namespace that don't have an active reference anymore (no live > processes, not explicit persistence via namespace fds) can't be used to > issue a SIOCGSKNS ioctl() to open the associated network namespace. >=20 > So two options I see if the api is based on ids: >=20 > (1) We use the active reference count and somehow also make it work with > =C2=A0=C2=A0=C2=A0 sockets. > (2) The active reference count is not needed and we say that listns() is > =C2=A0=C2=A0=C2=A0 an introspection system call anyway so we just always = list > =C2=A0=C2=A0=C2=A0 namespaces regardless of why they are still pinned: fi= les, > =C2=A0=C2=A0=C2=A0 mm_struct, network devices, everything is fair game. > (3) Throw hands up in the air and just not do it. >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > Add a new listns() system call that allows userspace to iterate through > namespaces in the system. This provides a programmatic interface to > discover and inspect namespaces, enhancing existing namespace apis. >=20 > Currently, there is no direct way for userspace to enumerate namespaces > in the system. Applications must resort to scanning /proc//ns/ > across all processes, which is: >=20 > 1. Inefficient - requires iterating over all processes > 2. Incomplete - misses inactive namespaces that aren't attached to any > =C2=A0=C2=A0 running process but are kept alive by file descriptors, bind= mounts, > =C2=A0=C2=A0 or parent namespace references > 3. Permission-heavy - requires access to /proc for many processes > 4. No ordering or ownership. > 5. No filtering per namespace type: Must always iterate and check all > =C2=A0=C2=A0 namespaces. >=20 > The list goes on. The listns() system call solves these problems by > providing direct kernel-level enumeration of namespaces. It is similar > to listmount() but obviously tailored to namespaces. I've been waiting for such an API for years; thanks for working on it. I mo= stly deal with network namespaces, where points 2 and 3 are especially painful. Recently, I've used this eBPF snippet to discover (at most 1024, because of= the verifier's halt checking) network namespaces, even if no process is attache= d. But I can't do anything with it in userspace since it's not possible to pas= s the inode number or netns cookie value to setns()... extern const void net_namespace_list __ksym; static void list_all_netns() { struct list_head *nslist =3D=C2=A0 bpf_core_cast(&net_namespace_list, struct list_head); struct list_head *iter =3D nslist->next; bpf_repeat(1024) { const struct net *net =3D=C2=A0 bpf_core_cast(container_of(iter, struct net, list), struct net); // bpf_printk("net: %p inode: %u cookie: %lu",=C2=A0 // net, net->ns.inum, net->net_cookie); if (iter->next =3D=3D nslist) break; iter =3D iter->next; } } >=20 > /* > =C2=A0* @req: Pointer to struct ns_id_req specifying search parameters > =C2=A0* @ns_ids: User buffer to receive namespace IDs > =C2=A0* @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to retur= n) > =C2=A0* @flags: Reserved for future use (must be 0) > =C2=A0*/ > ssize_t listns(const struct ns_id_req *req, u64 *ns_ids, > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 size_t nr_ns_ids, unsigned int flags); >=20 > Returns: > - On success: Number of namespace IDs written to ns_ids > - On error: Negative error code >=20 > /* > =C2=A0* @size: Structure size > =C2=A0* @ns_id: Starting point for iteration; use 0 for first call, then > =C2=A0*=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 use the last retu= rned ID for subsequent calls to paginate > =C2=A0* @ns_type: Bitmask of namespace types to include (from enum ns_typ= e): > =C2=A0*=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0: Re= turn all namespace types > =C2=A0*=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 MNT_N= S: Mount namespaces > =C2=A0*=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 NET_N= S: Network namespaces > =C2=A0*=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 USER_= NS: User namespaces > =C2=A0*=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 etc. = Can be OR'd together > =C2=A0* @user_ns_id: Filter results to namespaces owned by this user name= space: > =C2=A0*=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 0: Return all namespaces (subject to permission checks) > =C2=A0*=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 LISTNS_CURRENT_USER: Namespaces owned by caller's user > namespace > =C2=A0*=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 Other value: Namespaces owned by the specified user namespace > ID > =C2=A0*/ > struct ns_id_req { > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u32 size;=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* sizeof(struct ns_id_req) */ > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u32 spare;=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 /* Reserved, must be 0 */ > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u64 ns_id;=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 /* Last seen namespace ID (for pagination) */ > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u32 ns_type;=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 /* Filter by namespace type(s) */ > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u32 spare2;=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 /* Reserved, must be 0 */ > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u64 user_ns_id;=C2=A0=C2=A0 = /* Filter by owning user namespace */ > }; >=20 After this merged, do you see any chance for backports? Does it rely on rec= ent bits which is hard/impossible to backport? I'm not aware of backported sysc= alls but this would be really nice to see in older kernels. Ferenc