From: Christian Brauner <brauner@kernel.org>
To: linux-fsdevel@vger.kernel.org, Josef Bacik <josef@toxicpanda.com>,
Jeff Layton <jlayton@kernel.org>
Cc: "Jann Horn" <jannh@google.com>, "Mike Yuan" <me@yhndnzj.com>,
"Zbigniew Jędrzejewski-Szmek" <zbyszek@in.waw.pl>,
"Lennart Poettering" <mzxreary@0pointer.de>,
"Daan De Meyer" <daan.j.demeyer@gmail.com>,
"Aleksa Sarai" <cyphar@cyphar.com>,
"Amir Goldstein" <amir73il@gmail.com>,
"Tejun Heo" <tj@kernel.org>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Thomas Gleixner" <tglx@linutronix.de>,
"Alexander Viro" <viro@zeniv.linux.org.uk>,
"Jan Kara" <jack@suse.cz>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
bpf@vger.kernel.org, "Eric Dumazet" <edumazet@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
netdev@vger.kernel.org, "Arnd Bergmann" <arnd@arndb.de>,
"Christian Brauner" <brauner@kernel.org>,
syzbot+1957b26299cf3ff7890c@syzkaller.appspotmail.com
Subject: [PATCH 0/8] ns: fixes for namespace iteration and active reference counting
Date: Sun, 09 Nov 2025 22:11:21 +0100 [thread overview]
Message-ID: <20251109-namespace-6-19-fixes-v1-0-ae8a4ad5a3b3@kernel.org> (raw)
* Make sure to initialize the active reference count for the initial
network namespace and prevent __ns_common_init() from returning too
early.
* Make sure that passive reference counts are dropped outside of rcu
read locks as some namespaces such as the mount namespace do in fact
sleep when putting the last reference.
* The setns() system call supports:
(1) namespace file descriptors (nsfd)
(2) process file descriptors (pidfd)
When using nsfds the namespaces will remain active because they are
pinned by the vfs. However, when pidfds are used things are more
complicated.
When the target task exits and passes through exit_nsproxy_namespaces()
or is reaped and thus also passes through exit_cred_namespaces() after
the setns()'ing task has called prepare_nsset() but before the active
reference count of the set of namespaces it wants to setns() to might
have been dropped already:
P1 P2
pid_p1 = clone(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
pidfd = pidfd_open(pid_p1)
setns(pidfd, CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
prepare_nsset()
exit(0)
// ns->__ns_active_ref == 1
// parent_ns->__ns_active_ref == 1
-> exit_nsproxy_namespaces()
-> exit_cred_namespaces()
// ns_active_ref_put() will also put
// the reference on the owner of the
// namespace. If the only reason the
// owning namespace was alive was
// because it was a parent of @ns
// it's active reference count now goes
// to zero... --------------------------------
// |
// ns->__ns_active_ref == 0 |
// parent_ns->__ns_active_ref == 0 |
| commit_nsset()
-----------------> // If setns()
// now manages to install the namespaces
// it will call ns_active_ref_get()
// on them thus bumping the active reference
// count from zero again but without also
// taking the required reference on the owner.
// Thus we get:
//
// ns->__ns_active_ref == 1
// parent_ns->__ns_active_ref == 0
When later someone does ns_active_ref_put() on @ns it will underflow
parent_ns->__ns_active_ref leading to a splat from our asserts
thinking there are still active references when in fact the counter
just underflowed.
So resurrect the ownership chain if necessary as well. If the caller
succeeded to grab passive references to the set of namespaces the
setns() should simply succeed even if the target task exists or gets
reaped in the meantime.
The race is rare and can only be triggered when using pidfs to setns()
to namespaces. Also note that active reference on initial namespaces are
nops.
Since we now always handle parent references directly we can drop
ns_ref_active_get_owner() when adding a namespace to a namespace tree.
This is now all handled uniformly in the places where the new namespaces
actually become active.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (8):
ns: don't skip active reference count initialization
ns: don't increment or decrement initial namespaces
ns: make sure reference are dropped outside of rcu lock
ns: return EFAULT on put_user() error
ns: handle setns(pidfd, ...) cleanly
ns: add asserts for active refcount underflow
selftests/namespaces: add active reference count regression test
selftests/namespaces: test for efault
fs/nsfs.c | 2 +-
include/linux/ns_common.h | 49 +-
kernel/nscommon.c | 52 +-
kernel/nstree.c | 44 +-
tools/testing/selftests/namespaces/.gitignore | 2 +
tools/testing/selftests/namespaces/Makefile | 6 +-
.../selftests/namespaces/listns_efault_test.c | 521 +++++++++++++++++++++
.../namespaces/regression_pidfd_setns_test.c | 113 +++++
8 files changed, 715 insertions(+), 74 deletions(-)
---
base-commit: 8ebfb9896c97ab609222460e705f425cb3f0aad0
change-id: 20251109-namespace-6-19-fixes-5bbff9fc6267
next reply other threads:[~2025-11-09 21:13 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-09 21:11 Christian Brauner [this message]
2025-11-09 21:11 ` [PATCH 1/8] ns: don't skip active reference count initialization Christian Brauner
2025-11-09 21:11 ` [PATCH 2/8] ns: don't increment or decrement initial namespaces Christian Brauner
2025-11-09 21:11 ` [PATCH 3/8] ns: make sure reference are dropped outside of rcu lock Christian Brauner
2025-11-09 21:11 ` [PATCH 4/8] ns: return EFAULT on put_user() error Christian Brauner
2025-11-09 21:11 ` [PATCH 5/8] ns: handle setns(pidfd, ...) cleanly Christian Brauner
2025-11-09 21:11 ` [PATCH 6/8] ns: add asserts for active refcount underflow Christian Brauner
2025-11-09 21:11 ` [PATCH 7/8] selftests/namespaces: add active reference count regression test Christian Brauner
2025-11-09 21:11 ` [PATCH 8/8] selftests/namespaces: test for efault Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251109-namespace-6-19-fixes-v1-0-ae8a4ad5a3b3@kernel.org \
--to=brauner@kernel.org \
--cc=amir73il@gmail.com \
--cc=arnd@arndb.de \
--cc=bpf@vger.kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=cyphar@cyphar.com \
--cc=daan.j.demeyer@gmail.com \
--cc=edumazet@google.com \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=jannh@google.com \
--cc=jlayton@kernel.org \
--cc=josef@toxicpanda.com \
--cc=kuba@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=me@yhndnzj.com \
--cc=mzxreary@0pointer.de \
--cc=netdev@vger.kernel.org \
--cc=syzbot+1957b26299cf3ff7890c@syzkaller.appspotmail.com \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=zbyszek@in.waw.pl \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).