public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Christian Brauner <brauner@kernel.org>
To: Jeff Layton <jlayton@kernel.org>
Cc: linux-fsdevel@vger.kernel.org,
	 Alexander Viro <viro@zeniv.linux.org.uk>,
	Amir Goldstein <amir73il@gmail.com>,
	 Josef Bacik <josef@toxicpanda.com>, Jan Kara <jack@suse.cz>,
	Aleksa Sarai <cyphar@cyphar.com>
Subject: Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
Date: Tue, 6 Jan 2026 23:47:08 +0100	[thread overview]
Message-ID: <20260106-asphalt-lasziv-e830a46e8287@brauner> (raw)
In-Reply-To: <6efb8f5d904c3cc4273aef725ca0bd43b05902eb.camel@kernel.org>

On Mon, Jan 05, 2026 at 03:29:35PM -0500, Jeff Layton wrote:
> On Mon, 2025-12-29 at 14:03 +0100, Christian Brauner wrote:
> > When creating containers the setup usually involves using CLONE_NEWNS
> > via clone3() or unshare(). This copies the caller's complete mount
> > namespace. The runtime will also assemble a new rootfs and then use
> > pivot_root() to switch the old mount tree with the new rootfs. Afterward
> > it will recursively umount the old mount tree thereby getting rid of all
> > mounts.
> > 
> > On a basic system here where the mount table isn't particularly large
> > this still copies about 30 mounts. Copying all of these mounts only to
> > get rid of them later is pretty wasteful.
> > 
> > This is exacerbated if intermediary mount namespaces are used that only
> > exist for a very short amount of time and are immediately destroyed
> > again causing a ton of mounts to be copied and destroyed needlessly.
> > 
> > With a large mount table and a system where thousands or ten-thousands
> > of namespaces are spawned in parallel this quickly becomes a bottleneck
> > increasing contention on the semaphore.
> > 
> > Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> > OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> > returning a file descriptor referring to that mount tree
> > OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> > to a new mount namespace. In that new mount namespace the copied mount
> > tree has been mounted on top of a copy of the real rootfs.
> > 
> > The caller can setns() into that mount namespace and perform any
> > additionally setup such as move_mount()ing detached mounts in there.
> > 
> > This allows OPEN_TREE_NAMESPACE to function as a combined
> > unshare(CLONE_NEWNS) and pivot_root().
> > 
> > A caller may for example choose to create an extremely minimal rootfs:
> > 
> > fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
> > 
> > This will create a mount namespace where "wootwoot" has become the
> > rootfs mounted on top of the real rootfs. The caller can now setns()
> > into this new mount namespace and assemble additional mounts.
> > 
> > This also works with user namespaces:
> > 
> > unshare(CLONE_NEWUSER);
> > fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
> > 
> > which creates a new mount namespace owned by the earlier created user
> > namespace with "wootwoot" as the rootfs mounted on top of the real
> > rootfs.
> > 
> > This will scale a lot better when creating tons of mount namespaces and
> > will allow to get rid of a lot of unnecessary mount and umount cycles.
> > It also allows to create mount namespaces without needing to spawn
> > throwaway helper processes.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > Christian Brauner (2):
> >       mount: add OPEN_TREE_NAMESPACE
> >       selftests/open_tree: add OPEN_TREE_NAMESPACE tests
> > 
> >  fs/internal.h                                      |    1 +
> >  fs/namespace.c                                     |  155 ++-
> >  fs/nsfs.c                                          |   13 +
> >  include/uapi/linux/mount.h                         |    3 +-
> >  .../selftests/filesystems/open_tree_ns/.gitignore  |    1 +
> >  .../selftests/filesystems/open_tree_ns/Makefile    |   10 +
> >  .../filesystems/open_tree_ns/open_tree_ns_test.c   | 1030 ++++++++++++++++++++
> >  tools/testing/selftests/filesystems/utils.c        |   26 +
> >  tools/testing/selftests/filesystems/utils.h        |    1 +
> >  9 files changed, 1223 insertions(+), 17 deletions(-)
> > ---
> > base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
> > change-id: 20251229-work-empty-namespace-352a9c2dfe0a
> 
> I sat down today and rolled the attached program. It's a nonsensical
> test that just tries to fork new tasks that then spawn new mount
> namespaces and switch into them as quickly as possible.
> 
> Assuming that I've done this correctly, this gives me rough numbers
> from a test host that I checked out inside Meta:
> 
> With the older pivot_root() based method, I can create about 73k
> "containers" in 60s. With the newer open_tree() method, I can create
> about 109k in the same time. So it seems like the new method is roughly
> 40% faster than the older scheme (and a lot less syscalls too).
> 
> Note that the run_pivot() routine in the reproducer is based on a
> strace of an earlier reproducer. That one used minijail0 to create the
> containers. It's possible that there are more efficient ways to do what
> it's doing with the existing APIs. It seems to do some weird stuff too
> (e.g. setting everything to MS_PRIVATE twice under the old root).
> Spawning a real container might have other bottlenecks too.
> 
> Still, this extension to open_tree() seems like a good idea overall,
> and gets rid of a lot of useless work that we currently do when
> spawning a container. The only real downside that I can see is that
> container orchestrators will need changes to use the new method.
> 
> You can add:
> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Tested-by: Jeff Layton <jlayton@kernel.org>

Thank you for testing this!
The basic test looks correct. The pivot_root(".", ".") part could be
simplified a tiny bit but that shouldn't matter.

Fwiw, I think swapping out the rootfs isn't something that can always
be avoided in the manner I illustrated. Some users want to spawn an
empty mount namespace (e.g., just a tmpfs on top of the real rootfs) and
then assemble a detached mount tree and swap the two mounts.

But I have a better way of doing this than what pivot_root() currently
does. The main problem with pivot_root() is not just that it moves the
old rootfs to any other location on the new rootfs it also takes the
tasklist read lock and walks all processes on the system trying to find
any process that uses the old rootfs as its fs root or its pwd and then
rechroots and repwds all of these processes into the new rootfs.

But for 90% of the use-cases (containers) that is not needed. When the
container's mount namespace and rootfs are setup the task creating that
container is the only task that is using the old rootfs and that task
could very well just rechroot itself after it unmounted the old rootfs.

So in essence pivot_root() holds tasklist lock and walks all tasks on
the systems for no reason. If the user has a beefy and busy machine with
lots of processes coming and going each pivot_root() penalizes the whole
system.

I have a patchset that allows MOVE_MOUNT_BENEATH to work with the
rootfs (and an extension that adds MOVE_MOUNT_PIVOT_ROOT which optionally
does the rechrooting in case someone really needs to rechrooting.).

With MOVE_MOUNT_BENEATH working with the real rootfs that effectively
means one can stuff a new rootfs under the current rootfs, unmount the
old rootfs and chroot into the new rootfs and then be done.

That completely avoids the tasklist locking and has other benefits.

* You get the pivot_root(".", ".") trick for free.

* MOVE_MOUNT_BENEATH works with detached mounts meaning you can assemble
  your whole rootfs in a detached mount tree (since detached mounts can
  now be mounted onto other detached mounts) and then swap the old
  rootfs with your new rootfs.

* MOVE_MOUNT_BENEATH works with mount propagation. (Which means you
  could live-update the rootfs for all services. It would be a bit more
  complicated to actually make this work nicely but it would work.)

I just had a few thoughts in this area I wanted to note.

  reply	other threads:[~2026-01-06 22:47 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-29 13:03 [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Christian Brauner
2025-12-29 13:03 ` [PATCH 1/2] " Christian Brauner
2026-01-08 22:37   ` Aleksa Sarai
2026-01-12 13:00     ` Christian Brauner
2026-01-12 13:37       ` Aleksa Sarai
2026-02-24 11:23   ` Florian Weimer
2026-02-24 12:05     ` Christian Brauner
2026-02-24 13:30       ` Florian Weimer
2026-02-24 14:33         ` Christian Brauner
2026-02-26 11:54           ` Jan Kara
2026-03-02 10:15           ` Florian Weimer
2025-12-29 13:03 ` [PATCH 2/2] selftests/open_tree: add OPEN_TREE_NAMESPACE tests Christian Brauner
2025-12-29 15:24 ` [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Jeff Layton
2026-01-05 20:29 ` Jeff Layton
2026-01-06 22:47   ` Christian Brauner [this message]
2026-01-19 17:11 ` Askar Safin
2026-01-19 19:05   ` Andy Lutomirski
2026-01-19 22:21     ` Jeff Layton
2026-01-21 10:20       ` Christian Brauner
2026-01-21 18:00       ` Andy Lutomirski
2026-01-23 10:23         ` Christian Brauner
2026-01-24 10:13           ` Askar Safin
2026-01-21 19:56       ` Rob Landley
2026-02-19 23:42         ` Askar Safin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260106-asphalt-lasziv-e830a46e8287@brauner \
    --to=brauner@kernel.org \
    --cc=amir73il@gmail.com \
    --cc=cyphar@cyphar.com \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=josef@toxicpanda.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox