public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	Amir Goldstein	 <amir73il@gmail.com>,
	Josef Bacik <josef@toxicpanda.com>, Jan Kara <jack@suse.cz>,
	 Aleksa Sarai <cyphar@cyphar.com>
Subject: Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
Date: Mon, 05 Jan 2026 15:29:35 -0500	[thread overview]
Message-ID: <6efb8f5d904c3cc4273aef725ca0bd43b05902eb.camel@kernel.org> (raw)
In-Reply-To: <20251229-work-empty-namespace-v1-0-bfb24c7b061f@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 5033 bytes --]

On Mon, 2025-12-29 at 14:03 +0100, Christian Brauner wrote:
> When creating containers the setup usually involves using CLONE_NEWNS
> via clone3() or unshare(). This copies the caller's complete mount
> namespace. The runtime will also assemble a new rootfs and then use
> pivot_root() to switch the old mount tree with the new rootfs. Afterward
> it will recursively umount the old mount tree thereby getting rid of all
> mounts.
> 
> On a basic system here where the mount table isn't particularly large
> this still copies about 30 mounts. Copying all of these mounts only to
> get rid of them later is pretty wasteful.
> 
> This is exacerbated if intermediary mount namespaces are used that only
> exist for a very short amount of time and are immediately destroyed
> again causing a ton of mounts to be copied and destroyed needlessly.
> 
> With a large mount table and a system where thousands or ten-thousands
> of namespaces are spawned in parallel this quickly becomes a bottleneck
> increasing contention on the semaphore.
> 
> Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> returning a file descriptor referring to that mount tree
> OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> to a new mount namespace. In that new mount namespace the copied mount
> tree has been mounted on top of a copy of the real rootfs.
> 
> The caller can setns() into that mount namespace and perform any
> additionally setup such as move_mount()ing detached mounts in there.
> 
> This allows OPEN_TREE_NAMESPACE to function as a combined
> unshare(CLONE_NEWNS) and pivot_root().
> 
> A caller may for example choose to create an extremely minimal rootfs:
> 
> fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
> 
> This will create a mount namespace where "wootwoot" has become the
> rootfs mounted on top of the real rootfs. The caller can now setns()
> into this new mount namespace and assemble additional mounts.
> 
> This also works with user namespaces:
> 
> unshare(CLONE_NEWUSER);
> fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
> 
> which creates a new mount namespace owned by the earlier created user
> namespace with "wootwoot" as the rootfs mounted on top of the real
> rootfs.
> 
> This will scale a lot better when creating tons of mount namespaces and
> will allow to get rid of a lot of unnecessary mount and umount cycles.
> It also allows to create mount namespaces without needing to spawn
> throwaway helper processes.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Christian Brauner (2):
>       mount: add OPEN_TREE_NAMESPACE
>       selftests/open_tree: add OPEN_TREE_NAMESPACE tests
> 
>  fs/internal.h                                      |    1 +
>  fs/namespace.c                                     |  155 ++-
>  fs/nsfs.c                                          |   13 +
>  include/uapi/linux/mount.h                         |    3 +-
>  .../selftests/filesystems/open_tree_ns/.gitignore  |    1 +
>  .../selftests/filesystems/open_tree_ns/Makefile    |   10 +
>  .../filesystems/open_tree_ns/open_tree_ns_test.c   | 1030 ++++++++++++++++++++
>  tools/testing/selftests/filesystems/utils.c        |   26 +
>  tools/testing/selftests/filesystems/utils.h        |    1 +
>  9 files changed, 1223 insertions(+), 17 deletions(-)
> ---
> base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
> change-id: 20251229-work-empty-namespace-352a9c2dfe0a

I sat down today and rolled the attached program. It's a nonsensical
test that just tries to fork new tasks that then spawn new mount
namespaces and switch into them as quickly as possible.

Assuming that I've done this correctly, this gives me rough numbers
from a test host that I checked out inside Meta:

With the older pivot_root() based method, I can create about 73k
"containers" in 60s. With the newer open_tree() method, I can create
about 109k in the same time. So it seems like the new method is roughly
40% faster than the older scheme (and a lot less syscalls too).

Note that the run_pivot() routine in the reproducer is based on a
strace of an earlier reproducer. That one used minijail0 to create the
containers. It's possible that there are more efficient ways to do what
it's doing with the existing APIs. It seems to do some weird stuff too
(e.g. setting everything to MS_PRIVATE twice under the old root).
Spawning a real container might have other bottlenecks too.

Still, this extension to open_tree() seems like a good idea overall,
and gets rid of a lot of useless work that we currently do when
spawning a container. The only real downside that I can see is that
container orchestrators will need changes to use the new method.

You can add:

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>

[-- Attachment #2: spawnbench.c --]
[-- Type: text/x-csrc, Size: 2910 bytes --]

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <fcntl.h>
#include <getopt.h>
#include <sys/mount.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <time.h>
#include <stdbool.h>

#define CONCURRENCY	88
#define DURATION	60
#define ROOTDIR		"/var/empty"

static uint64_t	total_containers;
static bool		opentree;

static void run_pivot()
{
	int ret, oldfd, newfd;

	ret = unshare(CLONE_NEWNS);
	if (ret) {
		perror("unshare");
		exit(1);
	}

	ret = mount(NULL, "/", NULL, MS_REC|MS_PRIVATE, NULL);
	if (ret) {
		perror("mount");
		exit(1);
	}

	oldfd = openat(AT_FDCWD, "/", O_RDONLY|O_CLOEXEC|O_DIRECTORY);
	if (oldfd < 0) {
		perror("openat");
		exit(1);
	}

	newfd = openat(AT_FDCWD, ROOTDIR, O_RDONLY|O_CLOEXEC|O_DIRECTORY);
	if (newfd < 0) {
		perror("openat");
		exit(1);
	}

	ret = mount(ROOTDIR, ROOTDIR, NULL, MS_BIND|MS_REC, NULL);
	if (ret) {
		perror("mount");
		exit(1);
	}

	ret = chdir(ROOTDIR);
	if (ret) {
		perror("chdir");
		exit(1);
	}

	ret = syscall(SYS_pivot_root, ".", ".");
	if (ret) {
		perror("pivot_root");
		exit(1);
	}

	ret = fchdir(oldfd);
	if (ret) {
		perror("fchdir");
		exit(1);
	}

	ret = mount(NULL, ".", NULL, MS_REC|MS_PRIVATE, NULL);
	if (ret) {
		perror("mount");
		exit(1);
	}

	ret = umount2(".", MNT_DETACH);
	if (ret) {
		perror("umount");
		exit(1);
	}

	ret = fchdir(newfd);
	if (ret) {
		perror("fchdir");
		exit(1);
	}

	close(oldfd);
	close(newfd);
}

static void run_opentree()
{
	int fd, ret;

	// 2 == OPEN_TREE_NAMESPACE
	fd = syscall(SYS_open_tree, AT_FDCWD, ROOTDIR, 2);
	if (fd < 0) {
		perror("open_tree");
		exit(1);
	}

	ret = setns(fd, CLONE_NEWNS);
	if (ret) {
		perror("setns");
		exit(1);
	}

	close(fd);
}

static void *run_container(void *arg)
{
	pid_t pid;

	for (;;) {
		pid = fork();
		if (pid == 0) {
			if (opentree)
				run_opentree();
			else
				run_pivot();
			break;
		} else if (pid > 0) {
			waitpid(pid, NULL, 0);
			__atomic_add_fetch(&total_containers, 1, __ATOMIC_RELAXED);
		} else {
			perror("fork");
			exit(1);
		}
	}

	return NULL;
}

int main(int argc, char **argv)
{
	int i, opt;
	int concurrency = CONCURRENCY;
	int duration = DURATION;
	time_t start, now;

	while ((opt = getopt(argc, argv, "c:d:o")) != -1) {
		switch (opt) {
		case 'c':
			concurrency = atoi(optarg);
			break;
		case 'd':
			duration = atoi(optarg);
			break;
		case 'o':
			opentree = true;
			break;
		}
	}

	for (i = 0; i < concurrency; ++i) {
		pthread_t tid;
		int ret;

		ret = pthread_create(&tid, NULL, run_container, NULL);
		if (ret != 0) {
			fprintf(stderr, "pthread_create failed! %d\n", ret);
			exit(1);
		}
	}


	start = time(NULL);
	while ((now - start) < duration) {
		uint64_t val;

		__atomic_load(&total_containers, &val, __ATOMIC_RELAXED);
		printf("Total containers: %lu\n", val);
		sleep(2);
		now = time(NULL);
	}
}

  parent reply	other threads:[~2026-01-05 20:29 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-29 13:03 [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Christian Brauner
2025-12-29 13:03 ` [PATCH 1/2] " Christian Brauner
2026-01-08 22:37   ` Aleksa Sarai
2026-01-12 13:00     ` Christian Brauner
2026-01-12 13:37       ` Aleksa Sarai
2026-02-24 11:23   ` Florian Weimer
2026-02-24 12:05     ` Christian Brauner
2026-02-24 13:30       ` Florian Weimer
2026-02-24 14:33         ` Christian Brauner
2026-02-26 11:54           ` Jan Kara
2026-03-02 10:15           ` Florian Weimer
2025-12-29 13:03 ` [PATCH 2/2] selftests/open_tree: add OPEN_TREE_NAMESPACE tests Christian Brauner
2025-12-29 15:24 ` [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Jeff Layton
2026-01-05 20:29 ` Jeff Layton [this message]
2026-01-06 22:47   ` Christian Brauner
2026-01-19 17:11 ` Askar Safin
2026-01-19 19:05   ` Andy Lutomirski
2026-01-19 22:21     ` Jeff Layton
2026-01-21 10:20       ` Christian Brauner
2026-01-21 18:00       ` Andy Lutomirski
2026-01-23 10:23         ` Christian Brauner
2026-01-24 10:13           ` Askar Safin
2026-01-21 19:56       ` Rob Landley
2026-02-19 23:42         ` Askar Safin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6efb8f5d904c3cc4273aef725ca0bd43b05902eb.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=amir73il@gmail.com \
    --cc=brauner@kernel.org \
    --cc=cyphar@cyphar.com \
    --cc=jack@suse.cz \
    --cc=josef@toxicpanda.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox