[PATCH 0/2] mount: add OPEN_TREE

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
@ 2025-12-29 13:03 Christian Brauner
  2025-12-29 13:03 ` [PATCH 1/2] " Christian Brauner
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Christian Brauner @ 2025-12-29 13:03 UTC (permalink / raw)
  To: linux-fsdevel, Jeff Layton
  Cc: Alexander Viro, Amir Goldstein, Josef Bacik, Jan Kara,
	Aleksa Sarai, Christian Brauner

When creating containers the setup usually involves using CLONE_NEWNS
via clone3() or unshare(). This copies the caller's complete mount
namespace. The runtime will also assemble a new rootfs and then use
pivot_root() to switch the old mount tree with the new rootfs. Afterward
it will recursively umount the old mount tree thereby getting rid of all
mounts.

On a basic system here where the mount table isn't particularly large
this still copies about 30 mounts. Copying all of these mounts only to
get rid of them later is pretty wasteful.

This is exacerbated if intermediary mount namespaces are used that only
exist for a very short amount of time and are immediately destroyed
again causing a ton of mounts to be copied and destroyed needlessly.

With a large mount table and a system where thousands or ten-thousands
of namespaces are spawned in parallel this quickly becomes a bottleneck
increasing contention on the semaphore.

Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
returning a file descriptor referring to that mount tree
OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
to a new mount namespace. In that new mount namespace the copied mount
tree has been mounted on top of a copy of the real rootfs.

The caller can setns() into that mount namespace and perform any
additionally setup such as move_mount()ing detached mounts in there.

This allows OPEN_TREE_NAMESPACE to function as a combined
unshare(CLONE_NEWNS) and pivot_root().

A caller may for example choose to create an extremely minimal rootfs:

fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);

This will create a mount namespace where "wootwoot" has become the
rootfs mounted on top of the real rootfs. The caller can now setns()
into this new mount namespace and assemble additional mounts.

This also works with user namespaces:

unshare(CLONE_NEWUSER);
fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);

which creates a new mount namespace owned by the earlier created user
namespace with "wootwoot" as the rootfs mounted on top of the real
rootfs.

This will scale a lot better when creating tons of mount namespaces and
will allow to get rid of a lot of unnecessary mount and umount cycles.
It also allows to create mount namespaces without needing to spawn
throwaway helper processes.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (2):
      mount: add OPEN_TREE_NAMESPACE
      selftests/open_tree: add OPEN_TREE_NAMESPACE tests

 fs/internal.h                                      |    1 +
 fs/namespace.c                                     |  155 ++-
 fs/nsfs.c                                          |   13 +
 include/uapi/linux/mount.h                         |    3 +-
 .../selftests/filesystems/open_tree_ns/.gitignore  |    1 +
 .../selftests/filesystems/open_tree_ns/Makefile    |   10 +
 .../filesystems/open_tree_ns/open_tree_ns_test.c   | 1030 ++++++++++++++++++++
 tools/testing/selftests/filesystems/utils.c        |   26 +
 tools/testing/selftests/filesystems/utils.h        |    1 +
 9 files changed, 1223 insertions(+), 17 deletions(-)
---
base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
change-id: 20251229-work-empty-namespace-352a9c2dfe0a

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
  2025-12-29 13:03 [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Christian Brauner
@ 2025-12-29 13:03 ` Christian Brauner
  2026-01-08 22:37   ` Aleksa Sarai
  2026-02-24 11:23   ` Florian Weimer
  2025-12-29 13:03 ` [PATCH 2/2] selftests/open_tree: add OPEN_TREE_NAMESPACE tests Christian Brauner
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 24+ messages in thread
From: Christian Brauner @ 2025-12-29 13:03 UTC (permalink / raw)
  To: linux-fsdevel, Jeff Layton
  Cc: Alexander Viro, Amir Goldstein, Josef Bacik, Jan Kara,
	Aleksa Sarai, Christian Brauner

When creating containers the setup usually involves using CLONE_NEWNS
via clone3() or unshare(). This copies the caller's complete mount
namespace. The runtime will also assemble a new rootfs and then use
pivot_root() to switch the old mount tree with the new rootfs. Afterward
it will recursively umount the old mount tree thereby getting rid of all
mounts.

On a basic system here where the mount table isn't particularly large
this still copies about 30 mounts. Copying all of these mounts only to
get rid of them later is pretty wasteful.

This is exacerbated if intermediary mount namespaces are used that only
exist for a very short amount of time and are immediately destroyed
again causing a ton of mounts to be copied and destroyed needlessly.

With a large mount table and a system where thousands or ten-thousands
of containers are spawned in parallel this quickly becomes a bottleneck
increasing contention on the semaphore.

Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
returning a file descriptor referring to that mount tree
OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
to a new mount namespace. In that new mount namespace the copied mount
tree has been mounted on top of a copy of the real rootfs.

The caller can setns() into that mount namespace and perform any
additionally required setup such as move_mount() detached mounts in
there.

This allows OPEN_TREE_NAMESPACE to function as a combined
unshare(CLONE_NEWNS) and pivot_root().

A caller may for example choose to create an extremely minimal rootfs:

fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);

This will create a mount namespace where "wootwoot" has become the
rootfs mounted on top of the real rootfs. The caller can now setns()
into this new mount namespace and assemble additional mounts.

This also works with user namespaces:

unshare(CLONE_NEWUSER);
fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);

which creates a new mount namespace owned by the earlier created user
namespace with "wootwoot" as the rootfs mounted on top of the real
rootfs.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/internal.h              |   1 +
 fs/namespace.c             | 155 ++++++++++++++++++++++++++++++++++++++++-----
 fs/nsfs.c                  |  13 ++++
 include/uapi/linux/mount.h |   3 +-
 4 files changed, 155 insertions(+), 17 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index ab638d41ab81..b5aad5265e0e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -247,6 +247,7 @@ extern void mnt_pin_kill(struct mount *m);
  */
 extern const struct dentry_operations ns_dentry_operations;
 int open_namespace(struct ns_common *ns);
+struct file *open_namespace_file(struct ns_common *ns);
 
 /*
  * fs/stat.c:
diff --git a/fs/namespace.c b/fs/namespace.c
index c58674a20cad..fd9698671c70 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2796,6 +2796,9 @@ static inline void unlock_mount(struct pinned_mountpoint *m)
 		__unlock_mount(m);
 }
 
+static void lock_mount_exact(const struct path *path,
+			     struct pinned_mountpoint *mp);
+
 #define LOCK_MOUNT_MAYBE_BENEATH(mp, path, beneath) \
 	struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \
 	do_lock_mount((path), &mp, (beneath))
@@ -2946,10 +2949,11 @@ static inline bool may_copy_tree(const struct path *path)
 	return check_anonymous_mnt(mnt);
 }
 
-
-static struct mount *__do_loopback(const struct path *old_path, int recurse)
+static struct mount *__do_loopback(const struct path *old_path,
+				   unsigned int flags, unsigned int copy_flags)
 {
 	struct mount *old = real_mount(old_path->mnt);
+	bool recurse = flags & AT_RECURSIVE;
 
 	if (IS_MNT_UNBINDABLE(old))
 		return ERR_PTR(-EINVAL);
@@ -2960,10 +2964,22 @@ static struct mount *__do_loopback(const struct path *old_path, int recurse)
 	if (!recurse && __has_locked_children(old, old_path->dentry))
 		return ERR_PTR(-EINVAL);
 
+	/*
+	 * When creating a new mount namespace we don't want to copy over
+	 * mounts of mount namespaces to avoid the risk of cycles and also to
+	 * minimize the default complex interdependencies between mount
+	 * namespaces.
+	 *
+	 * We could ofc just check whether all mount namespace files aren't
+	 * creating cycles but really let's keep this simple.
+	 */
+	if (!(flags & OPEN_TREE_NAMESPACE))
+		copy_flags |= CL_COPY_MNT_NS_FILE;
+
 	if (recurse)
-		return copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
-	else
-		return clone_mnt(old, old_path->dentry, 0);
+		return copy_tree(old, old_path->dentry, copy_flags);
+
+	return clone_mnt(old, old_path->dentry, copy_flags);
 }
 
 /*
@@ -2974,7 +2990,9 @@ static int do_loopback(const struct path *path, const char *old_name,
 {
 	struct path old_path __free(path_put) = {};
 	struct mount *mnt = NULL;
+	unsigned int flags = recurse ? AT_RECURSIVE : 0;
 	int err;
+
 	if (!old_name || !*old_name)
 		return -EINVAL;
 	err = kern_path(old_name, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT, &old_path);
@@ -2991,7 +3009,7 @@ static int do_loopback(const struct path *path, const char *old_name,
 	if (!check_mnt(mp.parent))
 		return -EINVAL;
 
-	mnt = __do_loopback(&old_path, recurse);
+	mnt = __do_loopback(&old_path, flags, 0);
 	if (IS_ERR(mnt))
 		return PTR_ERR(mnt);
 
@@ -3004,7 +3022,7 @@ static int do_loopback(const struct path *path, const char *old_name,
 	return err;
 }
 
-static struct mnt_namespace *get_detached_copy(const struct path *path, bool recursive)
+static struct mnt_namespace *get_detached_copy(const struct path *path, unsigned int flags)
 {
 	struct mnt_namespace *ns, *mnt_ns = current->nsproxy->mnt_ns, *src_mnt_ns;
 	struct user_namespace *user_ns = mnt_ns->user_ns;
@@ -3029,7 +3047,7 @@ static struct mnt_namespace *get_detached_copy(const struct path *path, bool rec
 			ns->seq_origin = src_mnt_ns->ns.ns_id;
 	}
 
-	mnt = __do_loopback(path, recursive);
+	mnt = __do_loopback(path, flags, 0);
 	if (IS_ERR(mnt)) {
 		emptied_ns = ns;
 		return ERR_CAST(mnt);
@@ -3043,9 +3061,9 @@ static struct mnt_namespace *get_detached_copy(const struct path *path, bool rec
 	return ns;
 }
 
-static struct file *open_detached_copy(struct path *path, bool recursive)
+static struct file *open_detached_copy(struct path *path, unsigned int flags)
 {
-	struct mnt_namespace *ns = get_detached_copy(path, recursive);
+	struct mnt_namespace *ns = get_detached_copy(path, flags);
 	struct file *file;
 
 	if (IS_ERR(ns))
@@ -3061,21 +3079,114 @@ static struct file *open_detached_copy(struct path *path, bool recursive)
 	return file;
 }
 
+DEFINE_FREE(put_empty_mnt_ns, struct mnt_namespace *,
+	    if (!IS_ERR_OR_NULL(_T)) free_mnt_ns(_T))
+
+static struct file *open_new_namespace(struct path *path, unsigned int flags)
+{
+	struct mnt_namespace *new_ns __free(put_empty_mnt_ns) = NULL;
+	struct path to_path __free(path_put) = {};
+	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
+	struct user_namespace *user_ns = current_user_ns();
+	struct mount *new_ns_root;
+	struct mount *mnt;
+	struct ns_common *new_ns_common;
+	unsigned int copy_flags = 0;
+	bool locked = false;
+
+	if (user_ns != ns->user_ns)
+		copy_flags |= CL_SLAVE;
+
+	new_ns = alloc_mnt_ns(user_ns, false);
+	if (IS_ERR(new_ns))
+		return ERR_CAST(new_ns);
+
+	scoped_guard(namespace_excl) {
+		new_ns_root = clone_mnt(ns->root, ns->root->mnt.mnt_root, copy_flags);
+		if (IS_ERR(new_ns_root))
+			return ERR_CAST(new_ns_root);
+
+		/*
+		 * If the real rootfs had a locked mount on top of it somewhere
+		 * in the stack, lock the new mount tree as well so it can't be
+		 * exposed.
+		 */
+		mnt = ns->root;
+		while (mnt->overmount) {
+			mnt = mnt->overmount;
+			if (mnt->mnt.mnt_flags & MNT_LOCKED)
+				locked = true;
+		}
+	}
+
+	/*
+	 * We dropped the namespace semaphore so we can actually lock
+	 * the copy for mounting. The copied mount isn't attached to any
+	 * mount namespace and it is thus excluded from any propagation.
+	 * So realistically we're isolated and the mount can't be
+	 * overmounted.
+	 */
+
+	/* Borrow the reference from clone_mnt(). */
+	to_path.mnt = &new_ns_root->mnt;
+	to_path.dentry = dget(new_ns_root->mnt.mnt_root);
+
+	/* Now lock for actual mounting. */
+	LOCK_MOUNT_EXACT(mp, &to_path);
+	if (unlikely(IS_ERR(mp.parent)))
+		return ERR_CAST(mp.parent);
+
+	/*
+	 * We don't emulate unshare()ing a mount namespace. We stick to the
+	 * restrictions of creating detached bind-mounts. It has a lot
+	 * saner and simpler semantics.
+	 */
+	mnt = __do_loopback(path, flags, copy_flags);
+	if (IS_ERR(mnt))
+		return ERR_CAST(mnt);
+
+	scoped_guard(mount_writer) {
+		if (locked)
+			mnt->mnt.mnt_flags |= MNT_LOCKED;
+		/*
+		 * Now mount the detached tree on top of the copy of the
+		 * real rootfs we created.
+		 */
+		attach_mnt(mnt, new_ns_root, mp.mp);
+		if (user_ns != ns->user_ns)
+			lock_mnt_tree(new_ns_root);
+	}
+
+	/* Add all mounts to the new namespace. */
+	for (struct mount *p = new_ns_root; p; p = next_mnt(p, new_ns_root)) {
+		mnt_add_to_ns(new_ns, p);
+		new_ns->nr_mounts++;
+	}
+
+	new_ns->root = real_mount(no_free_ptr(to_path.mnt));
+	ns_tree_add_raw(new_ns);
+	new_ns_common = to_ns_common(no_free_ptr(new_ns));
+	return open_namespace_file(new_ns_common);
+}
+
 static struct file *vfs_open_tree(int dfd, const char __user *filename, unsigned int flags)
 {
 	int ret;
 	struct path path __free(path_put) = {};
 	int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
-	bool detached = flags & OPEN_TREE_CLONE;
 
 	BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
 
 	if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE |
 		      AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE |
-		      OPEN_TREE_CLOEXEC))
+		      OPEN_TREE_CLOEXEC | OPEN_TREE_NAMESPACE))
 		return ERR_PTR(-EINVAL);
 
-	if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE)) == AT_RECURSIVE)
+	if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE | OPEN_TREE_NAMESPACE)) ==
+	    AT_RECURSIVE)
+		return ERR_PTR(-EINVAL);
+
+	if (hweight32(flags & (OPEN_TREE_CLONE | OPEN_TREE_NAMESPACE)) > 1)
 		return ERR_PTR(-EINVAL);
 
 	if (flags & AT_NO_AUTOMOUNT)
@@ -3085,15 +3196,27 @@ static struct file *vfs_open_tree(int dfd, const char __user *filename, unsigned
 	if (flags & AT_EMPTY_PATH)
 		lookup_flags |= LOOKUP_EMPTY;
 
-	if (detached && !may_mount())
+	/*
+	 * If we create a new mount namespace with the cloned mount tree we
+	 * just care about being privileged over our current user namespace.
+	 * The new mount namespace will be owned by it.
+	 */
+	if ((flags & OPEN_TREE_NAMESPACE) &&
+	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if ((flags & OPEN_TREE_CLONE) && !may_mount())
 		return ERR_PTR(-EPERM);
 
 	ret = user_path_at(dfd, filename, lookup_flags, &path);
 	if (unlikely(ret))
 		return ERR_PTR(ret);
 
-	if (detached)
-		return open_detached_copy(&path, flags & AT_RECURSIVE);
+	if (flags & OPEN_TREE_NAMESPACE)
+		return open_new_namespace(&path, flags);
+
+	if (flags & OPEN_TREE_CLONE)
+		return open_detached_copy(&path, flags);
 
 	return dentry_open(&path, O_PATH, current_cred());
 }
diff --git a/fs/nsfs.c b/fs/nsfs.c
index bf27d5da91f1..db91de208645 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -99,6 +99,19 @@ int ns_get_path(struct path *path, struct task_struct *task,
 	return ns_get_path_cb(path, ns_get_path_task, &args);
 }
 
+struct file *open_namespace_file(struct ns_common *ns)
+{
+	struct path path __free(path_put) = {};
+	int err;
+
+	/* call first to consume reference */
+	err = path_from_stashed(&ns->stashed, nsfs_mnt, ns, &path);
+	if (err < 0)
+		return ERR_PTR(err);
+
+	return dentry_open(&path, O_RDONLY, current_cred());
+}
+
 /**
  * open_namespace - open a namespace
  * @ns: the namespace to open
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 5d3f8c9e3a62..acbc22241c9c 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -61,7 +61,8 @@
 /*
  * open_tree() flags.
  */
-#define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
+#define OPEN_TREE_CLONE		(1 << 0)	/* Clone the target tree and attach the clone */
+#define OPEN_TREE_NAMESPACE	(1 << 1)	/* Clone the target tree into a new mount namespace */
 #define OPEN_TREE_CLOEXEC	O_CLOEXEC	/* Close the file on execve() */
 
 /*

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] selftests/open_tree: add OPEN_TREE_NAMESPACE tests
  2025-12-29 13:03 [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Christian Brauner
  2025-12-29 13:03 ` [PATCH 1/2] " Christian Brauner
@ 2025-12-29 13:03 ` Christian Brauner
  2025-12-29 15:24 ` [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Jeff Layton
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2025-12-29 13:03 UTC (permalink / raw)
  To: linux-fsdevel, Jeff Layton
  Cc: Alexander Viro, Amir Goldstein, Josef Bacik, Jan Kara,
	Aleksa Sarai, Christian Brauner

Add tests for OPEN_TREE_NAMESPACE.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../selftests/filesystems/open_tree_ns/.gitignore  |    1 +
 .../selftests/filesystems/open_tree_ns/Makefile    |   10 +
 .../filesystems/open_tree_ns/open_tree_ns_test.c   | 1030 ++++++++++++++++++++
 tools/testing/selftests/filesystems/utils.c        |   26 +
 tools/testing/selftests/filesystems/utils.h        |    1 +
 5 files changed, 1068 insertions(+)

diff --git a/tools/testing/selftests/filesystems/open_tree_ns/.gitignore b/tools/testing/selftests/filesystems/open_tree_ns/.gitignore
new file mode 100644
index 000000000000..fb12b93fbcaa
--- /dev/null
+++ b/tools/testing/selftests/filesystems/open_tree_ns/.gitignore
@@ -0,0 +1 @@
+open_tree_ns_test
diff --git a/tools/testing/selftests/filesystems/open_tree_ns/Makefile b/tools/testing/selftests/filesystems/open_tree_ns/Makefile
new file mode 100644
index 000000000000..73c03c4a7ef6
--- /dev/null
+++ b/tools/testing/selftests/filesystems/open_tree_ns/Makefile
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0
+TEST_GEN_PROGS := open_tree_ns_test
+
+CFLAGS := -Wall -Werror -g $(KHDR_INCLUDES)
+LDLIBS := -lcap
+
+include ../../lib.mk
+
+$(OUTPUT)/open_tree_ns_test: open_tree_ns_test.c ../utils.c
+	$(CC) $(CFLAGS) -o $@ $^ $(LDLIBS)
diff --git a/tools/testing/selftests/filesystems/open_tree_ns/open_tree_ns_test.c b/tools/testing/selftests/filesystems/open_tree_ns/open_tree_ns_test.c
new file mode 100644
index 000000000000..9711556280ae
--- /dev/null
+++ b/tools/testing/selftests/filesystems/open_tree_ns/open_tree_ns_test.c
@@ -0,0 +1,1030 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test for OPEN_TREE_NAMESPACE flag.
+ *
+ * Test that open_tree() with OPEN_TREE_NAMESPACE creates a new mount
+ * namespace containing the specified mount tree.
+ */
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <linux/nsfs.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "../wrappers.h"
+#include "../statmount/statmount.h"
+#include "../utils.h"
+#include "../../kselftest_harness.h"
+
+#ifndef OPEN_TREE_NAMESPACE
+#define OPEN_TREE_NAMESPACE	(1 << 1)
+#endif
+
+static int get_mnt_ns_id(int fd, uint64_t *mnt_ns_id)
+{
+	if (ioctl(fd, NS_GET_MNTNS_ID, mnt_ns_id) < 0)
+		return -errno;
+	return 0;
+}
+
+static int get_mnt_ns_id_from_path(const char *path, uint64_t *mnt_ns_id)
+{
+	int fd, ret;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return -errno;
+
+	ret = get_mnt_ns_id(fd, mnt_ns_id);
+	close(fd);
+	return ret;
+}
+
+#define STATMOUNT_BUFSIZE (1 << 15)
+
+static struct statmount *statmount_alloc(uint64_t mnt_id, uint64_t mnt_ns_id, uint64_t mask)
+{
+	struct statmount *buf;
+	size_t bufsize = STATMOUNT_BUFSIZE;
+	int ret;
+
+	for (;;) {
+		buf = malloc(bufsize);
+		if (!buf)
+			return NULL;
+
+		ret = statmount(mnt_id, mnt_ns_id, mask, buf, bufsize, 0);
+		if (ret == 0)
+			return buf;
+
+		free(buf);
+		if (errno != EOVERFLOW)
+			return NULL;
+
+		bufsize <<= 1;
+	}
+}
+
+static void log_mount(struct __test_metadata *_metadata, struct statmount *sm)
+{
+	const char *fs_type = "";
+	const char *mnt_root = "";
+	const char *mnt_point = "";
+
+	if (sm->mask & STATMOUNT_FS_TYPE)
+		fs_type = sm->str + sm->fs_type;
+	if (sm->mask & STATMOUNT_MNT_ROOT)
+		mnt_root = sm->str + sm->mnt_root;
+	if (sm->mask & STATMOUNT_MNT_POINT)
+		mnt_point = sm->str + sm->mnt_point;
+
+	TH_LOG("  mnt_id: %llu, parent_id: %llu, fs_type: %s, root: %s, point: %s",
+	       (unsigned long long)sm->mnt_id,
+	       (unsigned long long)sm->mnt_parent_id,
+	       fs_type, mnt_root, mnt_point);
+}
+
+static void dump_mounts(struct __test_metadata *_metadata, uint64_t mnt_ns_id)
+{
+	uint64_t list[256];
+	ssize_t nr_mounts;
+
+	nr_mounts = listmount(LSMT_ROOT, mnt_ns_id, 0, list, 256, 0);
+	if (nr_mounts < 0) {
+		TH_LOG("listmount failed: %s", strerror(errno));
+		return;
+	}
+
+	TH_LOG("Mount namespace %llu contains %zd mount(s):",
+	       (unsigned long long)mnt_ns_id, nr_mounts);
+
+	for (ssize_t i = 0; i < nr_mounts; i++) {
+		struct statmount *sm;
+
+		sm = statmount_alloc(list[i], mnt_ns_id,
+				     STATMOUNT_MNT_BASIC |
+				     STATMOUNT_FS_TYPE |
+				     STATMOUNT_MNT_ROOT |
+				     STATMOUNT_MNT_POINT);
+		if (!sm) {
+			TH_LOG("  [%zd] mnt_id %llu: statmount failed: %s",
+			       i, (unsigned long long)list[i], strerror(errno));
+			continue;
+		}
+
+		log_mount(_metadata, sm);
+		free(sm);
+	}
+}
+
+FIXTURE(open_tree_ns)
+{
+	int fd;
+	uint64_t current_ns_id;
+};
+
+FIXTURE_VARIANT(open_tree_ns)
+{
+	const char *path;
+	unsigned int flags;
+	bool expect_success;
+	bool expect_different_ns;
+	int min_mounts;
+};
+
+FIXTURE_VARIANT_ADD(open_tree_ns, basic_root)
+{
+	.path = "/",
+	.flags = OPEN_TREE_NAMESPACE | OPEN_TREE_CLOEXEC,
+	.expect_success = true,
+	.expect_different_ns = true,
+	/*
+	 * The empty rootfs is hidden from listmount()/mountinfo,
+	 * so we only see the bind mount on top of it.
+	 */
+	.min_mounts = 1,
+};
+
+FIXTURE_VARIANT_ADD(open_tree_ns, recursive_root)
+{
+	.path = "/",
+	.flags = OPEN_TREE_NAMESPACE | AT_RECURSIVE | OPEN_TREE_CLOEXEC,
+	.expect_success = true,
+	.expect_different_ns = true,
+	.min_mounts = 1,
+};
+
+FIXTURE_VARIANT_ADD(open_tree_ns, subdir_tmp)
+{
+	.path = "/tmp",
+	.flags = OPEN_TREE_NAMESPACE | OPEN_TREE_CLOEXEC,
+	.expect_success = true,
+	.expect_different_ns = true,
+	.min_mounts = 1,
+};
+
+FIXTURE_VARIANT_ADD(open_tree_ns, subdir_proc)
+{
+	.path = "/proc",
+	.flags = OPEN_TREE_NAMESPACE | OPEN_TREE_CLOEXEC,
+	.expect_success = true,
+	.expect_different_ns = true,
+	.min_mounts = 1,
+};
+
+FIXTURE_VARIANT_ADD(open_tree_ns, recursive_tmp)
+{
+	.path = "/tmp",
+	.flags = OPEN_TREE_NAMESPACE | AT_RECURSIVE | OPEN_TREE_CLOEXEC,
+	.expect_success = true,
+	.expect_different_ns = true,
+	.min_mounts = 1,
+};
+
+FIXTURE_VARIANT_ADD(open_tree_ns, recursive_run)
+{
+	.path = "/run",
+	.flags = OPEN_TREE_NAMESPACE | AT_RECURSIVE | OPEN_TREE_CLOEXEC,
+	.expect_success = true,
+	.expect_different_ns = true,
+	.min_mounts = 1,
+};
+
+FIXTURE_VARIANT_ADD(open_tree_ns, invalid_recursive_alone)
+{
+	.path = "/",
+	.flags = AT_RECURSIVE | OPEN_TREE_CLOEXEC,
+	.expect_success = false,
+	.expect_different_ns = false,
+	.min_mounts = 0,
+};
+
+FIXTURE_SETUP(open_tree_ns)
+{
+	int ret;
+
+	self->fd = -1;
+
+	/* Check if open_tree syscall is supported */
+	ret = sys_open_tree(-1, NULL, 0);
+	if (ret == -1 && errno == ENOSYS)
+		SKIP(return, "open_tree() syscall not supported");
+
+	/* Check if statmount/listmount are supported */
+	ret = statmount(0, 0, 0, NULL, 0, 0);
+	if (ret == -1 && errno == ENOSYS)
+		SKIP(return, "statmount() syscall not supported");
+
+	/* Get current mount namespace ID for comparison */
+	ret = get_mnt_ns_id_from_path("/proc/self/ns/mnt", &self->current_ns_id);
+	if (ret < 0)
+		SKIP(return, "Failed to get current mount namespace ID");
+}
+
+FIXTURE_TEARDOWN(open_tree_ns)
+{
+	if (self->fd >= 0)
+		close(self->fd);
+}
+
+TEST_F(open_tree_ns, create_namespace)
+{
+	uint64_t new_ns_id;
+	uint64_t list[256];
+	ssize_t nr_mounts;
+	int ret;
+
+	self->fd = sys_open_tree(AT_FDCWD, variant->path, variant->flags);
+
+	if (!variant->expect_success) {
+		ASSERT_LT(self->fd, 0);
+		ASSERT_EQ(errno, EINVAL);
+		return;
+	}
+
+	if (self->fd < 0 && errno == EINVAL)
+		SKIP(return, "OPEN_TREE_NAMESPACE not supported");
+
+	ASSERT_GE(self->fd, 0);
+
+	/* Verify we can get the namespace ID */
+	ret = get_mnt_ns_id(self->fd, &new_ns_id);
+	ASSERT_EQ(ret, 0);
+
+	/* Verify it's a different namespace */
+	if (variant->expect_different_ns)
+		ASSERT_NE(new_ns_id, self->current_ns_id);
+
+	/* List mounts in the new namespace */
+	nr_mounts = listmount(LSMT_ROOT, new_ns_id, 0, list, 256, 0);
+	ASSERT_GE(nr_mounts, 0) {
+		TH_LOG("%m - listmount failed");
+	}
+
+	/* Verify minimum expected mounts */
+	ASSERT_GE(nr_mounts, variant->min_mounts);
+	TH_LOG("Namespace contains %zd mounts", nr_mounts);
+}
+
+TEST_F(open_tree_ns, setns_into_namespace)
+{
+	uint64_t new_ns_id;
+	pid_t pid;
+	int status;
+	int ret;
+
+	/* Only test with basic flags */
+	if (!(variant->flags & OPEN_TREE_NAMESPACE))
+		SKIP(return, "setns test only for basic / case");
+
+	self->fd = sys_open_tree(AT_FDCWD, variant->path, variant->flags);
+	if (self->fd < 0 && errno == EINVAL)
+		SKIP(return, "OPEN_TREE_NAMESPACE not supported");
+
+	ASSERT_GE(self->fd, 0);
+
+	/* Get namespace ID and dump all mounts */
+	ret = get_mnt_ns_id(self->fd, &new_ns_id);
+	ASSERT_EQ(ret, 0);
+
+	dump_mounts(_metadata, new_ns_id);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		/* Child: try to enter the namespace */
+		if (setns(self->fd, CLONE_NEWNS) < 0)
+			_exit(1);
+		_exit(0);
+	}
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_TRUE(WIFEXITED(status));
+	ASSERT_EQ(WEXITSTATUS(status), 0);
+}
+
+TEST_F(open_tree_ns, verify_mount_properties)
+{
+	struct statmount sm;
+	uint64_t new_ns_id;
+	uint64_t list[256];
+	ssize_t nr_mounts;
+	int ret;
+
+	/* Only test with basic flags on root */
+	if (variant->flags != (OPEN_TREE_NAMESPACE | OPEN_TREE_CLOEXEC) ||
+	    strcmp(variant->path, "/") != 0)
+		SKIP(return, "mount properties test only for basic / case");
+
+	self->fd = sys_open_tree(AT_FDCWD, "/", OPEN_TREE_NAMESPACE | OPEN_TREE_CLOEXEC);
+	if (self->fd < 0 && errno == EINVAL)
+		SKIP(return, "OPEN_TREE_NAMESPACE not supported");
+
+	ASSERT_GE(self->fd, 0);
+
+	ret = get_mnt_ns_id(self->fd, &new_ns_id);
+	ASSERT_EQ(ret, 0);
+
+	nr_mounts = listmount(LSMT_ROOT, new_ns_id, 0, list, 256, 0);
+	ASSERT_GE(nr_mounts, 1);
+
+	/* Get info about the root mount (the bind mount, rootfs is hidden) */
+	ret = statmount(list[0], new_ns_id, STATMOUNT_MNT_BASIC, &sm, sizeof(sm), 0);
+	ASSERT_EQ(ret, 0);
+
+	ASSERT_NE(sm.mnt_id, sm.mnt_parent_id);
+
+	TH_LOG("Root mount id: %llu, parent: %llu",
+	       (unsigned long long)sm.mnt_id,
+	       (unsigned long long)sm.mnt_parent_id);
+}
+
+FIXTURE(open_tree_ns_caps)
+{
+	bool has_caps;
+};
+
+FIXTURE_SETUP(open_tree_ns_caps)
+{
+	int ret;
+
+	/* Check if open_tree syscall is supported */
+	ret = sys_open_tree(-1, NULL, 0);
+	if (ret == -1 && errno == ENOSYS)
+		SKIP(return, "open_tree() syscall not supported");
+
+	self->has_caps = (geteuid() == 0);
+}
+
+FIXTURE_TEARDOWN(open_tree_ns_caps)
+{
+}
+
+TEST_F(open_tree_ns_caps, requires_cap_sys_admin)
+{
+	pid_t pid;
+	int status;
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+
+		/* Child: drop privileges using utils.h helper */
+		if (enter_userns() != 0)
+			_exit(2);
+
+		/* Drop all caps using utils.h helper */
+		if (caps_down() == 0)
+			_exit(3);
+
+		fd = sys_open_tree(AT_FDCWD, "/",
+				   OPEN_TREE_NAMESPACE | OPEN_TREE_CLOEXEC);
+		if (fd >= 0) {
+			close(fd);
+			/* Should have failed without caps */
+			_exit(1);
+		}
+
+		if (errno == EPERM)
+			_exit(0);
+
+		/* EINVAL means OPEN_TREE_NAMESPACE not supported */
+		if (errno == EINVAL)
+			_exit(4);
+
+		/* Unexpected error */
+		_exit(5);
+	}
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	switch (WEXITSTATUS(status)) {
+	case 0:
+		/* Expected: EPERM without caps */
+		break;
+	case 1:
+		ASSERT_FALSE(true) TH_LOG("OPEN_TREE_NAMESPACE succeeded without caps");
+		break;
+	case 2:
+		SKIP(return, "setup_userns failed");
+		break;
+	case 3:
+		SKIP(return, "caps_down failed");
+		break;
+	case 4:
+		SKIP(return, "OPEN_TREE_NAMESPACE not supported");
+		break;
+	default:
+		ASSERT_FALSE(true) TH_LOG("Unexpected error in child (exit %d)",
+					  WEXITSTATUS(status));
+		break;
+	}
+}
+
+FIXTURE(open_tree_ns_userns)
+{
+	int fd;
+};
+
+FIXTURE_SETUP(open_tree_ns_userns)
+{
+	int ret;
+
+	self->fd = -1;
+
+	/* Check if open_tree syscall is supported */
+	ret = sys_open_tree(-1, NULL, 0);
+	if (ret == -1 && errno == ENOSYS)
+		SKIP(return, "open_tree() syscall not supported");
+
+	/* Check if statmount/listmount are supported */
+	ret = statmount(0, 0, 0, NULL, 0, 0);
+	if (ret == -1 && errno == ENOSYS)
+		SKIP(return, "statmount() syscall not supported");
+}
+
+FIXTURE_TEARDOWN(open_tree_ns_userns)
+{
+	if (self->fd >= 0)
+		close(self->fd);
+}
+
+TEST_F(open_tree_ns_userns, create_in_userns)
+{
+	pid_t pid;
+	int status;
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		uint64_t new_ns_id;
+		uint64_t list[256];
+		ssize_t nr_mounts;
+		int fd;
+
+		/* Create new user namespace (also creates mount namespace) */
+		if (enter_userns() != 0)
+			_exit(2);
+
+		/* Now we have CAP_SYS_ADMIN in the user namespace */
+		fd = sys_open_tree(AT_FDCWD, "/",
+				   OPEN_TREE_NAMESPACE | OPEN_TREE_CLOEXEC);
+		if (fd < 0) {
+			if (errno == EINVAL)
+				_exit(4); /* OPEN_TREE_NAMESPACE not supported */
+			_exit(1);
+		}
+
+		/* Verify we can get the namespace ID */
+		if (get_mnt_ns_id(fd, &new_ns_id) != 0)
+			_exit(5);
+
+		/* Verify we can list mounts in the new namespace */
+		nr_mounts = listmount(LSMT_ROOT, new_ns_id, 0, list, 256, 0);
+		if (nr_mounts < 0)
+			_exit(6);
+
+		/* Should have at least 1 mount */
+		if (nr_mounts < 1)
+			_exit(7);
+
+		close(fd);
+		_exit(0);
+	}
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	switch (WEXITSTATUS(status)) {
+	case 0:
+		/* Success */
+		break;
+	case 1:
+		ASSERT_FALSE(true) TH_LOG("open_tree(OPEN_TREE_NAMESPACE) failed in userns");
+		break;
+	case 2:
+		SKIP(return, "setup_userns failed");
+		break;
+	case 4:
+		SKIP(return, "OPEN_TREE_NAMESPACE not supported");
+		break;
+	case 5:
+		ASSERT_FALSE(true) TH_LOG("Failed to get mount namespace ID");
+		break;
+	case 6:
+		ASSERT_FALSE(true) TH_LOG("listmount failed in new namespace");
+		break;
+	case 7:
+		ASSERT_FALSE(true) TH_LOG("New namespace has no mounts");
+		break;
+	default:
+		ASSERT_FALSE(true) TH_LOG("Unexpected error in child (exit %d)",
+					  WEXITSTATUS(status));
+		break;
+	}
+}
+
+TEST_F(open_tree_ns_userns, setns_in_userns)
+{
+	pid_t pid;
+	int status;
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		uint64_t new_ns_id;
+		int fd;
+		pid_t inner_pid;
+		int inner_status;
+
+		/* Create new user namespace */
+		if (enter_userns() != 0)
+			_exit(2);
+
+		fd = sys_open_tree(AT_FDCWD, "/",
+				   OPEN_TREE_NAMESPACE | OPEN_TREE_CLOEXEC);
+		if (fd < 0) {
+			if (errno == EINVAL)
+				_exit(4);
+			_exit(1);
+		}
+
+		if (get_mnt_ns_id(fd, &new_ns_id) != 0)
+			_exit(5);
+
+		/* Fork again to test setns into the new namespace */
+		inner_pid = fork();
+		if (inner_pid < 0)
+			_exit(8);
+
+		if (inner_pid == 0) {
+			/* Inner child: enter the new namespace */
+			if (setns(fd, CLONE_NEWNS) < 0)
+				_exit(1);
+			_exit(0);
+		}
+
+		if (waitpid(inner_pid, &inner_status, 0) != inner_pid)
+			_exit(9);
+
+		if (!WIFEXITED(inner_status) || WEXITSTATUS(inner_status) != 0)
+			_exit(10);
+
+		close(fd);
+		_exit(0);
+	}
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	switch (WEXITSTATUS(status)) {
+	case 0:
+		/* Success */
+		break;
+	case 1:
+		ASSERT_FALSE(true) TH_LOG("open_tree or setns failed in userns");
+		break;
+	case 2:
+		SKIP(return, "setup_userns failed");
+		break;
+	case 4:
+		SKIP(return, "OPEN_TREE_NAMESPACE not supported");
+		break;
+	case 5:
+		ASSERT_FALSE(true) TH_LOG("Failed to get mount namespace ID");
+		break;
+	case 8:
+		ASSERT_FALSE(true) TH_LOG("Inner fork failed");
+		break;
+	case 9:
+		ASSERT_FALSE(true) TH_LOG("Inner waitpid failed");
+		break;
+	case 10:
+		ASSERT_FALSE(true) TH_LOG("setns into new namespace failed");
+		break;
+	default:
+		ASSERT_FALSE(true) TH_LOG("Unexpected error in child (exit %d)",
+					  WEXITSTATUS(status));
+		break;
+	}
+}
+
+TEST_F(open_tree_ns_userns, recursive_in_userns)
+{
+	pid_t pid;
+	int status;
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		uint64_t new_ns_id;
+		uint64_t list[256];
+		ssize_t nr_mounts;
+		int fd;
+
+		/* Create new user namespace */
+		if (enter_userns() != 0)
+			_exit(2);
+
+		/* Test recursive flag in userns */
+		fd = sys_open_tree(AT_FDCWD, "/",
+				   OPEN_TREE_NAMESPACE | AT_RECURSIVE | OPEN_TREE_CLOEXEC);
+		if (fd < 0) {
+			if (errno == EINVAL)
+				_exit(4);
+			_exit(1);
+		}
+
+		if (get_mnt_ns_id(fd, &new_ns_id) != 0)
+			_exit(5);
+
+		nr_mounts = listmount(LSMT_ROOT, new_ns_id, 0, list, 256, 0);
+		if (nr_mounts < 0)
+			_exit(6);
+
+		/* Recursive should copy submounts too */
+		if (nr_mounts < 1)
+			_exit(7);
+
+		close(fd);
+		_exit(0);
+	}
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	switch (WEXITSTATUS(status)) {
+	case 0:
+		/* Success */
+		break;
+	case 1:
+		ASSERT_FALSE(true) TH_LOG("open_tree(OPEN_TREE_NAMESPACE|AT_RECURSIVE) failed in userns");
+		break;
+	case 2:
+		SKIP(return, "setup_userns failed");
+		break;
+	case 4:
+		SKIP(return, "OPEN_TREE_NAMESPACE not supported");
+		break;
+	case 5:
+		ASSERT_FALSE(true) TH_LOG("Failed to get mount namespace ID");
+		break;
+	case 6:
+		ASSERT_FALSE(true) TH_LOG("listmount failed in new namespace");
+		break;
+	case 7:
+		ASSERT_FALSE(true) TH_LOG("New namespace has no mounts");
+		break;
+	default:
+		ASSERT_FALSE(true) TH_LOG("Unexpected error in child (exit %d)",
+					  WEXITSTATUS(status));
+		break;
+	}
+}
+
+TEST_F(open_tree_ns_userns, umount_fails_einval)
+{
+	pid_t pid;
+	int status;
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		uint64_t new_ns_id;
+		uint64_t list[256];
+		ssize_t nr_mounts;
+		int fd;
+		ssize_t i;
+
+		/* Create new user namespace */
+		if (enter_userns() != 0)
+			_exit(2);
+
+		fd = sys_open_tree(AT_FDCWD, "/",
+				   OPEN_TREE_NAMESPACE | AT_RECURSIVE | OPEN_TREE_CLOEXEC);
+		if (fd < 0) {
+			if (errno == EINVAL)
+				_exit(4);
+			_exit(1);
+		}
+
+		if (get_mnt_ns_id(fd, &new_ns_id) != 0)
+			_exit(5);
+
+		/* Get all mounts in the new namespace */
+		nr_mounts = listmount(LSMT_ROOT, new_ns_id, 0, list, 256, LISTMOUNT_REVERSE);
+		if (nr_mounts < 0)
+			_exit(9);
+
+		if (nr_mounts < 1)
+			_exit(10);
+
+		/* Enter the new namespace */
+		if (setns(fd, CLONE_NEWNS) < 0)
+			_exit(6);
+
+		for (i = 0; i < nr_mounts; i++) {
+			struct statmount *sm;
+			const char *mnt_point;
+
+			sm = statmount_alloc(list[i], new_ns_id,
+					     STATMOUNT_MNT_POINT);
+			if (!sm)
+				_exit(11);
+
+			mnt_point = sm->str + sm->mnt_point;
+
+			TH_LOG("Trying to umount %s", mnt_point);
+			if (umount2(mnt_point, MNT_DETACH) == 0) {
+				free(sm);
+				_exit(7);
+			}
+
+			if (errno != EINVAL) {
+				/* Wrong error */
+				free(sm);
+				_exit(8);
+			}
+
+			free(sm);
+		}
+
+		close(fd);
+		_exit(0);
+	}
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	switch (WEXITSTATUS(status)) {
+	case 0:
+		break;
+	case 1:
+		ASSERT_FALSE(true) TH_LOG("open_tree(OPEN_TREE_NAMESPACE) failed");
+		break;
+	case 2:
+		SKIP(return, "setup_userns failed");
+		break;
+	case 4:
+		SKIP(return, "OPEN_TREE_NAMESPACE not supported");
+		break;
+	case 5:
+		ASSERT_FALSE(true) TH_LOG("Failed to get mount namespace ID");
+		break;
+	case 6:
+		ASSERT_FALSE(true) TH_LOG("setns into new namespace failed");
+		break;
+	case 7:
+		ASSERT_FALSE(true) TH_LOG("umount succeeded but should have failed with EINVAL");
+		break;
+	case 8:
+		ASSERT_FALSE(true) TH_LOG("umount failed with wrong error (expected EINVAL)");
+		break;
+	case 9:
+		ASSERT_FALSE(true) TH_LOG("listmount failed");
+		break;
+	case 10:
+		ASSERT_FALSE(true) TH_LOG("No mounts in new namespace");
+		break;
+	case 11:
+		ASSERT_FALSE(true) TH_LOG("statmount_alloc failed");
+		break;
+	default:
+		ASSERT_FALSE(true) TH_LOG("Unexpected error in child (exit %d)",
+					  WEXITSTATUS(status));
+		break;
+	}
+}
+
+TEST_F(open_tree_ns_userns, umount_succeeds)
+{
+	pid_t pid;
+	int status;
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		uint64_t new_ns_id;
+		uint64_t list[256];
+		ssize_t nr_mounts;
+		int fd;
+		ssize_t i;
+
+		if (unshare(CLONE_NEWNS))
+			_exit(1);
+
+		if (sys_mount(NULL, "/", NULL, MS_SLAVE | MS_REC, NULL) != 0)
+			_exit(1);
+
+		fd = sys_open_tree(AT_FDCWD, "/",
+				   OPEN_TREE_NAMESPACE | AT_RECURSIVE | OPEN_TREE_CLOEXEC);
+		if (fd < 0) {
+			if (errno == EINVAL)
+				_exit(4);
+			_exit(1);
+		}
+
+		if (get_mnt_ns_id(fd, &new_ns_id) != 0)
+			_exit(5);
+
+		/* Get all mounts in the new namespace */
+		nr_mounts = listmount(LSMT_ROOT, new_ns_id, 0, list, 256, LISTMOUNT_REVERSE);
+		if (nr_mounts < 0)
+			_exit(9);
+
+		if (nr_mounts < 1)
+			_exit(10);
+
+		/* Enter the new namespace */
+		if (setns(fd, CLONE_NEWNS) < 0)
+			_exit(6);
+
+		for (i = 0; i < nr_mounts; i++) {
+			struct statmount *sm;
+			const char *mnt_point;
+
+			sm = statmount_alloc(list[i], new_ns_id,
+					     STATMOUNT_MNT_POINT);
+			if (!sm)
+				_exit(11);
+
+			mnt_point = sm->str + sm->mnt_point;
+
+			TH_LOG("Trying to umount %s", mnt_point);
+			if (umount2(mnt_point, MNT_DETACH) != 0) {
+				free(sm);
+				_exit(7);
+			}
+
+			free(sm);
+		}
+
+		close(fd);
+		_exit(0);
+	}
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_TRUE(WIFEXITED(status));
+
+	switch (WEXITSTATUS(status)) {
+	case 0:
+		break;
+	case 1:
+		ASSERT_FALSE(true) TH_LOG("open_tree(OPEN_TREE_NAMESPACE) failed");
+		break;
+	case 2:
+		SKIP(return, "setup_userns failed");
+		break;
+	case 4:
+		SKIP(return, "OPEN_TREE_NAMESPACE not supported");
+		break;
+	case 5:
+		ASSERT_FALSE(true) TH_LOG("Failed to get mount namespace ID");
+		break;
+	case 6:
+		ASSERT_FALSE(true) TH_LOG("setns into new namespace failed");
+		break;
+	case 7:
+		ASSERT_FALSE(true) TH_LOG("umount succeeded but should have failed with EINVAL");
+		break;
+	case 9:
+		ASSERT_FALSE(true) TH_LOG("listmount failed");
+		break;
+	case 10:
+		ASSERT_FALSE(true) TH_LOG("No mounts in new namespace");
+		break;
+	case 11:
+		ASSERT_FALSE(true) TH_LOG("statmount_alloc failed");
+		break;
+	default:
+		ASSERT_FALSE(true) TH_LOG("Unexpected error in child (exit %d)",
+					  WEXITSTATUS(status));
+		break;
+	}
+}
+
+FIXTURE(open_tree_ns_unbindable)
+{
+	char tmpdir[PATH_MAX];
+	bool mounted;
+};
+
+FIXTURE_SETUP(open_tree_ns_unbindable)
+{
+	int ret;
+
+	self->mounted = false;
+
+	/* Check if open_tree syscall is supported */
+	ret = sys_open_tree(-1, NULL, 0);
+	if (ret == -1 && errno == ENOSYS)
+		SKIP(return, "open_tree() syscall not supported");
+
+	/* Create a temporary directory for the test mount */
+	snprintf(self->tmpdir, sizeof(self->tmpdir),
+		 "/tmp/open_tree_ns_test.XXXXXX");
+	ASSERT_NE(mkdtemp(self->tmpdir), NULL);
+
+	/* Mount tmpfs there */
+	ret = mount("tmpfs", self->tmpdir, "tmpfs", 0, NULL);
+	if (ret < 0) {
+		rmdir(self->tmpdir);
+		SKIP(return, "Failed to mount tmpfs");
+	}
+	self->mounted = true;
+
+	ret = mount(NULL, self->tmpdir, NULL, MS_UNBINDABLE, NULL);
+	if (ret < 0) {
+		rmdir(self->tmpdir);
+		SKIP(return, "Failed to make tmpfs unbindable");
+	}
+}
+
+FIXTURE_TEARDOWN(open_tree_ns_unbindable)
+{
+	if (self->mounted)
+		umount2(self->tmpdir, MNT_DETACH);
+	rmdir(self->tmpdir);
+}
+
+TEST_F(open_tree_ns_unbindable, fails_on_unbindable)
+{
+	int fd;
+
+	fd = sys_open_tree(AT_FDCWD, self->tmpdir,
+			   OPEN_TREE_NAMESPACE | OPEN_TREE_CLOEXEC);
+	ASSERT_LT(fd, 0);
+}
+
+TEST_F(open_tree_ns_unbindable, recursive_skips_on_unbindable)
+{
+	uint64_t new_ns_id;
+	uint64_t list[256];
+	ssize_t nr_mounts;
+	int fd;
+	ssize_t i;
+	bool found_unbindable = false;
+
+	fd = sys_open_tree(AT_FDCWD, "/",
+			   OPEN_TREE_NAMESPACE | AT_RECURSIVE | OPEN_TREE_CLOEXEC);
+	ASSERT_GT(fd, 0);
+
+	ASSERT_EQ(get_mnt_ns_id(fd, &new_ns_id), 0);
+
+	nr_mounts = listmount(LSMT_ROOT, new_ns_id, 0, list, 256, 0);
+	ASSERT_GE(nr_mounts, 0) {
+		TH_LOG("listmount failed: %m");
+	}
+
+	/*
+	 * Iterate through all mounts in the new namespace and verify
+	 * the unbindable tmpfs mount was silently dropped.
+	 */
+	for (i = 0; i < nr_mounts; i++) {
+		struct statmount *sm;
+		const char *mnt_point;
+
+		sm = statmount_alloc(list[i], new_ns_id, STATMOUNT_MNT_POINT);
+		ASSERT_NE(sm, NULL) {
+			TH_LOG("statmount_alloc failed for mnt_id %llu",
+			       (unsigned long long)list[i]);
+		}
+
+		mnt_point = sm->str + sm->mnt_point;
+
+		if (strcmp(mnt_point, self->tmpdir) == 0) {
+			TH_LOG("Found unbindable mount at %s (should have been dropped)",
+			       mnt_point);
+			found_unbindable = true;
+		}
+
+		free(sm);
+	}
+
+	ASSERT_FALSE(found_unbindable) {
+		TH_LOG("Unbindable mount at %s was not dropped", self->tmpdir);
+	}
+
+	close(fd);
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/filesystems/utils.c b/tools/testing/selftests/filesystems/utils.c
index c9dd5412b37b..d6f26f849053 100644
--- a/tools/testing/selftests/filesystems/utils.c
+++ b/tools/testing/selftests/filesystems/utils.c
@@ -515,6 +515,32 @@ int setup_userns(void)
 	return 0;
 }
 
+int enter_userns(void)
+{
+	int ret;
+	char buf[32];
+	uid_t uid = getuid();
+	gid_t gid = getgid();
+
+	ret = unshare(CLONE_NEWUSER);
+	if (ret)
+		return ret;
+
+	sprintf(buf, "0 %d 1", uid);
+	ret = write_file("/proc/self/uid_map", buf);
+	if (ret)
+		return ret;
+	ret = write_file("/proc/self/setgroups", "deny");
+	if (ret)
+		return ret;
+	sprintf(buf, "0 %d 1", gid);
+	ret = write_file("/proc/self/gid_map", buf);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
 /* caps_down - lower all effective caps */
 int caps_down(void)
 {
diff --git a/tools/testing/selftests/filesystems/utils.h b/tools/testing/selftests/filesystems/utils.h
index 70f7ccc607f4..0bccfed666a9 100644
--- a/tools/testing/selftests/filesystems/utils.h
+++ b/tools/testing/selftests/filesystems/utils.h
@@ -28,6 +28,7 @@ extern int cap_down(cap_value_t down);
 
 extern bool switch_ids(uid_t uid, gid_t gid);
 extern int setup_userns(void);
+extern int enter_userns(void);
 
 static inline bool switch_userns(int fd, uid_t uid, gid_t gid, bool drop_caps)
 {

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2025-12-29 13:03 [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Christian Brauner
  2025-12-29 13:03 ` [PATCH 1/2] " Christian Brauner
  2025-12-29 13:03 ` [PATCH 2/2] selftests/open_tree: add OPEN_TREE_NAMESPACE tests Christian Brauner
@ 2025-12-29 15:24 ` Jeff Layton
  2026-01-05 20:29 ` Jeff Layton
  2026-01-19 17:11 ` Askar Safin
  4 siblings, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2025-12-29 15:24 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel
  Cc: Alexander Viro, Amir Goldstein, Josef Bacik, Jan Kara,
	Aleksa Sarai

On Mon, 2025-12-29 at 14:03 +0100, Christian Brauner wrote:
> When creating containers the setup usually involves using CLONE_NEWNS
> via clone3() or unshare(). This copies the caller's complete mount
> namespace. The runtime will also assemble a new rootfs and then use
> pivot_root() to switch the old mount tree with the new rootfs. Afterward
> it will recursively umount the old mount tree thereby getting rid of all
> mounts.
> 
> On a basic system here where the mount table isn't particularly large
> this still copies about 30 mounts. Copying all of these mounts only to
> get rid of them later is pretty wasteful.
> 
> This is exacerbated if intermediary mount namespaces are used that only
> exist for a very short amount of time and are immediately destroyed
> again causing a ton of mounts to be copied and destroyed needlessly.
> 
> With a large mount table and a system where thousands or ten-thousands
> of namespaces are spawned in parallel this quickly becomes a bottleneck
> increasing contention on the semaphore.
> 
> Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> returning a file descriptor referring to that mount tree
> OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> to a new mount namespace. In that new mount namespace the copied mount
> tree has been mounted on top of a copy of the real rootfs.
> 
> The caller can setns() into that mount namespace and perform any
> additionally setup such as move_mount()ing detached mounts in there.
> 
> This allows OPEN_TREE_NAMESPACE to function as a combined
> unshare(CLONE_NEWNS) and pivot_root().
> 
> A caller may for example choose to create an extremely minimal rootfs:
> 
> fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
> 
> This will create a mount namespace where "wootwoot" has become the
> rootfs mounted on top of the real rootfs. The caller can now setns()
> into this new mount namespace and assemble additional mounts.
> 
> This also works with user namespaces:
> 
> unshare(CLONE_NEWUSER);
> fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
> 
> which creates a new mount namespace owned by the earlier created user
> namespace with "wootwoot" as the rootfs mounted on top of the real
> rootfs.
> 
> This will scale a lot better when creating tons of mount namespaces and
> will allow to get rid of a lot of unnecessary mount and umount cycles.
> It also allows to create mount namespaces without needing to spawn
> throwaway helper processes.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Christian Brauner (2):
>       mount: add OPEN_TREE_NAMESPACE
>       selftests/open_tree: add OPEN_TREE_NAMESPACE tests
> 
>  fs/internal.h                                      |    1 +
>  fs/namespace.c                                     |  155 ++-
>  fs/nsfs.c                                          |   13 +
>  include/uapi/linux/mount.h                         |    3 +-
>  .../selftests/filesystems/open_tree_ns/.gitignore  |    1 +
>  .../selftests/filesystems/open_tree_ns/Makefile    |   10 +
>  .../filesystems/open_tree_ns/open_tree_ns_test.c   | 1030 ++++++++++++++++++++
>  tools/testing/selftests/filesystems/utils.c        |   26 +
>  tools/testing/selftests/filesystems/utils.h        |    1 +
>  9 files changed, 1223 insertions(+), 17 deletions(-)
> ---
> base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
> change-id: 20251229-work-empty-namespace-352a9c2dfe0a

Thanks Christian,

This looks really cool. FWIW, I had a discussion at LPC with Aleksa and
Christian a few weeks ago about the performance problems we were seeing
with spawning a lot of containers in parallel. Cloning a mount
namespace is rather expensive and involves global locks. Hopefully this
would help reduce the contention for them.

I'm on holiday until next week, but I'll plan to play with this when I
get back, and see how it performs vs. the traditional
unshare()/pivot_root().

Cheers!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2025-12-29 13:03 [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Christian Brauner
                   ` (2 preceding siblings ...)
  2025-12-29 15:24 ` [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Jeff Layton
@ 2026-01-05 20:29 ` Jeff Layton
  2026-01-06 22:47   ` Christian Brauner
  2026-01-19 17:11 ` Askar Safin
  4 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2026-01-05 20:29 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel
  Cc: Alexander Viro, Amir Goldstein, Josef Bacik, Jan Kara,
	Aleksa Sarai

[-- Attachment #1: Type: text/plain, Size: 5033 bytes --]

On Mon, 2025-12-29 at 14:03 +0100, Christian Brauner wrote:
> When creating containers the setup usually involves using CLONE_NEWNS
> via clone3() or unshare(). This copies the caller's complete mount
> namespace. The runtime will also assemble a new rootfs and then use
> pivot_root() to switch the old mount tree with the new rootfs. Afterward
> it will recursively umount the old mount tree thereby getting rid of all
> mounts.
> 
> On a basic system here where the mount table isn't particularly large
> this still copies about 30 mounts. Copying all of these mounts only to
> get rid of them later is pretty wasteful.
> 
> This is exacerbated if intermediary mount namespaces are used that only
> exist for a very short amount of time and are immediately destroyed
> again causing a ton of mounts to be copied and destroyed needlessly.
> 
> With a large mount table and a system where thousands or ten-thousands
> of namespaces are spawned in parallel this quickly becomes a bottleneck
> increasing contention on the semaphore.
> 
> Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> returning a file descriptor referring to that mount tree
> OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> to a new mount namespace. In that new mount namespace the copied mount
> tree has been mounted on top of a copy of the real rootfs.
> 
> The caller can setns() into that mount namespace and perform any
> additionally setup such as move_mount()ing detached mounts in there.
> 
> This allows OPEN_TREE_NAMESPACE to function as a combined
> unshare(CLONE_NEWNS) and pivot_root().
> 
> A caller may for example choose to create an extremely minimal rootfs:
> 
> fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
> 
> This will create a mount namespace where "wootwoot" has become the
> rootfs mounted on top of the real rootfs. The caller can now setns()
> into this new mount namespace and assemble additional mounts.
> 
> This also works with user namespaces:
> 
> unshare(CLONE_NEWUSER);
> fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
> 
> which creates a new mount namespace owned by the earlier created user
> namespace with "wootwoot" as the rootfs mounted on top of the real
> rootfs.
> 
> This will scale a lot better when creating tons of mount namespaces and
> will allow to get rid of a lot of unnecessary mount and umount cycles.
> It also allows to create mount namespaces without needing to spawn
> throwaway helper processes.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Christian Brauner (2):
>       mount: add OPEN_TREE_NAMESPACE
>       selftests/open_tree: add OPEN_TREE_NAMESPACE tests
> 
>  fs/internal.h                                      |    1 +
>  fs/namespace.c                                     |  155 ++-
>  fs/nsfs.c                                          |   13 +
>  include/uapi/linux/mount.h                         |    3 +-
>  .../selftests/filesystems/open_tree_ns/.gitignore  |    1 +
>  .../selftests/filesystems/open_tree_ns/Makefile    |   10 +
>  .../filesystems/open_tree_ns/open_tree_ns_test.c   | 1030 ++++++++++++++++++++
>  tools/testing/selftests/filesystems/utils.c        |   26 +
>  tools/testing/selftests/filesystems/utils.h        |    1 +
>  9 files changed, 1223 insertions(+), 17 deletions(-)
> ---
> base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
> change-id: 20251229-work-empty-namespace-352a9c2dfe0a

I sat down today and rolled the attached program. It's a nonsensical
test that just tries to fork new tasks that then spawn new mount
namespaces and switch into them as quickly as possible.

Assuming that I've done this correctly, this gives me rough numbers
from a test host that I checked out inside Meta:

With the older pivot_root() based method, I can create about 73k
"containers" in 60s. With the newer open_tree() method, I can create
about 109k in the same time. So it seems like the new method is roughly
40% faster than the older scheme (and a lot less syscalls too).

Note that the run_pivot() routine in the reproducer is based on a
strace of an earlier reproducer. That one used minijail0 to create the
containers. It's possible that there are more efficient ways to do what
it's doing with the existing APIs. It seems to do some weird stuff too
(e.g. setting everything to MS_PRIVATE twice under the old root).
Spawning a real container might have other bottlenecks too.

Still, this extension to open_tree() seems like a good idea overall,
and gets rid of a lot of useless work that we currently do when
spawning a container. The only real downside that I can see is that
container orchestrators will need changes to use the new method.

You can add:

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>

[-- Attachment #2: spawnbench.c --]
[-- Type: text/x-csrc, Size: 2910 bytes --]

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <fcntl.h>
#include <getopt.h>
#include <sys/mount.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <time.h>
#include <stdbool.h>

#define CONCURRENCY	88
#define DURATION	60
#define ROOTDIR		"/var/empty"

static uint64_t	total_containers;
static bool		opentree;

static void run_pivot()
{
	int ret, oldfd, newfd;

	ret = unshare(CLONE_NEWNS);
	if (ret) {
		perror("unshare");
		exit(1);
	}

	ret = mount(NULL, "/", NULL, MS_REC|MS_PRIVATE, NULL);
	if (ret) {
		perror("mount");
		exit(1);
	}

	oldfd = openat(AT_FDCWD, "/", O_RDONLY|O_CLOEXEC|O_DIRECTORY);
	if (oldfd < 0) {
		perror("openat");
		exit(1);
	}

	newfd = openat(AT_FDCWD, ROOTDIR, O_RDONLY|O_CLOEXEC|O_DIRECTORY);
	if (newfd < 0) {
		perror("openat");
		exit(1);
	}

	ret = mount(ROOTDIR, ROOTDIR, NULL, MS_BIND|MS_REC, NULL);
	if (ret) {
		perror("mount");
		exit(1);
	}

	ret = chdir(ROOTDIR);
	if (ret) {
		perror("chdir");
		exit(1);
	}

	ret = syscall(SYS_pivot_root, ".", ".");
	if (ret) {
		perror("pivot_root");
		exit(1);
	}

	ret = fchdir(oldfd);
	if (ret) {
		perror("fchdir");
		exit(1);
	}

	ret = mount(NULL, ".", NULL, MS_REC|MS_PRIVATE, NULL);
	if (ret) {
		perror("mount");
		exit(1);
	}

	ret = umount2(".", MNT_DETACH);
	if (ret) {
		perror("umount");
		exit(1);
	}

	ret = fchdir(newfd);
	if (ret) {
		perror("fchdir");
		exit(1);
	}

	close(oldfd);
	close(newfd);
}

static void run_opentree()
{
	int fd, ret;

	// 2 == OPEN_TREE_NAMESPACE
	fd = syscall(SYS_open_tree, AT_FDCWD, ROOTDIR, 2);
	if (fd < 0) {
		perror("open_tree");
		exit(1);
	}

	ret = setns(fd, CLONE_NEWNS);
	if (ret) {
		perror("setns");
		exit(1);
	}

	close(fd);
}

static void *run_container(void *arg)
{
	pid_t pid;

	for (;;) {
		pid = fork();
		if (pid == 0) {
			if (opentree)
				run_opentree();
			else
				run_pivot();
			break;
		} else if (pid > 0) {
			waitpid(pid, NULL, 0);
			__atomic_add_fetch(&total_containers, 1, __ATOMIC_RELAXED);
		} else {
			perror("fork");
			exit(1);
		}
	}

	return NULL;
}

int main(int argc, char **argv)
{
	int i, opt;
	int concurrency = CONCURRENCY;
	int duration = DURATION;
	time_t start, now;

	while ((opt = getopt(argc, argv, "c:d:o")) != -1) {
		switch (opt) {
		case 'c':
			concurrency = atoi(optarg);
			break;
		case 'd':
			duration = atoi(optarg);
			break;
		case 'o':
			opentree = true;
			break;
		}
	}

	for (i = 0; i < concurrency; ++i) {
		pthread_t tid;
		int ret;

		ret = pthread_create(&tid, NULL, run_container, NULL);
		if (ret != 0) {
			fprintf(stderr, "pthread_create failed! %d\n", ret);
			exit(1);
		}
	}


	start = time(NULL);
	while ((now - start) < duration) {
		uint64_t val;

		__atomic_load(&total_containers, &val, __ATOMIC_RELAXED);
		printf("Total containers: %lu\n", val);
		sleep(2);
		now = time(NULL);
	}
}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-05 20:29 ` Jeff Layton
@ 2026-01-06 22:47   ` Christian Brauner
  0 siblings, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2026-01-06 22:47 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, Alexander Viro, Amir Goldstein, Josef Bacik,
	Jan Kara, Aleksa Sarai

On Mon, Jan 05, 2026 at 03:29:35PM -0500, Jeff Layton wrote:
> On Mon, 2025-12-29 at 14:03 +0100, Christian Brauner wrote:
> > When creating containers the setup usually involves using CLONE_NEWNS
> > via clone3() or unshare(). This copies the caller's complete mount
> > namespace. The runtime will also assemble a new rootfs and then use
> > pivot_root() to switch the old mount tree with the new rootfs. Afterward
> > it will recursively umount the old mount tree thereby getting rid of all
> > mounts.
> > 
> > On a basic system here where the mount table isn't particularly large
> > this still copies about 30 mounts. Copying all of these mounts only to
> > get rid of them later is pretty wasteful.
> > 
> > This is exacerbated if intermediary mount namespaces are used that only
> > exist for a very short amount of time and are immediately destroyed
> > again causing a ton of mounts to be copied and destroyed needlessly.
> > 
> > With a large mount table and a system where thousands or ten-thousands
> > of namespaces are spawned in parallel this quickly becomes a bottleneck
> > increasing contention on the semaphore.
> > 
> > Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> > OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> > returning a file descriptor referring to that mount tree
> > OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> > to a new mount namespace. In that new mount namespace the copied mount
> > tree has been mounted on top of a copy of the real rootfs.
> > 
> > The caller can setns() into that mount namespace and perform any
> > additionally setup such as move_mount()ing detached mounts in there.
> > 
> > This allows OPEN_TREE_NAMESPACE to function as a combined
> > unshare(CLONE_NEWNS) and pivot_root().
> > 
> > A caller may for example choose to create an extremely minimal rootfs:
> > 
> > fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
> > 
> > This will create a mount namespace where "wootwoot" has become the
> > rootfs mounted on top of the real rootfs. The caller can now setns()
> > into this new mount namespace and assemble additional mounts.
> > 
> > This also works with user namespaces:
> > 
> > unshare(CLONE_NEWUSER);
> > fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
> > 
> > which creates a new mount namespace owned by the earlier created user
> > namespace with "wootwoot" as the rootfs mounted on top of the real
> > rootfs.
> > 
> > This will scale a lot better when creating tons of mount namespaces and
> > will allow to get rid of a lot of unnecessary mount and umount cycles.
> > It also allows to create mount namespaces without needing to spawn
> > throwaway helper processes.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > Christian Brauner (2):
> >       mount: add OPEN_TREE_NAMESPACE
> >       selftests/open_tree: add OPEN_TREE_NAMESPACE tests
> > 
> >  fs/internal.h                                      |    1 +
> >  fs/namespace.c                                     |  155 ++-
> >  fs/nsfs.c                                          |   13 +
> >  include/uapi/linux/mount.h                         |    3 +-
> >  .../selftests/filesystems/open_tree_ns/.gitignore  |    1 +
> >  .../selftests/filesystems/open_tree_ns/Makefile    |   10 +
> >  .../filesystems/open_tree_ns/open_tree_ns_test.c   | 1030 ++++++++++++++++++++
> >  tools/testing/selftests/filesystems/utils.c        |   26 +
> >  tools/testing/selftests/filesystems/utils.h        |    1 +
> >  9 files changed, 1223 insertions(+), 17 deletions(-)
> > ---
> > base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
> > change-id: 20251229-work-empty-namespace-352a9c2dfe0a
> 
> I sat down today and rolled the attached program. It's a nonsensical
> test that just tries to fork new tasks that then spawn new mount
> namespaces and switch into them as quickly as possible.
> 
> Assuming that I've done this correctly, this gives me rough numbers
> from a test host that I checked out inside Meta:
> 
> With the older pivot_root() based method, I can create about 73k
> "containers" in 60s. With the newer open_tree() method, I can create
> about 109k in the same time. So it seems like the new method is roughly
> 40% faster than the older scheme (and a lot less syscalls too).
> 
> Note that the run_pivot() routine in the reproducer is based on a
> strace of an earlier reproducer. That one used minijail0 to create the
> containers. It's possible that there are more efficient ways to do what
> it's doing with the existing APIs. It seems to do some weird stuff too
> (e.g. setting everything to MS_PRIVATE twice under the old root).
> Spawning a real container might have other bottlenecks too.
> 
> Still, this extension to open_tree() seems like a good idea overall,
> and gets rid of a lot of useless work that we currently do when
> spawning a container. The only real downside that I can see is that
> container orchestrators will need changes to use the new method.
> 
> You can add:
> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Tested-by: Jeff Layton <jlayton@kernel.org>

Thank you for testing this!
The basic test looks correct. The pivot_root(".", ".") part could be
simplified a tiny bit but that shouldn't matter.

Fwiw, I think swapping out the rootfs isn't something that can always
be avoided in the manner I illustrated. Some users want to spawn an
empty mount namespace (e.g., just a tmpfs on top of the real rootfs) and
then assemble a detached mount tree and swap the two mounts.

But I have a better way of doing this than what pivot_root() currently
does. The main problem with pivot_root() is not just that it moves the
old rootfs to any other location on the new rootfs it also takes the
tasklist read lock and walks all processes on the system trying to find
any process that uses the old rootfs as its fs root or its pwd and then
rechroots and repwds all of these processes into the new rootfs.

But for 90% of the use-cases (containers) that is not needed. When the
container's mount namespace and rootfs are setup the task creating that
container is the only task that is using the old rootfs and that task
could very well just rechroot itself after it unmounted the old rootfs.

So in essence pivot_root() holds tasklist lock and walks all tasks on
the systems for no reason. If the user has a beefy and busy machine with
lots of processes coming and going each pivot_root() penalizes the whole
system.

I have a patchset that allows MOVE_MOUNT_BENEATH to work with the
rootfs (and an extension that adds MOVE_MOUNT_PIVOT_ROOT which optionally
does the rechrooting in case someone really needs to rechrooting.).

With MOVE_MOUNT_BENEATH working with the real rootfs that effectively
means one can stuff a new rootfs under the current rootfs, unmount the
old rootfs and chroot into the new rootfs and then be done.

That completely avoids the tasklist locking and has other benefits.

* You get the pivot_root(".", ".") trick for free.

* MOVE_MOUNT_BENEATH works with detached mounts meaning you can assemble
  your whole rootfs in a detached mount tree (since detached mounts can
  now be mounted onto other detached mounts) and then swap the old
  rootfs with your new rootfs.

* MOVE_MOUNT_BENEATH works with mount propagation. (Which means you
  could live-update the rootfs for all services. It would be a bit more
  complicated to actually make this work nicely but it would work.)

I just had a few thoughts in this area I wanted to note.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
  2025-12-29 13:03 ` [PATCH 1/2] " Christian Brauner
@ 2026-01-08 22:37   ` Aleksa Sarai
  2026-01-12 13:00     ` Christian Brauner
  2026-02-24 11:23   ` Florian Weimer
  1 sibling, 1 reply; 24+ messages in thread
From: Aleksa Sarai @ 2026-01-08 22:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Alexander Viro, Amir Goldstein,
	Josef Bacik, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 14408 bytes --]

On 2025-12-29, Christian Brauner <brauner@kernel.org> wrote:
> When creating containers the setup usually involves using CLONE_NEWNS
> via clone3() or unshare(). This copies the caller's complete mount
> namespace. The runtime will also assemble a new rootfs and then use
> pivot_root() to switch the old mount tree with the new rootfs. Afterward
> it will recursively umount the old mount tree thereby getting rid of all
> mounts.
> 
> On a basic system here where the mount table isn't particularly large
> this still copies about 30 mounts. Copying all of these mounts only to
> get rid of them later is pretty wasteful.
> 
> This is exacerbated if intermediary mount namespaces are used that only
> exist for a very short amount of time and are immediately destroyed
> again causing a ton of mounts to be copied and destroyed needlessly.
> 
> With a large mount table and a system where thousands or ten-thousands
> of containers are spawned in parallel this quickly becomes a bottleneck
> increasing contention on the semaphore.
> 
> Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> returning a file descriptor referring to that mount tree
> OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> to a new mount namespace. In that new mount namespace the copied mount
> tree has been mounted on top of a copy of the real rootfs.
> 
> The caller can setns() into that mount namespace and perform any
> additionally required setup such as move_mount() detached mounts in
> there.
> 
> This allows OPEN_TREE_NAMESPACE to function as a combined
> unshare(CLONE_NEWNS) and pivot_root().
> 
> A caller may for example choose to create an extremely minimal rootfs:
> 
> fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
> 
> This will create a mount namespace where "wootwoot" has become the
> rootfs mounted on top of the real rootfs. The caller can now setns()
> into this new mount namespace and assemble additional mounts.
> 
> This also works with user namespaces:
> 
> unshare(CLONE_NEWUSER);
> fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
> 
> which creates a new mount namespace owned by the earlier created user
> namespace with "wootwoot" as the rootfs mounted on top of the real
> rootfs.

I think there are a few other things I would really like (with my runc
hat on), such as being able to move_mount(2) into this new kind of
OPEN_TREE_NAMESPACE handle.

However, on the whole I think this seems very reasonable and is much
simpler than I anticipated when we talked about this at LPC. I've just
taken a general look but feel free to take my

Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>

and also maybe a

Suggested-by: Aleksa Sarai <cyphar@cyphar.com>

If you feel it's appropriate, since we came up with this together at
LPC? ;)

> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/internal.h              |   1 +
>  fs/namespace.c             | 155 ++++++++++++++++++++++++++++++++++++++++-----
>  fs/nsfs.c                  |  13 ++++
>  include/uapi/linux/mount.h |   3 +-
>  4 files changed, 155 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/internal.h b/fs/internal.h
> index ab638d41ab81..b5aad5265e0e 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -247,6 +247,7 @@ extern void mnt_pin_kill(struct mount *m);
>   */
>  extern const struct dentry_operations ns_dentry_operations;
>  int open_namespace(struct ns_common *ns);
> +struct file *open_namespace_file(struct ns_common *ns);
>  
>  /*
>   * fs/stat.c:
> diff --git a/fs/namespace.c b/fs/namespace.c
> index c58674a20cad..fd9698671c70 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2796,6 +2796,9 @@ static inline void unlock_mount(struct pinned_mountpoint *m)
>  		__unlock_mount(m);
>  }
>  
> +static void lock_mount_exact(const struct path *path,
> +			     struct pinned_mountpoint *mp);
> +
>  #define LOCK_MOUNT_MAYBE_BENEATH(mp, path, beneath) \
>  	struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \
>  	do_lock_mount((path), &mp, (beneath))
> @@ -2946,10 +2949,11 @@ static inline bool may_copy_tree(const struct path *path)
>  	return check_anonymous_mnt(mnt);
>  }
>  
> -
> -static struct mount *__do_loopback(const struct path *old_path, int recurse)
> +static struct mount *__do_loopback(const struct path *old_path,
> +				   unsigned int flags, unsigned int copy_flags)
>  {
>  	struct mount *old = real_mount(old_path->mnt);
> +	bool recurse = flags & AT_RECURSIVE;
>  
>  	if (IS_MNT_UNBINDABLE(old))
>  		return ERR_PTR(-EINVAL);
> @@ -2960,10 +2964,22 @@ static struct mount *__do_loopback(const struct path *old_path, int recurse)
>  	if (!recurse && __has_locked_children(old, old_path->dentry))
>  		return ERR_PTR(-EINVAL);
>  
> +	/*
> +	 * When creating a new mount namespace we don't want to copy over
> +	 * mounts of mount namespaces to avoid the risk of cycles and also to
> +	 * minimize the default complex interdependencies between mount
> +	 * namespaces.
> +	 *
> +	 * We could ofc just check whether all mount namespace files aren't
> +	 * creating cycles but really let's keep this simple.
> +	 */
> +	if (!(flags & OPEN_TREE_NAMESPACE))
> +		copy_flags |= CL_COPY_MNT_NS_FILE;

I kind of think this is a somewhat theoretical issue but I don't think
we'll be bitten by it. My gut feeling is that I'd prefer this to be an
OPEN_TREE_* flag that you have to set (so we can support this in the
future) but that's kinda ugly too...

> +
>  	if (recurse)
> -		return copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
> -	else
> -		return clone_mnt(old, old_path->dentry, 0);
> +		return copy_tree(old, old_path->dentry, copy_flags);
> +
> +	return clone_mnt(old, old_path->dentry, copy_flags);
>  }
>  
>  /*
> @@ -2974,7 +2990,9 @@ static int do_loopback(const struct path *path, const char *old_name,
>  {
>  	struct path old_path __free(path_put) = {};
>  	struct mount *mnt = NULL;
> +	unsigned int flags = recurse ? AT_RECURSIVE : 0;
>  	int err;
> +
>  	if (!old_name || !*old_name)
>  		return -EINVAL;
>  	err = kern_path(old_name, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT, &old_path);
> @@ -2991,7 +3009,7 @@ static int do_loopback(const struct path *path, const char *old_name,
>  	if (!check_mnt(mp.parent))
>  		return -EINVAL;
>  
> -	mnt = __do_loopback(&old_path, recurse);
> +	mnt = __do_loopback(&old_path, flags, 0);
>  	if (IS_ERR(mnt))
>  		return PTR_ERR(mnt);
>  
> @@ -3004,7 +3022,7 @@ static int do_loopback(const struct path *path, const char *old_name,
>  	return err;
>  }
>  
> -static struct mnt_namespace *get_detached_copy(const struct path *path, bool recursive)
> +static struct mnt_namespace *get_detached_copy(const struct path *path, unsigned int flags)
>  {
>  	struct mnt_namespace *ns, *mnt_ns = current->nsproxy->mnt_ns, *src_mnt_ns;
>  	struct user_namespace *user_ns = mnt_ns->user_ns;
> @@ -3029,7 +3047,7 @@ static struct mnt_namespace *get_detached_copy(const struct path *path, bool rec
>  			ns->seq_origin = src_mnt_ns->ns.ns_id;
>  	}
>  
> -	mnt = __do_loopback(path, recursive);
> +	mnt = __do_loopback(path, flags, 0);
>  	if (IS_ERR(mnt)) {
>  		emptied_ns = ns;
>  		return ERR_CAST(mnt);
> @@ -3043,9 +3061,9 @@ static struct mnt_namespace *get_detached_copy(const struct path *path, bool rec
>  	return ns;
>  }
>  
> -static struct file *open_detached_copy(struct path *path, bool recursive)
> +static struct file *open_detached_copy(struct path *path, unsigned int flags)
>  {
> -	struct mnt_namespace *ns = get_detached_copy(path, recursive);
> +	struct mnt_namespace *ns = get_detached_copy(path, flags);
>  	struct file *file;
>  
>  	if (IS_ERR(ns))
> @@ -3061,21 +3079,114 @@ static struct file *open_detached_copy(struct path *path, bool recursive)
>  	return file;
>  }
>  
> +DEFINE_FREE(put_empty_mnt_ns, struct mnt_namespace *,
> +	    if (!IS_ERR_OR_NULL(_T)) free_mnt_ns(_T))
> +
> +static struct file *open_new_namespace(struct path *path, unsigned int flags)
> +{
> +	struct mnt_namespace *new_ns __free(put_empty_mnt_ns) = NULL;
> +	struct path to_path __free(path_put) = {};
> +	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
> +	struct user_namespace *user_ns = current_user_ns();
> +	struct mount *new_ns_root;
> +	struct mount *mnt;
> +	struct ns_common *new_ns_common;
> +	unsigned int copy_flags = 0;
> +	bool locked = false;
> +
> +	if (user_ns != ns->user_ns)
> +		copy_flags |= CL_SLAVE;
> +
> +	new_ns = alloc_mnt_ns(user_ns, false);
> +	if (IS_ERR(new_ns))
> +		return ERR_CAST(new_ns);
> +
> +	scoped_guard(namespace_excl) {
> +		new_ns_root = clone_mnt(ns->root, ns->root->mnt.mnt_root, copy_flags);
> +		if (IS_ERR(new_ns_root))
> +			return ERR_CAST(new_ns_root);
> +
> +		/*
> +		 * If the real rootfs had a locked mount on top of it somewhere
> +		 * in the stack, lock the new mount tree as well so it can't be
> +		 * exposed.
> +		 */
> +		mnt = ns->root;
> +		while (mnt->overmount) {
> +			mnt = mnt->overmount;
> +			if (mnt->mnt.mnt_flags & MNT_LOCKED)
> +				locked = true;
> +		}
> +	}
> +
> +	/*
> +	 * We dropped the namespace semaphore so we can actually lock
> +	 * the copy for mounting. The copied mount isn't attached to any
> +	 * mount namespace and it is thus excluded from any propagation.
> +	 * So realistically we're isolated and the mount can't be
> +	 * overmounted.
> +	 */
> +
> +	/* Borrow the reference from clone_mnt(). */
> +	to_path.mnt = &new_ns_root->mnt;
> +	to_path.dentry = dget(new_ns_root->mnt.mnt_root);
> +
> +	/* Now lock for actual mounting. */
> +	LOCK_MOUNT_EXACT(mp, &to_path);
> +	if (unlikely(IS_ERR(mp.parent)))
> +		return ERR_CAST(mp.parent);
> +
> +	/*
> +	 * We don't emulate unshare()ing a mount namespace. We stick to the
> +	 * restrictions of creating detached bind-mounts. It has a lot
> +	 * saner and simpler semantics.
> +	 */
> +	mnt = __do_loopback(path, flags, copy_flags);
> +	if (IS_ERR(mnt))
> +		return ERR_CAST(mnt);
> +
> +	scoped_guard(mount_writer) {
> +		if (locked)
> +			mnt->mnt.mnt_flags |= MNT_LOCKED;
> +		/*
> +		 * Now mount the detached tree on top of the copy of the
> +		 * real rootfs we created.
> +		 */
> +		attach_mnt(mnt, new_ns_root, mp.mp);
> +		if (user_ns != ns->user_ns)
> +			lock_mnt_tree(new_ns_root);
> +	}
> +
> +	/* Add all mounts to the new namespace. */
> +	for (struct mount *p = new_ns_root; p; p = next_mnt(p, new_ns_root)) {
> +		mnt_add_to_ns(new_ns, p);
> +		new_ns->nr_mounts++;
> +	}
> +
> +	new_ns->root = real_mount(no_free_ptr(to_path.mnt));
> +	ns_tree_add_raw(new_ns);
> +	new_ns_common = to_ns_common(no_free_ptr(new_ns));
> +	return open_namespace_file(new_ns_common);
> +}
> +
>  static struct file *vfs_open_tree(int dfd, const char __user *filename, unsigned int flags)
>  {
>  	int ret;
>  	struct path path __free(path_put) = {};
>  	int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
> -	bool detached = flags & OPEN_TREE_CLONE;
>  
>  	BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
>  
>  	if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE |
>  		      AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE |
> -		      OPEN_TREE_CLOEXEC))
> +		      OPEN_TREE_CLOEXEC | OPEN_TREE_NAMESPACE))
>  		return ERR_PTR(-EINVAL);
>  
> -	if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE)) == AT_RECURSIVE)
> +	if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE | OPEN_TREE_NAMESPACE)) ==
> +	    AT_RECURSIVE)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (hweight32(flags & (OPEN_TREE_CLONE | OPEN_TREE_NAMESPACE)) > 1)
>  		return ERR_PTR(-EINVAL);
>  
>  	if (flags & AT_NO_AUTOMOUNT)
> @@ -3085,15 +3196,27 @@ static struct file *vfs_open_tree(int dfd, const char __user *filename, unsigned
>  	if (flags & AT_EMPTY_PATH)
>  		lookup_flags |= LOOKUP_EMPTY;
>  
> -	if (detached && !may_mount())
> +	/*
> +	 * If we create a new mount namespace with the cloned mount tree we
> +	 * just care about being privileged over our current user namespace.
> +	 * The new mount namespace will be owned by it.
> +	 */
> +	if ((flags & OPEN_TREE_NAMESPACE) &&
> +	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> +		return ERR_PTR(-EPERM);
> +
> +	if ((flags & OPEN_TREE_CLONE) && !may_mount())
>  		return ERR_PTR(-EPERM);
>  
>  	ret = user_path_at(dfd, filename, lookup_flags, &path);
>  	if (unlikely(ret))
>  		return ERR_PTR(ret);
>  
> -	if (detached)
> -		return open_detached_copy(&path, flags & AT_RECURSIVE);
> +	if (flags & OPEN_TREE_NAMESPACE)
> +		return open_new_namespace(&path, flags);
> +
> +	if (flags & OPEN_TREE_CLONE)
> +		return open_detached_copy(&path, flags);
>  
>  	return dentry_open(&path, O_PATH, current_cred());
>  }
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index bf27d5da91f1..db91de208645 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -99,6 +99,19 @@ int ns_get_path(struct path *path, struct task_struct *task,
>  	return ns_get_path_cb(path, ns_get_path_task, &args);
>  }
>  
> +struct file *open_namespace_file(struct ns_common *ns)
> +{
> +	struct path path __free(path_put) = {};
> +	int err;
> +
> +	/* call first to consume reference */
> +	err = path_from_stashed(&ns->stashed, nsfs_mnt, ns, &path);
> +	if (err < 0)
> +		return ERR_PTR(err);
> +
> +	return dentry_open(&path, O_RDONLY, current_cred());
> +}
> +
>  /**
>   * open_namespace - open a namespace
>   * @ns: the namespace to open
> diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> index 5d3f8c9e3a62..acbc22241c9c 100644
> --- a/include/uapi/linux/mount.h
> +++ b/include/uapi/linux/mount.h
> @@ -61,7 +61,8 @@
>  /*
>   * open_tree() flags.
>   */
> -#define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
> +#define OPEN_TREE_CLONE		(1 << 0)	/* Clone the target tree and attach the clone */
> +#define OPEN_TREE_NAMESPACE	(1 << 1)	/* Clone the target tree into a new mount namespace */
>  #define OPEN_TREE_CLOEXEC	O_CLOEXEC	/* Close the file on execve() */
>  
>  /*
> 
> -- 
> 2.47.3
> 
> 

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-08 22:37   ` Aleksa Sarai
@ 2026-01-12 13:00     ` Christian Brauner
  2026-01-12 13:37       ` Aleksa Sarai
  0 siblings, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2026-01-12 13:00 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: linux-fsdevel, Jeff Layton, Alexander Viro, Amir Goldstein,
	Josef Bacik, Jan Kara

> I think there are a few other things I would really like (with my runc
> hat on), such as being able to move_mount(2) into this new kind of
> OPEN_TREE_NAMESPACE handle.

As you know I have outlined multiple ways how we can do it on-list and
then again in Tokyo. There's a ToDo item I'm going to get to soon.
Around end of November someone asked about related changes again. I have
the beginnings for that in a branch. I just need to get to it but I need
to catch up with the list first...

> However, on the whole I think this seems very reasonable and is much
> simpler than I anticipated when we talked about this at LPC. I've just
> taken a general look but feel free to take my
> 
> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> and also maybe a
> 
> Suggested-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> If you feel it's appropriate, since we came up with this together at
> LPC? ;)

Eh, well, this rubs me the wrong way a little bit, I have to say. The
idea was basically just "what if we had an empty mount namespace" which
doesn't really work without other work I mentioned there. So to me this
was basically a fragment thrown around that I then turned into the
open_tree() idea that I explained later and which is rather different.
So adding a "Suggested-by" for that kinda makes it sound like I was the
code monkey which I very much disagree with. But I'll give us both a
Suggested-by line which should solve this nicely.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-12 13:00     ` Christian Brauner
@ 2026-01-12 13:37       ` Aleksa Sarai
  0 siblings, 0 replies; 24+ messages in thread
From: Aleksa Sarai @ 2026-01-12 13:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Alexander Viro, Amir Goldstein,
	Josef Bacik, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 2098 bytes --]

On 2026-01-12, Christian Brauner <brauner@kernel.org> wrote:
> > I think there are a few other things I would really like (with my runc
> > hat on), such as being able to move_mount(2) into this new kind of
> > OPEN_TREE_NAMESPACE handle.
> 
> As you know I have outlined multiple ways how we can do it on-list and
> then again in Tokyo. There's a ToDo item I'm going to get to soon.
> Around end of November someone asked about related changes again. I have
> the beginnings for that in a branch. I just need to get to it but I need
> to catch up with the list first...

Sure, I just wanted to have an on-list note about that bit of the
discussion.

> > However, on the whole I think this seems very reasonable and is much
> > simpler than I anticipated when we talked about this at LPC. I've just
> > taken a general look but feel free to take my
> > 
> > Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
> > 
> > and also maybe a
> > 
> > Suggested-by: Aleksa Sarai <cyphar@cyphar.com>
> > 
> > If you feel it's appropriate, since we came up with this together at
> > LPC? ;)
> 
> Eh, well, this rubs me the wrong way a little bit, I have to say. The
> idea was basically just "what if we had an empty mount namespace" which
> doesn't really work without other work I mentioned there. So to me this
> was basically a fragment thrown around that I then turned into the
> open_tree() idea that I explained later and which is rather different.
> So adding a "Suggested-by" for that kinda makes it sound like I was the
> code monkey which I very much disagree with. But I'll give us both a
> Suggested-by line which should solve this nicely.

Of course I didn't mean to suggest you were a code monkey.

You're quite right that this did turn out differently to the very rough
version we spoke about (though I think the same insight about removing
the wasted cloning for mounts that get blown away alleviating the cost
of the contention on namespace_sem applies regardless) -- in any case,
I'll leave it up to you.

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2025-12-29 13:03 [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Christian Brauner
                   ` (3 preceding siblings ...)
  2026-01-05 20:29 ` Jeff Layton
@ 2026-01-19 17:11 ` Askar Safin
  2026-01-19 19:05   ` Andy Lutomirski
  4 siblings, 1 reply; 24+ messages in thread
From: Askar Safin @ 2026-01-19 17:11 UTC (permalink / raw)
  To: brauner
  Cc: amir73il, cyphar, jack, jlayton, josef, linux-fsdevel, viro,
	Lennart Poettering, David Howells, Zhang Yunkai, cgel.zte,
	Menglong Dong, linux-kernel, initramfs, containers, linux-api,
	news, lwn, Jonathan Corbet, Rob Landley, emily, Christoph Hellwig

Christian Brauner <brauner@kernel.org>:
> Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> returning a file descriptor referring to that mount tree
> OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> to a new mount namespace. In that new mount namespace the copied mount
> tree has been mounted on top of a copy of the real rootfs.

I want to point at security benefits of this.

[[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
I like them, and I think they should get wider exposure. ]]

If this patchset ([1]) and [2] both land (they are both in "next" now and
likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
usually contain exactly 2 mounts: nullfs and whatever was passed to
open_tree(OPEN_TREE_NAMESPACE).

This means that even if attacker somehow is able to unmount its root and
get access to underlying mounts, then the only underlying thing they will
get is nullfs.

Also this means that other mounts are not only hidden in new namespace, they
are fully absent. This prevents attacks discussed here: [3], [4].

Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
is passed), there is no anymore hidden writable mount shared by all containers,
potentially available to attackers. This is concern raised in [5]:

> You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
> actually _be_ a filesystem. Even with your "fix", containers could communicate
> with each _other_ through it if it becomes accessible. If a container can get
> access to an empty initramfs and write into it, it can ask/answer the question
> "Are there any other containers on this machine running stux24" and then coordinate.

Note: as well as I understand all actual security bugs are already fixed in kernel,
runc and similar tools. But still [1] and [2] reduce chances of similar bugs
in the future, and this is very good thing.

Also: [1] and [2] are pretty big changes to how mount namespaces work, so
I added more people and lists to CC.

This mail is answer to [1].

[1] https://lore.kernel.org/all/20251229-work-empty-namespace-v1-0-bfb24c7b061f@kernel.org/
[2] https://lore.kernel.org/all/20260112-work-immutable-rootfs-v2-0-88dd1c34a204@kernel.org/

[3] https://lore.kernel.org/all/rxh6knvencwjajhgvdgzmrkwmyxwotu3itqyreun3h2pmaujhr@snhuqoq44kkf/
[4] https://github.com/opencontainers/runc/pull/1962
[5] https://lore.kernel.org/all/cec90924-e7ec-377c-fb02-e0f25ab9db73@landley.net/

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-19 17:11 ` Askar Safin
@ 2026-01-19 19:05   ` Andy Lutomirski
  2026-01-19 22:21     ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2026-01-19 19:05 UTC (permalink / raw)
  To: Askar Safin
  Cc: brauner, amir73il, cyphar, jack, jlayton, josef, linux-fsdevel,
	viro, Lennart Poettering, David Howells, Zhang Yunkai, cgel.zte,
	Menglong Dong, linux-kernel, initramfs, containers, linux-api,
	news, lwn, Jonathan Corbet, Rob Landley, emily, Christoph Hellwig

On Mon, Jan 19, 2026 at 10:56 AM Askar Safin <safinaskar@gmail.com> wrote:
>
> Christian Brauner <brauner@kernel.org>:
> > Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> > OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> > returning a file descriptor referring to that mount tree
> > OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> > to a new mount namespace. In that new mount namespace the copied mount
> > tree has been mounted on top of a copy of the real rootfs.
>
> I want to point at security benefits of this.
>
> [[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
> I like them, and I think they should get wider exposure. ]]
>
> If this patchset ([1]) and [2] both land (they are both in "next" now and
> likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
> command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
> usually contain exactly 2 mounts: nullfs and whatever was passed to
> open_tree(OPEN_TREE_NAMESPACE).
>
> This means that even if attacker somehow is able to unmount its root and
> get access to underlying mounts, then the only underlying thing they will
> get is nullfs.
>
> Also this means that other mounts are not only hidden in new namespace, they
> are fully absent. This prevents attacks discussed here: [3], [4].
>
> Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
> is passed), there is no anymore hidden writable mount shared by all containers,
> potentially available to attackers. This is concern raised in [5]:
>
> > You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
> > actually _be_ a filesystem. Even with your "fix", containers could communicate
> > with each _other_ through it if it becomes accessible. If a container can get
> > access to an empty initramfs and write into it, it can ask/answer the question
> > "Are there any other containers on this machine running stux24" and then coordinate.

I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the
path that gives it sensible behavior should be conditional like this.
Either make it *always* mount on top of nullfs (regardless of boot
options) or find some way to have it actually be the root.  I assume
the latter is challenging for some reason.

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-19 19:05   ` Andy Lutomirski
@ 2026-01-19 22:21     ` Jeff Layton
  2026-01-21 10:20       ` Christian Brauner
                         ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Jeff Layton @ 2026-01-19 22:21 UTC (permalink / raw)
  To: Andy Lutomirski, Askar Safin
  Cc: brauner, amir73il, cyphar, jack, josef, linux-fsdevel, viro,
	Lennart Poettering, David Howells, Zhang Yunkai, cgel.zte,
	Menglong Dong, linux-kernel, initramfs, containers, linux-api,
	news, lwn, Jonathan Corbet, Rob Landley, emily, Christoph Hellwig

On Mon, 2026-01-19 at 11:05 -0800, Andy Lutomirski wrote:
> On Mon, Jan 19, 2026 at 10:56 AM Askar Safin <safinaskar@gmail.com> wrote:
> > 
> > Christian Brauner <brauner@kernel.org>:
> > > Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> > > OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> > > returning a file descriptor referring to that mount tree
> > > OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> > > to a new mount namespace. In that new mount namespace the copied mount
> > > tree has been mounted on top of a copy of the real rootfs.
> > 
> > I want to point at security benefits of this.
> > 
> > [[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
> > I like them, and I think they should get wider exposure. ]]
> > 
> > If this patchset ([1]) and [2] both land (they are both in "next" now and
> > likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
> > command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
> > usually contain exactly 2 mounts: nullfs and whatever was passed to
> > open_tree(OPEN_TREE_NAMESPACE).
> > 
> > This means that even if attacker somehow is able to unmount its root and
> > get access to underlying mounts, then the only underlying thing they will
> > get is nullfs.
> > 
> > Also this means that other mounts are not only hidden in new namespace, they
> > are fully absent. This prevents attacks discussed here: [3], [4].
> > 
> > Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
> > is passed), there is no anymore hidden writable mount shared by all containers,
> > potentially available to attackers. This is concern raised in [5]:
> > 
> > > You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
> > > actually _be_ a filesystem. Even with your "fix", containers could communicate
> > > with each _other_ through it if it becomes accessible. If a container can get
> > > access to an empty initramfs and write into it, it can ask/answer the question
> > > "Are there any other containers on this machine running stux24" and then coordinate.
> 
> I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the
> path that gives it sensible behavior should be conditional like this.
> Either make it *always* mount on top of nullfs (regardless of boot
> options) or find some way to have it actually be the root.  I assume
> the latter is challenging for some reason.
> 

I think that's the plan. I suggested the same to Christian last week,
and he was amenable to removing the option and just always doing a
nullfs_rootfs mount.

We think that older runtimes should still "just work" with this scheme.
Out of an abundance of caution, we _might_ want a command-line option
to make it go back to old way, in case we find some userland stuff that
doesn't like this for some reason, but hopefully we won't even need
that.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-19 22:21     ` Jeff Layton
@ 2026-01-21 10:20       ` Christian Brauner
  2026-01-21 18:00       ` Andy Lutomirski
  2026-01-21 19:56       ` Rob Landley
  2 siblings, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2026-01-21 10:20 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Andy Lutomirski, Askar Safin, amir73il, cyphar, jack, josef,
	linux-fsdevel, viro, Lennart Poettering, David Howells,
	Zhang Yunkai, cgel.zte, Menglong Dong, linux-kernel, initramfs,
	containers, linux-api, news, lwn, Jonathan Corbet, Rob Landley,
	emily, Christoph Hellwig

On Mon, Jan 19, 2026 at 05:21:30PM -0500, Jeff Layton wrote:
> On Mon, 2026-01-19 at 11:05 -0800, Andy Lutomirski wrote:
> > On Mon, Jan 19, 2026 at 10:56 AM Askar Safin <safinaskar@gmail.com> wrote:
> > > 
> > > Christian Brauner <brauner@kernel.org>:
> > > > Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> > > > OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> > > > returning a file descriptor referring to that mount tree
> > > > OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> > > > to a new mount namespace. In that new mount namespace the copied mount
> > > > tree has been mounted on top of a copy of the real rootfs.
> > > 
> > > I want to point at security benefits of this.
> > > 
> > > [[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
> > > I like them, and I think they should get wider exposure. ]]
> > > 
> > > If this patchset ([1]) and [2] both land (they are both in "next" now and
> > > likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
> > > command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
> > > usually contain exactly 2 mounts: nullfs and whatever was passed to
> > > open_tree(OPEN_TREE_NAMESPACE).
> > > 
> > > This means that even if attacker somehow is able to unmount its root and
> > > get access to underlying mounts, then the only underlying thing they will
> > > get is nullfs.
> > > 
> > > Also this means that other mounts are not only hidden in new namespace, they
> > > are fully absent. This prevents attacks discussed here: [3], [4].
> > > 
> > > Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
> > > is passed), there is no anymore hidden writable mount shared by all containers,
> > > potentially available to attackers. This is concern raised in [5]:
> > > 
> > > > You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
> > > > actually _be_ a filesystem. Even with your "fix", containers could communicate
> > > > with each _other_ through it if it becomes accessible. If a container can get
> > > > access to an empty initramfs and write into it, it can ask/answer the question
> > > > "Are there any other containers on this machine running stux24" and then coordinate.
> > 
> > I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the
> > path that gives it sensible behavior should be conditional like this.
> > Either make it *always* mount on top of nullfs (regardless of boot
> > options) or find some way to have it actually be the root.  I assume
> > the latter is challenging for some reason.
> > 
> 
> I think that's the plan. I suggested the same to Christian last week,
> and he was amenable to removing the option and just always doing a
> nullfs_rootfs mount.

Whether or not the underlying mount is nullfs or not is irrelevant. If
it's not nullfs but a regular tmpfs it works just as well. If it has any
locked overmounts the new rootfs will become locked as well similarly if
it'll be owned by a new userns.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-19 22:21     ` Jeff Layton
  2026-01-21 10:20       ` Christian Brauner
@ 2026-01-21 18:00       ` Andy Lutomirski
  2026-01-23 10:23         ` Christian Brauner
  2026-01-21 19:56       ` Rob Landley
  2 siblings, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2026-01-21 18:00 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Askar Safin, brauner, amir73il, cyphar, jack, josef,
	linux-fsdevel, viro, Lennart Poettering, David Howells,
	Yunkai Zhang, cgel.zte, Menglong Dong, linux-kernel, initramfs,
	containers, linux-api, news, lwn, Jonathan Corbet, Rob Landley,
	emily, Christoph Hellwig

> On Jan 19, 2026, at 2:21 PM, Jeff Layton <jlayton@kernel.org> wrote:
>
> On Mon, 2026-01-19 at 11:05 -0800, Andy Lutomirski wrote:
>>> On Mon, Jan 19, 2026 at 10:56 AM Askar Safin <safinaskar@gmail.com> wrote:
>>>
>>> Christian Brauner <brauner@kernel.org>:
>>>> Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
>>>> OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
>>>> returning a file descriptor referring to that mount tree
>>>> OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
>>>> to a new mount namespace. In that new mount namespace the copied mount
>>>> tree has been mounted on top of a copy of the real rootfs.
>>>
>>> I want to point at security benefits of this.
>>>
>>> [[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
>>> I like them, and I think they should get wider exposure. ]]
>>>
>>> If this patchset ([1]) and [2] both land (they are both in "next" now and
>>> likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
>>> command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
>>> usually contain exactly 2 mounts: nullfs and whatever was passed to
>>> open_tree(OPEN_TREE_NAMESPACE).
>>>
>>> This means that even if attacker somehow is able to unmount its root and
>>> get access to underlying mounts, then the only underlying thing they will
>>> get is nullfs.
>>>
>>> Also this means that other mounts are not only hidden in new namespace, they
>>> are fully absent. This prevents attacks discussed here: [3], [4].
>>>
>>> Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
>>> is passed), there is no anymore hidden writable mount shared by all containers,
>>> potentially available to attackers. This is concern raised in [5]:
>>>
>>>> You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
>>>> actually _be_ a filesystem. Even with your "fix", containers could communicate
>>>> with each _other_ through it if it becomes accessible. If a container can get
>>>> access to an empty initramfs and write into it, it can ask/answer the question
>>>> "Are there any other containers on this machine running stux24" and then coordinate.
>>
>> I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the
>> path that gives it sensible behavior should be conditional like this.
>> Either make it *always* mount on top of nullfs (regardless of boot
>> options) or find some way to have it actually be the root.  I assume
>> the latter is challenging for some reason.
>>
>
> I think that's the plan. I suggested the same to Christian last week,
> and he was amenable to removing the option and just always doing a
> nullfs_rootfs mount.
>
> We think that older runtimes should still "just work" with this scheme.
> Out of an abundance of caution, we _might_ want a command-line option
> to make it go back to old way, in case we find some userland stuff that
> doesn't like this for some reason, but hopefully we won't even need
> that.

What I mean is: even if for some reason the kernel is running in a
mode where the *initial* rootfs is a real fs, I think it would be nice
for OPEN_TREE_NAMESPACE to use nullfs.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-19 22:21     ` Jeff Layton
  2026-01-21 10:20       ` Christian Brauner
  2026-01-21 18:00       ` Andy Lutomirski
@ 2026-01-21 19:56       ` Rob Landley
  2026-02-19 23:42         ` Askar Safin
  2 siblings, 1 reply; 24+ messages in thread
From: Rob Landley @ 2026-01-21 19:56 UTC (permalink / raw)
  To: Jeff Layton, Andy Lutomirski, Askar Safin
  Cc: amir73il, cyphar, jack, josef, linux-fsdevel, viro,
	Lennart Poettering, David Howells, Zhang Yunkai, cgel.zte,
	Menglong Dong, linux-kernel, initramfs, containers, linux-api,
	news, lwn, Jonathan Corbet, emily, Christoph Hellwig

>>>> You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
>>>> actually _be_ a filesystem. Even with your "fix", containers could communicate
>>>> with each _other_ through it if it becomes accessible. If a container can get
>>>> access to an empty initramfs and write into it, it can ask/answer the question
>>>> "Are there any other containers on this machine running stux24" and then coordinate.

Or you could just make the ROOT= codepath remount the empty initramfs -o 
ro like some switch_root implementations do. If the PID 1 you launch 
isn't in initramfs, don't leave initramfs writeable. That seems unlikely 
to break userspace.

(Having permissions to remount initramfs but _not_ having already 
"cracked root" seems... a bit funky? You have waaaaay more faith in 
security modules than I do...)

>> I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the
>> path that gives it sensible behavior should be conditional like this.
>> Either make it *always* mount on top of nullfs (regardless of boot
>> options) or find some way to have it actually be the root.  I assume
>> the latter is challenging for some reason.
> 
> I think that's the plan. I suggested the same to Christian last week,
> and he was amenable to removing the option and just always doing a
> nullfs_rootfs mount.

Since 2013, initramfs might be ramfs or tmpfs depending on 
circumstances. Adding a third option for it be nullfs when there's no 
cpio.gz extracted into it seems reasonable. (You can always mount a 
tmpfs _over_ it if you need that later, it's writeable so a PID 1 
launched in it has workspace.)

That said, if you are changing the semantics, right now we switch_root 
from initramfs instead of pivot_root because initramfs couldn't be 
unmounted. With this change would pivot_root become the mechanism for 
initramfs too? (If the cpio.gz recipient wasn't actually rootfs but was 
an overmount the way ROOT= does it.)

Aside: it would be nice if inaccessible mount points could automatically 
be garbage collected. There's already some "lazy umount" plumbing that 
does that when explicitly requested to, but last I checked there were 
cases that didn't get caught. It's been a while though, might already 
have been fixed. Presumably initramfs would always get pinned because 
it's PID 0's / reference...

Also, could you guys make CONFIG_DEVTMPFS_MOUNT work with initramfs? 
I've posted patches for that on and off since 2017, most recent one's 
probably 
https://landley.net/bin/mkroot/0.8.13/linux-patches/0003-Wire-up-CONFIG_DEVTMPFS_MOUNT-to-initramfs.patch 
(tested on a 6.17 kernel).

> We think that older runtimes should still "just work" with this scheme.
> Out of an abundance of caution, we _might_ want a command-line option
> to make it go back to old way, in case we find some userland stuff that
> doesn't like this for some reason, but hopefully we won't even need
> that.

I assume it will break stuff, but I also assume the systems it breaks 
will never upgrade to a 7.x kernel because the kernel itself would 
consume all available memory before launching PID 1.

Rob

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-21 18:00       ` Andy Lutomirski
@ 2026-01-23 10:23         ` Christian Brauner
  2026-01-24 10:13           ` Askar Safin
  0 siblings, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2026-01-23 10:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jeff Layton, Askar Safin, amir73il, cyphar, jack, josef,
	linux-fsdevel, viro, Lennart Poettering, David Howells,
	Yunkai Zhang, cgel.zte, Menglong Dong, linux-kernel, initramfs,
	containers, linux-api, news, lwn, Jonathan Corbet, Rob Landley,
	emily, Christoph Hellwig

On Wed, Jan 21, 2026 at 10:00:19AM -0800, Andy Lutomirski wrote:
> > On Jan 19, 2026, at 2:21 PM, Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Mon, 2026-01-19 at 11:05 -0800, Andy Lutomirski wrote:
> >>> On Mon, Jan 19, 2026 at 10:56 AM Askar Safin <safinaskar@gmail.com> wrote:
> >>>
> >>> Christian Brauner <brauner@kernel.org>:
> >>>> Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> >>>> OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> >>>> returning a file descriptor referring to that mount tree
> >>>> OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> >>>> to a new mount namespace. In that new mount namespace the copied mount
> >>>> tree has been mounted on top of a copy of the real rootfs.
> >>>
> >>> I want to point at security benefits of this.
> >>>
> >>> [[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
> >>> I like them, and I think they should get wider exposure. ]]
> >>>
> >>> If this patchset ([1]) and [2] both land (they are both in "next" now and
> >>> likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
> >>> command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
> >>> usually contain exactly 2 mounts: nullfs and whatever was passed to
> >>> open_tree(OPEN_TREE_NAMESPACE).
> >>>
> >>> This means that even if attacker somehow is able to unmount its root and
> >>> get access to underlying mounts, then the only underlying thing they will
> >>> get is nullfs.
> >>>
> >>> Also this means that other mounts are not only hidden in new namespace, they
> >>> are fully absent. This prevents attacks discussed here: [3], [4].
> >>>
> >>> Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
> >>> is passed), there is no anymore hidden writable mount shared by all containers,
> >>> potentially available to attackers. This is concern raised in [5]:
> >>>
> >>>> You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
> >>>> actually _be_ a filesystem. Even with your "fix", containers could communicate
> >>>> with each _other_ through it if it becomes accessible. If a container can get
> >>>> access to an empty initramfs and write into it, it can ask/answer the question
> >>>> "Are there any other containers on this machine running stux24" and then coordinate.
> >>
> >> I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the
> >> path that gives it sensible behavior should be conditional like this.
> >> Either make it *always* mount on top of nullfs (regardless of boot
> >> options) or find some way to have it actually be the root.  I assume
> >> the latter is challenging for some reason.
> >>
> >
> > I think that's the plan. I suggested the same to Christian last week,
> > and he was amenable to removing the option and just always doing a
> > nullfs_rootfs mount.
> >
> > We think that older runtimes should still "just work" with this scheme.
> > Out of an abundance of caution, we _might_ want a command-line option
> > to make it go back to old way, in case we find some userland stuff that
> > doesn't like this for some reason, but hopefully we won't even need
> > that.
> 
> What I mean is: even if for some reason the kernel is running in a
> mode where the *initial* rootfs is a real fs, I think it would be nice
> for OPEN_TREE_NAMESPACE to use nullfs.

The current patchset makes nullfs unconditional. As each mount
namespaces creates a new copy of the namespace root of the namespace it
was created from all mount namespace have nullfs as namespace root.
So every OPEN_TREE_NAMESPACE/FSMOUNT_NAMESPACE will be mounted on top of
nullfs as we always take the namespace root. If we have to make nullfs
conditional then yes, we could still do that - althoug it would be ugly
in various ways.

I would love to keep nullfs unconditional because it means I can wipe a
whole class of MNT_LOCKED nonsense from the face of the earth
afterwards.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-23 10:23         ` Christian Brauner
@ 2026-01-24 10:13           ` Askar Safin
  0 siblings, 0 replies; 24+ messages in thread
From: Askar Safin @ 2026-01-24 10:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Jeff Layton, amir73il, cyphar, jack, josef,
	linux-fsdevel, viro, Lennart Poettering, David Howells,
	Yunkai Zhang, cgel.zte, Menglong Dong, linux-kernel, initramfs,
	containers, linux-api, news, lwn, Jonathan Corbet, Rob Landley,
	Christoph Hellwig

On Fri, Jan 23, 2026 at 1:23 PM Christian Brauner <brauner@kernel.org> wrote:
> The current patchset makes nullfs unconditional. As each mount

Oops, I missed that "fs: use nullfs unconditionally as the real
rootfs" is present in vfs.all.

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-21 19:56       ` Rob Landley
@ 2026-02-19 23:42         ` Askar Safin
  0 siblings, 0 replies; 24+ messages in thread
From: Askar Safin @ 2026-02-19 23:42 UTC (permalink / raw)
  To: rob; +Cc: containers, initramfs, linux-api, linux-fsdevel, linux-kernel

Rob Landley <rob@landley.net>:
> Also, could you guys make CONFIG_DEVTMPFS_MOUNT work with initramfs?

I did something similar:
https://lore.kernel.org/initramfs/20260219210312.3468980-1-safinaskar@gmail.com/T/#u

Does this solve your problem?

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
  2025-12-29 13:03 ` [PATCH 1/2] " Christian Brauner
  2026-01-08 22:37   ` Aleksa Sarai
@ 2026-02-24 11:23   ` Florian Weimer
  2026-02-24 12:05     ` Christian Brauner
  1 sibling, 1 reply; 24+ messages in thread
From: Florian Weimer @ 2026-02-24 11:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Alexander Viro, Amir Goldstein,
	Josef Bacik, Jan Kara, Aleksa Sarai, linux-api, rudi

* Christian Brauner:

> diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> index 5d3f8c9e3a62..acbc22241c9c 100644
> --- a/include/uapi/linux/mount.h
> +++ b/include/uapi/linux/mount.h
> @@ -61,7 +61,8 @@
>  /*
>   * open_tree() flags.
>   */
> -#define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
> +#define OPEN_TREE_CLONE		(1 << 0)	/* Clone the target tree and attach the clone */

This change causes pointless -Werror=undef errors in projects that have
settled on the old definition.

Reported here:

  Bug 33921 - Building with Linux-7.0-rc1 errors on OPEN_TREE_CLONE
  <https://sourceware.org/bugzilla/show_bug.cgi?id=33921>

Thanks,
Florian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
  2026-02-24 11:23   ` Florian Weimer
@ 2026-02-24 12:05     ` Christian Brauner
  2026-02-24 13:30       ` Florian Weimer
  0 siblings, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2026-02-24 12:05 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-fsdevel, Jeff Layton, Alexander Viro, Amir Goldstein,
	Josef Bacik, Jan Kara, Aleksa Sarai, linux-api, rudi

On Tue, Feb 24, 2026 at 12:23:33PM +0100, Florian Weimer wrote:
> * Christian Brauner:
> 
> > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> > index 5d3f8c9e3a62..acbc22241c9c 100644
> > --- a/include/uapi/linux/mount.h
> > +++ b/include/uapi/linux/mount.h
> > @@ -61,7 +61,8 @@
> >  /*
> >   * open_tree() flags.
> >   */
> > -#define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
> > +#define OPEN_TREE_CLONE		(1 << 0)	/* Clone the target tree and attach the clone */
> 
> This change causes pointless -Werror=undef errors in projects that have
> settled on the old definition.
> 
> Reported here:
> 
>   Bug 33921 - Building with Linux-7.0-rc1 errors on OPEN_TREE_CLONE
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=33921>

Send a patch to change it back, please.
Otherwise it might take a few days until I get around to it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
  2026-02-24 12:05     ` Christian Brauner
@ 2026-02-24 13:30       ` Florian Weimer
  2026-02-24 14:33         ` Christian Brauner
  0 siblings, 1 reply; 24+ messages in thread
From: Florian Weimer @ 2026-02-24 13:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Alexander Viro, Amir Goldstein,
	Josef Bacik, Jan Kara, Aleksa Sarai, linux-api, rudi

* Christian Brauner:

> On Tue, Feb 24, 2026 at 12:23:33PM +0100, Florian Weimer wrote:
>> * Christian Brauner:
>> 
>> > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
>> > index 5d3f8c9e3a62..acbc22241c9c 100644
>> > --- a/include/uapi/linux/mount.h
>> > +++ b/include/uapi/linux/mount.h
>> > @@ -61,7 +61,8 @@
>> >  /*
>> >   * open_tree() flags.
>> >   */
>> > -#define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
>> > +#define OPEN_TREE_CLONE		(1 << 0)	/* Clone the target tree and attach the clone */
>> 
>> This change causes pointless -Werror=undef errors in projects that have
>> settled on the old definition.
>> 
>> Reported here:
>> 
>>   Bug 33921 - Building with Linux-7.0-rc1 errors on OPEN_TREE_CLONE
>>   <https://sourceware.org/bugzilla/show_bug.cgi?id=33921>
>
> Send a patch to change it back, please.
> Otherwise it might take a few days until I get around to it.

Rudi, could you post a patch?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
  2026-02-24 13:30       ` Florian Weimer
@ 2026-02-24 14:33         ` Christian Brauner
  2026-02-26 11:54           ` Jan Kara
  2026-03-02 10:15           ` Florian Weimer
  0 siblings, 2 replies; 24+ messages in thread
From: Christian Brauner @ 2026-02-24 14:33 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-fsdevel, Jeff Layton, Alexander Viro, Amir Goldstein,
	Josef Bacik, Jan Kara, Aleksa Sarai, linux-api, rudi

On Tue, Feb 24, 2026 at 02:30:37PM +0100, Florian Weimer wrote:
> * Christian Brauner:
> 
> > On Tue, Feb 24, 2026 at 12:23:33PM +0100, Florian Weimer wrote:
> >> * Christian Brauner:
> >> 
> >> > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> >> > index 5d3f8c9e3a62..acbc22241c9c 100644
> >> > --- a/include/uapi/linux/mount.h
> >> > +++ b/include/uapi/linux/mount.h
> >> > @@ -61,7 +61,8 @@
> >> >  /*
> >> >   * open_tree() flags.
> >> >   */
> >> > -#define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
> >> > +#define OPEN_TREE_CLONE		(1 << 0)	/* Clone the target tree and attach the clone */
> >> 
> >> This change causes pointless -Werror=undef errors in projects that have
> >> settled on the old definition.
> >> 
> >> Reported here:
> >> 
> >>   Bug 33921 - Building with Linux-7.0-rc1 errors on OPEN_TREE_CLONE
> >>   <https://sourceware.org/bugzilla/show_bug.cgi?id=33921>
> >
> > Send a patch to change it back, please.
> > Otherwise it might take a few days until I get around to it.
> 
> Rudi, could you post a patch?

I'm a bit confused though and not super happy that you're basically
asking us to be so constrained that we aren't even allowed to change 1
to 1 - just syntactically different.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
  2026-02-24 14:33         ` Christian Brauner
@ 2026-02-26 11:54           ` Jan Kara
  2026-03-02 10:15           ` Florian Weimer
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Kara @ 2026-02-26 11:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Florian Weimer, linux-fsdevel, Jeff Layton, Alexander Viro,
	Amir Goldstein, Josef Bacik, Jan Kara, Aleksa Sarai, linux-api,
	rudi

On Tue 24-02-26 15:33:13, Christian Brauner wrote:
> On Tue, Feb 24, 2026 at 02:30:37PM +0100, Florian Weimer wrote:
> > * Christian Brauner:
> > 
> > > On Tue, Feb 24, 2026 at 12:23:33PM +0100, Florian Weimer wrote:
> > >> * Christian Brauner:
> > >> 
> > >> > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> > >> > index 5d3f8c9e3a62..acbc22241c9c 100644
> > >> > --- a/include/uapi/linux/mount.h
> > >> > +++ b/include/uapi/linux/mount.h
> > >> > @@ -61,7 +61,8 @@
> > >> >  /*
> > >> >   * open_tree() flags.
> > >> >   */
> > >> > -#define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
> > >> > +#define OPEN_TREE_CLONE		(1 << 0)	/* Clone the target tree and attach the clone */
> > >> 
> > >> This change causes pointless -Werror=undef errors in projects that have
> > >> settled on the old definition.
> > >> 
> > >> Reported here:
> > >> 
> > >>   Bug 33921 - Building with Linux-7.0-rc1 errors on OPEN_TREE_CLONE
> > >>   <https://sourceware.org/bugzilla/show_bug.cgi?id=33921>
> > >
> > > Send a patch to change it back, please.
> > > Otherwise it might take a few days until I get around to it.
> > 
> > Rudi, could you post a patch?
> 
> I'm a bit confused though and not super happy that you're basically
> asking us to be so constrained that we aren't even allowed to change 1
> to 1 - just syntactically different.

Agreed, this looks more like a tooling bug than anything else...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] mount: add OPEN_TREE_NAMESPACE
  2026-02-24 14:33         ` Christian Brauner
  2026-02-26 11:54           ` Jan Kara
@ 2026-03-02 10:15           ` Florian Weimer
  1 sibling, 0 replies; 24+ messages in thread
From: Florian Weimer @ 2026-03-02 10:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Alexander Viro, Amir Goldstein,
	Josef Bacik, Jan Kara, Aleksa Sarai, linux-api, rudi

* Christian Brauner:

> On Tue, Feb 24, 2026 at 02:30:37PM +0100, Florian Weimer wrote:
>> * Christian Brauner:
>> 
>> > On Tue, Feb 24, 2026 at 12:23:33PM +0100, Florian Weimer wrote:
>> >> * Christian Brauner:
>> >> 
>> >> > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
>> >> > index 5d3f8c9e3a62..acbc22241c9c 100644
>> >> > --- a/include/uapi/linux/mount.h
>> >> > +++ b/include/uapi/linux/mount.h
>> >> > @@ -61,7 +61,8 @@
>> >> >  /*
>> >> >   * open_tree() flags.
>> >> >   */
>> >> > -#define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
>> >> > +#define OPEN_TREE_CLONE		(1 << 0)	/* Clone the target tree and attach the clone */
>> >> 
>> >> This change causes pointless -Werror=undef errors in projects that have
>> >> settled on the old definition.
>> >> 
>> >> Reported here:
>> >> 
>> >>   Bug 33921 - Building with Linux-7.0-rc1 errors on OPEN_TREE_CLONE
>> >>   <https://sourceware.org/bugzilla/show_bug.cgi?id=33921>
>> >
>> > Send a patch to change it back, please.
>> > Otherwise it might take a few days until I get around to it.
>> 
>> Rudi, could you post a patch?
>
> I'm a bit confused though and not super happy that you're basically
> asking us to be so constrained that we aren't even allowed to change 1
> to 1 - just syntactically different.

I'm not happy about it, either.  But it has happened before, for the
RENAME_* constants I believe.

We are already including <linux/mount.h> from <sys/mount.h>, so we can
work around this reliably on the glibc side, regardless of header
inclusion order.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-03-02 10:15 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-29 13:03 [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Christian Brauner
2025-12-29 13:03 ` [PATCH 1/2] " Christian Brauner
2026-01-08 22:37   ` Aleksa Sarai
2026-01-12 13:00     ` Christian Brauner
2026-01-12 13:37       ` Aleksa Sarai
2026-02-24 11:23   ` Florian Weimer
2026-02-24 12:05     ` Christian Brauner
2026-02-24 13:30       ` Florian Weimer
2026-02-24 14:33         ` Christian Brauner
2026-02-26 11:54           ` Jan Kara
2026-03-02 10:15           ` Florian Weimer
2025-12-29 13:03 ` [PATCH 2/2] selftests/open_tree: add OPEN_TREE_NAMESPACE tests Christian Brauner
2025-12-29 15:24 ` [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Jeff Layton
2026-01-05 20:29 ` Jeff Layton
2026-01-06 22:47   ` Christian Brauner
2026-01-19 17:11 ` Askar Safin
2026-01-19 19:05   ` Andy Lutomirski
2026-01-19 22:21     ` Jeff Layton
2026-01-21 10:20       ` Christian Brauner
2026-01-21 18:00       ` Andy Lutomirski
2026-01-23 10:23         ` Christian Brauner
2026-01-24 10:13           ` Askar Safin
2026-01-21 19:56       ` Rob Landley
2026-02-19 23:42         ` Askar Safin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox