From: Christian Brauner <brauner@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christian Brauner <brauner@kernel.org>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [GIT PULL 11/12 for v7.1] vfs mount
Date: Fri, 10 Apr 2026 17:21:34 +0200 [thread overview]
Message-ID: <20260410-vfs-mount-v71-89e63a03df4d@brauner> (raw)
In-Reply-To: <20260410-vfs-v71-b055f260060c@brauner>
Hey Linus,
/* Summary */
* Add FSMOUNT_NAMESPACE flag to fsmount() that creates a new mount
namespace with the newly created filesystem attached to a copy of the
real rootfs. This returns a namespace file descriptor instead of an
O_PATH mount fd, similar to how OPEN_TREE_NAMESPACE works for
open_tree().
This allows creating a new filesystem and immediately placing it in a
new mount namespace in a single operation, which is useful for
container runtimes and other namespace-based isolation mechanisms.
This accompanies OPEN_TREE_NAMESPACE and avoids a needless detour via
OPEN_TREE_NAMESPACE to get the same effect. Will be especially useful
when you mount an actual filesystem to be used as the container
rootfs.
* Currently, creating a new mount namespace always copies the entire
mount tree from the caller's namespace. For containers and sandboxes
that intend to build their mount table from scratch this is wasteful:
they inherit a potentially large mount tree only to immediately tear
it down.
This series adds support for creating a mount namespace that contains
only a clone of the root mount, with none of the child mounts. Two new
flags are introduced:
- CLONE_EMPTY_MNTNS (0x400000000) for clone3(), using the 64-bit flag space.
- UNSHARE_EMPTY_MNTNS (0x00100000) for unshare()
Both flags imply CLONE_NEWNS. The resulting namespace contains a
single nullfs root mount with an immutable empty directory. The
intended workflow is to then mount a real filesystem (e.g., tmpfs)
over the root and build the mount table from there.
* Allow MOVE_MOUNT_BENEATH to target the caller's rootfs, allowing to
switch out the rootfs without pivot_root(2).
The traditional approach to switching the rootfs involves
pivot_root(2) or a chroot_fs_refs()-based mechanism that atomically
updates fs->root for all tasks sharing the same fs_struct. This has
consequences for fork(), unshare(CLONE_FS), and setns().
This series instead decomposes root-switching into individually
atomic, locally-scoped steps:
fd_tree = open_tree(-EBADF, "/newroot", OPEN_TREE_CLONE |
OPEN_TREE_CLOEXEC); fchdir(fd_tree); move_mount(fd_tree, "", AT_FDCWD,
"/", MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH); chroot(".");
umount2(".", MNT_DETACH);
Since each step only modifies the caller's own state, the
fork/unshare/setns races are eliminated by design.
A key step to making this possible is to remove the locked mount
restriction. Originally MOVE_MOUNT_BENEATH doesn't support mounting
beneath a mount that is locked. The locked mount protects the
underlying mount from being revealed. This is a core mechanism of
unshare(CLONE_NEWUSER | CLONE_NEWNS). The mounts in the new mount
namespace become locked. That effectively makes the new mount table
useless as the caller cannot ever get rid of any of the mounts no
matter how useless they are.
We can lift this restriction though. We simply transfer the locked
property from the top mount to the mount beneath. This works because
what we care about is to protect the underlying mount aka the parent.
The mount mounted between the parent and the top mount takes over the
job of protecting the parent mount from the top mount mount. This
leaves us free to remove the locked property from the top mount which
can consequently be unmounted:
unshare(CLONE_NEWUSER | CLONE_NEWNS)
and we inherit a clone of procfs on /proc then currently we cannot
unmount it as:
umount -l /proc
will fail with EINVAL because the procfs mount is locked.
After this series we can now do:
mount --beneath -t tmpfs tmpfs /proc umount -l /proc
after which a tmpfs mount has been placed beneath the procfs mount.
The tmpfs mount has become locked and the procfs mount has become
unlocked.
This means you can safely modify an inherited mount table after
unprivileged namespace creation.
Afterwards we simply make it possible to move a mount beneath the
rootfs allowing to upgrade the rootfs.
Removing the locked restriction makes this very useful for containers
created with unshare(CLONE_NEWUSER | CLONE_NEWNS) to reshuffle an
inherited mount table safely and MOVE_MOUNT_BENEATH makes it possible
to switch out the rootfs instead of using the costly pivot_root(2).
/* Testing */
gcc (Debian 14.2.0-19) 14.2.0
Debian clang version 19.1.7 (3+b1)
No build failures or warnings were observed.
/* Conflicts */
Merge conflicts with mainline
=============================
No known conflicts.
Merge conflicts with other trees
================================
This will have a merge conflict with:
[1]: https://lore.kernel.org/20260410-namespaces-misc-v71-000ced4f8b7a@brauner
[2]: https://lore.kernel.org/20260410-vfs-pidfs-v71-b736f79a20b9@brauner
It can be resolved as follows:
diff --cc include/uapi/linux/sched.h
index 149dbc64923b,4e76fce9f777..000000000000
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@@ -34,11 -34,9 +34,12 @@@
#define CLONE_IO 0x80000000 /* Clone io context */
/* Flags for the clone3() syscall. */
-#define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
-#define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */
+#define CLONE_CLEAR_SIGHAND (1ULL << 32) /* Clear any signal handler and reset to SIG_DFL. */
+#define CLONE_INTO_CGROUP (1ULL << 33) /* Clone into a specific cgroup given the right permissions. */
+#define CLONE_AUTOREAP (1ULL << 34) /* Auto-reap child on exit. */
+#define CLONE_NNP (1ULL << 35) /* Set no_new_privs on child. */
+#define CLONE_PIDFD_AUTOKILL (1ULL << 36) /* Kill child when clone pidfd closes. */
+ #define CLONE_EMPTY_MNTNS (1ULL << 37) /* Create an empty mount namespace. */
/*
* cloning flags intersect with CSIGNAL so can be used with unshare and clone3
diff --cc kernel/fork.c
index 55a6906d3014,dea6b3454447..000000000000
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@@ -2941,9 -2906,9 +2951,9 @@@ static inline bool clone3_stack_valid(s
static bool clone3_args_valid(struct kernel_clone_args *kargs)
{
/* Verify that no unknown flags are passed along. */
-- if (kargs->flags &
- ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP |
- CLONE_AUTOREAP | CLONE_NNP | CLONE_PIDFD_AUTOKILL))
- ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND |
- CLONE_INTO_CGROUP | CLONE_EMPTY_MNTNS))
++ if (kargs->flags & ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND |
++ CLONE_INTO_CGROUP | CLONE_AUTOREAP | CLONE_NNP |
++ CLONE_PIDFD_AUTOKILL | CLONE_EMPTY_MNTNS))
return false;
/*
@@@ -3092,9 -3057,11 +3102,9 @@@ void __init proc_caches_init(void
*/
static int check_unshare_flags(unsigned long unshare_flags)
{
- if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
+ if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
- CLONE_NS_ALL))
- CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP|
- CLONE_NEWTIME | UNSHARE_EMPTY_MNTNS))
++ CLONE_NS_ALL|UNSHARE_EMPTY_MNTNS))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing
diff --cc kernel/nsproxy.c
index 63b44ee79847,1bdc5be2dd20..000000000000
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@@ -211,9 -213,12 +212,10 @@@ int unshare_nsproxy_namespaces(unsigne
struct nsproxy **new_nsp, struct cred *new_cred, struct fs_struct *new_fs)
{
struct user_namespace *user_ns;
+ u64 flags = unshare_flags;
int err = 0;
- if (!(unshare_flags & (CLONE_NS_ALL & ~CLONE_NEWUSER)))
- if (!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP |
- CLONE_NEWTIME)))
++ if (!(flags & (CLONE_NS_ALL & ~CLONE_NEWUSER)))
return 0;
user_ns = new_cred ? new_cred->user_ns : current_user_ns();
The following changes since commit 1f318b96cc84d7c2ab792fcc0bfd42a7ca890681:
Linux 7.0-rc3 (2026-03-08 16:56:54 -0700)
are available in the Git repository at:
git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-7.1-rc1.mount
for you to fetch changes up to ef6960362e5b33402b9977972ab2a9fabc4c8cb9:
selftests/namespaces: remove unused utils.h include from listns_efault_test (2026-03-31 10:58:58 +0200)
----------------------------------------------------------------
vfs-7.1-rc1.mount
Please consider pulling these changes from the signed vfs-7.1-rc1.mount tag.
Thanks!
Christian
----------------------------------------------------------------
Christian Brauner (27):
mount: start iterating from start of rbtree
mount: simplify __do_loopback()
mount: add FSMOUNT_NAMESPACE
tools: update mount.h header
selftests/statmount: add statmount_alloc() helper
selftests: add FSMOUNT_NAMESPACE tests
Merge patch series "fsmount: add FSMOUNT_NAMESPACE"
namespace: allow creating empty mount namespaces
selftests/filesystems: add tests for empty mount namespaces
selftests/filesystems: add clone3 tests for empty mount namespaces
Merge patch series "namespace: allow creating empty mount namespaces"
move_mount: transfer MNT_LOCKED
move_mount: allow MOVE_MOUNT_BENEATH on the rootfs
selftests/filesystems: add MOVE_MOUNT_BENEATH rootfs tests
Merge patch series "move_mount: expand MOVE_MOUNT_BENEATH"
Merge patch series "fs, audit: Avoid excessive dput/dget in audit_context setup and reset paths"
fs: use path_equal() in fs_struct helpers
fs: document seqlock usage in pwd pool APIs
fs: add drain_fs_pwd_pool() helper
fs: factor out get_fs_pwd_pool_locked() for lock-held callers
Merge branch 'vfs-7.1.fs_struct'
mount: always duplicate mount
selftests/statmount: remove duplicate wait_for_pid()
selftests/empty_mntns: fix statmount_alloc() signature mismatch
selftests/empty_mntns: fix wrong CLONE_EMPTY_MNTNS hex value in comment
selftests/fsmount_ns: add missing TARGETS and fix cap test
selftests/namespaces: remove unused utils.h include from listns_efault_test
Waiman Long (2):
fs: Add a pool of extra fs->pwd references to fs_struct
audit: Use the new {get,put}_fs_pwd_pool() APIs to get/put pwd references
fs/fs_struct.c | 34 +-
fs/namespace.c | 206 ++--
include/linux/fs_struct.h | 43 +-
include/uapi/linux/mount.h | 1 +
include/uapi/linux/sched.h | 7 +
kernel/auditsc.c | 7 +-
kernel/fork.c | 17 +-
kernel/nsproxy.c | 21 +-
tools/include/uapi/linux/mount.h | 14 +-
tools/testing/selftests/Makefile | 3 +
.../selftests/filesystems/empty_mntns/.gitignore | 4 +
.../selftests/filesystems/empty_mntns/Makefile | 12 +
.../empty_mntns/clone3_empty_mntns_test.c | 938 ++++++++++++++++
.../filesystems/empty_mntns/empty_mntns.h | 25 +
.../filesystems/empty_mntns/empty_mntns_test.c | 725 +++++++++++++
.../empty_mntns/overmount_chroot_test.c | 225 ++++
.../selftests/filesystems/fsmount_ns/.gitignore | 1 +
.../selftests/filesystems/fsmount_ns/Makefile | 10 +
.../filesystems/fsmount_ns/fsmount_ns_test.c | 1135 ++++++++++++++++++++
.../selftests/filesystems/move_mount/.gitignore | 2 +
.../selftests/filesystems/move_mount/Makefile | 10 +
.../filesystems/move_mount/move_mount_test.c | 492 +++++++++
.../selftests/filesystems/open_tree_ns/Makefile | 2 +-
.../filesystems/open_tree_ns/open_tree_ns_test.c | 43 +-
.../selftests/filesystems/statmount/statmount.h | 51 +
.../filesystems/statmount/statmount_test.c | 45 +-
.../filesystems/statmount/statmount_test_ns.c | 25 -
tools/testing/selftests/filesystems/utils.c | 4 +-
tools/testing/selftests/filesystems/utils.h | 2 +
.../selftests/namespaces/listns_efault_test.c | 1 -
30 files changed, 3903 insertions(+), 202 deletions(-)
create mode 100644 tools/testing/selftests/filesystems/empty_mntns/.gitignore
create mode 100644 tools/testing/selftests/filesystems/empty_mntns/Makefile
create mode 100644 tools/testing/selftests/filesystems/empty_mntns/clone3_empty_mntns_test.c
create mode 100644 tools/testing/selftests/filesystems/empty_mntns/empty_mntns.h
create mode 100644 tools/testing/selftests/filesystems/empty_mntns/empty_mntns_test.c
create mode 100644 tools/testing/selftests/filesystems/empty_mntns/overmount_chroot_test.c
create mode 100644 tools/testing/selftests/filesystems/fsmount_ns/.gitignore
create mode 100644 tools/testing/selftests/filesystems/fsmount_ns/Makefile
create mode 100644 tools/testing/selftests/filesystems/fsmount_ns/fsmount_ns_test.c
create mode 100644 tools/testing/selftests/filesystems/move_mount/.gitignore
create mode 100644 tools/testing/selftests/filesystems/move_mount/Makefile
create mode 100644 tools/testing/selftests/filesystems/move_mount/move_mount_test.c
next prev parent reply other threads:[~2026-04-10 15:23 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-10 15:15 [GIT PULL 00/12 for v7.1] v7.1 Christian Brauner
2026-04-10 15:16 ` [GIT PULL 01/12 for v7.1] vfs writeback Christian Brauner
2026-04-10 15:16 ` [GIT PULL 02/12 for v7.1] vfs xattr Christian Brauner
2026-04-10 15:16 ` [GIT PULL 03/12 for v7.1] vfs directory Christian Brauner
2026-04-10 15:17 ` [GIT PULL 04/12 for v7.1] vfs integrity Christian Brauner
2026-04-10 15:18 ` [GIT PULL 05/12 for v7.1] vfs fs_struct Christian Brauner
2026-04-10 15:18 ` [GIT PULL 06/12 for v7.1] vfs kino Christian Brauner
2026-04-10 15:19 ` [GIT PULL 07/12 for v7.1] vfs fat Christian Brauner
2026-04-10 15:19 ` [GIT PULL 08/12 for v7.1] vfs bh metadata Christian Brauner
2026-04-10 15:19 ` [GIT PULL 09/12 for v7.1] namespaces misc Christian Brauner
2026-04-10 15:21 ` [GIT PULL 10/12 for v7.1] vfs pidfs Christian Brauner
2026-04-10 15:21 ` Christian Brauner [this message]
2026-04-10 15:23 ` [GIT PULL 12/12 for v7.1] vfs misc Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260410-vfs-mount-v71-89e63a03df4d@brauner \
--to=brauner@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox