[PATCH RFC 0/3] move_mount: expand MOVE_MOUNT

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] move_mount: expand MOVE_MOUNT_BENEATH
@ 2026-02-24  0:40 Christian Brauner
  2026-02-24  0:40 ` [PATCH RFC 1/3] move_mount: transfer MNT_LOCKED Christian Brauner
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Christian Brauner @ 2026-02-24  0:40 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Alexander Viro, Jan Kara, Linus Torvalds, Christian Brauner

I'm too tired now to keep refining this but I think it's in good enough
shape for review.

Allow MOVE_MOUNT_BENEATH to target the caller's rootfs, allowing to
switch out the rootfs without pivot_root(2).

The traditional approach to switching the rootfs involves pivot_root(2)
or a chroot_fs_refs()-based mechanism that atomically updates fs->root
for all tasks sharing the same fs_struct. This has consequences for
fork(), unshare(CLONE_FS), and setns().

This series instead decomposes root-switching into individually atomic,
locally-scoped steps:

    fd_tree = open_tree(-EBADF, "/newroot",
                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
    fchdir(fd_tree);
    move_mount(fd_tree, "", AT_FDCWD, "/",
               MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH);
    chroot(".");
    umount2(".", MNT_DETACH);

Since each step only modifies the caller's own state, the
fork/unshare/setns races are eliminated by design.

A key step to making this possible is to remove the locked mount
restriction. Originally MOVE_MOUNT_BENEATH doesn't support mounting
beneath a mount that is locked. The locked mount protects the underlying
mount from being revealed. This is a core mechanism of
unshare(CLONE_NEWUSER | CLONE_NEWNS). The mounts in the new mount
namespace become locked. That effectively makes the new mount table
useless as the caller cannot ever get rid of any of the mounts no matter
how useless they are.

We can lift this restriction though. We simply transfer the locked
property from the top mount to the mount beneath. This works because
what we care about is to protect the underlying mount aka the parent.
The mount mounted between the parent and the top mount takes over the
job of protecting the parent mount from the top mount mount. This leaves
us free to remove the locked property from the top mount which can
consequently be unmounted:

  unshare(CLONE_NEWUSER | CLONE_NEWNS)

and we inherit a clone of procfs on /proc then currently we cannot
unmount it as:

  umount -l /proc

will fail with EINVAL because the procfs mount is locked.

After this series we can now do:

  mount --beneath -t tmpfs tmpfs /proc
  umount -l /proc

after which a tmpfs mount has been placed beneath the procfs mount. The
tmpfs mount has become locked and the procfs mount has become unlocked.

This means you can safely modify an inherited mount table after
unprivileged namespace creation.

Afterwards we simply make it possible to move a mount beneath the
rootfs allowing to upgrade the rootfs.

Removing the locked restriction makes this very useful for containers
created with unshare(CLONE_NEWUSER | CLONE_NEWNS) to reshuffle an
inherited mount table safely and MOVE_MOUNT_BENEATH makes it possible to
switch out the rootfs instead of using the costly pivot_root(2).

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (3):
      move_mount: transfer MNT_LOCKED
      move_mount: allow MOVE_MOUNT_BENEATH on the rootfs
      selftests/filesystems: add MOVE_MOUNT_BENEATH rootfs tests

 fs/namespace.c                                     |  37 +-
 tools/testing/selftests/Makefile                   |   1 +
 .../selftests/filesystems/move_mount/.gitignore    |   2 +
 .../selftests/filesystems/move_mount/Makefile      |  10 +
 .../filesystems/move_mount/move_mount_test.c       | 492 +++++++++++++++++++++
 5 files changed, 523 insertions(+), 19 deletions(-)
---
base-commit: 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f
change-id: 20260221-work-mount-beneath-rootfs-9164f67b7128

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH RFC 1/3] move_mount: transfer MNT_LOCKED
  2026-02-24  0:40 [PATCH RFC 0/3] move_mount: expand MOVE_MOUNT_BENEATH Christian Brauner
@ 2026-02-24  0:40 ` Christian Brauner
  2026-02-24  0:40 ` [PATCH RFC 2/3] move_mount: allow MOVE_MOUNT_BENEATH on the rootfs Christian Brauner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Christian Brauner @ 2026-02-24  0:40 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Alexander Viro, Jan Kara, Linus Torvalds, Christian Brauner

When performing a mount-beneath operation the target mount can often be
locked:

  unshare(CLONE_NEWUSER | CLONE_NEWNS);
  mount --beneath -t tmpfs tmpfs /proc

will fail because the procfs mount on /proc became locked when the mount
namespace was created from the parent mount namespace. Same logic for:

  unshare(CLONE_NEWUSER | CLONE_NEWNS);
  mount --beneath -t tmpfs tmpfs /

MNT_LOCKED is raised to prevent an unprivileged mount namespace from
revealing whatever is under a given mount. To replace the rootfs we need
to handle that case though.

We can simply transfer the locked mount property from the top mount to
the mount beneath. The new mount we mounted beneath the top mount takes
over the job of the top mount in protecting the parent mount from being
revealed. This leaves us free to allow the top mount to be unmounted.

This also works during mount propagation and also works for the
non-MOVE_MOUNT_BENEATH case:

(1) move_mount(MOVE_MOUNT_BENEATH): @source_mnt->overmount always NULL
(2) move_mount():                   @source_mnt->overmount maybe !NULL

For (1) can_move_mount_beneath() rejects overmounted @source_mnt (We
could allow this but whatever it's not really a use-case and it's fugly
to move an overmounted mount stack around. What are you even doing? So
let's keep that restriction.

For (2) we can have @source_mnt overmounted (Someone overmounted us
while we locked the target mount.). Both are fine. @source_mnt will be
mounted on whatever @q was mounted on and @q will be mounted on the top
of the @source_mnt mount stack. Even in such cases we can unlock @q and
lock @source_mnt if @q was locked.

This effectively makes mount propagation useful in cases where a mount
namespace has a locked mount somewhere and we propagate a new mount
beneath it but the mount namespace could never get at it because the old
top mount remains locked. Again, we just let the newly propagated mount
take over the protection and unlock the top mount.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ebe19ded293a..cdde6c6a30ee 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2636,6 +2636,19 @@ static int attach_recursive_mnt(struct mount *source_mnt,

 			if (unlikely(shorter) && child != source_mnt)
 				mp = shorter;
+			/*
+			 * If @q was locked it was meant to hide
+			 * whatever was under it. Let @child take over
+			 * that job and lock it, then we can unlock @q.
+			 * That'll allow another namespace to shed @q
+			 * and reveal @child. Clearly, that mounter
+			 * consented to this by not severing the mount
+			 * relationship. Otherwise, what's the point.
+			 */
+			if (IS_MNT_LOCKED(q)) {
+				child->mnt.mnt_flags |= MNT_LOCKED;
+				q->mnt.mnt_flags &= ~MNT_LOCKED;
+			}
 			mnt_change_mountpoint(r, mp, q);
 		}
 	}
@@ -3529,9 +3542,6 @@ static int can_move_mount_beneath(const struct mount *mnt_from,
 {
 	struct mount *parent_mnt_to = mnt_to->mnt_parent;

-	if (IS_MNT_LOCKED(mnt_to))
-		return -EINVAL;
-
 	/* Avoid creating shadow mounts during mount propagation. */
 	if (mnt_from->overmount)
 		return -EINVAL;

-- 
2.47.3

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH RFC 2/3] move_mount: allow MOVE_MOUNT_BENEATH on the rootfs
  2026-02-24  0:40 [PATCH RFC 0/3] move_mount: expand MOVE_MOUNT_BENEATH Christian Brauner
  2026-02-24  0:40 ` [PATCH RFC 1/3] move_mount: transfer MNT_LOCKED Christian Brauner
@ 2026-02-24  0:40 ` Christian Brauner
  2026-02-24  0:40 ` [PATCH RFC 3/3] selftests/filesystems: add MOVE_MOUNT_BENEATH rootfs tests Christian Brauner
  2026-02-24  1:40 ` [PATCH RFC 0/3] move_mount: expand MOVE_MOUNT_BENEATH Askar Safin
  3 siblings, 0 replies; 5+ messages in thread
From: Christian Brauner @ 2026-02-24  0:40 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Alexander Viro, Jan Kara, Linus Torvalds, Christian Brauner

Allow MOVE_MOUNT_BENEATH to target the caller's rootfs. When the target
of a mount-beneath operation is the caller's root mount, verify that:

(1) The caller is located at the root of the mount, as enforced by
    path_mounted() in do_lock_mount().
(2) Propagation from the parent mount would not overmount the target,
    to avoid propagating beneath the rootfs of other mount namespaces.

The root-switching is decomposed into individually atomic, locally-scoped
steps: mount-beneath inserts the new root under the old one, chroot(".")
switches the caller's root, and umount2(".", MNT_DETACH) removes the old
root. Since each step only modifies the caller's own state, this avoids
cross-namespace vulnerabilities and inherent fork/unshare/setns races
that a chroot_fs_refs()-based approach would have.

Userspace can use the following workflow to switch roots:

    fd_tree = open_tree(-EBADF, "/newroot",
                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
    fchdir(fd_tree);
    move_mount(fd_tree, "", AT_FDCWD, "/",
               MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH);
    chroot(".");
    umount2(".", MNT_DETACH);

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c | 21 +++++----------------
 1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index cdde6c6a30ee..fe9136062614 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2725,7 +2725,7 @@ static inline struct mount *where_to_mount(const struct path *path,
  * In all cases the location must not have been unmounted and the
  * chosen mountpoint must be allowed to be mounted on.  For "beneath"
  * case we also require the location to be at the root of a mount
- * that has a parent (i.e. is not a root of some namespace).
+ * that has something mounted on top of it (i.e. has an overmount).
  */
 static void do_lock_mount(const struct path *path,
 			  struct pinned_mountpoint *res,
@@ -3523,8 +3523,6 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2)
  * @mnt_to:   mount under which to mount
  * @mp:   mountpoint of @mnt_to
  *
- * - Make sure that nothing can be mounted beneath the caller's current
- *   root or the rootfs of the namespace.
  * - Make sure that the caller can unmount the topmost mount ensuring
  *   that the caller could reveal the underlying mountpoint.
  * - Ensure that nothing has been mounted on top of @mnt_from before we
@@ -3538,7 +3536,7 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2)
  */
 static int can_move_mount_beneath(const struct mount *mnt_from,
 				  const struct mount *mnt_to,
-				  const struct mountpoint *mp)
+				  struct pinned_mountpoint *mp)
 {
 	struct mount *parent_mnt_to = mnt_to->mnt_parent;
 
@@ -3546,15 +3544,6 @@ static int can_move_mount_beneath(const struct mount *mnt_from,
 	if (mnt_from->overmount)
 		return -EINVAL;
 
-	/*
-	 * Mounting beneath the rootfs only makes sense when the
-	 * semantics of pivot_root(".", ".") are used.
-	 */
-	if (&mnt_to->mnt == current->fs->root.mnt)
-		return -EINVAL;
-	if (parent_mnt_to == current->nsproxy->mnt_ns->root)
-		return -EINVAL;
-
 	if (mount_is_ancestor(mnt_to, mnt_from))
 		return -EINVAL;
 
@@ -3564,7 +3553,7 @@ static int can_move_mount_beneath(const struct mount *mnt_from,
 	 * propagating a copy @c of @mnt_from on top of @mnt_to. This
 	 * defeats the whole purpose of mounting beneath another mount.
 	 */
-	if (propagation_would_overmount(parent_mnt_to, mnt_to, mp))
+	if (propagation_would_overmount(parent_mnt_to, mnt_to, mp->mp))
 		return -EINVAL;
 
 	/*
@@ -3580,7 +3569,7 @@ static int can_move_mount_beneath(const struct mount *mnt_from,
 	 * @mnt_from beneath @mnt_to.
 	 */
 	if (check_mnt(mnt_from) &&
-	    propagation_would_overmount(parent_mnt_to, mnt_from, mp))
+	    propagation_would_overmount(parent_mnt_to, mnt_from, mp->mp))
 		return -EINVAL;
 
 	return 0;
@@ -3689,7 +3678,7 @@ static int do_move_mount(const struct path *old_path,
 
 		if (mp.parent != over->mnt_parent)
 			over = mp.parent->overmount;
-		err = can_move_mount_beneath(old, over, mp.mp);
+		err = can_move_mount_beneath(old, over, &mp);
 		if (err)
 			return err;
 	}

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH RFC 3/3] selftests/filesystems: add MOVE_MOUNT_BENEATH rootfs tests
  2026-02-24  0:40 [PATCH RFC 0/3] move_mount: expand MOVE_MOUNT_BENEATH Christian Brauner
  2026-02-24  0:40 ` [PATCH RFC 1/3] move_mount: transfer MNT_LOCKED Christian Brauner
  2026-02-24  0:40 ` [PATCH RFC 2/3] move_mount: allow MOVE_MOUNT_BENEATH on the rootfs Christian Brauner
@ 2026-02-24  0:40 ` Christian Brauner
  2026-02-24  1:40 ` [PATCH RFC 0/3] move_mount: expand MOVE_MOUNT_BENEATH Askar Safin
  3 siblings, 0 replies; 5+ messages in thread
From: Christian Brauner @ 2026-02-24  0:40 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Alexander Viro, Jan Kara, Linus Torvalds, Christian Brauner

Add tests for mounting beneath the rootfs using MOVE_MOUNT_BENEATH:

- beneath_rootfs_success: mount beneath /, fchdir, chroot, umount2
  MNT_DETACH -- verify root changed
- beneath_rootfs_old_root_stacked: after mount-beneath, verify old root
  parent is clone via statmount
- beneath_rootfs_in_chroot_fail: chroot into subdir of same mount,
  mount-beneath fails (dentry != mnt_root)
- beneath_rootfs_in_chroot_success: chroot into separate tmpfs mount,
  mount-beneath succeeds
- beneath_rootfs_locked_transfer: in user+mount ns: mount-beneath
  rootfs succeeds, MNT_LOCKED transfers, old root unmountable
- beneath_rootfs_locked_containment: in user+mount ns: after full
  root-switch workflow, new root is MNT_LOCKED (containment preserved)
- beneath_non_rootfs_locked_transfer: mounts created before
  unshare(CLONE_NEWUSER | CLONE_NEWNS) become locked; mount-beneath
  transfers MNT_LOCKED, displaced mount can be unmounted
- beneath_non_rootfs_locked_containment: same setup, verify new mount
  is MNT_LOCKED (containment preserved)

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/Makefile                   |   1 +
 .../selftests/filesystems/move_mount/.gitignore    |   2 +
 .../selftests/filesystems/move_mount/Makefile      |  10 +
 .../filesystems/move_mount/move_mount_test.c       | 492 +++++++++++++++++++++
 4 files changed, 505 insertions(+)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 450f13ba4cca..2d05b3e1a26e 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -38,6 +38,7 @@ TARGETS += filesystems/overlayfs
 TARGETS += filesystems/statmount
 TARGETS += filesystems/mount-notify
 TARGETS += filesystems/fuse
+TARGETS += filesystems/move_mount
 TARGETS += firmware
 TARGETS += fpu
 TARGETS += ftrace
diff --git a/tools/testing/selftests/filesystems/move_mount/.gitignore b/tools/testing/selftests/filesystems/move_mount/.gitignore
new file mode 100644
index 000000000000..c7557db30671
--- /dev/null
+++ b/tools/testing/selftests/filesystems/move_mount/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+move_mount_test
diff --git a/tools/testing/selftests/filesystems/move_mount/Makefile b/tools/testing/selftests/filesystems/move_mount/Makefile
new file mode 100644
index 000000000000..5c5b199b464b
--- /dev/null
+++ b/tools/testing/selftests/filesystems/move_mount/Makefile
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g $(KHDR_INCLUDES) $(TOOLS_INCLUDES)
+LDLIBS += -lcap
+
+TEST_GEN_PROGS := move_mount_test
+
+include ../../lib.mk
+
+$(OUTPUT)/move_mount_test: ../utils.c
diff --git a/tools/testing/selftests/filesystems/move_mount/move_mount_test.c b/tools/testing/selftests/filesystems/move_mount/move_mount_test.c
new file mode 100644
index 000000000000..f08f94b1f0ec
--- /dev/null
+++ b/tools/testing/selftests/filesystems/move_mount/move_mount_test.c
@@ -0,0 +1,492 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+// Copyright (c) 2026 Christian Brauner <brauner@kernel.org>
+
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/mount.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+#include "../wrappers.h"
+#include "../utils.h"
+#include "../statmount/statmount.h"
+#include "../../kselftest_harness.h"
+
+#include <linux/stat.h>
+
+#ifndef MOVE_MOUNT_BENEATH
+#define MOVE_MOUNT_BENEATH 0x00000200
+#endif
+
+static uint64_t get_unique_mnt_id_fd(int fd)
+{
+	struct statx sx;
+	int ret;
+
+	ret = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &sx);
+	if (ret)
+		return 0;
+
+	if (!(sx.stx_mask & STATX_MNT_ID_UNIQUE))
+		return 0;
+
+	return sx.stx_mnt_id;
+}
+
+/*
+ * Create a locked overmount stack at /mnt_dir for testing MNT_LOCKED
+ * transfer on non-rootfs mounts.
+ *
+ * Mounts tmpfs A at /mnt_dir, overmounts with tmpfs B, then enters a
+ * new user+mount namespace where both become locked. Returns the exit
+ * code to use on failure, or 0 on success.
+ */
+static int setup_locked_overmount(void)
+{
+	/* Isolate so mounts don't leak. */
+	if (unshare(CLONE_NEWNS))
+		return 1;
+	if (mount("", "/", NULL, MS_REC | MS_PRIVATE, NULL))
+		return 2;
+
+	/*
+	 * Create mounts while still in the initial user namespace so
+	 * they become locked after the subsequent user namespace
+	 * unshare.
+	 */
+	rmdir("/mnt_dir");
+	if (mkdir("/mnt_dir", 0755))
+		return 3;
+
+	/* Mount tmpfs A */
+	if (mount("tmpfs", "/mnt_dir", "tmpfs", 0, NULL))
+		return 4;
+
+	/* Overmount with tmpfs B */
+	if (mount("tmpfs", "/mnt_dir", "tmpfs", 0, NULL))
+		return 5;
+
+	/*
+	 * Create user+mount namespace. Mounts A and B become locked
+	 * because they might be covering something that is not supposed
+	 * to be revealed.
+	 */
+	if (setup_userns())
+		return 6;
+
+	/* Sanity check: B must be locked */
+	if (!umount2("/mnt_dir", MNT_DETACH) || errno != EINVAL)
+		return 7;
+
+	return 0;
+}
+
+/*
+ * Create a detached tmpfs mount and return its fd, or -1 on failure.
+ */
+static int create_detached_tmpfs(void)
+{
+	int fs_fd, mnt_fd;
+
+	fs_fd = sys_fsopen("tmpfs", FSOPEN_CLOEXEC);
+	if (fs_fd < 0)
+		return -1;
+
+	if (sys_fsconfig(fs_fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)) {
+		close(fs_fd);
+		return -1;
+	}
+
+	mnt_fd = sys_fsmount(fs_fd, FSMOUNT_CLOEXEC, 0);
+	close(fs_fd);
+	return mnt_fd;
+}
+
+FIXTURE(move_mount) {
+	uint64_t orig_root_id;
+};
+
+FIXTURE_SETUP(move_mount)
+{
+	ASSERT_EQ(unshare(CLONE_NEWNS), 0);
+
+	ASSERT_EQ(mount("", "/", NULL, MS_REC | MS_PRIVATE, NULL), 0);
+
+	self->orig_root_id = get_unique_mnt_id("/");
+	ASSERT_NE(self->orig_root_id, 0);
+}
+
+FIXTURE_TEARDOWN(move_mount)
+{
+}
+
+/*
+ * Test successful MOVE_MOUNT_BENEATH on the rootfs.
+ * Mount a clone beneath /, fchdir to the clone, chroot to switch root,
+ * then detach the old root.
+ */
+TEST_F(move_mount, beneath_rootfs_success)
+{
+	int fd_tree, ret;
+	uint64_t clone_id, root_id;
+
+	fd_tree = sys_open_tree(AT_FDCWD, "/",
+				OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(fd_tree, 0);
+
+	clone_id = get_unique_mnt_id_fd(fd_tree);
+	ASSERT_NE(clone_id, 0);
+	ASSERT_NE(clone_id, self->orig_root_id);
+
+	ASSERT_EQ(fchdir(fd_tree), 0);
+
+	ret = sys_move_mount(fd_tree, "", AT_FDCWD, "/",
+			     MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_BENEATH);
+	ASSERT_EQ(ret, 0);
+
+	close(fd_tree);
+
+	/* Switch root to the clone */
+	ASSERT_EQ(chroot("."), 0);
+
+	/* Verify "/" is now the clone */
+	root_id = get_unique_mnt_id("/");
+	ASSERT_NE(root_id, 0);
+	ASSERT_EQ(root_id, clone_id);
+
+	/* Detach old root */
+	ASSERT_EQ(umount2(".", MNT_DETACH), 0);
+}
+
+/*
+ * Test that after MOVE_MOUNT_BENEATH on the rootfs the old root is
+ * stacked on top of the clone. Verify via statmount that the old
+ * root's parent is the clone.
+ */
+TEST_F(move_mount, beneath_rootfs_old_root_stacked)
+{
+	int fd_tree, ret;
+	uint64_t clone_id;
+	struct statmount sm;
+
+	fd_tree = sys_open_tree(AT_FDCWD, "/",
+				OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(fd_tree, 0);
+
+	clone_id = get_unique_mnt_id_fd(fd_tree);
+	ASSERT_NE(clone_id, 0);
+	ASSERT_NE(clone_id, self->orig_root_id);
+
+	ASSERT_EQ(fchdir(fd_tree), 0);
+
+	ret = sys_move_mount(fd_tree, "", AT_FDCWD, "/",
+			     MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_BENEATH);
+	ASSERT_EQ(ret, 0);
+
+	close(fd_tree);
+
+	ASSERT_EQ(chroot("."), 0);
+
+	/* Old root's parent should now be the clone */
+	ASSERT_EQ(statmount(self->orig_root_id, 0, 0,
+			     STATMOUNT_MNT_BASIC, &sm, sizeof(sm), 0), 0);
+	ASSERT_EQ(sm.mnt_parent_id, clone_id);
+
+	ASSERT_EQ(umount2(".", MNT_DETACH), 0);
+}
+
+/*
+ * Test that MOVE_MOUNT_BENEATH on rootfs fails when chroot'd into a
+ * subdirectory of the same mount. The caller's fs->root.dentry doesn't
+ * match mnt->mnt_root so the kernel rejects it.
+ */
+TEST_F(move_mount, beneath_rootfs_in_chroot_fail)
+{
+	int fd_tree, ret;
+	uint64_t chroot_id, clone_id;
+
+	rmdir("/chroot_dir");
+	ASSERT_EQ(mkdir("/chroot_dir", 0755), 0);
+
+	chroot_id = get_unique_mnt_id("/chroot_dir");
+	ASSERT_NE(chroot_id, 0);
+	ASSERT_EQ(self->orig_root_id, chroot_id);
+
+	ASSERT_EQ(chdir("/chroot_dir"), 0);
+	ASSERT_EQ(chroot("."), 0);
+
+	fd_tree = sys_open_tree(AT_FDCWD, "/",
+				OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(fd_tree, 0);
+
+	clone_id = get_unique_mnt_id_fd(fd_tree);
+	ASSERT_NE(clone_id, 0);
+	ASSERT_NE(clone_id, chroot_id);
+
+	ASSERT_EQ(fchdir(fd_tree), 0);
+
+	/*
+	 * Should fail: fs->root.dentry (/chroot_dir) doesn't match
+	 * the mount's mnt_root (/).
+	 */
+	ret = sys_move_mount(fd_tree, "", AT_FDCWD, "/",
+			     MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_BENEATH);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, EINVAL);
+
+	close(fd_tree);
+}
+
+/*
+ * Test that MOVE_MOUNT_BENEATH on rootfs succeeds when chroot'd into a
+ * separate tmpfs mount. The caller's root dentry matches the mount's
+ * mnt_root since it's a dedicated mount.
+ */
+TEST_F(move_mount, beneath_rootfs_in_chroot_success)
+{
+	int fd_tree, ret;
+	uint64_t chroot_id, clone_id, root_id;
+	struct statmount sm;
+
+	rmdir("/chroot_dir");
+	ASSERT_EQ(mkdir("/chroot_dir", 0755), 0);
+	ASSERT_EQ(mount("tmpfs", "/chroot_dir", "tmpfs", 0, NULL), 0);
+
+	chroot_id = get_unique_mnt_id("/chroot_dir");
+	ASSERT_NE(chroot_id, 0);
+
+	ASSERT_EQ(chdir("/chroot_dir"), 0);
+	ASSERT_EQ(chroot("."), 0);
+
+	ASSERT_EQ(get_unique_mnt_id("/"), chroot_id);
+
+	fd_tree = sys_open_tree(AT_FDCWD, "/",
+				OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(fd_tree, 0);
+
+	clone_id = get_unique_mnt_id_fd(fd_tree);
+	ASSERT_NE(clone_id, 0);
+	ASSERT_NE(clone_id, chroot_id);
+
+	ASSERT_EQ(fchdir(fd_tree), 0);
+
+	ret = sys_move_mount(fd_tree, "", AT_FDCWD, "/",
+			     MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_BENEATH);
+	ASSERT_EQ(ret, 0);
+
+	close(fd_tree);
+
+	ASSERT_EQ(chroot("."), 0);
+
+	root_id = get_unique_mnt_id("/");
+	ASSERT_NE(root_id, 0);
+	ASSERT_EQ(root_id, clone_id);
+
+	ASSERT_EQ(statmount(chroot_id, 0, 0,
+			     STATMOUNT_MNT_BASIC, &sm, sizeof(sm), 0), 0);
+	ASSERT_EQ(sm.mnt_parent_id, clone_id);
+
+	ASSERT_EQ(umount2(".", MNT_DETACH), 0);
+}
+
+/*
+ * Test MNT_LOCKED transfer when mounting beneath rootfs in a user+mount
+ * namespace. After mount-beneath the new root gets MNT_LOCKED and the
+ * old root has MNT_LOCKED cleared so it can be unmounted.
+ */
+TEST_F(move_mount, beneath_rootfs_locked_transfer)
+{
+	int fd_tree, ret;
+	uint64_t clone_id, root_id;
+
+	ASSERT_EQ(setup_userns(), 0);
+
+	ASSERT_EQ(mount("", "/", NULL, MS_REC | MS_PRIVATE, NULL), 0);
+
+	fd_tree = sys_open_tree(AT_FDCWD, "/",
+				OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
+				AT_RECURSIVE);
+	ASSERT_GE(fd_tree, 0);
+
+	clone_id = get_unique_mnt_id_fd(fd_tree);
+	ASSERT_NE(clone_id, 0);
+
+	ASSERT_EQ(fchdir(fd_tree), 0);
+
+	ret = sys_move_mount(fd_tree, "", AT_FDCWD, "/",
+			     MOVE_MOUNT_F_EMPTY_PATH |
+			     MOVE_MOUNT_BENEATH);
+	ASSERT_EQ(ret, 0);
+
+	close(fd_tree);
+
+	ASSERT_EQ(chroot("."), 0);
+
+	root_id = get_unique_mnt_id("/");
+	ASSERT_EQ(root_id, clone_id);
+
+	/*
+	 * The old root should be unmountable (MNT_LOCKED was
+	 * transferred to the clone). If MNT_LOCKED wasn't
+	 * cleared, this would fail with EINVAL.
+	 */
+	ASSERT_EQ(umount2(".", MNT_DETACH), 0);
+
+	/* Verify "/" is still the clone after detaching old root */
+	root_id = get_unique_mnt_id("/");
+	ASSERT_EQ(root_id, clone_id);
+}
+
+/*
+ * Test containment invariant: after mount-beneath rootfs in a user+mount
+ * namespace, the new root must be MNT_LOCKED. The lock transfer from the
+ * old root preserves containment -- the process cannot unmount the new root
+ * to escape the namespace.
+ */
+TEST_F(move_mount, beneath_rootfs_locked_containment)
+{
+	int fd_tree, ret;
+	uint64_t clone_id, root_id;
+
+	ASSERT_EQ(setup_userns(), 0);
+
+	ASSERT_EQ(mount("", "/", NULL, MS_REC | MS_PRIVATE, NULL), 0);
+
+	/* Sanity: rootfs must be locked in the new userns */
+	ASSERT_EQ(umount2("/", MNT_DETACH), -1);
+	ASSERT_EQ(errno, EINVAL);
+
+	fd_tree = sys_open_tree(AT_FDCWD, "/",
+				OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
+				AT_RECURSIVE);
+	ASSERT_GE(fd_tree, 0);
+
+	clone_id = get_unique_mnt_id_fd(fd_tree);
+	ASSERT_NE(clone_id, 0);
+
+	ASSERT_EQ(fchdir(fd_tree), 0);
+
+	ret = sys_move_mount(fd_tree, "", AT_FDCWD, "/",
+			     MOVE_MOUNT_F_EMPTY_PATH |
+			     MOVE_MOUNT_BENEATH);
+	ASSERT_EQ(ret, 0);
+
+	close(fd_tree);
+
+	ASSERT_EQ(chroot("."), 0);
+
+	root_id = get_unique_mnt_id("/");
+	ASSERT_EQ(root_id, clone_id);
+
+	/* Detach old root (MNT_LOCKED was cleared from it) */
+	ASSERT_EQ(umount2(".", MNT_DETACH), 0);
+
+	/* Verify "/" is still the clone after detaching old root */
+	root_id = get_unique_mnt_id("/");
+	ASSERT_EQ(root_id, clone_id);
+
+	/*
+	 * The new root must be locked (MNT_LOCKED was transferred
+	 * from the old root). Attempting to unmount it must fail
+	 * with EINVAL, preserving the containment invariant.
+	 */
+	ASSERT_EQ(umount2("/", MNT_DETACH), -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+/*
+ * Test MNT_LOCKED transfer when mounting beneath a non-rootfs locked mount.
+ * Mounts created before unshare(CLONE_NEWUSER | CLONE_NEWNS) become locked
+ * in the new namespace. Mount-beneath transfers the lock from the displaced
+ * mount to the new mount, so the displaced mount can be unmounted.
+ */
+TEST_F(move_mount, beneath_non_rootfs_locked_transfer)
+{
+	int mnt_fd, ret;
+	uint64_t mnt_new_id, mnt_visible_id;
+
+	ASSERT_EQ(setup_locked_overmount(), 0);
+
+	mnt_fd = create_detached_tmpfs();
+	ASSERT_GE(mnt_fd, 0);
+
+	mnt_new_id = get_unique_mnt_id_fd(mnt_fd);
+	ASSERT_NE(mnt_new_id, 0);
+
+	/* Move mount beneath B (which is locked) */
+	ret = sys_move_mount(mnt_fd, "", AT_FDCWD, "/mnt_dir",
+			     MOVE_MOUNT_F_EMPTY_PATH |
+			     MOVE_MOUNT_BENEATH);
+	ASSERT_EQ(ret, 0);
+
+	close(mnt_fd);
+
+	/*
+	 * B should now be unmountable (MNT_LOCKED was transferred
+	 * to the new mount beneath it). If MNT_LOCKED wasn't
+	 * cleared from B, this would fail with EINVAL.
+	 */
+	ASSERT_EQ(umount2("/mnt_dir", MNT_DETACH), 0);
+
+	/* Verify the new mount is now visible */
+	mnt_visible_id = get_unique_mnt_id("/mnt_dir");
+	ASSERT_EQ(mnt_visible_id, mnt_new_id);
+}
+
+/*
+ * Test MNT_LOCKED containment when mounting beneath a non-rootfs mount
+ * that was locked during unshare(CLONE_NEWUSER | CLONE_NEWNS).
+ * Mounts created before unshare become locked in the new namespace.
+ * Mount-beneath transfers the lock, preserving containment: the new
+ * mount cannot be unmounted, but the displaced mount can.
+ */
+TEST_F(move_mount, beneath_non_rootfs_locked_containment)
+{
+	int mnt_fd, ret;
+	uint64_t mnt_new_id, mnt_visible_id;
+
+	ASSERT_EQ(setup_locked_overmount(), 0);
+
+	mnt_fd = create_detached_tmpfs();
+	ASSERT_GE(mnt_fd, 0);
+
+	mnt_new_id = get_unique_mnt_id_fd(mnt_fd);
+	ASSERT_NE(mnt_new_id, 0);
+
+	/*
+	 * Move new tmpfs beneath B at /mnt_dir.
+	 * Stack becomes: A -> new -> B
+	 * Lock transfers from B to new.
+	 */
+	ret = sys_move_mount(mnt_fd, "", AT_FDCWD, "/mnt_dir",
+			     MOVE_MOUNT_F_EMPTY_PATH |
+			     MOVE_MOUNT_BENEATH);
+	ASSERT_EQ(ret, 0);
+
+	close(mnt_fd);
+
+	/*
+	 * B lost MNT_LOCKED -- unmounting it must succeed.
+	 * This reveals the new mount at /mnt_dir.
+	 */
+	ASSERT_EQ(umount2("/mnt_dir", MNT_DETACH), 0);
+
+	/* Verify the new mount is now visible */
+	mnt_visible_id = get_unique_mnt_id("/mnt_dir");
+	ASSERT_EQ(mnt_visible_id, mnt_new_id);
+
+	/*
+	 * The new mount gained MNT_LOCKED -- unmounting it must
+	 * fail with EINVAL, preserving the containment invariant.
+	 */
+	ASSERT_EQ(umount2("/mnt_dir", MNT_DETACH), -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC 0/3] move_mount: expand MOVE_MOUNT_BENEATH
  2026-02-24  0:40 [PATCH RFC 0/3] move_mount: expand MOVE_MOUNT_BENEATH Christian Brauner
                   ` (2 preceding siblings ...)
  2026-02-24  0:40 ` [PATCH RFC 3/3] selftests/filesystems: add MOVE_MOUNT_BENEATH rootfs tests Christian Brauner
@ 2026-02-24  1:40 ` Askar Safin
  3 siblings, 0 replies; 5+ messages in thread
From: Askar Safin @ 2026-02-24  1:40 UTC (permalink / raw)
  To: brauner; +Cc: jack, linux-fsdevel, torvalds, viro

Christian Brauner <brauner@kernel.org>:
> The traditional approach to switching the rootfs involves pivot_root(2)
> or a chroot_fs_refs()-based mechanism that atomically updates fs->root
> for all tasks sharing the same fs_struct.

I think you meant here "sharing same cwd and root". The problem
with pivot_root is that it changes refs not only for tasks sharing
same fs_struct (i. e. cloned with CLONE_FS), but also for all tasks
sharing same cwd and root.


(I just do some proofreading. I'm trying to help. I hope you are not
offended.)

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-24  1:40 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-24  0:40 [PATCH RFC 0/3] move_mount: expand MOVE_MOUNT_BENEATH Christian Brauner
2026-02-24  0:40 ` [PATCH RFC 1/3] move_mount: transfer MNT_LOCKED Christian Brauner
2026-02-24  0:40 ` [PATCH RFC 2/3] move_mount: allow MOVE_MOUNT_BENEATH on the rootfs Christian Brauner
2026-02-24  0:40 ` [PATCH RFC 3/3] selftests/filesystems: add MOVE_MOUNT_BENEATH rootfs tests Christian Brauner
2026-02-24  1:40 ` [PATCH RFC 0/3] move_mount: expand MOVE_MOUNT_BENEATH Askar Safin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox