[PATCH RFC 0/5] fs: allow to tuck mounts explicitly

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/5] fs: allow to tuck mounts explicitly
@ 2023-03-18 15:51 Christian Brauner
  2023-03-18 15:51 ` [PATCH RFC 1/5] fs: add path_mounted() Christian Brauner
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Christian Brauner @ 2023-03-18 15:51 UTC (permalink / raw)
  To: Al Viro, Seth Forshee; +Cc: linux-fsdevel, Christian Brauner (Microsoft)

Various distributions are adding or are in the process of adding support
for system extensions and in the future configuration extensions through
various tools. A more detailed explanation on system and configuration
extensions can be found on the manpage which is listed below at [1].

System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.

When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.

System configuration images are similar but operate on directories
containing system or service configuration.

On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):

       TARGET  SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /       /       ext4    shared:1     29      1

On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.

Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.

In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:

       TARGET  SOURCE  FSTYPE  PROPAGATION         MNT_ID  PARENT_ID
       /       /       ext4    shared:24 master:1  71      47

For people not too familar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.

A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.

For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:

       TARGET                    SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /home/ubuntu/debian-tree  /       ext4    shared:99    61      60

So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:

       TARGET                  SOURCE              FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /propagate/debian-tree  /run/host/incoming  tmpfs   master:5     71      68

Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [2]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.

Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.

The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.

The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the sytems is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.

The existing mechanism of tucked mounts can be exposed as a new option
when attaching a mount giving userspace the ability to seamlessly
upgrade a mount. Now, instead of only being created as a consequence of
mount propagation they can be created explicitly. Today, tucked mounts
are created in two scenarios:

(1) When a service or container is started in a new mount namespace and
    pivot_root()s into its new rootfs. The way this is done is by
    tucking the new rootfs under the old rootfs:

            fd_newroot = open("/var/lib/machines/fedora", ...);
            fd_oldroot = open("/", ...);
            fchdir(fd_newroot);
            pivot_root(".", ".");

    After the pivot_root(".", ".") call the new rootfs is tucked under
    the old rootfs which can then be unmounted to reveal the underlying
    mount:

            fchdir(fd_oldroot);
            umount2(".", MNT_DETACH);

    Since pivot_root() moves the caller into a new rootfs no mounts must
    be propagated out of the new rootfs as a consequence of the
    pivot_root() call. Thus, the mounts cannot be shared.

(2) When a mount is propagated to a mount that already has another mount
    mounted on the same dentry.

    The easiest example for this is to create a new mount namespace. The
    following commands will create a mount namespace where the rootfs
    mount / will be a slave to the peer group of the host rootfs /
    mount's peer group. IOW, it will receive propagation from the host:

            mount --make-shared /
            unshare --mount --propagation=slave

    Now a new mount on the /mnt dentry in that mount namespace is
    created. (As it can be confusing it should be spelled out that the
    tmpfs mount on the /mnt dentry that was just created doesn't
    propagate back to the host because the rootfs mount / of the mount
    namespace isn't a peer of the host rootfs.):

            mount -t tmpfs tmpfs /mnt

            TARGET  SOURCE  FSTYPE  PROPAGATION
            └─/mnt  tmpfs   tmpfs

    Now another terminal in the host mount namespace can observe that
    the mount indeed hasn't propagated back to into the host mount
    namespace. A new mount can now be created on top of the /mnt dentry
    with the rootfs mount / as its parent:

            mount --bind /opt /mnt

            TARGET  SOURCE           FSTYPE  PROPAGATION
            └─/mnt  /dev/sda2[/opt]  ext4    shared:1

    The mount namespace that was created earlier can now observe that
    the bind mount created on the host has propagated into it:

            TARGET    SOURCE           FSTYPE  PROPAGATION
            └─/mnt    /dev/sda2[/opt]  ext4    master:1
              └─/mnt  tmpfs            tmpfs

    But instead of having been mounted on top of the tmpfs mount at the
    /mnt dentry the /opt mount has been mounted on top of the rootfs
    mount at the /mnt dentry. And the tmpfs mount has been remounted on
    top of the propagated /opt mount at the /opt dentry. So in other
    words, the propagated mount has been tucked under the preexisting
    mount in that mount namespace.

    Mount namespaces make this easy to illustrate but it's also easy to
    create tucked mounts in the same mount namespace (assuming a shared
    rootfs mount / with peer group id 1):

            mount --bind /opt /opt

            TARGET   SOURCE          FSTYPE  MNT_ID  PARENT_ID  PROPAGATION
            └─/opt  /dev/sda2[/opt]  ext4    188     29         shared:1

    If another mount is mounted on top of the /opt mount at the /opt
    dentry:

            mount --bind /tmp /opt

    The following clunky mount tree will result:

            TARGET      SOURCE           FSTYPE  MNT_ID  PARENT_ID  PROPAGATION
            └─/opt      /dev/sda2[/tmp]  ext4    405      29        shared:1
              └─/opt    /dev/sda2[/opt]  ext4    188     405        shared:1
                └─/opt  /dev/sda2[/tmp]  ext4    404     188        shared:1

    The /tmp mount is tucked both under the /opt mount and mounted on
    top of the /opt mount. This happens because the rootfs / and the
    /opt mount are shared mounts in the same peer group.

    When the new /tmp mount is supposed to be mounted at the /opt dentry
    then the /tmp mount first propagates to the root mount at the /opt
    dentry. But there already is the /opt mount mounted at the /opt
    dentry. So the old /opt mount at the /opt dentry will be mounted on
    top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent
    is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root
    is /opt which is what shows up as /opt under SOURCE). So again, a
    mount will be tucked under a preexisting mount.

    (Fwiw, a few iterations of mount --bind /opt /opt in a loop on a
     shared rootfs is a good example of what could be referred to as
     mount explosion.)

The main point is that tucked mounts allows userspace to umount a top
mount and reveal an underlying mount. So for example, umounting the
tmpfs mount on /mnt that was created in example (1) using mount
namespaces reveals the /opt mount which was tucked beneath it.

In (2) where a tucked mount was created in the same mount namespace
unmounting the top mount would unmount both the top mount and the tucked
mount and in the process remount the original mount on top of the rootfs
mount / at the /opt dentry again. This again, is a result of mount
propagation only this time it's umount propagation. However, this can be
avoided by simply making the parent mount / of the @opt mount a private
or slave mount. Then the top mount and the original mount can be
unmounted to reveal the tucked mount.

The advantage here is that the proposed mechanism already exists and
that it is powerful enough to cover cases where mounts are supposed to
be updated with new versions. Crucially, it offers an important
flexibility. Namely that updates to a system may either be forced or can
be delayed and the umount of the top mount be left to a service if it is
a cooperative one.

This adds a new flag to move_mount() that allows to explicitly create a
tucked mount adhering to the following semantics

* Mounts cannot be tucked under the rootfs. This restriction encompasses
  the rootfs but also chroots via chroot() and pivot_root(). To tuck
  mounts under the rootfs or a chroot pivot_root() can be used as
  illustrated above.
* The source mount must be a private mount to force the kernel to
  allocate a new, unused peer group id. This isn't a required
  restriction but a voluntary one. It avoids repeating a semantical
  quirk that already exists today. If bind mounts which already have a
  peer group id are inserted into mount trees that have the same peer
  group id this can cause a lot of mount propagation events to be
  generated (For example, consider running mount --bind /opt /opt in a
  loop where the parent mount is a shared mount.).
* Avoid getting rid of the top mount in the kernel. Cooperative services
  need to be able to unmount the top mount themselves.
  This also avoids a good deal of additional complexity. The umount
  would have to be propagated which would be another rather expensive
  operation. So namespace_lock() and lock_mount_hash() would potentially
  have to be held for a long time for both a mount and umount
  propagation. That should be avoided.
* The path to tuck under must be mounted and attached.
* Both the mount to tuck under and its parent must be in the caller's
  mount namespace and the caller must be able to mount in that mount
  namespace.
* The caller must be able to unmount the top mount to prove that they
  could reveal the underlying mount.
* The propagation tree is calculated based on the destination mount's
  parent mount and the destination mount's mountpoint on the parent
  mount. Of course, if the parent of the destination mount and the
  destination mount are shared mounts in the same peer group and the
  mountpoint of the new mount to be mounted is a subdir of their
  ->mnt_root then both will receive a mount of /opt. That's probably
  easier to understand with an example. Assuming a standard shared
  rootfs /:

          mount --bind /opt /opt
          mount --bind /tmp /opt

  will cause the same mount tree as:

          mount --bind /opt /opt
          mount --tuck /tmp /opt

  because both / and /opt are shared mounts/peers in the same peer
  group and the /opt dentry is a subdirectory of both the parent's and
  the child's ->mnt_root. If a mount tree like that is created it almost
  always is an accident or abuse of mount propagation. Realistically
  what most people probably mean in this scenarios is:

          mount --bind /opt /opt
          mount --make-private /opt
          mount --make-shared /opt

  This forces the allocation of a new separate peer group for the /opt
  mount. Aferwards a mount --bind or mount --tuck actually makes sense
  as the / and /opt mount belong to different peer groups. Before that
  it's likely just confusion about what the user wanted to achieve.

Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
---
Christian Brauner (5):
      fs: add path_mounted()
      pnode: pass mountpoint directly
      fs: fix __lookup_mnt() documentation
      fs: use a for loop when locking a mount
      fs: allow to tuck mounts explicitly

 fs/namespace.c             | 328 ++++++++++++++++++++++++++++++++++++---------
 fs/pnode.c                 |  12 +-
 include/uapi/linux/mount.h |   3 +-
 3 files changed, 274 insertions(+), 69 deletions(-)
---
base-commit: eeac8ede17557680855031c6f305ece2378af326
change-id: 20230202-fs-move-mount-replace-a5a20f471e04

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH RFC 1/5] fs: add path_mounted()
  2023-03-18 15:51 [PATCH RFC 0/5] fs: allow to tuck mounts explicitly Christian Brauner
@ 2023-03-18 15:51 ` Christian Brauner
  2023-03-18 15:51 ` [PATCH RFC 2/5] pnode: pass mountpoint directly Christian Brauner
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2023-03-18 15:51 UTC (permalink / raw)
  To: Al Viro, Seth Forshee; +Cc: linux-fsdevel, Christian Brauner (Microsoft)

Add a small helper to check whether a path refers to the root of the
mount instead of open-coding this everywhere.

Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
---
 fs/namespace.c | 36 ++++++++++++++++++++++++------------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index bc0f15257b49..154569fd7343 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1767,6 +1767,19 @@ bool may_mount(void)
 	return ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN);
 }
 
+/**
+ * path_mounted - check whether path is mounted
+ * @path: path to check
+ *
+ * Determine whether @path refers to the root of a mount.
+ *
+ * Return: true if @path is the root of a mount, false if not.
+ */
+static inline bool path_mounted(const struct path *path)
+{
+	return path->mnt->mnt_root == path->dentry;
+}
+
 static void warn_mandlock(void)
 {
 	pr_warn_once("=======================================================\n"
@@ -1782,7 +1795,7 @@ static int can_umount(const struct path *path, int flags)
 
 	if (!may_mount())
 		return -EPERM;
-	if (path->dentry != path->mnt->mnt_root)
+	if (!path_mounted(path))
 		return -EINVAL;
 	if (!check_mnt(mnt))
 		return -EINVAL;
@@ -2367,7 +2380,7 @@ static int do_change_type(struct path *path, int ms_flags)
 	int type;
 	int err = 0;
 
-	if (path->dentry != path->mnt->mnt_root)
+	if (!path_mounted(path))
 		return -EINVAL;
 
 	type = flags_to_propagation_type(ms_flags);
@@ -2646,7 +2659,7 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
 	if (!check_mnt(mnt))
 		return -EINVAL;
 
-	if (path->dentry != mnt->mnt.mnt_root)
+	if (!path_mounted(path))
 		return -EINVAL;
 
 	if (!can_change_locked_flags(mnt, mnt_flags))
@@ -2685,7 +2698,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
 	if (!check_mnt(mnt))
 		return -EINVAL;
 
-	if (path->dentry != path->mnt->mnt_root)
+	if (!path_mounted(path))
 		return -EINVAL;
 
 	if (!can_change_locked_flags(mnt, mnt_flags))
@@ -2775,9 +2788,9 @@ static int do_set_group(struct path *from_path, struct path *to_path)
 
 	err = -EINVAL;
 	/* To and From paths should be mount roots */
-	if (from_path->dentry != from_path->mnt->mnt_root)
+	if (!path_mounted(from_path))
 		goto out;
-	if (to_path->dentry != to_path->mnt->mnt_root)
+	if (!path_mounted(to_path))
 		goto out;
 
 	/* Setting sharing groups is only allowed across same superblock */
@@ -2858,7 +2871,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 	if (old->mnt.mnt_flags & MNT_LOCKED)
 		goto out;
 
-	if (old_path->dentry != old_path->mnt->mnt_root)
+	if (!path_mounted(old_path))
 		goto out;
 
 	if (d_is_dir(new_path->dentry) !=
@@ -2940,8 +2953,7 @@ static int do_add_mount(struct mount *newmnt, struct mountpoint *mp,
 	}
 
 	/* Refuse the same filesystem on the same mount point */
-	if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
-	    path->mnt->mnt_root == path->dentry)
+	if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path_mounted(path))
 		return -EBUSY;
 
 	if (d_is_symlink(newmnt->mnt.mnt_root))
@@ -3920,11 +3932,11 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
 	if (new_mnt == root_mnt || old_mnt == root_mnt)
 		goto out4; /* loop, on the same file system  */
 	error = -EINVAL;
-	if (root.mnt->mnt_root != root.dentry)
+	if (!path_mounted(&root))
 		goto out4; /* not a mountpoint */
 	if (!mnt_has_parent(root_mnt))
 		goto out4; /* not attached */
-	if (new.mnt->mnt_root != new.dentry)
+	if (!path_mounted(&new))
 		goto out4; /* not a mountpoint */
 	if (!mnt_has_parent(new_mnt))
 		goto out4; /* not attached */
@@ -4127,7 +4139,7 @@ static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
 	struct mount *mnt = real_mount(path->mnt);
 	int err = 0;
 
-	if (path->dentry != mnt->mnt.mnt_root)
+	if (!path_mounted(path))
 		return -EINVAL;
 
 	if (kattr->mnt_userns) {

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RFC 2/5] pnode: pass mountpoint directly
  2023-03-18 15:51 [PATCH RFC 0/5] fs: allow to tuck mounts explicitly Christian Brauner
  2023-03-18 15:51 ` [PATCH RFC 1/5] fs: add path_mounted() Christian Brauner
@ 2023-03-18 15:51 ` Christian Brauner
  2023-03-18 15:51 ` [PATCH RFC 3/5] fs: fix __lookup_mnt() documentation Christian Brauner
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2023-03-18 15:51 UTC (permalink / raw)
  To: Al Viro, Seth Forshee; +Cc: linux-fsdevel, Christian Brauner (Microsoft)

Currently, we use a global variable to stash the destination
mountpoint. All global variables are changed in propagate_one(). The
mountpoint variable is one of the few which doesn't change after
initialization. Instead, just pass the destination mountpoint directly
making it easy to verify directly in propagate_mnt() that the
destination mountpoint never changes.

Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
---
 fs/pnode.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 468e4e65a615..3cede8b18c8b 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -214,7 +214,6 @@ static struct mount *next_group(struct mount *m, struct mount *origin)
 
 /* all accesses are serialized by namespace_sem */
 static struct mount *last_dest, *first_source, *last_source, *dest_master;
-static struct mountpoint *mp;
 static struct hlist_head *list;
 
 static inline bool peers(struct mount *m1, struct mount *m2)
@@ -222,7 +221,7 @@ static inline bool peers(struct mount *m1, struct mount *m2)
 	return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
 }
 
-static int propagate_one(struct mount *m)
+static int propagate_one(struct mount *m, struct mountpoint *dest_mp)
 {
 	struct mount *child;
 	int type;
@@ -230,7 +229,7 @@ static int propagate_one(struct mount *m)
 	if (IS_MNT_NEW(m))
 		return 0;
 	/* skip if mountpoint isn't covered by it */
-	if (!is_subdir(mp->m_dentry, m->mnt.mnt_root))
+	if (!is_subdir(dest_mp->m_dentry, m->mnt.mnt_root))
 		return 0;
 	if (peers(m, last_dest)) {
 		type = CL_MAKE_SHARED;
@@ -262,7 +261,7 @@ static int propagate_one(struct mount *m)
 	if (IS_ERR(child))
 		return PTR_ERR(child);
 	read_seqlock_excl(&mount_lock);
-	mnt_set_mountpoint(m, mp, child);
+	mnt_set_mountpoint(m, dest_mp, child);
 	if (m->mnt_master != dest_master)
 		SET_MNT_MARK(m->mnt_master);
 	read_sequnlock_excl(&mount_lock);
@@ -299,13 +298,12 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
 	last_dest = dest_mnt;
 	first_source = source_mnt;
 	last_source = source_mnt;
-	mp = dest_mp;
 	list = tree_list;
 	dest_master = dest_mnt->mnt_master;
 
 	/* all peers of dest_mnt, except dest_mnt itself */
 	for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
-		ret = propagate_one(n);
+		ret = propagate_one(n, dest_mp);
 		if (ret)
 			goto out;
 	}
@@ -316,7 +314,7 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
 		/* everything in that slave group */
 		n = m;
 		do {
-			ret = propagate_one(n);
+			ret = propagate_one(n, dest_mp);
 			if (ret)
 				goto out;
 			n = next_peer(n);

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RFC 3/5] fs: fix __lookup_mnt() documentation
  2023-03-18 15:51 [PATCH RFC 0/5] fs: allow to tuck mounts explicitly Christian Brauner
  2023-03-18 15:51 ` [PATCH RFC 1/5] fs: add path_mounted() Christian Brauner
  2023-03-18 15:51 ` [PATCH RFC 2/5] pnode: pass mountpoint directly Christian Brauner
@ 2023-03-18 15:51 ` Christian Brauner
  2023-04-21  6:28   ` Al Viro
  2023-03-18 15:52 ` [PATCH RFC 4/5] fs: use a for loop when locking a mount Christian Brauner
  2023-03-18 15:52 ` [PATCH RFC 5/5] fs: allow to tuck mounts explicitly Christian Brauner
  4 siblings, 1 reply; 10+ messages in thread
From: Christian Brauner @ 2023-03-18 15:51 UTC (permalink / raw)
  To: Al Viro, Seth Forshee; +Cc: linux-fsdevel, Christian Brauner (Microsoft)

The comment on top of __lookup_mnt() states that it finds the first
mount implying that there could be multiple mounts mounted at the same
dentry with the same parent.

This was true on old kernels where __lookup_mnt() could encounter a
stack of child mounts such that each child had the same parent mount and
was mounted at the same dentry. These were called "shadow mounts" and
were created during mount propagation. So back then if a mount @m in the
destination propagation tree already had a child mount @p mounted at
@mp then any mount @n we propagated to @m at the same @mp would be
appended after the preexisting mount @p in @mount_hashtable.

This hasn't been the case for quite a while now and I don't see an
obvious way how such mount stacks could be created in another way. And
if that's possible it would invalidate assumptions made in other parts
of the code.

So for a long time on all relevant kernels the child-parent relationship
is unique per dentry. So given a child mount @c mounted at its parent
mount @p on dentry @mp means that @c is the only child mounted on
@p at @mp. Should a mount @m be propagated to @p on @mp then @m will be
mounted on @p at @mp and the preexisting child @c will be remounted on
top of @m at @m->mnt_root.

Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
---
 fs/namespace.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 154569fd7343..42dc87f86f34 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -658,9 +658,16 @@ static bool legitimize_mnt(struct vfsmount *bastard, unsigned seq)
 	return false;
 }
 
-/*
- * find the first mount at @dentry on vfsmount @mnt.
- * call under rcu_read_lock()
+/**
+ * __lookup_mnt - find child mount
+ * @mnt:	parent mount
+ * @dentry:	mountpoint
+ *
+ * If @mnt has a child mount mounted @dentry find and return it. If a
+ * mount is found it is unique, i.e., there are no shadow child mounts
+ * with @mnt as parent and mounted at @dentry.
+ *
+ * Return: The child of @mnt mounted @dentry or NULL if there is none.
  */
 struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
 {

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 3/5] fs: fix __lookup_mnt() documentation
  2023-03-18 15:51 ` [PATCH RFC 3/5] fs: fix __lookup_mnt() documentation Christian Brauner
@ 2023-04-21  6:28   ` Al Viro
  2023-04-24 16:37     ` Christian Brauner
  0 siblings, 1 reply; 10+ messages in thread
From: Al Viro @ 2023-04-21  6:28 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Seth Forshee, linux-fsdevel

On Sat, Mar 18, 2023 at 04:51:59PM +0100, Christian Brauner wrote:
> The comment on top of __lookup_mnt() states that it finds the first
> mount implying that there could be multiple mounts mounted at the same
> dentry with the same parent.
> 
> This was true on old kernels where __lookup_mnt() could encounter a
> stack of child mounts such that each child had the same parent mount and
> was mounted at the same dentry. These were called "shadow mounts" and
> were created during mount propagation. So back then if a mount @m in the
> destination propagation tree already had a child mount @p mounted at
> @mp then any mount @n we propagated to @m at the same @mp would be
> appended after the preexisting mount @p in @mount_hashtable.
> 
> This hasn't been the case for quite a while now and I don't see an
> obvious way how such mount stacks could be created in another way.

Not quite, actually - there's a nasty corner case where mnt_change_mountpoint()
would create those.  And your subsequent patch steps into the same fun.
Look: suppose the root of the tree you are feeding to attach_recursive_mnt()
has managed to grow a mount right on top of its root.  The same will be
reproduced in all its copies created by propagate_mnt().  Now, suppose
one of the slaves of the place where we are trying to mount it on already
has something mounted on it.  Well, we hit this:
                q = __lookup_mnt(&child->mnt_parent->mnt,
				 child->mnt_mountpoint);
		if (q)
			mnt_change_mountpoint(child, smp, q);
which will tuck the child (a copy we'd made) under q (existing mount on
top of the place that copy is for).  Result: 'q' overmounts the root of
'child' now.  So does the copy of whatever had been overmounting the
root of 'source_mnt'...

And yes, it can happen.  Consider e.g. do_loopback(); we have looked
up the mountpoint ('path'), we have looked up the subtree to copy
('old_path'), we had lock_mount(path) made sure that namespace_sem
is held *and* path is not overmounted (followed into whatever
overmounts that might have happened since we looked the mountpoint
up).  Now, think what happens if 'old_path' is also overmounted while
we are trying to get namespace_sem...

A similar scenario exists for do_move_mount(), and there it's in
a sense worse - there we have to cope with the possibility that
from_dfd is an O_PATH descriptor created by fsmount().  And I'm
not at all convinced that we can't arrange for automount point
to be there and be triggered by the time of move_mount(2)...
The reason it's worse is that here we can't just follow mounts
all the way down - we want to take the entire mount tree associated
with that descriptor.

> And
> if that's possible it would invalidate assumptions made in other parts
> of the code.

Details?  I'm not saying it's impossible - we might have a real bug in
that area.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 3/5] fs: fix __lookup_mnt() documentation
  2023-04-21  6:28   ` Al Viro
@ 2023-04-24 16:37     ` Christian Brauner
  0 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2023-04-24 16:37 UTC (permalink / raw)
  To: Al Viro; +Cc: Seth Forshee, linux-fsdevel

On Fri, Apr 21, 2023 at 07:28:38AM +0100, Al Viro wrote:
> On Sat, Mar 18, 2023 at 04:51:59PM +0100, Christian Brauner wrote:
> > The comment on top of __lookup_mnt() states that it finds the first
> > mount implying that there could be multiple mounts mounted at the same
> > dentry with the same parent.
> > 
> > This was true on old kernels where __lookup_mnt() could encounter a
> > stack of child mounts such that each child had the same parent mount and
> > was mounted at the same dentry. These were called "shadow mounts" and
> > were created during mount propagation. So back then if a mount @m in the
> > destination propagation tree already had a child mount @p mounted at
> > @mp then any mount @n we propagated to @m at the same @mp would be
> > appended after the preexisting mount @p in @mount_hashtable.
> > 
> > This hasn't been the case for quite a while now and I don't see an
> > obvious way how such mount stacks could be created in another way.
> 
> Not quite, actually - there's a nasty corner case where mnt_change_mountpoint()
> would create those.  And your subsequent patch steps into the same fun.
> Look: suppose the root of the tree you are feeding to attach_recursive_mnt()
> has managed to grow a mount right on top of its root.  The same will be
> reproduced in all its copies created by propagate_mnt().  Now, suppose
> one of the slaves of the place where we are trying to mount it on already
> has something mounted on it.  Well, we hit this:
>                 q = __lookup_mnt(&child->mnt_parent->mnt,
> 				 child->mnt_mountpoint);
> 		if (q)
> 			mnt_change_mountpoint(child, smp, q);
> which will tuck the child (a copy we'd made) under q (existing mount on
> top of the place that copy is for).  Result: 'q' overmounts the root of
> 'child' now.  So does the copy of whatever had been overmounting the
> root of 'source_mnt'...
> 
> And yes, it can happen.  Consider e.g. do_loopback(); we have looked
> up the mountpoint ('path'), we have looked up the subtree to copy
> ('old_path'), we had lock_mount(path) made sure that namespace_sem
> is held *and* path is not overmounted (followed into whatever
> overmounts that might have happened since we looked the mountpoint
> up).  Now, think what happens if 'old_path' is also overmounted while
> we are trying to get namespace_sem...

So I think I understand what you're saying and here's one example:

        (1) mount --bind /mnt /opt
        (2) fd_from = open_tree("/opt", OPEN_TREE_CLONE)
        (3) mount --bind /tmp /opt
        (4) mount_move(fd_from, fd_somewhere_else, ...)

Where the mount of (3) happens:

* after path lookup for fd_from in (2)
* before namespace_lock() is acquired by (2) and open_detached_copy()
  is called

So then open_detached_copy() and consequently __do_loopack() will see
and copy both (1) and (3) mounted on top of it. So roughly:

           (0) mnt->mnt_root                   = "/"
            |  real_mount(mnt)->mnt_mountpoint = "/"
            |  real_mount(mnt)->mnt_id         = 10
            |
fd_from ----└─ (1) mnt->mnt_root                       = "/mnt"
                |  real_mount(mnt)->mnt_mountpoint     = "/opt"
                |  real_mount(mnt)->mnt_id             = 20
                |  real_mount(mnt)->mnt_parent->mnt_id = 10
                |
                |      /* overmounted */
                └─ (3) mnt->mnt_root                       = "/tmp"
                       real_mount(mnt)->mnt_mountpoint     = "/mnt"
                       real_mount(mnt)->mnt_id             = 30
                       real_mount(mnt)->mnt_parent->mnt_id = 20

Then, if mount propagation were to happen into e.g., another mount
namespace as part of (4) where there's another mount (5) already mounted
at our mountpoint:

        (5) mount --bind /mnt /opt

        (0') mnt->mnt_root                  = "/"
         |  real_mount(mnt)->mnt_mountpoint = "/"
         |  real_mount(mnt)->mnt_id         = 100
         |
         └─ (5) mnt->mnt_root                       = "/mnt";
                real_mount(mnt)->mnt_mountpoint     = "/opt";
                real_mount(mnt)->mnt_id             = 50
                real_mount(mnt)->mnt_parent->mnt_id = 100

then we'd create a new copy (1') of the mount tree at (1):

        (0') mnt->mnt_root                   = "/"
         |   real_mount(mnt)->mnt_mountpoint = "/"
         |   real_mount(mnt)->mnt_id         = 100
         |                                               
         └─ (1') mnt->mnt_root                        = "/mnt"
             ||   real_mount(mnt)->mnt_mountpoint     = "/opt"
             ||   real_mount(mnt)->mnt_id             = 200
             ||   real_mount(mnt)->mnt_parent->mnt_id = 100
             ||                                              
             |└─ (3') mnt->mnt_root                       = "/tmp"
             |        real_mount(mnt)->mnt_mountpoint     = "/mnt"
             |        real_mount(mnt)->mnt_id             = 300
             |        real_mount(mnt)->mnt_parent->mnt_id = 200
             |                                               
             └─- (5)  mnt->mnt_root                       = "/mnt";
                      real_mount(mnt)->mnt_mountpoint     = "/mnt";
                      real_mount(mnt)->mnt_id             = 50
                      real_mount(mnt)->mnt_parent->mnt_id = 200

So remounting (5) on top of (1') aka tucking (1') beneath (5).
Afterwards we will have both (3') and (5) with (1') as parent at the
same mountpoint. Ugh...

However, I think that this is only possible with attached source mounts
that get moved? For anonymous mounts we know that there cannot be any
mounts on top of them since neither direct mounts nor indirect mounts
aka mount propagation onto anonymous mounts is possible unless you have
automounts triggered on them, I think.

> 
> A similar scenario exists for do_move_mount(), and there it's in
> a sense worse - there we have to cope with the possibility that
> from_dfd is an O_PATH descriptor created by fsmount().  And I'm
> not at all convinced that we can't arrange for automount point
> to be there and be triggered by the time of move_mount(2)...

Thanks for the pointers about automounts. That was very helpful.

So right now, any file descriptor returned from fsmount() will refer to
an anonymous mount and will thus have an anonymous mount namespace
attached to mnt->mnt_ns.

Any explicit attempt to use move_mount(2) or mount(2) to add a mount on
top of an O_PATH file descriptor gotten from fsmount() will fail in
check_mnt(). And any implicit attempt to add a mount on top of an
anonymous mount
file descriptor/mount via mount propagation will fail as well on
anonymous mounts.

This is behavior that userspace explicitly relies upon (I know of at
least 4 big projects.) as they use fsmount() and
open_tree(OPEN_TREE_CLONE) file descriptors in security sensitive
contexts where they can't risk suddenly have mounts appear in locations
where they don't expect them.

Reading through the code right now, I think it's clear that automounts
can be triggered on fsmount() or open_tree(OPEN_TREE_CLONE) fds even
before such mounts have been attached via move_mount(2).

So when we stumble upon an automount point that is located in an
anonymous mount we call finish_automount(->d_automount(), path).

So we end up in do_add_mount():

        struct mount *parent = real_mount(path->mnt);

(Ignoring for a second that "parent" is a misleading name here...)
and we know that parent->mnt_ns is an anonymous mount namespace, so we
fail:

          if (unlikely(!check_mnt(parent))) {

but then succeed and get past the next check:

                  /* ... and for those we'd better have mountpoint still alive */
                  if (!parent->mnt_ns)
                          return -EINVAL;

I'm not sure if that's intentional.

In any case, I would very much prefer if we were to simply refuse
automounts on top of anonymous mounts as well.

Not just is it extremly unlikely to be a use-case but it would also be
consistent in refusing mounting on top of anonymous mounts as mentioned
above:

diff --git a/fs/namespace.c b/fs/namespace.c
index 6836e937ee61..bf9f4d36ab98 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3095,6 +3095,11 @@ int finish_automount(struct vfsmount *m, const struct path *path)
                goto discard_locked;
        }

+       if (!is_anon_ns(real_mount(path->mnt)->mnt_ns)) {
+               err = -EINVAL;
+               goto discard_locked;
+       }
+
        err = do_add_mount(mnt, mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
        unlock_mount(mp);
        if (unlikely(err))

So we would require that automounts can only be triggered if the mount
they'll be mounted upon has been made visible in the filesystem.

This would also continue to provide guarantees to userspace when
operating on private mounts. What do you think?

> The reason it's worse is that here we can't just follow mounts
> all the way down - we want to take the entire mount tree associated
> with that descriptor.

I'm not clear yet why this is an issue. finish_automount() will call
__lookup_mount() and if anything is mounted on top path->mnt,dentry then
the automount is silently discarded under namespace_lock(), no? So there
should be mutual exclusion here, between move_mount(2) and
finish_automount().

> 
> > And
> > if that's possible it would invalidate assumptions made in other parts
> > of the code.
> 
> Details?  I'm not saying it's impossible - we might have a real bug in
> that area.

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RFC 4/5] fs: use a for loop when locking a mount
  2023-03-18 15:51 [PATCH RFC 0/5] fs: allow to tuck mounts explicitly Christian Brauner
                   ` (2 preceding siblings ...)
  2023-03-18 15:51 ` [PATCH RFC 3/5] fs: fix __lookup_mnt() documentation Christian Brauner
@ 2023-03-18 15:52 ` Christian Brauner
  2023-03-18 15:52 ` [PATCH RFC 5/5] fs: allow to tuck mounts explicitly Christian Brauner
  4 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2023-03-18 15:52 UTC (permalink / raw)
  To: Al Viro, Seth Forshee; +Cc: linux-fsdevel, Christian Brauner (Microsoft)

Currently, lock_mount() uses a goto to retry the lookup until it
succeeded in acquiring the namespace_lock() preventing the top mount
from being overmounted. While that's perfectly fine we want to lookup
the mountpoint on the parent of the top mount in later patches. So adapt
the code to make this easier to implement. Also, the for loop is
arguably a little cleaner and makes the code easier to follow. No
functional changes intended.

Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
---
 fs/namespace.c | 50 +++++++++++++++++++++++++++++---------------------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 42dc87f86f34..7f22fcfd8eab 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2308,31 +2308,39 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 
 static struct mountpoint *lock_mount(struct path *path)
 {
-	struct vfsmount *mnt;
-	struct dentry *dentry = path->dentry;
-retry:
-	inode_lock(dentry->d_inode);
-	if (unlikely(cant_mount(dentry))) {
-		inode_unlock(dentry->d_inode);
-		return ERR_PTR(-ENOENT);
-	}
-	namespace_lock();
-	mnt = lookup_mnt(path);
-	if (likely(!mnt)) {
-		struct mountpoint *mp = get_mountpoint(dentry);
-		if (IS_ERR(mp)) {
-			namespace_unlock();
+	struct vfsmount *mnt = path->mnt;
+	struct dentry *dentry;
+	struct mountpoint *mp;
+
+	for (;;) {
+		dentry = path->dentry;
+		inode_lock(dentry->d_inode);
+		if (unlikely(cant_mount(dentry))) {
 			inode_unlock(dentry->d_inode);
-			return mp;
+			return ERR_PTR(-ENOENT);
 		}
+
+		namespace_lock();
+
+		mnt = lookup_mnt(path);
+		if (likely(!mnt))
+			break;
+
+		namespace_unlock();
+		inode_unlock(dentry->d_inode);
+		path_put(path);
+		path->mnt = mnt;
+		path->dentry = dget(mnt->mnt_root);
+	}
+
+	mp = get_mountpoint(dentry);
+	if (IS_ERR(mp)) {
+		namespace_unlock();
+		inode_unlock(dentry->d_inode);
 		return mp;
 	}
-	namespace_unlock();
-	inode_unlock(path->dentry->d_inode);
-	path_put(path);
-	path->mnt = mnt;
-	dentry = path->dentry = dget(mnt->mnt_root);
-	goto retry;
+
+	return mp;
 }
 
 static void unlock_mount(struct mountpoint *where)

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RFC 5/5] fs: allow to tuck mounts explicitly
  2023-03-18 15:51 [PATCH RFC 0/5] fs: allow to tuck mounts explicitly Christian Brauner
                   ` (3 preceding siblings ...)
  2023-03-18 15:52 ` [PATCH RFC 4/5] fs: use a for loop when locking a mount Christian Brauner
@ 2023-03-18 15:52 ` Christian Brauner
  2023-04-21  6:29   ` Al Viro
  4 siblings, 1 reply; 10+ messages in thread
From: Christian Brauner @ 2023-03-18 15:52 UTC (permalink / raw)
  To: Al Viro, Seth Forshee; +Cc: linux-fsdevel, Christian Brauner (Microsoft)

Various distributions are adding or are in the process of adding support
for system extensions and in the future configuration extensions through
various tools. A more detailed explanation on system and configuration
extensions can be found on the manpage which is listed below at [1].

System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.

When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.

System configuration images are similar but operate on directories
containing system or service configuration.

On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):

       TARGET  SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /       /       ext4    shared:1     29      1

On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.

Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.

In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:

       TARGET  SOURCE  FSTYPE  PROPAGATION         MNT_ID  PARENT_ID
       /       /       ext4    shared:24 master:1  71      47

For people not too familar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.

A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.

For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:

       TARGET                    SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /home/ubuntu/debian-tree  /       ext4    shared:99    61      60

So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:

       TARGET                  SOURCE              FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /propagate/debian-tree  /run/host/incoming  tmpfs   master:5     71      68

Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [2]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.

Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.

The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.

The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the sytems is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.

The existing mechanism of tucked mounts can be exposed as a new option
when attaching a mount giving userspace the ability to seamlessly
upgrade a mount. Now, instead of only being created as a consequence of
mount propagation they can be created explicitly. Today, tucked mounts
are created in two scenarios:

(1) When a service or container is started in a new mount namespace and
    pivot_root()s into its new rootfs. The way this is done is by
    tucking the new rootfs under the old rootfs:

            fd_newroot = open("/var/lib/machines/fedora", ...);
            fd_oldroot = open("/", ...);
            fchdir(fd_newroot);
            pivot_root(".", ".");

    After the pivot_root(".", ".") call the new rootfs is tucked under
    the old rootfs which can then be unmounted to reveal the underlying
    mount:

            fchdir(fd_oldroot);
            umount2(".", MNT_DETACH);

    Since pivot_root() moves the caller into a new rootfs no mounts must
    be propagated out of the new rootfs as a consequence of the
    pivot_root() call. Thus, the mounts cannot be shared.

(2) When a mount is propagated to a mount that already has another mount
    mounted on the same dentry.

    The easiest example for this is to create a new mount namespace. The
    following commands will create a mount namespace where the rootfs
    mount / will be a slave to the peer group of the host rootfs /
    mount's peer group. IOW, it will receive propagation from the host:

            mount --make-shared /
            unshare --mount --propagation=slave

    Now a new mount on the /mnt dentry in that mount namespace is
    created. (As it can be confusing it should be spelled out that the
    tmpfs mount on the /mnt dentry that was just created doesn't
    propagate back to the host because the rootfs mount / of the mount
    namespace isn't a peer of the host rootfs.):

            mount -t tmpfs tmpfs /mnt

            TARGET  SOURCE  FSTYPE  PROPAGATION
            └─/mnt  tmpfs   tmpfs

    Now another terminal in the host mount namespace can observe that
    the mount indeed hasn't propagated back to into the host mount
    namespace. A new mount can now be created on top of the /mnt dentry
    with the rootfs mount / as its parent:

            mount --bind /opt /mnt

            TARGET  SOURCE           FSTYPE  PROPAGATION
            └─/mnt  /dev/sda2[/opt]  ext4    shared:1

    The mount namespace that was created earlier can now observe that
    the bind mount created on the host has propagated into it:

            TARGET    SOURCE           FSTYPE  PROPAGATION
            └─/mnt    /dev/sda2[/opt]  ext4    master:1
              └─/mnt  tmpfs            tmpfs

    But instead of having been mounted on top of the tmpfs mount at the
    /mnt dentry the /opt mount has been mounted on top of the rootfs
    mount at the /mnt dentry. And the tmpfs mount has been remounted on
    top of the propagated /opt mount at the /opt dentry. So in other
    words, the propagated mount has been tucked under the preexisting
    mount in that mount namespace.

    Mount namespaces make this easy to illustrate but it's also easy to
    create tucked mounts in the same mount namespace (assuming a shared
    rootfs mount / with peer group id 1):

            mount --bind /opt /opt

            TARGET   SOURCE          FSTYPE  MNT_ID  PARENT_ID  PROPAGATION
            └─/opt  /dev/sda2[/opt]  ext4    188     29         shared:1

    If another mount is mounted on top of the /opt mount at the /opt
    dentry:

            mount --bind /tmp /opt

    The following clunky mount tree will result:

            TARGET      SOURCE           FSTYPE  MNT_ID  PARENT_ID  PROPAGATION
            └─/opt      /dev/sda2[/tmp]  ext4    405      29        shared:1
              └─/opt    /dev/sda2[/opt]  ext4    188     405        shared:1
                └─/opt  /dev/sda2[/tmp]  ext4    404     188        shared:1

    The /tmp mount is tucked both under the /opt mount and mounted on
    top of the /opt mount. This happens because the rootfs / and the
    /opt mount are shared mounts in the same peer group.

    When the new /tmp mount is supposed to be mounted at the /opt dentry
    then the /tmp mount first propagates to the root mount at the /opt
    dentry. But there already is the /opt mount mounted at the /opt
    dentry. So the old /opt mount at the /opt dentry will be mounted on
    top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent
    is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root
    is /opt which is what shows up as /opt under SOURCE). So again, a
    mount will be tucked under a preexisting mount.

    (Fwiw, a few iterations of mount --bind /opt /opt in a loop on a
     shared rootfs is a good example of what could be referred to as
     mount explosion.)

The main point is that tucked mounts allows userspace to umount a top
mount and reveal an underlying mount. So for example, umounting the
tmpfs mount on /mnt that was created in example (1) using mount
namespaces reveals the /opt mount which was tucked beneath it.

In (2) where a tucked mount was created in the same mount namespace
unmounting the top mount would unmount both the top mount and the tucked
mount and in the process remount the original mount on top of the rootfs
mount / at the /opt dentry again. This again, is a result of mount
propagation only this time it's umount propagation. However, this can be
avoided by simply making the parent mount / of the @opt mount a private
or slave mount. Then the top mount and the original mount can be
unmounted to reveal the tucked mount.

The advantage here is that the proposed mechanism already exists and
that it is powerful enough to cover cases where mounts are supposed to
be updated with new versions. Crucially, it offers an important
flexibility. Namely that updates to a system may either be forced or can
be delayed and the umount of the top mount be left to a service if it is
a cooperative one.

This adds a new flag to move_mount() that allows to explicitly create a
tucked mount adhering to the following semantics

* Mounts cannot be tucked under the rootfs. This restriction encompasses
  the rootfs but also chroots via chroot() and pivot_root(). To tuck
  mounts under the rootfs or a chroot pivot_root() can be used as
  illustrated above.
* The source mount must be a private mount to force the kernel to
  allocate a new, unused peer group id. This isn't a required
  restriction but a voluntary one. It avoids repeating a semantical
  quirk that already exists today. If bind mounts which already have a
  peer group id are inserted into mount trees that have the same peer
  group id this can cause a lot of mount propagation events to be
  generated (For example, consider running mount --bind /opt /opt in a
  loop where the parent mount is a shared mount.).
* Avoid getting rid of the top mount in the kernel. Cooperative services
  need to be able to unmount the top mount themselves.
  This also avoids a good deal of additional complexity. The umount
  would have to be propagated which would be another rather expensive
  operation. So namespace_lock() and lock_mount_hash() would potentially
  have to be held for a long time for both a mount and umount
  propagation. That should be avoided.
* The path to tuck under must be mounted and attached.
* Both the mount to tuck under and its parent must be in the caller's
  mount namespace and the caller must be able to mount in that mount
  namespace.
* The caller must be able to unmount the top mount to prove that they
  could reveal the underlying mount.
* The propagation tree is calculated based on the destination mount's
  parent mount and the destination mount's mountpoint on the parent
  mount. Of course, if the parent of the destination mount and the
  destination mount are shared mounts in the same peer group and the
  mountpoint of the new mount to be mounted is a subdir of their
  ->mnt_root then both will receive a mount of /opt. That's probably
  easier to understand with an example. Assuming a standard shared
  rootfs /:

          mount --bind /opt /opt
          mount --bind /tmp /opt

  will cause the same mount tree as:

          mount --bind /opt /opt
          mount --tuck /tmp /opt

  because both / and /opt are shared mounts/peers in the same peer
  group and the /opt dentry is a subdirectory of both the parent's and
  the child's ->mnt_root. If a mount tree like that is created it almost
  always is an accident or abuse of mount propagation. Realistically
  what most people probably mean in this scenarios is:

          mount --bind /opt /opt
          mount --make-private /opt
          mount --make-shared /opt

  This forces the allocation of a new separate peer group for the /opt
  mount. Aferwards a mount --bind or mount --tuck actually makes sense
  as the / and /opt mount belong to different peer groups. Before that
  it's likely just confusion about what the user wanted to achieve.

Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
---
 fs/namespace.c             | 231 ++++++++++++++++++++++++++++++++++++++++-----
 include/uapi/linux/mount.h |   3 +-
 2 files changed, 207 insertions(+), 27 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 7f22fcfd8eab..93f8902c6589 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -935,6 +935,62 @@ static void attach_mnt(struct mount *mnt,
 	__attach_mnt(mnt, parent);
 }
 
+/**
+ * mnt_tuck_mountpoint - tuck a mount beneath another one
+ *
+ * @tucked_mnt: the mount to tuck
+ * @top_mnt:	the mount to tuck @tucked_mnt under
+ * @tucked_mp:	the new mountpoint of @top_mnt on @tucked_mnt
+ *
+ * Remove @top_mnt from its current mountpoint @top_mnt->mnt_mp and
+ * parent @top_mnt->mnt_parent and mount it on top of @tucked_mnt at
+ * @tucked_mp. And mount @tucked_mnt on the old parent and old
+ * mountpoint of @top_mnt.
+ *
+ * Note that we keep the reference count in tact when we remove @top_mnt
+ * from its old mountpoint and parent to prevent UAF issues. Once we've
+ * mounted @top_mnt on @tucked_mnt the reference count gets bumped once
+ * more. So make sure that we drop it to not leak the mount and
+ * mountpoint.
+ */
+static void mnt_tuck_mountpoint(struct mount *tucked_mnt, struct mount *top_mnt,
+				struct mountpoint *tucked_mp)
+{
+	struct mount *old_top_parent = top_mnt->mnt_parent;
+	struct mountpoint *old_top_mp;
+
+	old_top_mp = unhash_mnt(top_mnt);
+	attach_mnt(top_mnt, tucked_mnt, tucked_mp);
+	mnt_set_mountpoint(old_top_parent, old_top_mp, tucked_mnt);
+	put_mountpoint(old_top_mp);
+	mnt_add_count(old_top_parent, -1);
+}
+
+/**
+ * tuck_mnt - tuck a mount beneath another one, attach to
+ *            @mount_hashtable and parent's list of child mounts
+ *
+ * @tucked_mnt: the mount to tuck
+ * @top_mnt:	the mount to tuck @tucked_mnt under
+ * @tucked_mp:	the new mountpoint of @top_mnt on @tucked_mnt
+ *
+ * Remove @top_mnt from its current parent and mountpoint and mount it
+ * on @tucked_mp on @tucked_mnt, and mount @tucked_mnt on the old
+ * parent and old mountpoint of @top_mnt. Finally, attach @tucked_mnt
+ * mount to @mnt_hashtable and @tucked_mnt->mnt_parent->mnt_mounts.
+ *
+ * Note, when we call __attach_mnt() we've already mounted @tucked_mnt
+ * on top of @top_mnt's old parent so @tucked_mnt->mnt_parent will point
+ * to the correct parent.
+ */
+static void tuck_mnt(struct mount *tucked_mnt,
+		     struct mount *top_mnt,
+		     struct mountpoint *tucked_mp)
+{
+	mnt_tuck_mountpoint(tucked_mnt, top_mnt, tucked_mp);
+	__attach_mnt(tucked_mnt, tucked_mnt->mnt_parent);
+}
+
 void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
 {
 	struct mountpoint *old_mp = mnt->mnt_mp;
@@ -2154,12 +2210,16 @@ int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
 	return 0;
 }
 
+typedef enum mnt_tree_flags_t {
+	MNT_TREE_MOVE	= BIT(0),
+	MNT_TREE_TUCK	= BIT(1),
+} mnt_tree_flags_t;
+
 /*
  *  @source_mnt : mount tree to be attached
- *  @nd         : place the mount tree @source_mnt is attached
- *  @parent_nd  : if non-null, detach the source_mnt from its parent and
- *  		   store the parent mount and mountpoint dentry.
- *  		   (done when source_mnt is moved)
+ *  @top_mnt	: mount that @source_mnt will be mounted on or tucked under
+ *  @dest_mp	: the mountpoint @source_mnt will be mounted at
+ *  @flags	: modify how @source_mnt is supposed to be attached
  *
  *  NOTE: in the table below explains the semantics when a source mount
  *  of a given type is attached to a destination mount of a given type.
@@ -2218,17 +2278,18 @@ int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
  * in allocations.
  */
 static int attach_recursive_mnt(struct mount *source_mnt,
-			struct mount *dest_mnt,
-			struct mountpoint *dest_mp,
-			bool moving)
+				struct mount *top_mnt,
+				struct mountpoint *dest_mp,
+				mnt_tree_flags_t flags)
 {
 	struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns;
 	HLIST_HEAD(tree_list);
-	struct mnt_namespace *ns = dest_mnt->mnt_ns;
+	struct mnt_namespace *ns = top_mnt->mnt_ns;
 	struct mountpoint *smp;
-	struct mount *child, *p;
+	struct mount *child, *dest_mnt, *p;
 	struct hlist_node *n;
-	int err;
+	int err = 0;
+	bool moving = flags & MNT_TREE_MOVE, tuck = flags & MNT_TREE_TUCK;
 
 	/* Preallocate a mountpoint in case the new mounts need
 	 * to be tucked under other mounts.
@@ -2244,29 +2305,48 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 			goto out;
 	}
 
+	if (tuck)
+		dest_mnt = top_mnt->mnt_parent;
+	else
+		dest_mnt = top_mnt;
+
 	if (IS_MNT_SHARED(dest_mnt)) {
 		err = invent_group_ids(source_mnt, true);
 		if (err)
 			goto out;
 		err = propagate_mnt(dest_mnt, dest_mp, source_mnt, &tree_list);
-		lock_mount_hash();
-		if (err)
-			goto out_cleanup_ids;
+	}
+	lock_mount_hash();
+	if (err)
+		goto out_cleanup_ids;
+
+	/* Recheck with lock_mount_hash() held. */
+	if (tuck && IS_MNT_LOCKED(top_mnt)) {
+		err = -EINVAL;
+		goto out_cleanup_ids;
+	}
+
+	if (IS_MNT_SHARED(dest_mnt)) {
 		for (p = source_mnt; p; p = next_mnt(p, source_mnt))
 			set_mnt_shared(p);
-	} else {
-		lock_mount_hash();
 	}
+
 	if (moving) {
 		unhash_mnt(source_mnt);
-		attach_mnt(source_mnt, dest_mnt, dest_mp);
+		if (tuck)
+			tuck_mnt(source_mnt, top_mnt, smp);
+		else
+			attach_mnt(source_mnt, dest_mnt, dest_mp);
 		touch_mnt_namespace(source_mnt->mnt_ns);
 	} else {
 		if (source_mnt->mnt_ns) {
 			/* move from anon - the caller will destroy */
 			list_del_init(&source_mnt->mnt_ns->list);
 		}
-		mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
+		if (tuck)
+			mnt_tuck_mountpoint(source_mnt, top_mnt, smp);
+		else
+			mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
 		commit_tree(source_mnt);
 	}
 
@@ -2306,14 +2386,35 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 	return err;
 }
 
-static struct mountpoint *lock_mount(struct path *path)
+/**
+ * lock_mount_mountpoint - lock mount and mountpoint
+ * @path: target path
+ * @tuck: whether we intend to tuck a mount beneath @path
+ *
+ * Follow the mount stack on @path until the top mount is found.
+ *
+ * If we intend to mount on top of @path->mnt acquire the inode_lock()
+ * for the top mount's ->mnt_root to protect against concurrent removal
+ * of our prospective mountpoint from another mount namespace.
+ *
+ * If we intend to tuck beneath the top mount @m acquire the
+ * inode_lock() on @m's mountpoint @mp on @m->mnt_parent. Otherwise we
+ * risk racing with someone who unlinked @mp from another mount
+ * namespace where @m doesn't have a child mount mounted @mp. We don't
+ * care if @m->mnt_root/@path->dentry is removed (as long as
+ * @path->dentry isn't equal to @m->mnt_mountpoint of course).
+ *
+ * Return: Either the target mountpoint on the top mount or the top
+ *         mount's mountpoint.
+ */
+static struct mountpoint *lock_mount_mountpoint(struct path *path, bool tuck)
 {
 	struct vfsmount *mnt = path->mnt;
 	struct dentry *dentry;
 	struct mountpoint *mp;
 
 	for (;;) {
-		dentry = path->dentry;
+		dentry = tuck ? real_mount(mnt)->mnt_mountpoint : path->dentry;
 		inode_lock(dentry->d_inode);
 		if (unlikely(cant_mount(dentry))) {
 			inode_unlock(dentry->d_inode);
@@ -2343,6 +2444,11 @@ static struct mountpoint *lock_mount(struct path *path)
 	return mp;
 }
 
+static inline struct mountpoint *lock_mount(struct path *path)
+{
+	return lock_mount_mountpoint(path, false);
+}
+
 static void unlock_mount(struct mountpoint *where)
 {
 	struct dentry *dentry = where->m_dentry;
@@ -2364,7 +2470,7 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
 	      d_is_dir(mnt->mnt.mnt_root))
 		return -ENOTDIR;
 
-	return attach_recursive_mnt(mnt, p, mp, false);
+	return attach_recursive_mnt(mnt, p, mp, 0);
 }
 
 /*
@@ -2849,7 +2955,64 @@ static int do_set_group(struct path *from_path, struct path *to_path)
 	return err;
 }
 
-static int do_move_mount(struct path *old_path, struct path *new_path)
+/**
+ * can_tuck_mount - check that we can tuck a mount
+ * @from: mount to tuck under
+ * @to:   mount under which to tuck
+ *
+ * - Make sure that the mount to tuck under isn't a shared mount so we
+ *   force the kernel to allocate a new peer group id. This simplifies
+ *   the mount trees that can be created and limits propagation events
+ *   in cases where @to, and/or @to->mnt_parent are in the same peer
+ *   group. Something that's a nuisance already today.
+ * - Make sure that @to->dentry is actually the root of a mount under
+ *   which we can tuck another mount.
+ * - Make sure that nothing can be tucked under the caller's current
+ *   root or the rootfs of the namespace.
+ * - Make sure that the caller can unmount the topmost mount ensuring
+ *   that the caller could reveal the underlying mountpoint.
+ *
+ * Return: On success 0, and on error a negative error code is returned.
+ */
+static int can_tuck_mount(struct path *from, struct path *to)
+{
+	struct mount *mnt_from = real_mount(from->mnt),
+		     *mnt_to = real_mount(to->mnt);
+
+	if (!check_mnt(mnt_to))
+		return -EINVAL;
+
+	if (!mnt_has_parent(mnt_to))
+		return -EINVAL;
+
+	if (IS_MNT_SHARED(mnt_from))
+		return -EINVAL;
+
+	if (!path_mounted(to))
+		return -EINVAL;
+
+	if (mnt_from == mnt_to)
+		return -EINVAL;
+
+	/*
+	 * Tucking a mount beneath the rootfs only makes sense when the
+	 * tuck semantics of pivot_root(".", ".") are used.
+	 */
+	if (&mnt_to->mnt == current->fs->root.mnt)
+		return -EINVAL;
+	if (mnt_to->mnt_parent == current->nsproxy->mnt_ns->root)
+		return -EINVAL;
+
+	for (struct mount *p = mnt_from; mnt_has_parent(p); p = p->mnt_parent)
+		if (p == mnt_to)
+			return -EINVAL;
+
+	/* Ensure the caller could reveal the underlying mount. */
+	return can_umount(to, 0);
+}
+
+static int do_move_mount(struct path *old_path, struct path *new_path,
+			 bool tuck)
 {
 	struct mnt_namespace *ns;
 	struct mount *p;
@@ -2858,8 +3021,9 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 	struct mountpoint *mp, *old_mp;
 	int err;
 	bool attached;
+	mnt_tree_flags_t flags = 0;
 
-	mp = lock_mount(new_path);
+	mp = lock_mount_mountpoint(new_path, tuck);
 	if (IS_ERR(mp))
 		return PTR_ERR(mp);
 
@@ -2867,9 +3031,20 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 	p = real_mount(new_path->mnt);
 	parent = old->mnt_parent;
 	attached = mnt_has_parent(old);
+	if (attached)
+		flags |= MNT_TREE_MOVE;
 	old_mp = old->mnt_mp;
 	ns = old->mnt_ns;
 
+	if (tuck) {
+		err = can_tuck_mount(old_path, new_path);
+		if (err)
+			goto out;
+
+		p = p->mnt_parent;
+		flags |= MNT_TREE_TUCK;
+	}
+
 	err = -EINVAL;
 	/* The mountpoint must be in our namespace. */
 	if (!check_mnt(p))
@@ -2910,8 +3085,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 		if (p == old)
 			goto out;
 
-	err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
-				   attached);
+	err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp, flags);
 	if (err)
 		goto out;
 
@@ -2943,7 +3117,7 @@ static int do_move_mount_old(struct path *path, const char *old_name)
 	if (err)
 		return err;
 
-	err = do_move_mount(&old_path, path);
+	err = do_move_mount(&old_path, path, false);
 	path_put(&old_path);
 	return err;
 }
@@ -3807,6 +3981,10 @@ SYSCALL_DEFINE5(move_mount,
 	if (flags & ~MOVE_MOUNT__MASK)
 		return -EINVAL;
 
+	if ((flags & (MOVE_MOUNT_TUCK | MOVE_MOUNT_SET_GROUP)) ==
+	    (MOVE_MOUNT_TUCK | MOVE_MOUNT_SET_GROUP))
+		return -EINVAL;
+
 	/* If someone gives a pathname, they aren't permitted to move
 	 * from an fd that requires unmount as we can't get at the flag
 	 * to clear it afterwards.
@@ -3836,7 +4014,8 @@ SYSCALL_DEFINE5(move_mount,
 	if (flags & MOVE_MOUNT_SET_GROUP)
 		ret = do_set_group(&from_path, &to_path);
 	else
-		ret = do_move_mount(&from_path, &to_path);
+		ret = do_move_mount(&from_path, &to_path,
+				    (flags & MOVE_MOUNT_TUCK));
 
 out_to:
 	path_put(&to_path);
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 4d93967f8aea..751089e3e0bd 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -74,7 +74,8 @@
 #define MOVE_MOUNT_T_AUTOMOUNTS		0x00000020 /* Follow automounts on to path */
 #define MOVE_MOUNT_T_EMPTY_PATH		0x00000040 /* Empty to path permitted */
 #define MOVE_MOUNT_SET_GROUP		0x00000100 /* Set sharing group instead */
-#define MOVE_MOUNT__MASK		0x00000177
+#define MOVE_MOUNT_TUCK			0x00000200 /* tuck mount */
+#define MOVE_MOUNT__MASK		0x00000377
 
 /*
  * fsopen() flags.

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 5/5] fs: allow to tuck mounts explicitly
  2023-03-18 15:52 ` [PATCH RFC 5/5] fs: allow to tuck mounts explicitly Christian Brauner
@ 2023-04-21  6:29   ` Al Viro
  2023-04-24 17:36     ` Christian Brauner
  0 siblings, 1 reply; 10+ messages in thread
From: Al Viro @ 2023-04-21  6:29 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Seth Forshee, linux-fsdevel

On Sat, Mar 18, 2023 at 04:52:01PM +0100, Christian Brauner wrote:

> +/**
> + * mnt_tuck_mountpoint - tuck a mount beneath another one
> + *
> + * @tucked_mnt: the mount to tuck
> + * @top_mnt:	the mount to tuck @tucked_mnt under
> + * @tucked_mp:	the new mountpoint of @top_mnt on @tucked_mnt
> + *
> + * Remove @top_mnt from its current mountpoint @top_mnt->mnt_mp and
> + * parent @top_mnt->mnt_parent and mount it on top of @tucked_mnt at
> + * @tucked_mp. And mount @tucked_mnt on the old parent and old
> + * mountpoint of @top_mnt.
> + *
> + * Note that we keep the reference count in tact when we remove @top_mnt
> + * from its old mountpoint and parent to prevent UAF issues. Once we've
> + * mounted @top_mnt on @tucked_mnt the reference count gets bumped once
> + * more. So make sure that we drop it to not leak the mount and
> + * mountpoint.
> + */
> +static void mnt_tuck_mountpoint(struct mount *tucked_mnt, struct mount *top_mnt,
> +				struct mountpoint *tucked_mp)
> +{
> +	struct mount *old_top_parent = top_mnt->mnt_parent;
> +	struct mountpoint *old_top_mp;
> +
> +	old_top_mp = unhash_mnt(top_mnt);
> +	attach_mnt(top_mnt, tucked_mnt, tucked_mp);
> +	mnt_set_mountpoint(old_top_parent, old_top_mp, tucked_mnt);
> +	put_mountpoint(old_top_mp);
> +	mnt_add_count(old_top_parent, -1);
> +}

Umm...  And if something is mounted on top of your tucked_mnt?  Right
on top of its root, I mean...  BTW, why not make it

static void mnt_tuck_mountpoint(struct mount *tucked_mnt, struct mount *top_mnt,
                                struct mountpoint *tucked_mp)
{
        struct mount *old_parent = top_mnt->mnt_parent;
        struct mountpoint *old_mp = top_mnt->mnt_mp;

        mnt_set_mountpoint(old_parent, old_mp, tucked_mnt);
        mnt_change_mountpoint(tucked_mnt, tucked_mp, top_mnt);
}

I mean, look at the existing caller of mnt_change_mountpoint() nearby -
that's precisely how that "tucking" is done right now.  And I'm rather
afraid that it's vulnerable to the same problem - if the tree we are
copying has the root of its root mount overmounted by something...

> +static void tuck_mnt(struct mount *tucked_mnt,
> +		     struct mount *top_mnt,
> +		     struct mountpoint *tucked_mp)
> +{
> +	mnt_tuck_mountpoint(tucked_mnt, top_mnt, tucked_mp);
> +	__attach_mnt(tucked_mnt, tucked_mnt->mnt_parent);
> +}

Not sure it's worth bothering with - only one caller and it might be
cleaner to expand attach_mnt() there, turning it into set or tuck,
followed by unconditional __attach...

> +typedef enum mnt_tree_flags_t {
> +	MNT_TREE_MOVE	= BIT(0),
> +	MNT_TREE_TUCK	= BIT(1),
> +} mnt_tree_flags_t;

What combinations are possible?

>  static int attach_recursive_mnt(struct mount *source_mnt,
> -			struct mount *dest_mnt,
> -			struct mountpoint *dest_mp,
> -			bool moving)
> +				struct mount *top_mnt,
> +				struct mountpoint *dest_mp,
> +				mnt_tree_flags_t flags)
>  {
>  	struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns;
>  	HLIST_HEAD(tree_list);
> -	struct mnt_namespace *ns = dest_mnt->mnt_ns;
> +	struct mnt_namespace *ns = top_mnt->mnt_ns;
>  	struct mountpoint *smp;
> -	struct mount *child, *p;
> +	struct mount *child, *dest_mnt, *p;
>  	struct hlist_node *n;
> -	int err;
> +	int err = 0;
> +	bool moving = flags & MNT_TREE_MOVE, tuck = flags & MNT_TREE_TUCK;
>  
>  	/* Preallocate a mountpoint in case the new mounts need
>  	 * to be tucked under other mounts.
> @@ -2244,29 +2305,48 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>  			goto out;
>  	}
>  
> +	if (tuck)
> +		dest_mnt = top_mnt->mnt_parent;
> +	else
> +		dest_mnt = top_mnt;
> +
>  	if (IS_MNT_SHARED(dest_mnt)) {
>  		err = invent_group_ids(source_mnt, true);
>  		if (err)
>  			goto out;
>  		err = propagate_mnt(dest_mnt, dest_mp, source_mnt, &tree_list);
> -		lock_mount_hash();
> -		if (err)
> -			goto out_cleanup_ids;
> +	}
> +	lock_mount_hash();
> +	if (err)
> +		goto out_cleanup_ids;
> +
> +	/* Recheck with lock_mount_hash() held. */
> +	if (tuck && IS_MNT_LOCKED(top_mnt)) {
> +		err = -EINVAL;
> +		goto out_cleanup_ids;
> +	}
> +
> +	if (IS_MNT_SHARED(dest_mnt)) {
>  		for (p = source_mnt; p; p = next_mnt(p, source_mnt))
>  			set_mnt_shared(p);
> -	} else {
> -		lock_mount_hash();
>  	}
> +
>  	if (moving) {
>  		unhash_mnt(source_mnt);
> -		attach_mnt(source_mnt, dest_mnt, dest_mp);
> +		if (tuck)
> +			tuck_mnt(source_mnt, top_mnt, smp);
> +		else
> +			attach_mnt(source_mnt, dest_mnt, dest_mp);
>  		touch_mnt_namespace(source_mnt->mnt_ns);
>  	} else {
>  		if (source_mnt->mnt_ns) {
>  			/* move from anon - the caller will destroy */
>  			list_del_init(&source_mnt->mnt_ns->list);
>  		}
> -		mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
> +		if (tuck)
> +			mnt_tuck_mountpoint(source_mnt, top_mnt, smp);
> +		else
> +			mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
>  		commit_tree(source_mnt);
>  	}
>  
> @@ -2306,14 +2386,35 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>  	return err;
>  }
>  
> -static struct mountpoint *lock_mount(struct path *path)
> +/**
> + * lock_mount_mountpoint - lock mount and mountpoint
> + * @path: target path
> + * @tuck: whether we intend to tuck a mount beneath @path
> + *
> + * Follow the mount stack on @path until the top mount is found.
> + *
> + * If we intend to mount on top of @path->mnt acquire the inode_lock()
> + * for the top mount's ->mnt_root to protect against concurrent removal
> + * of our prospective mountpoint from another mount namespace.
> + *
> + * If we intend to tuck beneath the top mount @m acquire the
> + * inode_lock() on @m's mountpoint @mp on @m->mnt_parent. Otherwise we
> + * risk racing with someone who unlinked @mp from another mount
> + * namespace where @m doesn't have a child mount mounted @mp. We don't
> + * care if @m->mnt_root/@path->dentry is removed (as long as
> + * @path->dentry isn't equal to @m->mnt_mountpoint of course).
> + *
> + * Return: Either the target mountpoint on the top mount or the top
> + *         mount's mountpoint.
> + */
> +static struct mountpoint *lock_mount_mountpoint(struct path *path, bool tuck)
>  {
>  	struct vfsmount *mnt = path->mnt;
>  	struct dentry *dentry;
>  	struct mountpoint *mp;
>  
>  	for (;;) {
> -		dentry = path->dentry;
> +		dentry = tuck ? real_mount(mnt)->mnt_mountpoint : path->dentry;
>  		inode_lock(dentry->d_inode);

What happens if another mount --move changes ->mnt_mountpoint of that sucker
just as we fetch it and we end up with inode_lock(dentry->d_inode) of a dentry
that is not pinned by anything?  path->dentry *is* pinned, so it's safe to
work with; this one, OTOH...

> +/**
> + * can_tuck_mount - check that we can tuck a mount
> + * @from: mount to tuck under
> + * @to:   mount under which to tuck
> + *
> + * - Make sure that the mount to tuck under isn't a shared mount so we
> + *   force the kernel to allocate a new peer group id. This simplifies
> + *   the mount trees that can be created and limits propagation events
> + *   in cases where @to, and/or @to->mnt_parent are in the same peer
> + *   group. Something that's a nuisance already today.
> + * - Make sure that @to->dentry is actually the root of a mount under
> + *   which we can tuck another mount.
> + * - Make sure that nothing can be tucked under the caller's current
> + *   root or the rootfs of the namespace.
> + * - Make sure that the caller can unmount the topmost mount ensuring
> + *   that the caller could reveal the underlying mountpoint.
> + *
> + * Return: On success 0, and on error a negative error code is returned.
> + */
> +static int can_tuck_mount(struct path *from, struct path *to)
> +{
> +	struct mount *mnt_from = real_mount(from->mnt),
> +		     *mnt_to = real_mount(to->mnt);
> +
> +	if (!check_mnt(mnt_to))
> +		return -EINVAL;
> +
> +	if (!mnt_has_parent(mnt_to))
> +		return -EINVAL;
> +
> +	if (IS_MNT_SHARED(mnt_from))
> +		return -EINVAL;

What if it's not shared, but gets propagation from whatever we
are going to attach it to?

> +	if (!path_mounted(to))
> +		return -EINVAL;
> +
> +	if (mnt_from == mnt_to)
> +		return -EINVAL;
> +
> +	/*
> +	 * Tucking a mount beneath the rootfs only makes sense when the
> +	 * tuck semantics of pivot_root(".", ".") are used.
> +	 */
> +	if (&mnt_to->mnt == current->fs->root.mnt)
> +		return -EINVAL;
> +	if (mnt_to->mnt_parent == current->nsproxy->mnt_ns->root)
> +		return -EINVAL;
> +
> +	for (struct mount *p = mnt_from; mnt_has_parent(p); p = p->mnt_parent)
> +		if (p == mnt_to)
> +			return -EINVAL;

Umm...  Can we have !mnt_has_parent(mnt_from), BTW?

> +	/* Ensure the caller could reveal the underlying mount. */
> +	return can_umount(to, 0);

That's really odd.  It duplicates quite a bit of the tests above and
the whole thing is rather hard to follow, especially since you've got
tests done after the can_tuck_mount() call in do_move_mount()...

> +}
> +
> +static int do_move_mount(struct path *old_path, struct path *new_path,
> +			 bool tuck)
>  {
>  	struct mnt_namespace *ns;
>  	struct mount *p;
> @@ -2858,8 +3021,9 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
>  	struct mountpoint *mp, *old_mp;
>  	int err;
>  	bool attached;
> +	mnt_tree_flags_t flags = 0;
>  
> -	mp = lock_mount(new_path);
> +	mp = lock_mount_mountpoint(new_path, tuck);
>  	if (IS_ERR(mp))
>  		return PTR_ERR(mp);
>  
> @@ -2867,9 +3031,20 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
>  	p = real_mount(new_path->mnt);
>  	parent = old->mnt_parent;
>  	attached = mnt_has_parent(old);
> +	if (attached)
> +		flags |= MNT_TREE_MOVE;
>  	old_mp = old->mnt_mp;
>  	ns = old->mnt_ns;
>  
> +	if (tuck) {
> +		err = can_tuck_mount(old_path, new_path);
> +		if (err)
> +			goto out;
> +
> +		p = p->mnt_parent;
> +		flags |= MNT_TREE_TUCK;
> +	}
> +
>  	err = -EINVAL;
>  	/* The mountpoint must be in our namespace. */
>  	if (!check_mnt(p))

Why not check it first?  That'd kill the corresponding
check_mnt(to) in can_tuck_mount() (check_mnt() on parent is
the same as on child).  Confused...

> @@ -2910,8 +3085,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
>  		if (p == old)
>  			goto out;
>  
> -	err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
> -				   attached);
> +	err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp, flags);
>  	if (err)
>  		goto out;
>  
> @@ -2943,7 +3117,7 @@ static int do_move_mount_old(struct path *path, const char *old_name)
>  	if (err)
>  		return err;
>  
> -	err = do_move_mount(&old_path, path);
> +	err = do_move_mount(&old_path, path, false);
>  	path_put(&old_path);
>  	return err;
>  }
> @@ -3807,6 +3981,10 @@ SYSCALL_DEFINE5(move_mount,
>  	if (flags & ~MOVE_MOUNT__MASK)
>  		return -EINVAL;
>  
> +	if ((flags & (MOVE_MOUNT_TUCK | MOVE_MOUNT_SET_GROUP)) ==
> +	    (MOVE_MOUNT_TUCK | MOVE_MOUNT_SET_GROUP))
> +		return -EINVAL;
> +
>  	/* If someone gives a pathname, they aren't permitted to move
>  	 * from an fd that requires unmount as we can't get at the flag
>  	 * to clear it afterwards.
> @@ -3836,7 +4014,8 @@ SYSCALL_DEFINE5(move_mount,
>  	if (flags & MOVE_MOUNT_SET_GROUP)
>  		ret = do_set_group(&from_path, &to_path);
>  	else
> -		ret = do_move_mount(&from_path, &to_path);
> +		ret = do_move_mount(&from_path, &to_path,
> +				    (flags & MOVE_MOUNT_TUCK));

OK, so you want to be able to use that on an anonymous tree, then?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC 5/5] fs: allow to tuck mounts explicitly
  2023-04-21  6:29   ` Al Viro
@ 2023-04-24 17:36     ` Christian Brauner
  0 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2023-04-24 17:36 UTC (permalink / raw)
  To: Al Viro; +Cc: Seth Forshee, linux-fsdevel

On Fri, Apr 21, 2023 at 07:29:58AM +0100, Al Viro wrote:
> On Sat, Mar 18, 2023 at 04:52:01PM +0100, Christian Brauner wrote:
> 
> > +/**
> > + * mnt_tuck_mountpoint - tuck a mount beneath another one
> > + *
> > + * @tucked_mnt: the mount to tuck
> > + * @top_mnt:	the mount to tuck @tucked_mnt under
> > + * @tucked_mp:	the new mountpoint of @top_mnt on @tucked_mnt
> > + *
> > + * Remove @top_mnt from its current mountpoint @top_mnt->mnt_mp and
> > + * parent @top_mnt->mnt_parent and mount it on top of @tucked_mnt at
> > + * @tucked_mp. And mount @tucked_mnt on the old parent and old
> > + * mountpoint of @top_mnt.
> > + *
> > + * Note that we keep the reference count in tact when we remove @top_mnt
> > + * from its old mountpoint and parent to prevent UAF issues. Once we've
> > + * mounted @top_mnt on @tucked_mnt the reference count gets bumped once
> > + * more. So make sure that we drop it to not leak the mount and
> > + * mountpoint.
> > + */
> > +static void mnt_tuck_mountpoint(struct mount *tucked_mnt, struct mount *top_mnt,
> > +				struct mountpoint *tucked_mp)
> > +{
> > +	struct mount *old_top_parent = top_mnt->mnt_parent;
> > +	struct mountpoint *old_top_mp;
> > +
> > +	old_top_mp = unhash_mnt(top_mnt);
> > +	attach_mnt(top_mnt, tucked_mnt, tucked_mp);
> > +	mnt_set_mountpoint(old_top_parent, old_top_mp, tucked_mnt);
> > +	put_mountpoint(old_top_mp);
> > +	mnt_add_count(old_top_parent, -1);
> > +}
> 
> Umm...  And if something is mounted on top of your tucked_mnt?  Right
> on top of its root, I mean...  BTW, why not make it
> 
> static void mnt_tuck_mountpoint(struct mount *tucked_mnt, struct mount *top_mnt,
>                                 struct mountpoint *tucked_mp)
> {
>         struct mount *old_parent = top_mnt->mnt_parent;
>         struct mountpoint *old_mp = top_mnt->mnt_mp;
> 
>         mnt_set_mountpoint(old_parent, old_mp, tucked_mnt);
>         mnt_change_mountpoint(tucked_mnt, tucked_mp, top_mnt);
> }
> 
> I mean, look at the existing caller of mnt_change_mountpoint() nearby -
> that's precisely how that "tucking" is done right now.  And I'm rather
> afraid that it's vulnerable to the same problem - if the tree we are
> copying has the root of its root mount overmounted by something...

Are you just trying to point out that this would create shadow mounts - aka
discussion in the previous mail - or is there something else you're getting at?

If it's shadow mounts then this could be restricted to non-overmounted source
mounts aka do a lookup_mnt(from_path) and if it doesn't return NULL refuse
creating a tucked mount with @from_path?

> 
> > +static void tuck_mnt(struct mount *tucked_mnt,
> > +		     struct mount *top_mnt,
> > +		     struct mountpoint *tucked_mp)
> > +{
> > +	mnt_tuck_mountpoint(tucked_mnt, top_mnt, tucked_mp);
> > +	__attach_mnt(tucked_mnt, tucked_mnt->mnt_parent);
> > +}
> 
> Not sure it's worth bothering with - only one caller and it might be
> cleaner to expand attach_mnt() there, turning it into set or tuck,
> followed by unconditional __attach...

Yeah, sure.

> 
> > +typedef enum mnt_tree_flags_t {
> > +	MNT_TREE_MOVE	= BIT(0),
> > +	MNT_TREE_TUCK	= BIT(1),

I've renamed all that in v2 btw to MNT_TREE_BENEATH/MOVE_MOUNT_BENEATH
so as to avoid the inevitable mount --f... typo. Also I've been told I
should've urban-dictionaried it. I haven't yet done so. But there might be
additional fun I'm currently unaware of...

> > +} mnt_tree_flags_t;
> 
> What combinations are possible?

Both can be specified together or alone.

> 
> >  static int attach_recursive_mnt(struct mount *source_mnt,
> > -			struct mount *dest_mnt,
> > -			struct mountpoint *dest_mp,
> > -			bool moving)
> > +				struct mount *top_mnt,
> > +				struct mountpoint *dest_mp,
> > +				mnt_tree_flags_t flags)
> >  {
> >  	struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns;
> >  	HLIST_HEAD(tree_list);
> > -	struct mnt_namespace *ns = dest_mnt->mnt_ns;
> > +	struct mnt_namespace *ns = top_mnt->mnt_ns;
> >  	struct mountpoint *smp;
> > -	struct mount *child, *p;
> > +	struct mount *child, *dest_mnt, *p;
> >  	struct hlist_node *n;
> > -	int err;
> > +	int err = 0;
> > +	bool moving = flags & MNT_TREE_MOVE, tuck = flags & MNT_TREE_TUCK;
> >  
> >  	/* Preallocate a mountpoint in case the new mounts need
> >  	 * to be tucked under other mounts.
> > @@ -2244,29 +2305,48 @@ static int attach_recursive_mnt(struct mount *source_mnt,
> >  			goto out;
> >  	}
> >  
> > +	if (tuck)
> > +		dest_mnt = top_mnt->mnt_parent;
> > +	else
> > +		dest_mnt = top_mnt;
> > +
> >  	if (IS_MNT_SHARED(dest_mnt)) {
> >  		err = invent_group_ids(source_mnt, true);
> >  		if (err)
> >  			goto out;
> >  		err = propagate_mnt(dest_mnt, dest_mp, source_mnt, &tree_list);
> > -		lock_mount_hash();
> > -		if (err)
> > -			goto out_cleanup_ids;
> > +	}
> > +	lock_mount_hash();
> > +	if (err)
> > +		goto out_cleanup_ids;
> > +
> > +	/* Recheck with lock_mount_hash() held. */
> > +	if (tuck && IS_MNT_LOCKED(top_mnt)) {
> > +		err = -EINVAL;
> > +		goto out_cleanup_ids;
> > +	}
> > +
> > +	if (IS_MNT_SHARED(dest_mnt)) {
> >  		for (p = source_mnt; p; p = next_mnt(p, source_mnt))
> >  			set_mnt_shared(p);
> > -	} else {
> > -		lock_mount_hash();
> >  	}
> > +
> >  	if (moving) {
> >  		unhash_mnt(source_mnt);
> > -		attach_mnt(source_mnt, dest_mnt, dest_mp);
> > +		if (tuck)
> > +			tuck_mnt(source_mnt, top_mnt, smp);
> > +		else
> > +			attach_mnt(source_mnt, dest_mnt, dest_mp);
> >  		touch_mnt_namespace(source_mnt->mnt_ns);
> >  	} else {
> >  		if (source_mnt->mnt_ns) {
> >  			/* move from anon - the caller will destroy */
> >  			list_del_init(&source_mnt->mnt_ns->list);
> >  		}
> > -		mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
> > +		if (tuck)
> > +			mnt_tuck_mountpoint(source_mnt, top_mnt, smp);
> > +		else
> > +			mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
> >  		commit_tree(source_mnt);
> >  	}
> >  
> > @@ -2306,14 +2386,35 @@ static int attach_recursive_mnt(struct mount *source_mnt,
> >  	return err;
> >  }
> >  
> > -static struct mountpoint *lock_mount(struct path *path)
> > +/**
> > + * lock_mount_mountpoint - lock mount and mountpoint
> > + * @path: target path
> > + * @tuck: whether we intend to tuck a mount beneath @path
> > + *
> > + * Follow the mount stack on @path until the top mount is found.
> > + *
> > + * If we intend to mount on top of @path->mnt acquire the inode_lock()
> > + * for the top mount's ->mnt_root to protect against concurrent removal
> > + * of our prospective mountpoint from another mount namespace.
> > + *
> > + * If we intend to tuck beneath the top mount @m acquire the
> > + * inode_lock() on @m's mountpoint @mp on @m->mnt_parent. Otherwise we
> > + * risk racing with someone who unlinked @mp from another mount
> > + * namespace where @m doesn't have a child mount mounted @mp. We don't
> > + * care if @m->mnt_root/@path->dentry is removed (as long as
> > + * @path->dentry isn't equal to @m->mnt_mountpoint of course).
> > + *
> > + * Return: Either the target mountpoint on the top mount or the top
> > + *         mount's mountpoint.
> > + */
> > +static struct mountpoint *lock_mount_mountpoint(struct path *path, bool tuck)
> >  {
> >  	struct vfsmount *mnt = path->mnt;
> >  	struct dentry *dentry;
> >  	struct mountpoint *mp;
> >  
> >  	for (;;) {
> > -		dentry = path->dentry;
> > +		dentry = tuck ? real_mount(mnt)->mnt_mountpoint : path->dentry;
> >  		inode_lock(dentry->d_inode);
> 
> What happens if another mount --move changes ->mnt_mountpoint of that sucker

Groan, right.

> just as we fetch it and we end up with inode_lock(dentry->d_inode) of a dentry
> that is not pinned by anything?  path->dentry *is* pinned, so it's safe to
> work with; this one, OTOH...

So taking

if (beneath) {
        read_seqlock_excl(&mount_lock);
        dentry = dget(real_mount(mnt)->mnt_mountpoint);
        read_sequnlock_excl(&mount_lock);

and checking under namespace_lock() that mnt->mnt_mountpoint == dentry should
be enough to protect against this, no?

> 
> > +/**
> > + * can_tuck_mount - check that we can tuck a mount
> > + * @from: mount to tuck under
> > + * @to:   mount under which to tuck
> > + *
> > + * - Make sure that the mount to tuck under isn't a shared mount so we
> > + *   force the kernel to allocate a new peer group id. This simplifies
> > + *   the mount trees that can be created and limits propagation events
> > + *   in cases where @to, and/or @to->mnt_parent are in the same peer
> > + *   group. Something that's a nuisance already today.
> > + * - Make sure that @to->dentry is actually the root of a mount under
> > + *   which we can tuck another mount.
> > + * - Make sure that nothing can be tucked under the caller's current
> > + *   root or the rootfs of the namespace.
> > + * - Make sure that the caller can unmount the topmost mount ensuring
> > + *   that the caller could reveal the underlying mountpoint.
> > + *
> > + * Return: On success 0, and on error a negative error code is returned.
> > + */
> > +static int can_tuck_mount(struct path *from, struct path *to)
> > +{
> > +	struct mount *mnt_from = real_mount(from->mnt),
> > +		     *mnt_to = real_mount(to->mnt);
> > +
> > +	if (!check_mnt(mnt_to))
> > +		return -EINVAL;
> > +
> > +	if (!mnt_has_parent(mnt_to))
> > +		return -EINVAL;
> > +
> > +	if (IS_MNT_SHARED(mnt_from))
> > +		return -EINVAL;
> 
> What if it's not shared, but gets propagation from whatever we
> are going to attach it to?

Afaict, that should be ok. In fact, even having mnt_from be a shared
mount should be ok. The reason for restricting it to non-shared mounts
is outlined above in the documentation:

- Make sure that the mount to tuck under isn't a shared mount so we
  force the kernel to allocate a new peer group id. This simplifies
  the mount trees that can be created and limits propagation events
  in cases where @to, and/or @to->mnt_parent are in the same peer
  group. Something that's a nuisance already today.

If mnt_from isn't shared and @mnt_to is shared then @mnt_from will be
turned into a shared mount with a new peer group id. Sure, @mnt_from can
have child mounts that receive propagation from @mnt_to. But all of that
should be ok.

> 
> > +	if (!path_mounted(to))
> > +		return -EINVAL;
> > +
> > +	if (mnt_from == mnt_to)
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * Tucking a mount beneath the rootfs only makes sense when the
> > +	 * tuck semantics of pivot_root(".", ".") are used.
> > +	 */
> > +	if (&mnt_to->mnt == current->fs->root.mnt)
> > +		return -EINVAL;
> > +	if (mnt_to->mnt_parent == current->nsproxy->mnt_ns->root)
> > +		return -EINVAL;
> > +
> > +	for (struct mount *p = mnt_from; mnt_has_parent(p); p = p->mnt_parent)
> > +		if (p == mnt_to)
> > +			return -EINVAL;
> 
> Umm...  Can we have !mnt_has_parent(mnt_from), BTW?
> 
> > +	/* Ensure the caller could reveal the underlying mount. */
> > +	return can_umount(to, 0);
> 
> That's really odd.  It duplicates quite a bit of the tests above and
> the whole thing is rather hard to follow, especially since you've got
> tests done after the can_tuck_mount() call in do_move_mount()...

I wanted to have permission checks explicitly and visible open-coded and
I simply thought that the duplication in can_umount() wouldn't matter...
I can get rid of that.

> 
> > +}
> > +
> > +static int do_move_mount(struct path *old_path, struct path *new_path,
> > +			 bool tuck)
> >  {
> >  	struct mnt_namespace *ns;
> >  	struct mount *p;
> > @@ -2858,8 +3021,9 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> >  	struct mountpoint *mp, *old_mp;
> >  	int err;
> >  	bool attached;
> > +	mnt_tree_flags_t flags = 0;
> >  
> > -	mp = lock_mount(new_path);
> > +	mp = lock_mount_mountpoint(new_path, tuck);
> >  	if (IS_ERR(mp))
> >  		return PTR_ERR(mp);
> >  
> > @@ -2867,9 +3031,20 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> >  	p = real_mount(new_path->mnt);
> >  	parent = old->mnt_parent;
> >  	attached = mnt_has_parent(old);
> > +	if (attached)
> > +		flags |= MNT_TREE_MOVE;
> >  	old_mp = old->mnt_mp;
> >  	ns = old->mnt_ns;
> >  
> > +	if (tuck) {
> > +		err = can_tuck_mount(old_path, new_path);
> > +		if (err)
> > +			goto out;
> > +
> > +		p = p->mnt_parent;
> > +		flags |= MNT_TREE_TUCK;
> > +	}
> > +
> >  	err = -EINVAL;
> >  	/* The mountpoint must be in our namespace. */
> >  	if (!check_mnt(p))
> 
> Why not check it first?  That'd kill the corresponding
> check_mnt(to) in can_tuck_mount() (check_mnt() on parent is
> the same as on child).  Confused...

Sure.

> 
> > @@ -2910,8 +3085,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> >  		if (p == old)
> >  			goto out;
> >  
> > -	err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
> > -				   attached);
> > +	err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp, flags);
> >  	if (err)
> >  		goto out;
> >  
> > @@ -2943,7 +3117,7 @@ static int do_move_mount_old(struct path *path, const char *old_name)
> >  	if (err)
> >  		return err;
> >  
> > -	err = do_move_mount(&old_path, path);
> > +	err = do_move_mount(&old_path, path, false);
> >  	path_put(&old_path);
> >  	return err;
> >  }
> > @@ -3807,6 +3981,10 @@ SYSCALL_DEFINE5(move_mount,
> >  	if (flags & ~MOVE_MOUNT__MASK)
> >  		return -EINVAL;
> >  
> > +	if ((flags & (MOVE_MOUNT_TUCK | MOVE_MOUNT_SET_GROUP)) ==
> > +	    (MOVE_MOUNT_TUCK | MOVE_MOUNT_SET_GROUP))
> > +		return -EINVAL;
> > +
> >  	/* If someone gives a pathname, they aren't permitted to move
> >  	 * from an fd that requires unmount as we can't get at the flag
> >  	 * to clear it afterwards.
> > @@ -3836,7 +4014,8 @@ SYSCALL_DEFINE5(move_mount,
> >  	if (flags & MOVE_MOUNT_SET_GROUP)
> >  		ret = do_set_group(&from_path, &to_path);
> >  	else
> > -		ret = do_move_mount(&from_path, &to_path);
> > +		ret = do_move_mount(&from_path, &to_path,
> > +				    (flags & MOVE_MOUNT_TUCK));
> 
> OK, so you want to be able to use that on an anonymous tree, then?

Yeah, absolutely. Anonymous mounts are an absolute _delight_ and super nice for
service managers and containers that work a lot with mounts and need guarantees
that stuff can't just get overmounted without their knowledge.

Idk if you follow userspace at all but I've been pushing for util-linux
to completely adapt the new mount api as part of idmapped mount support.
That work will land in util-linux 2.39 (should be released in the next weeks).
So there we'll deal with a lot of anonymous mounts going forward.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-04-24 17:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-18 15:51 [PATCH RFC 0/5] fs: allow to tuck mounts explicitly Christian Brauner
2023-03-18 15:51 ` [PATCH RFC 1/5] fs: add path_mounted() Christian Brauner
2023-03-18 15:51 ` [PATCH RFC 2/5] pnode: pass mountpoint directly Christian Brauner
2023-03-18 15:51 ` [PATCH RFC 3/5] fs: fix __lookup_mnt() documentation Christian Brauner
2023-04-21  6:28   ` Al Viro
2023-04-24 16:37     ` Christian Brauner
2023-03-18 15:52 ` [PATCH RFC 4/5] fs: use a for loop when locking a mount Christian Brauner
2023-03-18 15:52 ` [PATCH RFC 5/5] fs: allow to tuck mounts explicitly Christian Brauner
2023-04-21  6:29   ` Al Viro
2023-04-24 17:36     ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).