[RFC] move_mount(2): still breakage around new mount detection

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] move_mount(2): still breakage around new mount detection
@ 2025-04-28  6:30 Al Viro
  2025-04-28  7:03 ` Al Viro
  0 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2025-04-28  6:30 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

	do_move_mount() sets MNTNS_PROPAGATING on move from
anon ns; AFAICS, the cause is that we want everything in
the original copy to be treated as new mounts while handling
propagation.  We could go through the subtree and clear
->mnt_ns before propagation, but then we'd need to restore
them on failure, so you mark the source anon namespace
and make the check for new mount look at that as well.

	Fair enough, but...  you clear that flag exactly in
the case when doing that is absolutely pointless - 
        if (err)
                goto out;

        if (is_anon_ns(ns))
                ns->mntns_flags &= ~MNTNS_PROPAGATING;

        /* if the mount is moved, it should no longer be expire
         * automatically */
        list_del_init(&old->mnt_expire);
        if (attached)
                put_mountpoint(old_mp);
out:
        unlock_mount(mp);
        if (!err) {
                if (attached) {
                        mntput_no_expire(parent);
                } else {
                        /* Make sure we notice when we leak mounts. */
                        VFS_WARN_ON_ONCE(!mnt_ns_empty(ns));
                        free_mnt_ns(ns);
                }
        }
And in the beginning you have
        ns = old->mnt_ns;
...
        if (!(attached ? check_mnt(old) : is_anon_ns(ns)))
                goto out;
with check_mnt() and is_anon_ns() being mutually exclusive.

So you clear that flag *only* if err is 0 and attached is false.
Which is to say, in case when you will hit free_mnt_ns(ns).

And you do leave it stuck on failure, which is an observable
bug, AFAICS.

Look:
	mkdir foo
	mkdir blah
	mount --make-private foo
	mount --bind foo foo
	mount --make-shared foo
	fd = open_tree(AT_FDCWD, "foo", OPEN_TREE_CLONE)
	mkdir foo/bar
	mount -t tmpfs none foo/bar
	echo "hedgehog can never be buggered at all" > foo/bar/baz
	move_mount(fd, "", AT_FDCWD, "blah", MOVE_MOUNT_F_EMPTY_PATH)
you get tmpfs showing up at blah/bar, as expected, with
	cat blah/bar/baz
quoting PTerry.  Now, add a *failing* move_mount() attempt right after that
open_tree() - e.g. insert
	move_mount(fd, "", AT_FDCWD, "/dev/null", MOVE_MOUNT_F_EMPTY_PATH)
right before mkdir foo/bar; that move_mount() will fail since source is
a directory and destination isn't.  But... when the second move_mount()
succeeds, there won't be anything mounted on blah/bar.

See the problem?  FWIW, actual reproducer:

#include <sys/mount.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

char s[] = "hedgehog can never be buggered at all\n";
main()
{
	int fd, fd2;

	mkdir("foo", 0700);
	mkdir("blah", 0700);
	mount(NULL, "foo", NULL, MS_PRIVATE, NULL);
	mount("foo", "foo", NULL, MS_BIND, NULL);
	mount(NULL, "foo", NULL, MS_SHARED, NULL);
        fd = open_tree(AT_FDCWD, "foo", OPEN_TREE_CLONE);
        mkdir("foo/bar", 0700);

#if 1
	printf("this move_mount() will fail\n");
        if (move_mount(fd, "", AT_FDCWD, "/dev/null", MOVE_MOUNT_F_EMPTY_PATH) == 0)
		return -1;
	perror("first move_mount");
#endif

	mount("none", "foo/bar", "tmpfs", 0, NULL);
	fd2 = creat("foo/bar/baz", 0600);
	write(fd2, s, strlen(s));
	close(fd2);
	printf("this move_mount() should succeed... ");
        if (move_mount(fd, "", AT_FDCWD, "blah", MOVE_MOUNT_F_EMPTY_PATH) < 0)
		perror("failed");
	else
		printf("it has\n");
        system("cat blah/bar/baz");
	umount2("foo", MNT_DETACH);
	umount2("blah", 0);
	return 0;
}

Replace #if 1 with #if 0 and compare results...

The minimal fix would be to move that
        if (is_anon_ns(ns))
                ns->mntns_flags &= ~MNTNS_PROPAGATING;

several lines down, right after out:; that's the easiest to backport.
However, I'm rather dubious about the flag thing - at any time we should
have at most one ns so marked, right?  And we only care about it in
propagate_mnt(), where we are serialized under namespace_lock.
So why not simply remember the anon_ns we would've marked and compare
->mnt_ns with it instead of dereferencing and checking for flag?

IOW, what's wrong with the following?

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/mount.h b/fs/mount.h
index 7aecf2a60472..ad7173037924 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -7,10 +7,6 @@
 
 extern struct list_head notify_list;
 
-typedef __u32 __bitwise mntns_flags_t;
-
-#define MNTNS_PROPAGATING	((__force mntns_flags_t)(1 << 0))
-
 struct mnt_namespace {
 	struct ns_common	ns;
 	struct mount *	root;
@@ -37,7 +33,6 @@ struct mnt_namespace {
 	struct rb_node		mnt_ns_tree_node; /* node in the mnt_ns_tree */
 	struct list_head	mnt_ns_list; /* entry in the sequential list of mounts namespace */
 	refcount_t		passive; /* number references not pinning @mounts */
-	mntns_flags_t		mntns_flags;
 } __randomize_layout;
 
 struct mnt_pcp {
diff --git a/fs/namespace.c b/fs/namespace.c
index eba4748388b1..191f88efc6ef 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3656,14 +3656,6 @@ static int do_move_mount(struct path *old_path,
 		 */
 		if ((is_anon_ns(p->mnt_ns) && ns == p->mnt_ns))
 			goto out;
-
-		/*
-		 * If this is an anonymous mount tree ensure that mount
-		 * propagation can detect mounts that were just
-		 * propagated to the target mount tree so we don't
-		 * propagate onto them.
-		 */
-		ns->mntns_flags |= MNTNS_PROPAGATING;
 	} else if (is_anon_ns(p->mnt_ns)) {
 		/*
 		 * Don't allow moving an attached mount tree to an
@@ -3714,9 +3706,6 @@ static int do_move_mount(struct path *old_path,
 	if (err)
 		goto out;
 
-	if (is_anon_ns(ns))
-		ns->mntns_flags &= ~MNTNS_PROPAGATING;
-
 	/* if the mount is moved, it should no longer be expire
 	 * automatically */
 	list_del_init(&old->mnt_expire);
diff --git a/fs/pnode.c b/fs/pnode.c
index 7a062a5de10e..3285aeb25f38 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -13,6 +13,18 @@
 #include "internal.h"
 #include "pnode.h"
 
+static struct mnt_namespace *source_anon;
+static inline bool IS_MNT_PROPAGATED(const struct mount *m)
+{
+	/*
+	 * If this is an anonymous mount tree ensure that mount
+	 * propagation can detect mounts that were just
+	 * propagated to the target mount tree so we don't
+	 * propagate onto them.
+	 */
+	return !m->mnt_ns || m->mnt_ns == source_anon;
+}
+
 /* return the next shared peer mount of @p */
 static inline struct mount *next_peer(struct mount *p)
 {
@@ -300,6 +312,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
 	last_source = source_mnt;
 	list = tree_list;
 	dest_master = dest_mnt->mnt_master;
+	source_anon = source_mnt->mnt_ns;
+	if (source_anon && !is_anon_ns(source_anon))
+		source_anon = NULL;
 
 	/* all peers of dest_mnt, except dest_mnt itself */
 	for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
diff --git a/fs/pnode.h b/fs/pnode.h
index ddafe0d087ca..ba28110c87d2 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -12,7 +12,6 @@
 
 #define IS_MNT_SHARED(m) ((m)->mnt.mnt_flags & MNT_SHARED)
 #define IS_MNT_SLAVE(m) ((m)->mnt_master)
-#define IS_MNT_PROPAGATED(m) (!(m)->mnt_ns || ((m)->mnt_ns->mntns_flags & MNTNS_PROPAGATING))
 #define CLEAR_MNT_SHARED(m) ((m)->mnt.mnt_flags &= ~MNT_SHARED)
 #define IS_MNT_UNBINDABLE(m) ((m)->mnt.mnt_flags & MNT_UNBINDABLE)
 #define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED)

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC] move_mount(2): still breakage around new mount detection
  2025-04-28  6:30 [RFC] move_mount(2): still breakage around new mount detection Al Viro
@ 2025-04-28  7:03 ` Al Viro
  2025-04-28  8:50   ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2025-04-28  7:03 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

On Mon, Apr 28, 2025 at 07:30:56AM +0100, Al Viro wrote:
> have at most one ns so marked, right?  And we only care about it in
> propagate_mnt(), where we are serialized under namespace_lock.
> So why not simply remember the anon_ns we would've marked and compare
> ->mnt_ns with it instead of dereferencing and checking for flag?
> 
> IOW, what's wrong with the following?

Hmm... You also have propagation_would_overmount() (from
can_move_mount_beneath()) checking it...  IDGI.

For that predicate to trigger in there you need source
anon ns - you won't see NULL ->mnt_ns there.  So...
mnt_from is the absolute root of anon ns, target is
*not* in that anon ns (either it's in our current namespace,
or in a different anon ns).  IOW, in
        if (propagation_would_overmount(parent_mnt_to, mnt_to, mp))
		return -EINVAL;
IS_MNT_PROPAGATED() will be false (mnt_to has unmarked namespace)
and in
        if (propagation_would_overmount(parent_mnt_to, mnt_from, mp))
		return -EINVAL;
IS_MNT_PROPAGATED() is true.  So basically, we can drop that
check inf propagation_would_overmount() and take it to
can_move_mount_beneath(), turning the second check into
        if (check_mnt(mnt_from) &&
	    propagation_would_overmount(parent_mnt_to, mnt_from, mp))
		    return -EINVAL;
since mnt_from is either ours or root of anon and the check removed
from propagation_would_overmount() had it return false in "mnt_from
is root of anon" case.

And we obviously need it cleared at the end of propagate_mnt(),
yielding the patch below.  Do you see any other problems?

diff --git a/fs/mount.h b/fs/mount.h
index 7aecf2a60472..ad7173037924 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -7,10 +7,6 @@
 
 extern struct list_head notify_list;
 
-typedef __u32 __bitwise mntns_flags_t;
-
-#define MNTNS_PROPAGATING	((__force mntns_flags_t)(1 << 0))
-
 struct mnt_namespace {
 	struct ns_common	ns;
 	struct mount *	root;
@@ -37,7 +33,6 @@ struct mnt_namespace {
 	struct rb_node		mnt_ns_tree_node; /* node in the mnt_ns_tree */
 	struct list_head	mnt_ns_list; /* entry in the sequential list of mounts namespace */
 	refcount_t		passive; /* number references not pinning @mounts */
-	mntns_flags_t		mntns_flags;
 } __randomize_layout;
 
 struct mnt_pcp {
diff --git a/fs/namespace.c b/fs/namespace.c
index eba4748388b1..3061f1b04d4c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3556,7 +3556,8 @@ static int can_move_mount_beneath(const struct path *from,
 	 * @mnt_from itself. This defeats the whole purpose of mounting
 	 * @mnt_from beneath @mnt_to.
 	 */
-	if (propagation_would_overmount(parent_mnt_to, mnt_from, mp))
+	if (check_mnt(mnt_from) &&
+	    propagation_would_overmount(parent_mnt_to, mnt_from, mp))
 		return -EINVAL;
 
 	return 0;
@@ -3656,14 +3657,6 @@ static int do_move_mount(struct path *old_path,
 		 */
 		if ((is_anon_ns(p->mnt_ns) && ns == p->mnt_ns))
 			goto out;
-
-		/*
-		 * If this is an anonymous mount tree ensure that mount
-		 * propagation can detect mounts that were just
-		 * propagated to the target mount tree so we don't
-		 * propagate onto them.
-		 */
-		ns->mntns_flags |= MNTNS_PROPAGATING;
 	} else if (is_anon_ns(p->mnt_ns)) {
 		/*
 		 * Don't allow moving an attached mount tree to an
@@ -3714,9 +3707,6 @@ static int do_move_mount(struct path *old_path,
 	if (err)
 		goto out;
 
-	if (is_anon_ns(ns))
-		ns->mntns_flags &= ~MNTNS_PROPAGATING;
-
 	/* if the mount is moved, it should no longer be expire
 	 * automatically */
 	list_del_init(&old->mnt_expire);
diff --git a/fs/pnode.c b/fs/pnode.c
index 7a062a5de10e..26d0482fe017 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -13,6 +13,18 @@
 #include "internal.h"
 #include "pnode.h"
 
+static struct mnt_namespace *source_anon;
+static inline bool IS_MNT_PROPAGATED(const struct mount *m)
+{
+	/*
+	 * If this is an anonymous mount tree ensure that mount
+	 * propagation can detect mounts that were just
+	 * propagated to the target mount tree so we don't
+	 * propagate onto them.
+	 */
+	return !m->mnt_ns || m->mnt_ns == source_anon;
+}
+
 /* return the next shared peer mount of @p */
 static inline struct mount *next_peer(struct mount *p)
 {
@@ -300,6 +312,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
 	last_source = source_mnt;
 	list = tree_list;
 	dest_master = dest_mnt->mnt_master;
+	source_anon = source_mnt->mnt_ns;
+	if (source_anon && !is_anon_ns(source_anon))
+		source_anon = NULL;
 
 	/* all peers of dest_mnt, except dest_mnt itself */
 	for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
@@ -328,6 +343,7 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
 			CLEAR_MNT_MARK(m->mnt_master);
 	}
 	read_sequnlock_excl(&mount_lock);
+	source_anon = NULL;
 	return ret;
 }
 
@@ -380,9 +396,6 @@ bool propagation_would_overmount(const struct mount *from,
 	if (!IS_MNT_SHARED(from))
 		return false;
 
-	if (IS_MNT_PROPAGATED(to))
-		return false;
-
 	if (to->mnt.mnt_root != mp->m_dentry)
 		return false;
 
diff --git a/fs/pnode.h b/fs/pnode.h
index ddafe0d087ca..ba28110c87d2 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -12,7 +12,6 @@
 
 #define IS_MNT_SHARED(m) ((m)->mnt.mnt_flags & MNT_SHARED)
 #define IS_MNT_SLAVE(m) ((m)->mnt_master)
-#define IS_MNT_PROPAGATED(m) (!(m)->mnt_ns || ((m)->mnt_ns->mntns_flags & MNTNS_PROPAGATING))
 #define CLEAR_MNT_SHARED(m) ((m)->mnt.mnt_flags &= ~MNT_SHARED)
 #define IS_MNT_UNBINDABLE(m) ((m)->mnt.mnt_flags & MNT_UNBINDABLE)
 #define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED)

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC] move_mount(2): still breakage around new mount detection
  2025-04-28  7:03 ` Al Viro
@ 2025-04-28  8:50   ` Christian Brauner
  2025-04-28 18:53     ` Al Viro
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2025-04-28  8:50 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Mon, Apr 28, 2025 at 08:03:53AM +0100, Al Viro wrote:
> On Mon, Apr 28, 2025 at 07:30:56AM +0100, Al Viro wrote:
> > have at most one ns so marked, right?  And we only care about it in
> > propagate_mnt(), where we are serialized under namespace_lock.
> > So why not simply remember the anon_ns we would've marked and compare
> > ->mnt_ns with it instead of dereferencing and checking for flag?
> > 
> > IOW, what's wrong with the following?
> 
> Hmm... You also have propagation_would_overmount() (from
> can_move_mount_beneath()) checking it...  IDGI.
> 
> For that predicate to trigger in there you need source
> anon ns - you won't see NULL ->mnt_ns there.  So...
> mnt_from is the absolute root of anon ns, target is
> *not* in that anon ns (either it's in our current namespace,
> or in a different anon ns).  IOW, in
>         if (propagation_would_overmount(parent_mnt_to, mnt_to, mp))
> 		return -EINVAL;
> IS_MNT_PROPAGATED() will be false (mnt_to has unmarked namespace)
> and in
>         if (propagation_would_overmount(parent_mnt_to, mnt_from, mp))
> 		return -EINVAL;
> IS_MNT_PROPAGATED() is true.  So basically, we can drop that
> check inf propagation_would_overmount() and take it to
> can_move_mount_beneath(), turning the second check into
>         if (check_mnt(mnt_from) &&
> 	    propagation_would_overmount(parent_mnt_to, mnt_from, mp))
> 		    return -EINVAL;
> since mnt_from is either ours or root of anon and the check removed
> from propagation_would_overmount() had it return false in "mnt_from
> is root of anon" case.
> 
> And we obviously need it cleared at the end of propagate_mnt(),
> yielding the patch below.  Do you see any other problems?
> 
> diff --git a/fs/mount.h b/fs/mount.h
> index 7aecf2a60472..ad7173037924 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -7,10 +7,6 @@
>  
>  extern struct list_head notify_list;
>  
> -typedef __u32 __bitwise mntns_flags_t;
> -
> -#define MNTNS_PROPAGATING	((__force mntns_flags_t)(1 << 0))
> -
>  struct mnt_namespace {
>  	struct ns_common	ns;
>  	struct mount *	root;
> @@ -37,7 +33,6 @@ struct mnt_namespace {
>  	struct rb_node		mnt_ns_tree_node; /* node in the mnt_ns_tree */
>  	struct list_head	mnt_ns_list; /* entry in the sequential list of mounts namespace */
>  	refcount_t		passive; /* number references not pinning @mounts */
> -	mntns_flags_t		mntns_flags;
>  } __randomize_layout;
>  
>  struct mnt_pcp {
> diff --git a/fs/namespace.c b/fs/namespace.c
> index eba4748388b1..3061f1b04d4c 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3556,7 +3556,8 @@ static int can_move_mount_beneath(const struct path *from,
>  	 * @mnt_from itself. This defeats the whole purpose of mounting
>  	 * @mnt_from beneath @mnt_to.
>  	 */
> -	if (propagation_would_overmount(parent_mnt_to, mnt_from, mp))
> +	if (check_mnt(mnt_from) &&
> +	    propagation_would_overmount(parent_mnt_to, mnt_from, mp))
>  		return -EINVAL;
>  
>  	return 0;
> @@ -3656,14 +3657,6 @@ static int do_move_mount(struct path *old_path,
>  		 */
>  		if ((is_anon_ns(p->mnt_ns) && ns == p->mnt_ns))
>  			goto out;
> -
> -		/*
> -		 * If this is an anonymous mount tree ensure that mount
> -		 * propagation can detect mounts that were just
> -		 * propagated to the target mount tree so we don't
> -		 * propagate onto them.
> -		 */
> -		ns->mntns_flags |= MNTNS_PROPAGATING;
>  	} else if (is_anon_ns(p->mnt_ns)) {
>  		/*
>  		 * Don't allow moving an attached mount tree to an
> @@ -3714,9 +3707,6 @@ static int do_move_mount(struct path *old_path,
>  	if (err)
>  		goto out;
>  
> -	if (is_anon_ns(ns))
> -		ns->mntns_flags &= ~MNTNS_PROPAGATING;
> -
>  	/* if the mount is moved, it should no longer be expire
>  	 * automatically */
>  	list_del_init(&old->mnt_expire);
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 7a062a5de10e..26d0482fe017 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -13,6 +13,18 @@
>  #include "internal.h"
>  #include "pnode.h"
>  
> +static struct mnt_namespace *source_anon;
> +static inline bool IS_MNT_PROPAGATED(const struct mount *m)
> +{
> +	/*
> +	 * If this is an anonymous mount tree ensure that mount
> +	 * propagation can detect mounts that were just
> +	 * propagated to the target mount tree so we don't
> +	 * propagate onto them.
> +	 */
> +	return !m->mnt_ns || m->mnt_ns == source_anon;
> +}
> +
>  /* return the next shared peer mount of @p */
>  static inline struct mount *next_peer(struct mount *p)
>  {
> @@ -300,6 +312,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
>  	last_source = source_mnt;
>  	list = tree_list;
>  	dest_master = dest_mnt->mnt_master;
> +	source_anon = source_mnt->mnt_ns;
> +	if (source_anon && !is_anon_ns(source_anon))
> +		source_anon = NULL;
>  
>  	/* all peers of dest_mnt, except dest_mnt itself */
>  	for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
> @@ -328,6 +343,7 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
>  			CLEAR_MNT_MARK(m->mnt_master);
>  	}
>  	read_sequnlock_excl(&mount_lock);
> +	source_anon = NULL;

I'm not fond of the global variable. I would generally agree with you if
that were really performance sensitive but this really isn't.

I'll have more uses for the flags member very soon as I will make it
possible to list mounts in anonymous mount namespaces because it
confuses userspace to no end that they can't list detached mount trees.

So anonymous mount namespaces will simply get a mount namespace id just
like any other mount namespace and simply be discerned by a flag.

Thanks for going through this. I appreciate it.

The check_mnt() simplification is good though.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] move_mount(2): still breakage around new mount detection
  2025-04-28  8:50   ` Christian Brauner
@ 2025-04-28 18:53     ` Al Viro
  2025-04-29  4:03       ` Al Viro
                         ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Al Viro @ 2025-04-28 18:53 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

On Mon, Apr 28, 2025 at 10:50:53AM +0200, Christian Brauner wrote:

> I'm not fond of the global variable. I would generally agree with you if
> that were really performance sensitive but this really isn't.

Up to you; propagation calculations *are* hard-serialized (on namespace_sem)
and changing that is too much pain to consider, so I have no problem with
globals in that specific case (note several such in propagate_mnt()
machinery; that was a deliberate decision to avoid shitloads of arguments
that would have to be passed around otherwise), but...

Anyway, minimal fix is to shift clearing the flag, as below.
Longer term I'd rather shift setting and clearing it down into
propagate_mnt() (and dropped check from propagation_would_overmount(),
with corresponding change to can_move_mount_beneath()).

It's really "for the purposes of this mount propagation event treat all
mounts in that namespace as 'new'", so the smaller scope that thing has
the easier it is to reason about...

> I'll have more uses for the flags member very soon as I will make it
> possible to list mounts in anonymous mount namespaces because it
> confuses userspace to no end that they can't list detached mount trees.
> 
> So anonymous mount namespaces will simply get a mount namespace id just
> like any other mount namespace and simply be discerned by a flag.
> 
> Thanks for going through this. I appreciate it.
> 
> The check_mnt() simplification is good though.

FWIW, I've a series of cleanups falling out of audit of struct mount
handling; it's still growing, but I'll post the stable parts for review
tonight or tomorrow...

--------
[PATCH] do_move_mount(): don't leak MNTNS_PROPAGATING on failures

as it is, a failed move_mount(2) from anon namespace breaks
all further propagation into that namespace, including normal
mounts in non-anon namespaces that would otherwise propagate
there.

Fixes: 064fe6e233e8 "mount: handle mount propagation for detached mount trees" v6.15+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/namespace.c b/fs/namespace.c
index eba4748388b1..8b8348ee5a55 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3714,15 +3714,14 @@ static int do_move_mount(struct path *old_path,
 	if (err)
 		goto out;

-	if (is_anon_ns(ns))
-		ns->mntns_flags &= ~MNTNS_PROPAGATING;
-
 	/* if the mount is moved, it should no longer be expire
 	 * automatically */
 	list_del_init(&old->mnt_expire);
 	if (attached)
 		put_mountpoint(old_mp);
 out:
+	if (is_anon_ns(ns))
+		ns->mntns_flags &= ~MNTNS_PROPAGATING;
 	unlock_mount(mp);
 	if (!err) {
 		if (attached) {

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC] move_mount(2): still breakage around new mount detection
  2025-04-28 18:53     ` Al Viro
@ 2025-04-29  4:03       ` Al Viro
  2025-04-29  5:10         ` Al Viro
  2025-04-29  7:56         ` Christian Brauner
  2025-04-29  7:52       ` Christian Brauner
  2025-05-08  5:56       ` more breakage there (was Re: [RFC] move_mount(2): still breakage around new mount detection) Al Viro
  2 siblings, 2 replies; 29+ messages in thread
From: Al Viro @ 2025-04-29  4:03 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

On Mon, Apr 28, 2025 at 07:53:18PM +0100, Al Viro wrote:

> FWIW, I've a series of cleanups falling out of audit of struct mount
> handling; it's still growing, but I'll post the stable parts for review
> tonight or tomorrow...

_Another_ fun one, this time around do_umount().  Take a look
at this chunk in mntput_no_expire():
        lock_mount_hash();
	/*
	 * make sure that if __legitimize_mnt() has not seen us grab
	 * mount_lock, we'll see their refcount increment here.
	 */
	smp_mb();
	mnt_add_count(mnt, -1);
	count = mnt_get_count(mnt);

... and note that we do *not* have such a barrier in do_umount(), between
        lock_mount_hash();
and
                shrink_submounts(mnt);
		retval = -EBUSY;
		if (!propagate_mount_busy(mnt, 2)) {
making it possible to __legitimize_mnt() fail to see lock_mount_hash() in
do_umount(), with do_umount() not noticing the increment of refcount done
by __legitimize_mnt().  It is considerably harder to hit, but I wouldn't
bet on it being impossible...

The sky is not falling (the worst we'll get is a successful sync umount(2)
ending up like a lazy one would; sucks if you see that umount(2) has succeeded
and e.g. pull a USB stick out, of course, but...)

But AFAICS we need a barrier here, to make sure that either legitimize_mnt()
fails seqcount check, grabs mount_lock, sees MNT_SYNC_UMOUNT and quitely
decrements refcount and buggers off or umount(2) sees the increment in
legitimize_mnt() and fails with -EBUSY.

It's really the same situation as with mntput_no_expire(), except that
there the corresponding flag is MNT_DOOMED...

[PATCH] do_umount(): add missing barrier before refcount checks in sync case

do_umount() analogue of the race fixed in 119e1ef80ecf "fix
__legitimize_mnt()/mntput() race".  Here we want to make sure that
if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will
notice their refcount increment.  Harder to hit than mntput_no_expire()
one, fortunately, and consequences are milder (sync umount acting
like umount -l on a rare race with RCU pathwalk hitting at just the
wrong time instead of use-after-free galore mntput_no_expire()
counterpart used to be hit).  Still a bug...

Fixes: 48a066e72d97 ("RCU'd vsfmounts")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/namespace.c b/fs/namespace.c
index eba4748388b1..d8a344d0a80a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -787,7 +787,7 @@ int __legitimize_mnt(struct vfsmount *bastard, unsigned seq)
 		return 0;
 	mnt = real_mount(bastard);
 	mnt_add_count(mnt, 1);
-	smp_mb();			// see mntput_no_expire()
+	smp_mb();		// see mntput_no_expire() and do_umount()
 	if (likely(!read_seqretry(&mount_lock, seq)))
 		return 0;
 	lock_mount_hash();
@@ -2044,6 +2044,7 @@ static int do_umount(struct mount *mnt, int flags)
 			umount_tree(mnt, UMOUNT_PROPAGATE);
 		retval = 0;
 	} else {
+		smp_mb(); // paired with __legitimize_mnt()
 		shrink_submounts(mnt);
 		retval = -EBUSY;
 		if (!propagate_mount_busy(mnt, 2)) {

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC] move_mount(2): still breakage around new mount detection
  2025-04-29  4:03       ` Al Viro
@ 2025-04-29  5:10         ` Al Viro
  2025-04-29  5:27           ` Al Viro
                             ` (2 more replies)
  2025-04-29  7:56         ` Christian Brauner
  1 sibling, 3 replies; 29+ messages in thread
From: Al Viro @ 2025-04-29  5:10 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

On Tue, Apr 29, 2025 at 05:03:58AM +0100, Al Viro wrote:
> On Mon, Apr 28, 2025 at 07:53:18PM +0100, Al Viro wrote:
> 
> > FWIW, I've a series of cleanups falling out of audit of struct mount
> > handling; it's still growing, but I'll post the stable parts for review
> > tonight or tomorrow...
> 
> _Another_ fun one, this time around do_umount().

... and more, from 620c266f3949 "fhandle: relax open_by_handle_at()
permission checks" - just what is protecting has_locked_children()
use there?  We are, after all, iterating through ->mnt_mounts -
with no locks whatsoever.  Not to mention the fun question regarding
the result (including the bits sensitive to is_mounted()) remaining
valid by the time you get through exportfs_decode_fh_raw() (and no,
you can't hold any namespace_sem over it - no IO allowed under
that, so we'll need to recheck after that point)...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] move_mount(2): still breakage around new mount detection
  2025-04-29  5:10         ` Al Viro
@ 2025-04-29  5:27           ` Al Viro
  2025-04-29  8:21           ` Christian Brauner
  2025-05-05  5:08           ` Al Viro
  2 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2025-04-29  5:27 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

On Tue, Apr 29, 2025 at 06:10:54AM +0100, Al Viro wrote:
> On Tue, Apr 29, 2025 at 05:03:58AM +0100, Al Viro wrote:
> > On Mon, Apr 28, 2025 at 07:53:18PM +0100, Al Viro wrote:
> > 
> > > FWIW, I've a series of cleanups falling out of audit of struct mount
> > > handling; it's still growing, but I'll post the stable parts for review
> > > tonight or tomorrow...
> > 
> > _Another_ fun one, this time around do_umount().
> 
> ... and more, from 620c266f3949 "fhandle: relax open_by_handle_at()
> permission checks" - just what is protecting has_locked_children()
> use there?  We are, after all, iterating through ->mnt_mounts -
> with no locks whatsoever.  Not to mention the fun question regarding
> the result (including the bits sensitive to is_mounted()) remaining
> valid by the time you get through exportfs_decode_fh_raw() (and no,
> you can't hold any namespace_sem over it - no IO allowed under
> that, so we'll need to recheck after that point)...

Anyway, I've pushed the obvious fixes to viro/vfs.git#fixes for now;
there's more coming (mnt_flags misuses, basically), but that'll have
to wait until I get some sleep...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] move_mount(2): still breakage around new mount detection
  2025-04-28 18:53     ` Al Viro
  2025-04-29  4:03       ` Al Viro
@ 2025-04-29  7:52       ` Christian Brauner
  2025-05-08  5:56       ` more breakage there (was Re: [RFC] move_mount(2): still breakage around new mount detection) Al Viro
  2 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2025-04-29  7:52 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Mon, Apr 28, 2025 at 07:53:18PM +0100, Al Viro wrote:
> On Mon, Apr 28, 2025 at 10:50:53AM +0200, Christian Brauner wrote:
> 
> > I'm not fond of the global variable. I would generally agree with you if
> > that were really performance sensitive but this really isn't.
> 
> Up to you; propagation calculations *are* hard-serialized (on namespace_sem)
> and changing that is too much pain to consider, so I have no problem with
> globals in that specific case (note several such in propagate_mnt()
> machinery; that was a deliberate decision to avoid shitloads of arguments
> that would have to be passed around otherwise), but...

I know, I know. I've seen Ram's code and touched it enough. The globals
have been very confusing at times though especially because they change
across multiple function calls. But anyway.

> 
> Anyway, minimal fix is to shift clearing the flag, as below.
> Longer term I'd rather shift setting and clearing it down into
> propagate_mnt() (and dropped check from propagation_would_overmount(),
> with corresponding change to can_move_mount_beneath()).
> 
> It's really "for the purposes of this mount propagation event treat all
> mounts in that namespace as 'new'", so the smaller scope that thing has
> the easier it is to reason about...

I had a similar thought but my reasoning has been that if it's as close
to the system call interface as possible then it's very very obvious
where this happens and how it's cleared. IOW, the bigger scope felt
actually easier in this case.

> 
> > I'll have more uses for the flags member very soon as I will make it
> > possible to list mounts in anonymous mount namespaces because it
> > confuses userspace to no end that they can't list detached mount trees.
> > 
> > So anonymous mount namespaces will simply get a mount namespace id just
> > like any other mount namespace and simply be discerned by a flag.
> > 
> > Thanks for going through this. I appreciate it.
> > 
> > The check_mnt() simplification is good though.
> 
> FWIW, I've a series of cleanups falling out of audit of struct mount

Seems good.

> handling; it's still growing, but I'll post the stable parts for review
> tonight or tomorrow...
> 
> --------
> [PATCH] do_move_mount(): don't leak MNTNS_PROPAGATING on failures
> 
> as it is, a failed move_mount(2) from anon namespace breaks
> all further propagation into that namespace, including normal
> mounts in non-anon namespaces that would otherwise propagate
> there.
> 
> Fixes: 064fe6e233e8 "mount: handle mount propagation for detached mount trees" v6.15+
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/fs/namespace.c b/fs/namespace.c
> index eba4748388b1..8b8348ee5a55 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3714,15 +3714,14 @@ static int do_move_mount(struct path *old_path,
>  	if (err)
>  		goto out;
>  
> -	if (is_anon_ns(ns))
> -		ns->mntns_flags &= ~MNTNS_PROPAGATING;
> -
>  	/* if the mount is moved, it should no longer be expire
>  	 * automatically */
>  	list_del_init(&old->mnt_expire);
>  	if (attached)
>  		put_mountpoint(old_mp);
>  out:
> +	if (is_anon_ns(ns))
> +		ns->mntns_flags &= ~MNTNS_PROPAGATING;

Thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] move_mount(2): still breakage around new mount detection
  2025-04-29  4:03       ` Al Viro
  2025-04-29  5:10         ` Al Viro
@ 2025-04-29  7:56         ` Christian Brauner
  2025-04-29 12:27           ` Al Viro
  1 sibling, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2025-04-29  7:56 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Tue, Apr 29, 2025 at 05:03:58AM +0100, Al Viro wrote:
> On Mon, Apr 28, 2025 at 07:53:18PM +0100, Al Viro wrote:
> 
> > FWIW, I've a series of cleanups falling out of audit of struct mount
> > handling; it's still growing, but I'll post the stable parts for review
> > tonight or tomorrow...
> 
> _Another_ fun one, this time around do_umount().  Take a look
> at this chunk in mntput_no_expire():
>         lock_mount_hash();
> 	/*
> 	 * make sure that if __legitimize_mnt() has not seen us grab
> 	 * mount_lock, we'll see their refcount increment here.
> 	 */
> 	smp_mb();
> 	mnt_add_count(mnt, -1);
> 	count = mnt_get_count(mnt);
> 
> ... and note that we do *not* have such a barrier in do_umount(), between
>         lock_mount_hash();
> and
>                 shrink_submounts(mnt);
> 		retval = -EBUSY;
> 		if (!propagate_mount_busy(mnt, 2)) {
> making it possible to __legitimize_mnt() fail to see lock_mount_hash() in
> do_umount(), with do_umount() not noticing the increment of refcount done
> by __legitimize_mnt().  It is considerably harder to hit, but I wouldn't
> bet on it being impossible...

Most of these issues are almost impossible to hit in real workloads or
it's so rare that it doesn't matter. This one in particular seems like a
really uninteresting one. I mean, yes we should probably add that
barrier there but also nobody would care if we didn't.

> 
> The sky is not falling (the worst we'll get is a successful sync umount(2)
> ending up like a lazy one would; sucks if you see that umount(2) has succeeded
> and e.g. pull a USB stick out, of course, but...)
> 
> But AFAICS we need a barrier here, to make sure that either legitimize_mnt()
> fails seqcount check, grabs mount_lock, sees MNT_SYNC_UMOUNT and quitely
> decrements refcount and buggers off or umount(2) sees the increment in
> legitimize_mnt() and fails with -EBUSY.
> 
> It's really the same situation as with mntput_no_expire(), except that
> there the corresponding flag is MNT_DOOMED...
> 
> [PATCH] do_umount(): add missing barrier before refcount checks in sync case
>     
> do_umount() analogue of the race fixed in 119e1ef80ecf "fix
> __legitimize_mnt()/mntput() race".  Here we want to make sure that
> if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will
> notice their refcount increment.  Harder to hit than mntput_no_expire()
> one, fortunately, and consequences are milder (sync umount acting
> like umount -l on a rare race with RCU pathwalk hitting at just the
> wrong time instead of use-after-free galore mntput_no_expire()
> counterpart used to be hit).  Still a bug...
>  
> Fixes: 48a066e72d97 ("RCU'd vsfmounts")

That's an ancient bug at that...

> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Reviewed-by: Christian Brauner <brauner@kernel.org>

> diff --git a/fs/namespace.c b/fs/namespace.c
> index eba4748388b1..d8a344d0a80a 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -787,7 +787,7 @@ int __legitimize_mnt(struct vfsmount *bastard, unsigned seq)
>  		return 0;
>  	mnt = real_mount(bastard);
>  	mnt_add_count(mnt, 1);
> -	smp_mb();			// see mntput_no_expire()
> +	smp_mb();		// see mntput_no_expire() and do_umount()
>  	if (likely(!read_seqretry(&mount_lock, seq)))
>  		return 0;
>  	lock_mount_hash();
> @@ -2044,6 +2044,7 @@ static int do_umount(struct mount *mnt, int flags)
>  			umount_tree(mnt, UMOUNT_PROPAGATE);
>  		retval = 0;
>  	} else {
> +		smp_mb(); // paired with __legitimize_mnt()
>  		shrink_submounts(mnt);
>  		retval = -EBUSY;
>  		if (!propagate_mount_busy(mnt, 2)) {

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] move_mount(2): still breakage around new mount detection
  2025-04-29  5:10         ` Al Viro
  2025-04-29  5:27           ` Al Viro
@ 2025-04-29  8:21           ` Christian Brauner
  2025-05-05  5:08           ` Al Viro
  2 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2025-04-29  8:21 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Tue, Apr 29, 2025 at 06:10:54AM +0100, Al Viro wrote:
> On Tue, Apr 29, 2025 at 05:03:58AM +0100, Al Viro wrote:
> > On Mon, Apr 28, 2025 at 07:53:18PM +0100, Al Viro wrote:
> > 
> > > FWIW, I've a series of cleanups falling out of audit of struct mount
> > > handling; it's still growing, but I'll post the stable parts for review
> > > tonight or tomorrow...
> > 
> > _Another_ fun one, this time around do_umount().
> 
> ... and more, from 620c266f3949 "fhandle: relax open_by_handle_at()
> permission checks" - just what is protecting has_locked_children()
> use there?  We are, after all, iterating through ->mnt_mounts -
> with no locks whatsoever.  Not to mention the fun question regarding

Mea culpa. That was clearly an obvious accident.
It's a not very subtle oversight.

> the result (including the bits sensitive to is_mounted()) remaining
> valid by the time you get through exportfs_decode_fh_raw() (and no,
> you can't hold any namespace_sem over it - no IO allowed under

I know.

> that, so we'll need to recheck after that point)...

Why bother rechecking? It's literally just to prove that the caller
could theoretically get an unobstructed view by shedding mounts if they
wanted to. This is an approximation _at best_.

The caller obviously cannot switch mount namespace so there's no risk of
suddenly a bunch of locked mounts popping up. The only way we could
suddenly have a locked mount appear is if someone propagated a mount
tree into the callers user+mount namespace from a different user+mount
namespace and some mounts were about to expire. And it even has to be an
actual tree and not just a single mount.

So honestly, the one check is just fine. It's imperfect as it is and
it's for the special users that really need this. It's blocked in nearly
all container runtimes by default anyway because it can be abused to
open any crap on the system. This is literally fringe but needed
functionality.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] move_mount(2): still breakage around new mount detection
  2025-04-29  7:56         ` Christian Brauner
@ 2025-04-29 12:27           ` Al Viro
  0 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2025-04-29 12:27 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

On Tue, Apr 29, 2025 at 09:56:14AM +0200, Christian Brauner wrote:
> On Tue, Apr 29, 2025 at 05:03:58AM +0100, Al Viro wrote:

> > by __legitimize_mnt().  It is considerably harder to hit, but I wouldn't
> > bet on it being impossible...
> 
> Most of these issues are almost impossible to hit in real workloads or
> it's so rare that it doesn't matter. This one in particular seems like a
> really uninteresting one. I mean, yes we should probably add that
> barrier there but also nobody would care if we didn't.

Rule of the thumb: whenever you see a barrier, the need to understand
everything that's going on with the function goes up a _lot_.
Especially when one of those suckers is in a seriously hot codepath
(and __legitimize_mnt() qualifies).  I certainly agree that this one is
not critical - said so right in commit message, but one thing we need in
the area is to review the locking; not only the majority of comments are
badly obsolete (anything that mentions vfsmount_lock, for starters), but
we also have fun questions about the mount_lock users - which ones need
to touch seqcount component (and full lock_mount_hash()) and which ony
need the spinlock side.  Same for RCU delays - witness the thread that
got me started on reviewing the locking/refcounting/lifetimes in there.

Another piece of fun: there are grades of struct mount accessibility;
to a very limited extent it becomes reachable as soon as we put it into
the per-superblock list, even before the damn thing is returned to caller
of clone_mnt().  There's only one thing that can get to them that way -
sb_prepare_remount_readonly().  But that includes modifications of
->mnt_flags - setting and clearing MNT_WRITE_HOLD.  Which makes
CLEAR_MNT_SHARED(mnt) in clone_mnt() (as well as set_mnt_shared(mnt)
in CL_MAKE_SHARED case) racy, and not harmlessly so.  This one is
trivial to fix (set ->mnt_flags before attaching to superblock and
inserting into the list), but there are several places in similar spirit
where reordering doesn't solve the problem (e.g. fs/overlayfs/super.c
playing with the flags on clone_private_mnt() results).  I *really*
don't want to export mount_lock and only slightly less so - a generic
helper for adding to/removing from flags, so sane API needed in
this case is an interesting question...

I'm putting together documentation on struct mount, will post it
once it's reasonably complete.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] move_mount(2): still breakage around new mount detection
  2025-04-29  5:10         ` Al Viro
  2025-04-29  5:27           ` Al Viro
  2025-04-29  8:21           ` Christian Brauner
@ 2025-05-05  5:08           ` Al Viro
  2025-05-05 14:20             ` Christian Brauner
  2 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2025-05-05  5:08 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

On Tue, Apr 29, 2025 at 06:10:54AM +0100, Al Viro wrote:
> On Tue, Apr 29, 2025 at 05:03:58AM +0100, Al Viro wrote:
> > On Mon, Apr 28, 2025 at 07:53:18PM +0100, Al Viro wrote:
> > 
> > > FWIW, I've a series of cleanups falling out of audit of struct mount
> > > handling; it's still growing, but I'll post the stable parts for review
> > > tonight or tomorrow...
> > 
> > _Another_ fun one, this time around do_umount().
> 
> ... and more, from 620c266f3949 "fhandle: relax open_by_handle_at()
> permission checks" - just what is protecting has_locked_children()
> use there?  We are, after all, iterating through ->mnt_mounts -
> with no locks whatsoever.  Not to mention the fun question regarding
> the result (including the bits sensitive to is_mounted()) remaining
> valid by the time you get through exportfs_decode_fh_raw() (and no,
> you can't hold any namespace_sem over it - no IO allowed under
> that, so we'll need to recheck after that point)...

FWIW, looking at do_move_mounts(): some of the tests look odd.

        if (is_anon_ns(ns)) {
                /*
                 * Ending up with two files referring to the root of the
                 * same anonymous mount namespace would cause an error
                 * as this would mean trying to move the same mount
                 * twice into the mount tree which would be rejected
                 * later. But be explicit about it right here.
                 */
                if ((is_anon_ns(p->mnt_ns) && ns == p->mnt_ns))
                        goto out;

Why are we checking is_anon_ns(p->mnt_ns) here?  If ns is equal
to p->mnt_ns, we have just verified that is_anon_ns() is true for it;
if it is not, there's no point bothering with is_anon_ns() since
conjunction is false anyway.  And it's not as if that comparison
had been unsafe to calculate if is_anon_ns(p->mnt_ns) is false...

Looks really really confusing - is there a typo somewhere?  Why
not simply
        if (is_anon_ns(ns)) {
		/*
		 * Can't move the root of namespace into the same
		 * namespace.  Reject that early.
		 */
		if (ns == p->mnt)
			goto out;
What am I missing here?

Another odd thing: what's the point rejecting move of /foo/bar/baz/ beneath
/foo?  What's wrong with doing that?  _IF_ that's really intended, it needs
at least a comment spelling that out.  TBH, for quite a while I'd been
staring at that wondering WTF do you duplicate the common check for target
not being a descendent of source, but with different error value.  Until
spotting that the check is about _source_ being a descendent of target
rather than the other way round...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] move_mount(2): still breakage around new mount detection
  2025-05-05  5:08           ` Al Viro
@ 2025-05-05 14:20             ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2025-05-05 14:20 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

>         if (is_anon_ns(ns)) {
> 		/*
> 		 * Can't move the root of namespace into the same
> 		 * namespace.  Reject that early.
> 		 */
> 		if (ns == p->mnt)
> 			goto out;
> What am I missing here?

Seems good to me. Thanks.

> Another odd thing: what's the point rejecting move of /foo/bar/baz/ beneath
> /foo?  What's wrong with doing that?  _IF_ that's really intended, it needs
> at least a comment spelling that out.  TBH, for quite a while I'd been
> staring at that wondering WTF do you duplicate the common check for target
> not being a descendent of source, but with different error value.  Until
> spotting that the check is about _source_ being a descendent of target
> rather than the other way round...

You mean:

          for (struct mount *p = mnt_from; mnt_has_parent(p); p = p->mnt_parent)
                  if (p == mnt_to)
                          return -EINVAL;

I wasn't sure whether the resulting mount tree was sane under all
conditions for the service configuration update use-case. Feel free to
remove it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* more breakage there (was Re: [RFC] move_mount(2): still breakage around new mount detection)
  2025-04-28 18:53     ` Al Viro
  2025-04-29  4:03       ` Al Viro
  2025-04-29  7:52       ` Christian Brauner
@ 2025-05-08  5:56       ` Al Viro
  2025-05-08 19:59         ` Al Viro
  2025-05-09 11:06         ` more breakage there (was Re: [RFC] move_mount(2): still breakage around new mount detection) Christian Brauner
  2 siblings, 2 replies; 29+ messages in thread
From: Al Viro @ 2025-05-08  5:56 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

On Mon, Apr 28, 2025 at 07:53:18PM +0100, Al Viro wrote:

> Up to you; propagation calculations *are* hard-serialized (on namespace_sem)
> and changing that is too much pain to consider, so I have no problem with
> globals in that specific case (note several such in propagate_mnt()
> machinery; that was a deliberate decision to avoid shitloads of arguments
> that would have to be passed around otherwise), but...

OK, now I finally understand what felt fishy about either solution.

Back when the checks had been IS_MNT_NEW, we were guaranteed that
anything on the the slave lists of new mount would be new as well.

No amount of copy_tree() could change ->mnt_master of existing mounts,
so anything predating the beginning of propagate_mnt() would still
have ->mnt_master pointing to old mounts - no operations other than
copy_tree() had been done since we have taken namespace_sem.

That's where your IS_MNT_PROPAGATED breaks.  It mixes "nothing useful
to be found in this direction" with "don't mount anything on this one".
And these are not the same now.

Suppose you have mounts A, B and C, A propagating to B, B - to C.

If you made B private, propagation would go directly from A to C,
and mount on A/foo would result in a copy on C/foo.

Suppose you've done open_tree B with OPEN_TREE_CLONE before making
B private.  After open_tree your propagation graph is
	A -> [B <-> B'] -> C
with new mount B' being in your anon_ns.  Making B private leaves you
with
	A -> B' -> C
and mount on A/foo still propagates to C/foo, along with foo in your
anon_ns.

So far, so good, but what happens if you move_mount the root of your
anon_ns to A/foo?  Sure, you want to suppress copying it to foo in B',
but you will end suppressing the copy on C/foo as well.  propagation_next()
will not visit C at all - when it reaches B', it'll see IS_MNT_PROPAGATED
and refuse to look what B' might be propagating to.

IOW, IS_MNT_PROPAGATED in propagate_one() is fine, but in propagation_next(),
skip_propagation_subtree() and next_group() we really need IS_MNT_NEW.
And the check in propagate_one() should be

	/* skip ones added by this propagate_mnt() */
	if (IS_MNT_NEW(m))
                return 0;
        /* skip if mountpoint is outside of subtree seen in m */
	if (!is_subdir(dest_mp->m_dentry, m->mnt.mnt_root))
		return 0;
	/* skip if m is in the anon_ns we are emptying */
	if (m->mnt_ns->mntns_flags & MNTNS_PROPAGATING)
		return 0;
That part of check is really about the validity of this particular
location, not the cutoff for further propagation.  IS_MNT_NEW(),
OTOH, is a hard cutoff.

FWIW, I would take the last remaining IS_MNT_PROPAGATED() (in
propagation_would_overmount()) as discussed in this thread -
with
-       if (propagation_would_overmount(parent_mnt_to, mnt_from, mp))
+       if (check_mnt(mnt_from) &&
+           propagation_would_overmount(parent_mnt_to, mnt_from, mp))
in can_move_mount_beneath() and lose the one in propagation_would_overmount()

I'll cook something along those lines (on top of "do_move_mount(): don't
leak MNTNS_PROPAGATING on failures") and if it survives overnight tests
post it tomorrow^Win the morning...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: more breakage there (was Re: [RFC] move_mount(2): still breakage around new mount detection)
  2025-05-08  5:56       ` more breakage there (was Re: [RFC] move_mount(2): still breakage around new mount detection) Al Viro
@ 2025-05-08 19:59         ` Al Viro
  2025-05-08 20:00           ` [PATCH 1/4] __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock Al Viro
                             ` (3 more replies)
  2025-05-09 11:06         ` more breakage there (was Re: [RFC] move_mount(2): still breakage around new mount detection) Christian Brauner
  1 sibling, 4 replies; 29+ messages in thread
From: Al Viro @ 2025-05-08 19:59 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

On Thu, May 08, 2025 at 06:56:10AM +0100, Al Viro wrote:

> I'll cook something along those lines (on top of "do_move_mount(): don't
> leak MNTNS_PROPAGATING on failures") and if it survives overnight tests
> post it tomorrow^Win the morning...

Got it.  Pushed out to viro/vfs.git#fixes; just 4 commits there at the moment -

1) __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock
2) do_umount(): add missing barrier before refcount checks in sync case
3) do_move_mount(): don't leak MNTNS_PROPAGATING on failures
4) fix IS_MNT_PROPAGATING uses

I've got reliable reproducers for #3 and #4, but they'll probably need to
be massaged for kselftests - any help with that would be very welcome.

Patches and those two reproducers in followups; please review and test.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/4] __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock
  2025-05-08 19:59         ` Al Viro
@ 2025-05-08 20:00           ` Al Viro
  2025-05-09 11:02             ` Christian Brauner
  2025-05-08 20:01           ` [PATCH 2/4] do_umount(): add missing barrier before refcount checks in sync case Al Viro
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2025-05-08 20:00 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

... or we risk stealing final mntput from sync umount - raising mnt_count
after umount(2) has verified that victim is not busy, but before it
has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see
that it's safe to quietly undo mnt_count increment and leaves dropping
the reference to caller, where it'll be a full-blown mntput().

Check under mount_lock is needed; leaving the current one done before
taking that makes no sense - it's nowhere near common enough to bother
with.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/namespace.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 98a5cd756e9a..eba4748388b1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -790,12 +790,8 @@ int __legitimize_mnt(struct vfsmount *bastard, unsigned seq)
 	smp_mb();			// see mntput_no_expire()
 	if (likely(!read_seqretry(&mount_lock, seq)))
 		return 0;
-	if (bastard->mnt_flags & MNT_SYNC_UMOUNT) {
-		mnt_add_count(mnt, -1);
-		return 1;
-	}
 	lock_mount_hash();
-	if (unlikely(bastard->mnt_flags & MNT_DOOMED)) {
+	if (unlikely(bastard->mnt_flags & (MNT_SYNC_UMOUNT | MNT_DOOMED))) {
 		mnt_add_count(mnt, -1);
 		unlock_mount_hash();
 		return 1;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/4] do_umount(): add missing barrier before refcount checks in sync case
  2025-05-08 19:59         ` Al Viro
  2025-05-08 20:00           ` [PATCH 1/4] __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock Al Viro
@ 2025-05-08 20:01           ` Al Viro
  2025-05-09 11:02             ` Christian Brauner
  2025-05-08 20:02           ` [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures Al Viro
  2025-05-08 20:02           ` [PATCH 4/4] fix IS_MNT_PROPAGATING uses Al Viro
  3 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2025-05-08 20:01 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

do_umount() analogue of the race fixed in 119e1ef80ecf "fix
__legitimize_mnt()/mntput() race".  Here we want to make sure that
if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will
notice their refcount increment.  Harder to hit than mntput_no_expire()
one, fortunately, and consequences are milder (sync umount acting
like umount -l on a rare race with RCU pathwalk hitting at just the
wrong time instead of use-after-free galore mntput_no_expire()
counterpart used to be hit).  Still a bug...

Fixes: 48a066e72d97 ("RCU'd vfsmounts")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/namespace.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index eba4748388b1..d8a344d0a80a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -787,7 +787,7 @@ int __legitimize_mnt(struct vfsmount *bastard, unsigned seq)
 		return 0;
 	mnt = real_mount(bastard);
 	mnt_add_count(mnt, 1);
-	smp_mb();			// see mntput_no_expire()
+	smp_mb();		// see mntput_no_expire() and do_umount()
 	if (likely(!read_seqretry(&mount_lock, seq)))
 		return 0;
 	lock_mount_hash();
@@ -2044,6 +2044,7 @@ static int do_umount(struct mount *mnt, int flags)
 			umount_tree(mnt, UMOUNT_PROPAGATE);
 		retval = 0;
 	} else {
+		smp_mb(); // paired with __legitimize_mnt()
 		shrink_submounts(mnt);
 		retval = -EBUSY;
 		if (!propagate_mount_busy(mnt, 2)) {
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures
  2025-05-08 19:59         ` Al Viro
  2025-05-08 20:00           ` [PATCH 1/4] __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock Al Viro
  2025-05-08 20:01           ` [PATCH 2/4] do_umount(): add missing barrier before refcount checks in sync case Al Viro
@ 2025-05-08 20:02           ` Al Viro
  2025-05-08 20:03             ` reproducer for "do_move_mount(): don't leak MNTNS_PROPAGATING on failures" Al Viro
                               ` (2 more replies)
  2025-05-08 20:02           ` [PATCH 4/4] fix IS_MNT_PROPAGATING uses Al Viro
  3 siblings, 3 replies; 29+ messages in thread
From: Al Viro @ 2025-05-08 20:02 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

as it is, a failed move_mount(2) from anon namespace breaks
all further propagation into that namespace, including normal
mounts in non-anon namespaces that would otherwise propagate
there.

Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/namespace.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index d8a344d0a80a..04a9bb9f31fa 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3715,15 +3715,14 @@ static int do_move_mount(struct path *old_path,
 	if (err)
 		goto out;
 
-	if (is_anon_ns(ns))
-		ns->mntns_flags &= ~MNTNS_PROPAGATING;
-
 	/* if the mount is moved, it should no longer be expire
 	 * automatically */
 	list_del_init(&old->mnt_expire);
 	if (attached)
 		put_mountpoint(old_mp);
 out:
+	if (is_anon_ns(ns))
+		ns->mntns_flags &= ~MNTNS_PROPAGATING;
 	unlock_mount(mp);
 	if (!err) {
 		if (attached) {
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/4] fix IS_MNT_PROPAGATING uses
  2025-05-08 19:59         ` Al Viro
                             ` (2 preceding siblings ...)
  2025-05-08 20:02           ` [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures Al Viro
@ 2025-05-08 20:02           ` Al Viro
  2025-05-08 20:04             ` reproducer for "fix IS_MNT_PROPAGATING uses" Al Viro
  2025-05-09 11:01             ` [PATCH 4/4] fix IS_MNT_PROPAGATING uses Christian Brauner
  3 siblings, 2 replies; 29+ messages in thread
From: Al Viro @ 2025-05-08 20:02 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

propagate_mnt() does not attach anything to mounts created during
propagate_mnt() itself.  What's more, anything on ->mnt_slave_list
of such new mount must also be new, so we don't need to even look
there.

When move_mount() had been introduced, we've got an additional
class of mounts to skip - if we are moving from anon namespace,
we do not want to propagate to mounts we are moving (i.e. all
mounts in that anon namespace).

Unfortunately, the part about "everything on their ->mnt_slave_list
will also be ignorable" is not true - if we have propagation graph
	A -> B -> C
and do OPEN_TREE_CLONE open_tree() of B, we get
	A -> [B <-> B'] -> C
as propagation graph, where B' is a clone of B in our detached tree.
Making B private will result in
	A -> B' -> C
C still gets propagation from A, as it would after making B private
if we hadn't done that open_tree(), but now the propagation goes
through B'.  Trying to move_mount() our detached tree on subdirectory
in A should have
	* moved B' on that subdirectory in A
	* skipped the corresponding subdirectory in B' itself
	* copied B' on the corresponding subdirectory in C.
As it is, the logics in propagation_next() and friends ends up
skipping propagation into C, since it doesn't consider anything
downstream of B'.

IOW, walking the propagation graph should only skip the ->mnt_slave_list
of new mounts; the only places where the check for "in that one
anon namespace" are applicable are propagate_one() (where we should
treat that as the same kind of thing as "mountpoint we are looking
at is not visible in the mount we are looking at") and
propagation_would_overmount().  The latter is better dealt with
in the caller (can_move_mount_beneath()); on the first call of
propagation_would_overmount() the test is always false, on the
second it is always true in "move from anon namespace" case and
always false in "move within our namespace" one, so it's easier
to just use check_mnt() before bothering with the second call and
be done with that.

Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/namespace.c |  3 ++-
 fs/pnode.c     | 17 +++++++++--------
 fs/pnode.h     |  2 +-
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 04a9bb9f31fa..1b466c54a357 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3557,7 +3557,8 @@ static int can_move_mount_beneath(const struct path *from,
 	 * @mnt_from itself. This defeats the whole purpose of mounting
 	 * @mnt_from beneath @mnt_to.
 	 */
-	if (propagation_would_overmount(parent_mnt_to, mnt_from, mp))
+	if (check_mnt(mnt_from) &&
+	    propagation_would_overmount(parent_mnt_to, mnt_from, mp))
 		return -EINVAL;
 
 	return 0;
diff --git a/fs/pnode.c b/fs/pnode.c
index 7a062a5de10e..fb77427df39e 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -150,7 +150,7 @@ static struct mount *propagation_next(struct mount *m,
 					 struct mount *origin)
 {
 	/* are there any slaves of this mount? */
-	if (!IS_MNT_PROPAGATED(m) && !list_empty(&m->mnt_slave_list))
+	if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
 		return first_slave(m);
 
 	while (1) {
@@ -174,7 +174,7 @@ static struct mount *skip_propagation_subtree(struct mount *m,
 	 * Advance m such that propagation_next will not return
 	 * the slaves of m.
 	 */
-	if (!IS_MNT_PROPAGATED(m) && !list_empty(&m->mnt_slave_list))
+	if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
 		m = last_slave(m);
 
 	return m;
@@ -185,7 +185,7 @@ static struct mount *next_group(struct mount *m, struct mount *origin)
 	while (1) {
 		while (1) {
 			struct mount *next;
-			if (!IS_MNT_PROPAGATED(m) && !list_empty(&m->mnt_slave_list))
+			if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
 				return first_slave(m);
 			next = next_peer(m);
 			if (m->mnt_group_id == origin->mnt_group_id) {
@@ -226,11 +226,15 @@ static int propagate_one(struct mount *m, struct mountpoint *dest_mp)
 	struct mount *child;
 	int type;
 	/* skip ones added by this propagate_mnt() */
-	if (IS_MNT_PROPAGATED(m))
+	if (IS_MNT_NEW(m))
 		return 0;
-	/* skip if mountpoint isn't covered by it */
+	/* skip if mountpoint isn't visible in m */
 	if (!is_subdir(dest_mp->m_dentry, m->mnt.mnt_root))
 		return 0;
+	/* skip if m is in the anon_ns we are emptying */
+	if (m->mnt_ns->mntns_flags & MNTNS_PROPAGATING)
+		return 0;
+
 	if (peers(m, last_dest)) {
 		type = CL_MAKE_SHARED;
 	} else {
@@ -380,9 +384,6 @@ bool propagation_would_overmount(const struct mount *from,
 	if (!IS_MNT_SHARED(from))
 		return false;
 
-	if (IS_MNT_PROPAGATED(to))
-		return false;
-
 	if (to->mnt.mnt_root != mp->m_dentry)
 		return false;
 
diff --git a/fs/pnode.h b/fs/pnode.h
index ddafe0d087ca..34b6247af01d 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -12,7 +12,7 @@
 
 #define IS_MNT_SHARED(m) ((m)->mnt.mnt_flags & MNT_SHARED)
 #define IS_MNT_SLAVE(m) ((m)->mnt_master)
-#define IS_MNT_PROPAGATED(m) (!(m)->mnt_ns || ((m)->mnt_ns->mntns_flags & MNTNS_PROPAGATING))
+#define IS_MNT_NEW(m) (!(m)->mnt_ns)
 #define CLEAR_MNT_SHARED(m) ((m)->mnt.mnt_flags &= ~MNT_SHARED)
 #define IS_MNT_UNBINDABLE(m) ((m)->mnt.mnt_flags & MNT_UNBINDABLE)
 #define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED)
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* reproducer for "do_move_mount(): don't leak MNTNS_PROPAGATING on failures"
  2025-05-08 20:02           ` [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures Al Viro
@ 2025-05-08 20:03             ` Al Viro
  2025-05-09 11:02             ` [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures Christian Brauner
  2025-05-13 11:03             ` Lai, Yi
  2 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2025-05-08 20:03 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

#include <string.h>
#include <unistd.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <stdio.h>

static int tmpfs(const char *name)
{
	return mount("none", name, "tmpfs", 0, NULL);
}
static int change(const char *name, int how)
{
	return mount(NULL, name, NULL, how, NULL);
}
static int bind(const char *from, const char *to)
{
	return mount(from, to, NULL, MS_BIND, NULL);
}
static _Bool exists(const char *name)
{
	return access(name, F_OK) != -1;
}

void playground(void)
{
	mkdir("/tmp/foo", 0700);
	tmpfs("/tmp/foo");
	change("/tmp/foo", MS_PRIVATE);
	chdir("/tmp/foo");
}

void cleanup(int fd)
{
	close(fd);
	chdir("/tmp");
	umount2("/tmp/foo", MNT_DETACH);
	rmdir("/tmp/foo");
}

main()
{
	playground();

	mkdir("A", 0700);
        mkdir("A/subdir", 0700);
	mkdir("B", 0700);
	bind("A", "A");
	change("A", MS_SHARED);

        int fd = open_tree(AT_FDCWD, "A", OPEN_TREE_CLONE);

	// this move_mount should fail (directory on top of non-directory)
        if (move_mount(fd, "", AT_FDCWD, "/dev/null", MOVE_MOUNT_F_EMPTY_PATH) == 0) {
		printf("unexpected success of first move_mount()\n");
		cleanup(fd);
		return -1;
	}
	// this should propagate into detached tree
	tmpfs("A/subdir");
	mkdir("A/subdir/foo", 0700);
	// move detached tree in, so we could check it
	if (move_mount(fd, "", AT_FDCWD, "B", MOVE_MOUNT_F_EMPTY_PATH) != 0) {
		printf("unexpected failure of the second move_mount()\n");
		cleanup(fd);
	}
	if (!exists("B/subdir/foo")) {
		printf("failed to propagate into detached tree\n");
		cleanup(fd);
		return -1;
	}

	printf("success\n");
	cleanup(fd);
	return 0;
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* reproducer for "fix IS_MNT_PROPAGATING uses"
  2025-05-08 20:02           ` [PATCH 4/4] fix IS_MNT_PROPAGATING uses Al Viro
@ 2025-05-08 20:04             ` Al Viro
  2025-05-09 11:01             ` [PATCH 4/4] fix IS_MNT_PROPAGATING uses Christian Brauner
  1 sibling, 0 replies; 29+ messages in thread
From: Al Viro @ 2025-05-08 20:04 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds

#include <string.h>
#include <unistd.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <stdio.h>

static int tmpfs(const char *name)
{
	return mount("none", name, "tmpfs", 0, NULL);
}
static int change(const char *name, int how)
{
	return mount(NULL, name, NULL, how, NULL);
}
static int bind(const char *from, const char *to)
{
	return mount(from, to, NULL, MS_BIND, NULL);
}
static _Bool exists(const char *name)
{
	return access(name, F_OK) != -1;
}

int playground(void)
{
	mkdir("/tmp/foo", 0700);
	tmpfs("/tmp/foo");
	change("/tmp/foo", MS_PRIVATE);
	chdir("/tmp/foo");
}

void cleanup(int fd)
{
	close(fd);
	chdir("/tmp");
	umount2("/tmp/foo", MNT_DETACH);
	rmdir("/tmp/foo");
}

int setup(void)
{
	int fd;

	playground();

	mkdir("A", 0700);
	mkdir("B", 0700);
	mkdir("C", 0700);

	tmpfs("A");
	change("A", MS_SHARED);
	bind("A", "B");
	change("B", MS_SLAVE);
	change("B", MS_SHARED);
	bind("B", "C");
	change("C", MS_SLAVE);

	mkdir("A/foo", 0700);
	mkdir("A/subdir", 0700);
	fd = open_tree(AT_FDCWD, "B", OPEN_TREE_CLONE);
	change("B", MS_PRIVATE);
	return fd;
}

main()
{
	int fd;

	fd = setup();
	bind("B", "A/subdir");
	printf("bind propagated to C: %s\n",
		exists("C/subdir/foo") ? "yes" : "no");
	cleanup(fd);

	fd = setup();
	move_mount(AT_FDCWD, "B", AT_FDCWD, "A/subdir", MOVE_MOUNT_F_EMPTY_PATH);
	printf("move_mount propagated to C: %s\n",
		exists("C/subdir/foo") ? "yes" : "no");
	cleanup(fd);

	fd = setup();
	move_mount(fd, "", AT_FDCWD, "A/subdir", MOVE_MOUNT_F_EMPTY_PATH);
	printf("move_mount (fd) propagated to C: %s\n",
		exists("C/subdir/foo") ? "yes" : "no");
	cleanup(fd);

	return 0;
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] fix IS_MNT_PROPAGATING uses
  2025-05-08 20:02           ` [PATCH 4/4] fix IS_MNT_PROPAGATING uses Al Viro
  2025-05-08 20:04             ` reproducer for "fix IS_MNT_PROPAGATING uses" Al Viro
@ 2025-05-09 11:01             ` Christian Brauner
  1 sibling, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2025-05-09 11:01 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Thu, May 08, 2025 at 09:02:42PM +0100, Al Viro wrote:
> propagate_mnt() does not attach anything to mounts created during
> propagate_mnt() itself.  What's more, anything on ->mnt_slave_list
> of such new mount must also be new, so we don't need to even look
> there.
> 
> When move_mount() had been introduced, we've got an additional
> class of mounts to skip - if we are moving from anon namespace,
> we do not want to propagate to mounts we are moving (i.e. all
> mounts in that anon namespace).
> 
> Unfortunately, the part about "everything on their ->mnt_slave_list
> will also be ignorable" is not true - if we have propagation graph
> 	A -> B -> C
> and do OPEN_TREE_CLONE open_tree() of B, we get
> 	A -> [B <-> B'] -> C
> as propagation graph, where B' is a clone of B in our detached tree.
> Making B private will result in
> 	A -> B' -> C
> C still gets propagation from A, as it would after making B private
> if we hadn't done that open_tree(), but now the propagation goes
> through B'.  Trying to move_mount() our detached tree on subdirectory
> in A should have
> 	* moved B' on that subdirectory in A
> 	* skipped the corresponding subdirectory in B' itself
> 	* copied B' on the corresponding subdirectory in C.
> As it is, the logics in propagation_next() and friends ends up
> skipping propagation into C, since it doesn't consider anything
> downstream of B'.
> 
> IOW, walking the propagation graph should only skip the ->mnt_slave_list
> of new mounts; the only places where the check for "in that one
> anon namespace" are applicable are propagate_one() (where we should
> treat that as the same kind of thing as "mountpoint we are looking
> at is not visible in the mount we are looking at") and
> propagation_would_overmount().  The latter is better dealt with
> in the caller (can_move_mount_beneath()); on the first call of
> propagation_would_overmount() the test is always false, on the
> second it is always true in "move from anon namespace" case and
> always false in "move within our namespace" one, so it's easier
> to just use check_mnt() before bothering with the second call and
> be done with that.
> 
> Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees")
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Thanks, looks good.
Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures
  2025-05-08 20:02           ` [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures Al Viro
  2025-05-08 20:03             ` reproducer for "do_move_mount(): don't leak MNTNS_PROPAGATING on failures" Al Viro
@ 2025-05-09 11:02             ` Christian Brauner
  2025-05-13 11:03             ` Lai, Yi
  2 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2025-05-09 11:02 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Thu, May 08, 2025 at 09:02:11PM +0100, Al Viro wrote:
> as it is, a failed move_mount(2) from anon namespace breaks
> all further propagation into that namespace, including normal
> mounts in non-anon namespaces that would otherwise propagate
> there.
> 
> Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees")
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Thanks!
Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] do_umount(): add missing barrier before refcount checks in sync case
  2025-05-08 20:01           ` [PATCH 2/4] do_umount(): add missing barrier before refcount checks in sync case Al Viro
@ 2025-05-09 11:02             ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2025-05-09 11:02 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Thu, May 08, 2025 at 09:01:37PM +0100, Al Viro wrote:
> do_umount() analogue of the race fixed in 119e1ef80ecf "fix
> __legitimize_mnt()/mntput() race".  Here we want to make sure that
> if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will
> notice their refcount increment.  Harder to hit than mntput_no_expire()
> one, fortunately, and consequences are milder (sync umount acting
> like umount -l on a rare race with RCU pathwalk hitting at just the
> wrong time instead of use-after-free galore mntput_no_expire()
> counterpart used to be hit).  Still a bug...
> 
> Fixes: 48a066e72d97 ("RCU'd vfsmounts")
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Thanks!
Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock
  2025-05-08 20:00           ` [PATCH 1/4] __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock Al Viro
@ 2025-05-09 11:02             ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2025-05-09 11:02 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Thu, May 08, 2025 at 09:00:53PM +0100, Al Viro wrote:
> ... or we risk stealing final mntput from sync umount - raising mnt_count
> after umount(2) has verified that victim is not busy, but before it
> has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see
> that it's safe to quietly undo mnt_count increment and leaves dropping
> the reference to caller, where it'll be a full-blown mntput().
> 
> Check under mount_lock is needed; leaving the current one done before
> taking that makes no sense - it's nowhere near common enough to bother
> with.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Thanks!
Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: more breakage there (was Re: [RFC] move_mount(2): still breakage around new mount detection)
  2025-05-08  5:56       ` more breakage there (was Re: [RFC] move_mount(2): still breakage around new mount detection) Al Viro
  2025-05-08 19:59         ` Al Viro
@ 2025-05-09 11:06         ` Christian Brauner
  1 sibling, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2025-05-09 11:06 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Thu, May 08, 2025 at 06:56:10AM +0100, Al Viro wrote:
> On Mon, Apr 28, 2025 at 07:53:18PM +0100, Al Viro wrote:
> 
> > Up to you; propagation calculations *are* hard-serialized (on namespace_sem)
> > and changing that is too much pain to consider, so I have no problem with
> > globals in that specific case (note several such in propagate_mnt()
> > machinery; that was a deliberate decision to avoid shitloads of arguments
> > that would have to be passed around otherwise), but...
> 
> OK, now I finally understand what felt fishy about either solution.
> 
> Back when the checks had been IS_MNT_NEW, we were guaranteed that
> anything on the the slave lists of new mount would be new as well.
> 
> No amount of copy_tree() could change ->mnt_master of existing mounts,
> so anything predating the beginning of propagate_mnt() would still
> have ->mnt_master pointing to old mounts - no operations other than
> copy_tree() had been done since we have taken namespace_sem.

Yes.

> 
> That's where your IS_MNT_PROPAGATED breaks.  It mixes "nothing useful
> to be found in this direction" with "don't mount anything on this one".
> And these are not the same now.
> 
> Suppose you have mounts A, B and C, A propagating to B, B - to C.
> 
> If you made B private, propagation would go directly from A to C,
> and mount on A/foo would result in a copy on C/foo.
> 
> Suppose you've done open_tree B with OPEN_TREE_CLONE before making
> B private.  After open_tree your propagation graph is
> 	A -> [B <-> B'] -> C
> with new mount B' being in your anon_ns.  Making B private leaves you
> with

I cannot describe how much I hate mount propagation and how much I would
like to burn it from the face of this earth.

> 	A -> B' -> C
> and mount on A/foo still propagates to C/foo, along with foo in your
> anon_ns.
> 
> So far, so good, but what happens if you move_mount the root of your
> anon_ns to A/foo?  Sure, you want to suppress copying it to foo in B',
> but you will end suppressing the copy on C/foo as well.  propagation_next()
> will not visit C at all - when it reaches B', it'll see IS_MNT_PROPAGATED
> and refuse to look what B' might be propagating to.

I regret that I reenabled propagation into anonymous mount namepaces in
the first place.

> 
> IOW, IS_MNT_PROPAGATED in propagate_one() is fine, but in propagation_next(),
> skip_propagation_subtree() and next_group() we really need IS_MNT_NEW.
> And the check in propagate_one() should be
> 
> 	/* skip ones added by this propagate_mnt() */
> 	if (IS_MNT_NEW(m))
>                 return 0;
>         /* skip if mountpoint is outside of subtree seen in m */
> 	if (!is_subdir(dest_mp->m_dentry, m->mnt.mnt_root))
> 		return 0;
> 	/* skip if m is in the anon_ns we are emptying */
> 	if (m->mnt_ns->mntns_flags & MNTNS_PROPAGATING)
> 		return 0;
> That part of check is really about the validity of this particular
> location, not the cutoff for further propagation.  IS_MNT_NEW(),
> OTOH, is a hard cutoff.
> 
> FWIW, I would take the last remaining IS_MNT_PROPAGATED() (in
> propagation_would_overmount()) as discussed in this thread -
> with
> -       if (propagation_would_overmount(parent_mnt_to, mnt_from, mp))
> +       if (check_mnt(mnt_from) &&
> +           propagation_would_overmount(parent_mnt_to, mnt_from, mp))
> in can_move_mount_beneath() and lose the one in propagation_would_overmount()

Yes.

> 
> I'll cook something along those lines (on top of "do_move_mount(): don't
> leak MNTNS_PROPAGATING on failures") and if it survives overnight tests
> post it tomorrow^Win the morning...

Thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures
  2025-05-08 20:02           ` [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures Al Viro
  2025-05-08 20:03             ` reproducer for "do_move_mount(): don't leak MNTNS_PROPAGATING on failures" Al Viro
  2025-05-09 11:02             ` [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures Christian Brauner
@ 2025-05-13 11:03             ` Lai, Yi
  2025-05-13 12:08               ` Al Viro
  2 siblings, 1 reply; 29+ messages in thread
From: Lai, Yi @ 2025-05-13 11:03 UTC (permalink / raw)
  To: Al Viro; +Cc: Christian Brauner, linux-fsdevel, Linus Torvalds, yi1.lai

Hi Al Viro,

Greetings!

I used Syzkaller and found that there is general protection fault in do_move_mount in linux v6.15-rc6.

After bisection and the first bad commit is:
"
267fc3a06a37 do_move_mount(): don't leak MNTNS_PROPAGATING on failures
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250513_095133_do_move_mount/bzImage_82f2b0b97b36ee3fcddf0f0780a9a0825d52fec3
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/250513_095133_do_move_mount/bzImage_82f2b0b97b36ee3fcddf0f0780a9a0825d52fec3

"
[   17.307248] ==================================================================
[   17.307611] BUG: KASAN: slab-use-after-free in ip6_route_info_create+0xb84/0xc30
[   17.307993] Read of size 1 at addr ffff8880100b8a94 by task repro/727
[   17.308291] 
[   17.308389] CPU: 0 UID: 0 PID: 727 Comm: repro Not tainted 6.15.0-rc4-next-20250428-33035b665157 #1 PREEMPT(voluntary) 
[   17.308397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   17.308405] Call Trace:
[   17.308412]  <TASK>
[   17.308414]  dump_stack_lvl+0xea/0x150
[   17.308439]  print_report+0xce/0x660
[   17.308469]  ? ip6_route_info_create+0xb84/0xc30
[   17.308475]  ? kasan_complete_mode_report_info+0x80/0x200
[   17.308482]  ? ip6_route_info_create+0xb84/0xc30
[   17.308489]  kasan_report+0xd6/0x110
[   17.308496]  ? ip6_route_info_create+0xb84/0xc30
[   17.308504]  __asan_report_load1_noabort+0x18/0x20
[   17.308509]  ip6_route_info_create+0xb84/0xc30
[   17.308516]  ip6_route_add+0x32/0x320
[   17.308524]  ipv6_route_ioctl+0x414/0x5a0
[   17.308530]  ? __pfx_ipv6_route_ioctl+0x10/0x10
[   17.308539]  ? __might_fault+0xf1/0x1b0
[   17.308556]  inet6_ioctl+0x265/0x2b0
[   17.308568]  ? __pfx_inet6_ioctl+0x10/0x10
[   17.308573]  ? do_anonymous_page+0x4b5/0x1b30
[   17.308579]  ? register_lock_class+0x49/0x4b0
[   17.308597]  ? __sanitizer_cov_trace_switch+0x58/0xa0
[   17.308616]  sock_do_ioctl+0xde/0x260
[   17.308628]  ? __pfx_sock_do_ioctl+0x10/0x10
[   17.308634]  ? __lock_acquire+0x410/0x2260
[   17.308640]  ? __lock_acquire+0x410/0x2260
[   17.308649]  ? __sanitizer_cov_trace_switch+0x58/0xa0
[   17.308656]  sock_ioctl+0x23e/0x6a0
[   17.308665]  ? __pfx_sock_ioctl+0x10/0x10
[   17.308671]  ? __this_cpu_preempt_check+0x21/0x30
[   17.308683]  ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
[   17.308694]  ? lockdep_hardirqs_on+0x89/0x110
[   17.308703]  ? trace_hardirqs_on+0x51/0x60
[   17.308717]  ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
[   17.308723]  ? __sanitizer_cov_trace_cmp4+0x1a/0x20
[   17.308729]  ? ktime_get_coarse_real_ts64+0xad/0xf0
[   17.308737]  ? __pfx_sock_ioctl+0x10/0x10
[   17.308744]  __x64_sys_ioctl+0x1bc/0x220
[   17.308765]  x64_sys_call+0x122e/0x2150
[   17.308774]  do_syscall_64+0x6d/0x150
[   17.308783]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   17.308789] RIP: 0033:0x7f75a8c3ee5d
[   17.308797] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[   17.308803] RSP: 002b:00007ffe7620af68 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[   17.308814] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f75a8c3ee5d
[   17.308818] RDX: 00000000200015c0 RSI: 000000000000890b RDI: 0000000000000003
[   17.308821] RBP: 00007ffe7620af80 R08: 0000000000000800 R09: 0000000000000800
[   17.308825] R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffe7620b098
[   17.308828] R13: 0000000000401136 R14: 0000000000403e08 R15: 00007f75a8fc3000
[   17.308835]  </TASK>
[   17.308837] 
[   17.320668] Allocated by task 653:
[   17.320836]  kasan_save_stack+0x2c/0x60
[   17.321028]  kasan_save_track+0x18/0x40
[   17.321217]  kasan_save_alloc_info+0x3c/0x50
[   17.321430]  __kasan_slab_alloc+0x62/0x80
[   17.321627]  kmem_cache_alloc_noprof+0x13d/0x430
[   17.321855]  getname_kernel+0x5c/0x390
[   17.322044]  kern_path+0x29/0x90
[   17.322203]  unix_find_other+0x11b/0x880
[   17.322395]  unix_stream_connect+0x4f5/0x1a50
[   17.322604]  __sys_connect_file+0x159/0x1d0
[   17.322805]  __sys_connect+0x176/0x1b0
[   17.322986]  __x64_sys_connect+0x7b/0xc0
[   17.323180]  x64_sys_call+0x1bc7/0x2150
[   17.323371]  do_syscall_64+0x6d/0x150
[   17.323555]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   17.323793] 
[   17.323876] Freed by task 653:
[   17.324024]  kasan_save_stack+0x2c/0x60
[   17.324212]  kasan_save_track+0x18/0x40
[   17.324405]  kasan_save_free_info+0x3f/0x60
[   17.324606]  __kasan_slab_free+0x3d/0x60
[   17.324799]  kmem_cache_free+0x2ea/0x520
[   17.324987]  putname.part.0+0x132/0x180
[   17.325175]  kern_path+0x74/0x90
[   17.325335]  unix_find_other+0x11b/0x880
[   17.325526]  unix_stream_connect+0x4f5/0x1a50
[   17.325736]  __sys_connect_file+0x159/0x1d0
[   17.325941]  __sys_connect+0x176/0x1b0
[   17.326122]  __x64_sys_connect+0x7b/0xc0
[   17.326316]  x64_sys_call+0x1bc7/0x2150
[   17.326504]  do_syscall_64+0x6d/0x150
[   17.326687]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   17.326929] 
[   17.327013] The buggy address belongs to the object at ffff8880100b8000
[   17.327013]  which belongs to the cache names_cache of size 4096
[   17.327572] The buggy address is located 2708 bytes inside of
[   17.327572]  freed 4096-byte region [ffff8880100b8000, ffff8880100b9000)
[   17.328121] 
[   17.328204] The buggy address belongs to the physical page:
[   17.328461] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x100b8
[   17.328831] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[   17.329184] flags: 0xfffffc0000040(head|node=0|zone=1|lastcpupid=0x1fffff)
[   17.329508] page_type: f5(slab)
[   17.329670] raw: 000fffffc0000040 ffff88800d72cdc0 dead000000000100 dead000000000122
[   17.330022] raw: 0000000000000000 0000000000070007 00000000f5000000 0000000000000000
[   17.330381] head: 000fffffc0000040 ffff88800d72cdc0 dead000000000100 dead000000000122
[   17.330738] head: 0000000000000000 0000000000070007 00000000f5000000 0000000000000000
[   17.331094] head: 000fffffc0000003 ffffea0000402e01 00000000ffffffff 00000000ffffffff
[   17.331454] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
[   17.331807] page dumped because: kasan: bad access detected
[   17.332066] 
[   17.332150] Memory state around the buggy address:
[   17.332374]  ffff8880100b8980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   17.332703]  ffff8880100b8a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   17.333031] >ffff8880100b8a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   17.333361]                          ^
[   17.333545]  ffff8880100b8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   17.333874]  ffff8880100b8b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   17.334201] ==================================================================
"

Hope this cound be insightful to you.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install 

On Thu, May 08, 2025 at 09:02:11PM +0100, Al Viro wrote:
> as it is, a failed move_mount(2) from anon namespace breaks
> all further propagation into that namespace, including normal
> mounts in non-anon namespaces that would otherwise propagate
> there.
> 
> Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees")
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/namespace.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index d8a344d0a80a..04a9bb9f31fa 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3715,15 +3715,14 @@ static int do_move_mount(struct path *old_path,
>  	if (err)
>  		goto out;
>  
> -	if (is_anon_ns(ns))
> -		ns->mntns_flags &= ~MNTNS_PROPAGATING;
> -
>  	/* if the mount is moved, it should no longer be expire
>  	 * automatically */
>  	list_del_init(&old->mnt_expire);
>  	if (attached)
>  		put_mountpoint(old_mp);
>  out:
> +	if (is_anon_ns(ns))
> +		ns->mntns_flags &= ~MNTNS_PROPAGATING;
>  	unlock_mount(mp);
>  	if (!err) {
>  		if (attached) {
> -- 
> 2.39.5
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures
  2025-05-13 11:03             ` Lai, Yi
@ 2025-05-13 12:08               ` Al Viro
  2025-05-13 14:33                 ` Lai, Yi
  0 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2025-05-13 12:08 UTC (permalink / raw)
  To: Lai, Yi; +Cc: Christian Brauner, linux-fsdevel, Linus Torvalds, yi1.lai

On Tue, May 13, 2025 at 07:03:14PM +0800, Lai, Yi wrote:
> Hi Al Viro,
> 
> Greetings!
> 
> I used Syzkaller and found that there is general protection fault in do_move_mount in linux v6.15-rc6.
> 
> After bisection and the first bad commit is:
> "
> 267fc3a06a37 do_move_mount(): don't leak MNTNS_PROPAGATING on failures
> "
> 
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250513_095133_do_move_mount/bzImage_82f2b0b97b36ee3fcddf0f0780a9a0825d52fec3
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/250513_095133_do_move_mount/bzImage_82f2b0b97b36ee3fcddf0f0780a9a0825d52fec3

Are you sure that stack traces are from the same reproducer?  Because they
look nothing like what it's doing...

I'm pretty sure I see the problem there, but I don't see how it could
fail to oops right in do_move_mount() itself if triggered...

As a quick check, could you see if the same kernel + diff below still
gives the same report?

diff --git a/fs/namespace.c b/fs/namespace.c
index 1b466c54a357..a5983726e51d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3722,7 +3722,7 @@ static int do_move_mount(struct path *old_path,
 	if (attached)
 		put_mountpoint(old_mp);
 out:
-	if (is_anon_ns(ns))
+	if (!IS_ERR_OR_NULL(ns) && is_anon_ns(ns))
 		ns->mntns_flags &= ~MNTNS_PROPAGATING;
 	unlock_mount(mp);
 	if (!err) {

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures
  2025-05-13 12:08               ` Al Viro
@ 2025-05-13 14:33                 ` Lai, Yi
  0 siblings, 0 replies; 29+ messages in thread
From: Lai, Yi @ 2025-05-13 14:33 UTC (permalink / raw)
  To: Al Viro; +Cc: Christian Brauner, linux-fsdevel, Linus Torvalds, yi1.lai

On Tue, May 13, 2025 at 01:08:58PM +0100, Al Viro wrote:
> On Tue, May 13, 2025 at 07:03:14PM +0800, Lai, Yi wrote:
> > Hi Al Viro,
> > 
> > Greetings!
> > 
> > I used Syzkaller and found that there is general protection fault in do_move_mount in linux v6.15-rc6.
> > 
> > After bisection and the first bad commit is:
> > "
> > 267fc3a06a37 do_move_mount(): don't leak MNTNS_PROPAGATING on failures
> > "
> > 
> > All detailed into can be found at:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount
> > Syzkaller repro code:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/repro.c
> > Syzkaller repro syscall steps:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/repro.prog
> > Syzkaller report:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/repro.report
> > Kconfig(make olddefconfig):
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/kconfig_origin
> > Bisect info:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/250513_095133_do_move_mount/bisect_info.log
> > bzImage:
> > https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250513_095133_do_move_mount/bzImage_82f2b0b97b36ee3fcddf0f0780a9a0825d52fec3
> > Issue dmesg:
> > https://github.com/laifryiee/syzkaller_logs/blob/main/250513_095133_do_move_mount/bzImage_82f2b0b97b36ee3fcddf0f0780a9a0825d52fec3
> 
> Are you sure that stack traces are from the same reproducer?  Because they
> look nothing like what it's doing...
>

Yes. The reproducer causes the OOP in do_move_mount().
> I'm pretty sure I see the problem there, but I don't see how it could
> fail to oops right in do_move_mount() itself if triggered...
> 
> As a quick check, could you see if the same kernel + diff below still
> gives the same report?
> 

After applying the diff, the issue cannot be reproduced.
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 1b466c54a357..a5983726e51d 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3722,7 +3722,7 @@ static int do_move_mount(struct path *old_path,
>  	if (attached)
>  		put_mountpoint(old_mp);
>  out:
> -	if (is_anon_ns(ns))
> +	if (!IS_ERR_OR_NULL(ns) && is_anon_ns(ns))
>  		ns->mntns_flags &= ~MNTNS_PROPAGATING;
>  	unlock_mount(mp);
>  	if (!err) {

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-05-13 14:43 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-28  6:30 [RFC] move_mount(2): still breakage around new mount detection Al Viro
2025-04-28  7:03 ` Al Viro
2025-04-28  8:50   ` Christian Brauner
2025-04-28 18:53     ` Al Viro
2025-04-29  4:03       ` Al Viro
2025-04-29  5:10         ` Al Viro
2025-04-29  5:27           ` Al Viro
2025-04-29  8:21           ` Christian Brauner
2025-05-05  5:08           ` Al Viro
2025-05-05 14:20             ` Christian Brauner
2025-04-29  7:56         ` Christian Brauner
2025-04-29 12:27           ` Al Viro
2025-04-29  7:52       ` Christian Brauner
2025-05-08  5:56       ` more breakage there (was Re: [RFC] move_mount(2): still breakage around new mount detection) Al Viro
2025-05-08 19:59         ` Al Viro
2025-05-08 20:00           ` [PATCH 1/4] __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock Al Viro
2025-05-09 11:02             ` Christian Brauner
2025-05-08 20:01           ` [PATCH 2/4] do_umount(): add missing barrier before refcount checks in sync case Al Viro
2025-05-09 11:02             ` Christian Brauner
2025-05-08 20:02           ` [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures Al Viro
2025-05-08 20:03             ` reproducer for "do_move_mount(): don't leak MNTNS_PROPAGATING on failures" Al Viro
2025-05-09 11:02             ` [PATCH 3/4] do_move_mount(): don't leak MNTNS_PROPAGATING on failures Christian Brauner
2025-05-13 11:03             ` Lai, Yi
2025-05-13 12:08               ` Al Viro
2025-05-13 14:33                 ` Lai, Yi
2025-05-08 20:02           ` [PATCH 4/4] fix IS_MNT_PROPAGATING uses Al Viro
2025-05-08 20:04             ` reproducer for "fix IS_MNT_PROPAGATING uses" Al Viro
2025-05-09 11:01             ` [PATCH 4/4] fix IS_MNT_PROPAGATING uses Christian Brauner
2025-05-09 11:06         ` more breakage there (was Re: [RFC] move_mount(2): still breakage around new mount detection) Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).