linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] propagate_umount() breakage
@ 2025-05-11 23:27 Al Viro
  2025-05-12  4:50 ` Eric W. Biederman
  0 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2025-05-11 23:27 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, Christian Brauner

reproducer:
------------------------------------------------------------
# create a playground
mkdir /tmp/foo
mount -t tmpfs none /tmp/foo
mount --make-private /tmp/foo
cd /tmp/foo

# set one-way propagation from A to B
mkdir A
mkdir B
mount -t tmpfs none A
mount --make-shared A
mount --bind A B
mount --make-slave B

# A/1 -> B/1, A/1/2 -> B/1/2
mkdir A/1
mount -t tmpfs none A/1
mkdir A/1/2
mount -t tmpfs none A/1/2

# overmount the entire B/1/2
mount -t tmpfs none B/1/2

# make sure it's busy - set a mount at B/1/2/x
mkdir B/1/2/x
mount -t tmpfs none B/1/2/x

stat B/1/x # shouldn't exist

umount -l A/1

stat B/1/x # ... and now it does
------------------------------------------------------------

What happens is that mounts on B/1 and B/1/2 had been considered
as victims - and taken out, since the overmount on top of B/1/2
overmounted the root of the first mount on B/1/2 and it got
reparented - all the way to B/1.

Correct behaviour would be to have B/1 left in place and upper
B/1/2 to be reparented once.

As an aside, that's a catch from several days of attempts to prove
correctness of propagate_umount(); I'm still not sure there's
nothing else wrong with it.  _Maybe_ it's the only problem in
there, but reconstructing the proof of correctness has turned
out to be a real bitch ;-/

I seriously suspect that a lot of headache comes from trying
to combine collecting the full set of potential victims with
deciding what can and what can not be taken out - gathering
all of them first would simplify things.  First pass would've
logics similar to your variant, but without __propagate_umount()
part[*]

After the set is collected, we could go through it, doing the
something along the lines of
	how = 0
	for each child in children(m)
		if child in set
			continue
		how = 1
		if child is not mounted on root
			how = 2
			break
	if how == 2
		kick_out_ancestors(m)
		remove m itself from set // needs to cooperate with outer loop
	else if how == 1
		for (p = m; p in set && p is mounted on root; p = p->mnt_parent)
			;
		if p in set
			kick_out_ancestors(p)
	else if children(m) is empty && m is not locked	// to optimize things a bit
		commit to unmounting m (again, needs to cooperate with the outer loop)

"Cooperate with the outer loop" might mean something like
having this per-element work leave removal of its argument to
caller and report whether its argument needs to be removed.

After that we'd be left with everything still in the set
having no out-of-set children that would be obstacles.
The only thing that remains after that is MNT_LOCKED and
that's as simple as
	while set is not empty
		m = first element of set
		for (p = m; p is locked && p in set; p = p->mnt_parent)
			;
		if p not in set {
			if p is not committed to unmount
				remove everything from m to p from set
				continue
		} else {
			p = p->mnt_parent
		}
		commit everything from m to p to unmount, removing from set

I'll try to put something of that sort together, along with
detailed explanation of what it's doing - in D/f/*, rather than
buring it in commit messages, and along with "read and update
D/f/... if you are ever touch this function" in fs/pnode.c itself;
this fun is not something I would like to repeat several years
down the road ;-/

We *REALLY* need a good set of regression tests for that stuff.
If you have anything along those lines sitting somewhere, please
post a reference.  The same goes for everybody else who might
have something in that general area.


[*] well, that and with fixed skip_propagation_subtree() logics; it's
easier to combine it with propagation_next() rather than trying to set
the things up so that the next call of propagation_next() would DTRT -
it's less work and yours actually has a corner case if the last element
of ->mnt_slave_list has non-empty ->mnt_slave_list itself.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] propagate_umount() breakage
  2025-05-11 23:27 [BUG] propagate_umount() breakage Al Viro
@ 2025-05-12  4:50 ` Eric W. Biederman
  2025-05-13  3:56   ` Al Viro
  0 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2025-05-12  4:50 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, Christian Brauner

Al Viro <viro@zeniv.linux.org.uk> writes:

> reproducer:
> ------------------------------------------------------------
> # create a playground
> mkdir /tmp/foo
> mount -t tmpfs none /tmp/foo
> mount --make-private /tmp/foo
> cd /tmp/foo
>
> # set one-way propagation from A to B
> mkdir A
> mkdir B
> mount -t tmpfs none A
> mount --make-shared A
> mount --bind A B
> mount --make-slave B
>
> # A/1 -> B/1, A/1/2 -> B/1/2
> mkdir A/1
> mount -t tmpfs none A/1
> mkdir A/1/2
> mount -t tmpfs none A/1/2
>
> # overmount the entire B/1/2
> mount -t tmpfs none B/1/2
>
> # make sure it's busy - set a mount at B/1/2/x
> mkdir B/1/2/x
> mount -t tmpfs none B/1/2/x
>
> stat B/1/x # shouldn't exist
>
> umount -l A/1
>
> stat B/1/x # ... and now it does
> ------------------------------------------------------------
>
> What happens is that mounts on B/1 and B/1/2 had been considered
> as victims - and taken out, since the overmount on top of B/1/2
> overmounted the root of the first mount on B/1/2 and it got
> reparented - all the way to B/1.

Yes, that behavior is incorrect since it causes a userspace visible
change on where the mount is visible.

> Correct behaviour would be to have B/1 left in place and upper
> B/1/2 to be reparented once.

As I read __propagate_umount that is what is trying to be implemented.

I am a bit mystified why the semantics aren't simply to lazily umount
(aka MNT_DETACH) that overmount.  But that is not what the code is
trying to do.  It probably isn't worth considering a change in semantics
at this point.

It looks like the challenge is in the __propgate_umount loop.
If I am reading thing correctly:
- __propagate_umount recognizes that B/1/2 can be unmounted from the
  overmount.
- The code then considers the parent of B/1/2 B/1.
- When considering B/1 there is only one child B/1/2 that has been
  umounted and it has already been marked to be umounted so it
  is ignored (Sigh).

So a minimal fix would go up the mount pile to see if there is anything
that must remain.  Or probably use a bit like use a bit like MNT_MARK to
recognize there is an overmount remaining.  So despite a child being
unmounted it still should count as if it was mounted.

> As an aside, that's a catch from several days of attempts to prove
> correctness of propagate_umount(); I'm still not sure there's
> nothing else wrong with it.  _Maybe_ it's the only problem in
> there, but reconstructing the proof of correctness has turned
> out to be a real bitch ;-/
>
> I seriously suspect that a lot of headache comes from trying
> to combine collecting the full set of potential victims with
> deciding what can and what can not be taken out - gathering
> all of them first would simplify things.  First pass would've
> logics similar to your variant, but without __propagate_umount()
> part[*]

This is there be dragons talk.

With out care it is easy to get the code to go non-linear in
the number of mounts.

That said I don't see any immediate problem with a first pass
without my __propgate_umount.

As I read the current code the __propagate_umount loop is just
about propagating the information up from the leaves.

> After the set is collected, we could go through it, doing the
> something along the lines of
> 	how = 0
> 	for each child in children(m)
> 		if child in set
> 			continue
> 		how = 1
> 		if child is not mounted on root
> 			how = 2
> 			break
> 	if how == 2
> 		kick_out_ancestors(m)
> 		remove m itself from set // needs to cooperate with outer loop
> 	else if how == 1
> 		for (p = m; p in set && p is mounted on root; p = p->mnt_parent)
> 			;
> 		if p in set
> 			kick_out_ancestors(p)
> 	else if children(m) is empty && m is not locked	// to optimize things a bit
> 		commit to unmounting m (again, needs to cooperate with the outer loop)
>
> "Cooperate with the outer loop" might mean something like
> having this per-element work leave removal of its argument to
> caller and report whether its argument needs to be removed.
>
> After that we'd be left with everything still in the set
> having no out-of-set children that would be obstacles.
> The only thing that remains after that is MNT_LOCKED and
> that's as simple as
> 	while set is not empty
> 		m = first element of set
> 		for (p = m; p is locked && p in set; p = p->mnt_parent)
> 			;
> 		if p not in set {
> 			if p is not committed to unmount
> 				remove everything from m to p from set
> 				continue
> 		} else {
> 			p = p->mnt_parent
> 		}
> 		commit everything from m to p to unmount, removing from set
>
> I'll try to put something of that sort together, along with
> detailed explanation of what it's doing - in D/f/*, rather than
> buring it in commit messages, and along with "read and update
> D/f/... if you are ever touch this function" in fs/pnode.c itself;
> this fun is not something I would like to repeat several years
> down the road ;-/

I think I understand what you are saying.  But I will have to see the
actually code.

> We *REALLY* need a good set of regression tests for that stuff.
> If you have anything along those lines sitting somewhere, please
> post a reference.  The same goes for everybody else who might
> have something in that general area.

I will have to look.  As I recall everything I have is completely manual
but it could be a starting point at least.

> [*] well, that and with fixed skip_propagation_subtree() logics; it's
> easier to combine it with propagation_next() rather than trying to set
> the things up so that the next call of propagation_next() would DTRT -
> it's less work and yours actually has a corner case if the last element
> of ->mnt_slave_list has non-empty ->mnt_slave_list itself.

I hope that helps a little,

Eric


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] propagate_umount() breakage
  2025-05-12  4:50 ` Eric W. Biederman
@ 2025-05-13  3:56   ` Al Viro
  2025-05-15 11:41     ` Al Viro
  0 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2025-05-13  3:56 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, Christian Brauner

On Sun, May 11, 2025 at 11:50:40PM -0500, Eric W. Biederman wrote:

> This is there be dragons talk.
> 
> With out care it is easy to get the code to go non-linear in
> the number of mounts.
> 
> That said I don't see any immediate problem with a first pass
> without my __propgate_umount.
> 
> As I read the current code the __propagate_umount loop is just
> about propagating the information up from the leaves.
> 
> > After the set is collected, we could go through it, doing the
> > something along the lines of
> > 	how = 0
> > 	for each child in children(m)
> > 		if child in set
> > 			continue
> > 		how = 1
> > 		if child is not mounted on root
> > 			how = 2
> > 			break
> > 	if how == 2
> > 		kick_out_ancestors(m)
> > 		remove m itself from set // needs to cooperate with outer loop
> > 	else if how == 1
> > 		for (p = m; p in set && p is mounted on root; p = p->mnt_parent)
> > 			;
> > 		if p in set
> > 			kick_out_ancestors(p)
> > 	else if children(m) is empty && m is not locked	// to optimize things a bit
> > 		commit to unmounting m (again, needs to cooperate with the outer loop)

Misses some valid candidates, unfortunately.  It's not hard to fix -
handle_overmounts(m)
        remove_it = false;
        non_vanishing = NULL; // something non-vanishing there
        for each child in m->mnt_mounts
                if child not in Candidates
                        non_vanishing = child;
                        if (!overmounts_root(child)) {
                                remove_it = true;
                                break;
                        }
        if (non_vanishing) {
                for (p = m; p in Candidates && !marked(p); p = p->mnt_parent) {
                        if (!overmounts_root(non_vanishing) && p != m)
                                Candidates -= {p}
			else
				mark(p);
                        non_vanishing = p;
                }
        } else if m->mnt_mounts is empty && m is not locked { // optimize a bit
                commit_to_unmounting(m);
                remove_it = true;
        }
        res = next(m) // NULL if at the end of Candidates
        if (remove_it) {
		unmark(m);
                Candidates -= {m}
	}
        return res
will do the right thing and it's linear by the number of mounts we must
look at.

And yes, it's going to be posted along with the proof of correctness -
I've had it with the amount of subtle corner cases in that area ;-/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] propagate_umount() breakage
  2025-05-13  3:56   ` Al Viro
@ 2025-05-15 11:41     ` Al Viro
  2025-05-15 11:47       ` Al Viro
  0 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2025-05-15 11:41 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, Christian Brauner

On Tue, May 13, 2025 at 04:56:22AM +0100, Al Viro wrote:

> And yes, it's going to be posted along with the proof of correctness -
> I've had it with the amount of subtle corner cases in that area ;-/

FWIW, it seems to survive testing; it's _not_ final - I'm still editing
the promised proof, but by this point it's stylistic work.  I hadn't
touched that kind of formal writing for quite a while, so the text is clumsy.
The code changes are pretty much in the final shape.  Current diff of
code (all in fs/pnode.*) relative to mainline:

diff --git a/fs/pnode.c b/fs/pnode.c
index fb77427df39e..32ecebbc1d19 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -24,11 +24,6 @@ static inline struct mount *first_slave(struct mount *p)
 	return list_entry(p->mnt_slave_list.next, struct mount, mnt_slave);
 }
 
-static inline struct mount *last_slave(struct mount *p)
-{
-	return list_entry(p->mnt_slave_list.prev, struct mount, mnt_slave);
-}
-
 static inline struct mount *next_slave(struct mount *p)
 {
 	return list_entry(p->mnt_slave.next, struct mount, mnt_slave);
@@ -136,6 +131,23 @@ void change_mnt_propagation(struct mount *mnt, int type)
 	}
 }
 
+static struct mount *__propagation_next(struct mount *m,
+					 struct mount *origin)
+{
+	while (1) {
+		struct mount *master = m->mnt_master;
+
+		if (master == origin->mnt_master) {
+			struct mount *next = next_peer(m);
+			return (next == origin) ? NULL : next;
+		} else if (m->mnt_slave.next != &master->mnt_slave_list)
+			return next_slave(m);
+
+		/* back at master */
+		m = master;
+	}
+}
+
 /*
  * get the next mount in the propagation tree.
  * @m: the mount seen last
@@ -153,31 +165,26 @@ static struct mount *propagation_next(struct mount *m,
 	if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
 		return first_slave(m);
 
-	while (1) {
-		struct mount *master = m->mnt_master;
-
-		if (master == origin->mnt_master) {
-			struct mount *next = next_peer(m);
-			return (next == origin) ? NULL : next;
-		} else if (m->mnt_slave.next != &master->mnt_slave_list)
-			return next_slave(m);
+	return __propagation_next(m, origin);
+}
 
-		/* back at master */
-		m = master;
-	}
+static inline bool peers(const struct mount *m1, const struct mount *m2)
+{
+	return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
 }
 
 static struct mount *skip_propagation_subtree(struct mount *m,
 						struct mount *origin)
 {
 	/*
-	 * Advance m such that propagation_next will not return
-	 * the slaves of m.
+	 * Advance m past everything that gets propagation from it.
 	 */
-	if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
-		m = last_slave(m);
+	struct mount *p = __propagation_next(m, origin);
 
-	return m;
+	while (p && peers(m, p))
+		p = __propagation_next(p, origin);
+
+	return p;
 }
 
 static struct mount *next_group(struct mount *m, struct mount *origin)
@@ -216,11 +223,6 @@ static struct mount *next_group(struct mount *m, struct mount *origin)
 static struct mount *last_dest, *first_source, *last_source, *dest_master;
 static struct hlist_head *list;
 
-static inline bool peers(const struct mount *m1, const struct mount *m2)
-{
-	return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
-}
-
 static int propagate_one(struct mount *m, struct mountpoint *dest_mp)
 {
 	struct mount *child;
@@ -463,109 +465,144 @@ void propagate_mount_unlock(struct mount *mnt)
 	}
 }
 
-static void umount_one(struct mount *mnt, struct list_head *to_umount)
+static LIST_HEAD(to_umount);
+static LIST_HEAD(candidates);
+
+static inline struct mount *first_candidate(void)
 {
-	CLEAR_MNT_MARK(mnt);
-	mnt->mnt.mnt_flags |= MNT_UMOUNT;
-	list_del_init(&mnt->mnt_child);
-	list_del_init(&mnt->mnt_umounting);
-	move_from_ns(mnt, to_umount);
+	if (list_empty(&candidates))
+		return NULL;
+	return list_first_entry(&candidates, struct mount, mnt_umounting);
 }
 
-/*
- * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
- * parent propagates to.
- */
-static bool __propagate_umount(struct mount *mnt,
-			       struct list_head *to_umount,
-			       struct list_head *to_restore)
+static inline bool is_candidate(struct mount *m)
 {
-	bool progress = false;
-	struct mount *child;
-
-	/*
-	 * The state of the parent won't change if this mount is
-	 * already unmounted or marked as without children.
-	 */
-	if (mnt->mnt.mnt_flags & (MNT_UMOUNT | MNT_MARKED))
-		goto out;
+	return !list_empty(&m->mnt_umounting);
+}
 
-	/* Verify topper is the only grandchild that has not been
-	 * speculatively unmounted.
-	 */
-	list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
-		if (child->mnt_mountpoint == mnt->mnt.mnt_root)
-			continue;
-		if (!list_empty(&child->mnt_umounting) && IS_MNT_MARKED(child))
-			continue;
-		/* Found a mounted child */
-		goto children;
-	}
+static inline bool will_be_unmounted(struct mount *m)
+{
+	return m->mnt.mnt_flags & MNT_UMOUNT;
+}
 
-	/* Mark mounts that can be unmounted if not locked */
-	SET_MNT_MARK(mnt);
-	progress = true;
+static void umount_one(struct mount *m)
+{
+	m->mnt.mnt_flags |= MNT_UMOUNT;
+	list_del_init(&m->mnt_child);
+	move_from_ns(m, &to_umount);
+}
 
-	/* If a mount is without children and not locked umount it. */
-	if (!IS_MNT_LOCKED(mnt)) {
-		umount_one(mnt, to_umount);
-	} else {
-children:
-		list_move_tail(&mnt->mnt_umounting, to_restore);
-	}
-out:
-	return progress;
+static void remove_candidate(struct mount *m)
+{
+	CLEAR_MNT_MARK(m);
+	list_del_init(&m->mnt_umounting);
 }
 
-static void umount_list(struct list_head *to_umount,
-			struct list_head *to_restore)
+static void gather_candidates(struct list_head *set)
 {
-	struct mount *mnt, *child, *tmp;
-	list_for_each_entry(mnt, to_umount, mnt_list) {
-		list_for_each_entry_safe(child, tmp, &mnt->mnt_mounts, mnt_child) {
-			/* topper? */
-			if (child->mnt_mountpoint == mnt->mnt.mnt_root)
-				list_move_tail(&child->mnt_umounting, to_restore);
-			else
-				umount_one(child, to_umount);
+	LIST_HEAD(visited);
+	struct mount *m, *p, *q;
+
+	list_for_each_entry(m, set, mnt_list) {
+		if (is_candidate(m))
+			continue;
+		list_add(&m->mnt_umounting, &visited);
+		p = m->mnt_parent;
+		q = propagation_next(p, p);
+		while (q) {
+			struct mount *child = __lookup_mnt(&q->mnt,
+							   m->mnt_mountpoint);
+			if (child) {
+				struct list_head *head;
+
+				if (is_candidate(child)) {
+					q = skip_propagation_subtree(q, p);
+					continue;
+				}
+				if (will_be_unmounted(child))
+					head = &visited;
+				else
+					head = &candidates;
+				list_add(&child->mnt_umounting, head);
+			}
+			q = propagation_next(q, p);
 		}
 	}
+	while (!list_empty(&visited))	// empty visited
+		list_del_init(visited.next);
 }
 
-static void restore_mounts(struct list_head *to_restore)
+static struct mount *trim_one(struct mount *m)
 {
-	/* Restore mounts to a clean working state */
-	while (!list_empty(to_restore)) {
-		struct mount *mnt, *parent;
-		struct mountpoint *mp;
-
-		mnt = list_first_entry(to_restore, struct mount, mnt_umounting);
-		CLEAR_MNT_MARK(mnt);
-		list_del_init(&mnt->mnt_umounting);
-
-		/* Should this mount be reparented? */
-		mp = mnt->mnt_mp;
-		parent = mnt->mnt_parent;
-		while (parent->mnt.mnt_flags & MNT_UMOUNT) {
-			mp = parent->mnt_mp;
-			parent = parent->mnt_parent;
+	bool remove_this = false, found = false, umount_this = false;
+	struct mount *n;
+	struct list_head *next;
+
+	list_for_each_entry(n, &m->mnt_mounts, mnt_child) {
+		if (!is_candidate(n)) {
+			found = true;
+			if (n->mnt_mountpoint != m->mnt.mnt_root) {
+				remove_this = true;
+				break;
+			}
 		}
-		if (parent != mnt->mnt_parent) {
-			mnt_change_mountpoint(parent, mp, mnt);
-			mnt_notify_add(mnt);
+	}
+	if (found) {
+		struct mount *this = m;
+		for (struct mount *p = this->mnt_parent;
+		     !IS_MNT_MARKED(this) && is_candidate(p);
+		     this = p, p = p->mnt_parent) {
+			SET_MNT_MARK(this);
+			if (this->mnt_mountpoint != p->mnt.mnt_root)
+				remove_candidate(p);
 		}
+	} else if (!IS_MNT_LOCKED(m) && list_empty(&m->mnt_mounts)) {
+		remove_this = true;
+		umount_this = true;
 	}
+	next = m->mnt_umounting.next;
+	if (remove_this) {
+		remove_candidate(m);
+		if (umount_this)
+			umount_one(m);
+	}
+	if (next == &candidates)
+		return NULL;
+
+	return list_entry(next, struct mount, mnt_umounting);
 }
 
-static void cleanup_umount_visitations(struct list_head *visited)
+static void handle_locked(struct mount *m)
 {
-	while (!list_empty(visited)) {
-		struct mount *mnt =
-			list_first_entry(visited, struct mount, mnt_umounting);
-		list_del_init(&mnt->mnt_umounting);
+	struct mount *cutoff = m, *p;
+
+	for (p = m; is_candidate(p); p = p->mnt_parent) {
+		remove_candidate(p);
+		if (!IS_MNT_LOCKED(p))
+			cutoff = p->mnt_parent;
+	}
+	if (will_be_unmounted(p))
+		cutoff = p;
+	while (m != cutoff) {
+		umount_one(m);
+		m = m->mnt_parent;
 	}
 }
 
+static void reparent(struct mount *m)
+{
+	struct mount *p = m;
+	struct mountpoint *mp;
+
+	do {
+		mp = p->mnt_mp;
+		p = p->mnt_parent;
+	} while (will_be_unmounted(p));
+
+	mnt_change_mountpoint(p, mp, m);
+	mnt_notify_add(m);
+}
+
 /*
  * collect all mounts that receive propagation from the mount in @list,
  * and return these additional mounts in the same list.
@@ -573,71 +610,27 @@ static void cleanup_umount_visitations(struct list_head *visited)
  *
  * vfsmount lock must be held for write
  */
-int propagate_umount(struct list_head *list)
+void propagate_umount(struct list_head *set)
 {
-	struct mount *mnt;
-	LIST_HEAD(to_restore);
-	LIST_HEAD(to_umount);
-	LIST_HEAD(visited);
+	struct mount *m;
 
-	/* Find candidates for unmounting */
-	list_for_each_entry_reverse(mnt, list, mnt_list) {
-		struct mount *parent = mnt->mnt_parent;
-		struct mount *m;
+	gather_candidates(set);
 
-		/*
-		 * If this mount has already been visited it is known that it's
-		 * entire peer group and all of their slaves in the propagation
-		 * tree for the mountpoint has already been visited and there is
-		 * no need to visit them again.
-		 */
-		if (!list_empty(&mnt->mnt_umounting))
-			continue;
+	// make it non-shifting
+	for (m = first_candidate(); m; m = trim_one(m))
+		;
 
-		list_add_tail(&mnt->mnt_umounting, &visited);
-		for (m = propagation_next(parent, parent); m;
-		     m = propagation_next(m, parent)) {
-			struct mount *child = __lookup_mnt(&m->mnt,
-							   mnt->mnt_mountpoint);
-			if (!child)
-				continue;
-
-			if (!list_empty(&child->mnt_umounting)) {
-				/*
-				 * If the child has already been visited it is
-				 * know that it's entire peer group and all of
-				 * their slaves in the propgation tree for the
-				 * mountpoint has already been visited and there
-				 * is no need to visit this subtree again.
-				 */
-				m = skip_propagation_subtree(m, parent);
-				continue;
-			} else if (child->mnt.mnt_flags & MNT_UMOUNT) {
-				/*
-				 * We have come across a partially unmounted
-				 * mount in a list that has not been visited
-				 * yet. Remember it has been visited and
-				 * continue about our merry way.
-				 */
-				list_add_tail(&child->mnt_umounting, &visited);
-				continue;
-			}
+	// ... and non-revealing
+	while (!list_empty(&candidates))
+		handle_locked(first_candidate());
 
-			/* Check the child and parents while progress is made */
-			while (__propagate_umount(child,
-						  &to_umount, &to_restore)) {
-				/* Is the parent a umount candidate? */
-				child = child->mnt_parent;
-				if (list_empty(&child->mnt_umounting))
-					break;
-			}
-		}
+	// now to_umount consists of all non-rejected candidates
+	// deal with reparenting of remaining overmounts on those
+	list_for_each_entry(m, &to_umount, mnt_list) {
+		while (!list_empty(&m->mnt_mounts)) // should be at most one
+			reparent(list_first_entry(&m->mnt_mounts,
+						  struct mount, mnt_child));
 	}
 
-	umount_list(&to_umount, &to_restore);
-	restore_mounts(&to_restore);
-	cleanup_umount_visitations(&visited);
-	list_splice_tail(&to_umount, list);
-
-	return 0;
+	list_splice_tail_init(&to_umount, set);
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 34b6247af01d..d84a397bfd43 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -39,7 +39,7 @@ static inline void set_mnt_shared(struct mount *mnt)
 void change_mnt_propagation(struct mount *, int);
 int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
-int propagate_umount(struct list_head *);
+void propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
 void propagate_mount_unlock(struct mount *);
 void mnt_release_group_id(struct mount *);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [BUG] propagate_umount() breakage
  2025-05-15 11:41     ` Al Viro
@ 2025-05-15 11:47       ` Al Viro
  2025-05-16  5:21         ` [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage) Al Viro
  0 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2025-05-15 11:47 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, Christian Brauner

On Thu, May 15, 2025 at 12:41:50PM +0100, Al Viro wrote:
> On Tue, May 13, 2025 at 04:56:22AM +0100, Al Viro wrote:
> 
> > And yes, it's going to be posted along with the proof of correctness -
> > I've had it with the amount of subtle corner cases in that area ;-/
> 
> FWIW, it seems to survive testing; it's _not_ final - I'm still editing
> the promised proof, but by this point it's stylistic work.  I hadn't
> touched that kind of formal writing for quite a while, so the text is clumsy.
> The code changes are pretty much in the final shape.  Current diff of
> code (all in fs/pnode.*) relative to mainline:

... and the current notes are below; they obviously need more editing ;-/

-----
Umount propagation starts with a set of mounts we are already going to
take out.  Ideally, we would like to add all downstream cognates to
that set - anything with the same mountpoint as one of the removed
mounts and with parent that would receive events from the parent of that
mount.  However, there are some constraints the resulting set must
satisfy.

It is convenient to define several properties of sets of mounts:

1) A set S of mounts is non-shifting if for any mount X belonging
to S all subtrees mounted strictly inside of X (i.e. not overmounting
the root of X) contain only elements of S.

2) A set S is non-revealing if all locked mounts that belong to S have
parents that also belong to S.

3) A set S is closed if it contains all children of its elements.

The set of mounts taken out by umount(2) must be non-shifting and
non-revealing; the first constraint is what allows to reparent
any remaining mounts and the second is what prevents the exposure
of any concealed mountpoints.

propagate_umount() takes the original set as an argument and tries to
extend that set.  The original set is a full subtree and its root is
unlocked; what matters is that it's closed and non-revealing.
Resulting set may not be closed; there might still be mounts outside
of that set, but only on top of stacks of root-overmounting elements
of set.  They can be reparented to the place where the bottom of
stack is attached to a mount that will survive.  NOTE: doing that
will violate a constraint on having no more than one mount with
the same parent/mountpoint pair; however, the caller (umount_tree())
will immediately remedy that - it may keep unmounted element attached
to parent, but only if the parent itself is unmounted.  Since all
conflicts created by reparenting have common parent *not* in the
set and one side of the conflict (bottom of the stack of overmounts)
is in the set, it will be resolved.  However, we rely upon umount_tree()
doing that pretty much immediately after the call of propagate_umount().

Algorithm is based on two statements:
	1) for any set S, there is a maximal non-shifting subset of S
and it can be calculated in O(#S) time.
	2) for any non-shifting set S, there is a maximal non-revealing
subset of S.  That subset is also non-shifting and it can be calculated
in O(#S) time.

		Finding candidates.

We are given a closed set U and we want to find all mounts that have
the same mountpoint as some mount m in U *and* whose parent receives
propagation from the parent of the same mount m.  Naive implementation
would be
	S = {}
	for each m in U
		add m to S
		p = parent(m)
		for each q in Propagation(p) - {p}
			child = look_up(q, mountpoint(m))
			if child
				add child to S
but that can lead to excessive work - there might be propagation among the
subtrees of U, in which case we'd end up examining the same candidates
many times.  Since propagation is transitive, the same will happen to
everything downstream of that candidate and it's not hard to construct
cases where the approach above leads to the time quadratic by the actual
number of candidates.

Note that if we run into a candidate we'd already seen, it must've been
added on an earlier iteration of the outer loop - all additions made
during one iteration of the outer loop have different parents.  So
if we find a child already added to the set, we know that everything
in Propagation(parent(child)) with the same mountpoint has been already
added.
	S = {}
	for each m in U
		if m in S
			continue
		add m to S
		p = parent(m)
		q = propagation_next(p, p)
		while q
			child = look_up(q, mountpoint(m))
			if child
				if child in S
					q = skip_them(q, p)
					continue;
				add child to S
			q = propagation_next(q, p)
where
skip_them(q, p)
	keep walking Propagation(p) from q until we find something
	not in Propagation(q)

would get rid of that problem, but we need a sane implementation of
skip_them().  That's not hard to do - split propagation_next() into
"down into mnt_slave_list" and "forward-and-up" parts, with the
skip_them() being "repeat the forward-and-up part until we get NULL
or something that isn't a peer of the one we are skipping".

Note that there can be no absolute roots among the extra candidates -
they all come from mount lookups.  Absolute root among the original
set is _currently_ impossible, but it might be worth protecting
against.

		Maximal non-shifting subsets.

To prove (1), consider the following operation:
	Trim(S, T) = S - {x : there is a sequence (x_0,...,x_k), such that
				k >= 1,
				for all i < k  x_i is the parent of x_{i+1},
				for all i < k  x_i belongs to S,
				x_k does not belong to S,
				mountpoint(x_1) != root(x_0),
				x = x_0,
				x_{k-1} belongs to T}

In other words, we remove an element x if there is a chain of descendents
starting at x, _not_ overmounting root of x and leaving S right after it
passes through some element of T.

Properties:

	A) for any element x that belongs to S - Trim(S, T) there is
a subtree mounted strictly inside x and containing some elements that
do not belong to S.  Directly from definition - subtree that consists
of x_1 and all its descendents will satisfy our conditions.

	B) if U is a non-shifting subset of S, U is a subset of Trim(S, T).
Indeed, suppose U is not a subset of Trim(S, T).  Then there is an element
u that belongs both to U and S - Trim(S, T).  By (A) there must be a subtree
mounted strictly inside u and containing elements that do not belong to S.
But U is non-shifting by assumption, so all elements of such subtree
will belong to U, which is a subset of S.  Contradiction.

	C) Trim(Trim(S, T1), T2) = Trim(S, union of T1 and T2).
[would be easier with pictures... ;-/]

First, let's show that Trim(Trim(S, T1), T2) is a subset of
Trim(S, union of T1 and T2).  Suppose there is an element x
of Trim(Trim(S, T1), T2) that does not belong to
Trim(S, union of T1 and T2).  Since Trim(Trim(S, T1), T2) is
a subset of Trim(S, T1), x must be an element of Trim(S, T1).

Since it doesn't belong to Trim(S, union of T1 and T2), there is a
chain of descent (x_0 = x, x_1, ..., x_k) such that
	x_1 does not overmount root of x_0
	x_0, ..., x_{k-1} all belong to S
	x_k doesn't belong to S
	x_{k-1} belongs to T1 or to T2.
If some of x_0,...,x_{k-1} do not belong to Trim(S, T1), let x_i
be the first such element.  Since x is an element of Trim(S, T1),
i can't be 0.  But then concatenation of (x_0, ..., x_{i-1})
with the chain that has lead to exclusion of x_i from Trim(S, T1)
would yield a chain excluding x itself from Trim(S, T1).  Since
x belongs to Trim(S, T1), we have shown that
	x_0, ..., x_{k-1} all belong to Trim(S, T1)
Since our chain does not exclude x from Trim(S, T1), x_{k-1} can't
be an element of T1.
So we have
	x_{k-1} belongs to T2
and
	x_k doesn't belong to S, let alone Trim(S, T1)
i.e. our chain excludes x from Trim(Trim(S, T1), T2).  Contradiction.

To prove the inclusion in opposite direction, suppose there is an
element x that belongs to Trim(S, union of T1 and T2) but not to
Trim(Trim(S, T1), T2).  x must belong to Trim(S, T1), since any chain
that would exclude it from Trim(S, T1) would've excluded it from
Trim(S, union of T1 and T2) as well.

Since x belongs to Trim(S, T1), but not Trim(Trim(S, T1), T2), there
must be a chain of descent (x_0 = x, x_1, ..., x_k) such that
	x_1 does not overmount root of x_0
	x_0, ..., x_{k-1} all belong to Trim(S, T1)
	x_k doesn't belong to Trim(S, T1)
	x_{k-1} belongs to T2.
x_k must belong to S, otherwise that chain would've excluded x
from Trim(S, union of T1 and T2).  But concatenation of
(x_0, ..., x_{k-1}) with the chain that excluded x_k from
Trim(S, T1) will yield a chain that excludes x from
Trim(S, union of T1 and T2).  Contradition.

	D) Trim(S, T) = S when S does not intersect with T.
Trivial from definition - for removing something it requires a chain
with next-to-last element belonging both to T and S.  If S and T do
not intersect, nothing get removed...

	E) Trim(S, S) is a non-shifting subset of S that contains any
non-shifting subset of S.
Suppose Trim(S, S) is *not* non-shifting.  Then it must be an element
x belonging to Trim(S, S) and a subtree mounted strictly inside x,
with some of its elements not belonging to Trim(S, S).  Let y be
one of the subtree elements that do not belong to Trim(S, S).
Let (x_0 = x, x_1, ..., x_k = y) be its chain of descent.
Without loss of generality we can assume that all of x_0,...,x_{k-1}
belong to Trim(S, S).
	y must belong to S, since otherwise that chain would've
excluded x from Trim(S, S).  Consider the chain excluding y from
Trim(S, S); its concatenation with (x_0, ..., x_{k-1}) yields
a chain that excludes x from Trim(S, S).  Contradiction.
The second part immediately follows from (B).

	F) Let U be a closed subset of S.  Then U will be a subset of
Trim(S, {x}) for any x.
Follows from (B), since any such U will be non-shifting.

	G) Let U be a closed subset of S.  Then Trim(S, {u}) = S for
any u belonging to U.
Follows directly from definition - if the next-to-last element of a
descent chain is equal to u, its last element belongs to U and thus to S.

Implementation:

	H1) Let S be an ordered set.  Then the following calculates
its maximal non-shifting subset:
Max_Non_Shifting(S)
	if S is empty
		return S
	x = first(S)
	V = S
	while true
		V = Trim(V, {x})
		if there are elements of V greater than x
			x = first of such elements
		else
			return V
Let M be the set of all values of x we have observed in the loop.
Consider the final value of V.  By (C) it's equal to Trim(S, M)
and by (B) it contains any non-shifting subset of S.  But for any
element that has remained in V until the end there must have been
an iteration with x equal to that element, so V is a subset of M.
In other words,
	Trim(V, V) = Trim(Trim(S, M), V)
		   = Trim(S, union of M and V)
		   = Trim(S, M)
		   = V
so by (E) V is non-shifting.

	H2) The following is O(#S) equivalent of S = Max_Non_Shifting(S),
assuming we can mark the elements and originally no marks are set:

Trim_to_non_shifting
	for (x = first(S); x; x = Trim_one(x))
		;
where
Trim_one(m)
	remove_this = false;
	found = false;
	for each n in children(m) {
		if (n not in S) {
			found = true;
			if (mountpoint(n) != root(m)) {
				remove_this = true;
				break;
			}
		}
	}
	if (!found)
		return next(m);
	for (this = m, p = parent(m); !marked(this) && p in S; this = p, p = parent(p)) {
		mark(this);
		if (mountpoint(this) != root(p)) {
			unmark(p);
			remove p from S;
		}
	}
	res = next(m);
	if (remove_this) {
		unmark(m);
		remove m from S;
	}
	return res;

Trim_one(m) will
	* replace S with Trim(S, {m})
	* return NULL if no elements greater than m remain in S
	* return the smallest of such elements otherwise
	* maintain the following invariants:
		* only elements of S are marked
		* if p is marked and
		     (x_0, x_1, ..., x_k = p) is an ancestry chain and
		     all x_i belong to S
		  then
		     for any i > 0  x_i overmounts the root of x_{i-1}
Proof:
	Consider the chains excluding elements from Trim(S, {m}).
The last two elements in each are m and some child of m that does not
belong to S.  If no such children exist, Trim(S, {m}) is equal to S
and next(m) is the correct return value.
	m itself is removed if and only if the chain has exactly two
elements, i.e. when the last element does not overmount the root of
m.  In other words, if there is a child not in S that does not overmount
the root of m.  Once we are done looking at the children 'remove_this'
is true if and only if m itself needs to be removed.
	All other elements to remove will be ancestors of m, such that
the entire descent chain from them to m is contained in S.  Let
(x_0, x_1, ..., x_k = m) be the longest such chain.  x_i needs to
be removed if and only if x_{i+1} does not overmount its root.
	Note that due to the invariant for marks, finding x_{i+1}
marked means that none of its ancestors will qualify for removal.
What's more, after we have found and removed everything we needed,
all remaining elements of the chain can be marked - their nearest
ancestors that do not overmount their parent's root will be outside
of S.  Since ancestry chains can't loop, we can set the marks as
we go - we won't be looking at that element again until after all
removals.
	If Trim_one(m) needs to remove m, it does that after all
other removals.  Once those removals (if any) are done m is still
an element of our set, so the smallest remaining element greater
than m is equal to next(m).  Once it is found, we can finally remove
m itself.
	Time complexity:
* we get no more than O(#S) calls of Trim_one()
* the loop over children in Trim_one() never looks at the same child
twice through all the calls.
* iterations of that loop for children in S are no more than O(#S)
in the worst case
* at most two children that are not elements of S are considered per
call of Trim_one().
* the second loop in Trim_one() sets mark once per iteration and
no element of S has is set more than once.

	H3) In practice we have a large closed non-revealing subset in S.
It can be used to reduce the amount of work.  What's more, we can have
all elements of U removed from their parents' lists of children, which
considerably simplifies life.

// S is a represented as a disjoint union of an non-nrevealing closed
// set U and a set C.  Everything in U has been removed from their parents'
// lists of children.

Trim_to_non_shifting2
	for (x = first(C); x; x = Trim_one(x))
		;
where
Trim_one(m)
	remove_this = false;
	found = false;
	umount_this = false;
	for each n in children(m) {
		if (n not in C) {
			found = true;
			if (mountpoint(n) != root(m)) {
				remove_this = true;
				break;
			}
		}
	}
	if (found) {
		this = m;
		for (p = parent(this);
		     !marked(this) && p in C;
		     m = p, p = parent(p)) {
			mark(this);
			if (mountpoint(this) != root(p)) {
				unmark(p);
				remove p from C;
			}
		}
	} else if (!locked(m) && children(m) is empty) {
		remove_this = true;
		umount_this = true;
	}
	res = next(m);
	if (remove_this) {
		unmark(m);
		remove m from C;
		if (umount_this) {
			remove m from children(parent(m))
			add m to U;
		}
	}
	return res;

Equivalent to Trim_to_non_shifting, modulo representation.  We know
that Trim_one() is a no-op when its argument belongs to U, we won't
see any elements of U in lists of children, so we only need to check
for belonging to C and we know that parent of anything not in U can't
be in U, so the ancestor-trimming loop won't run into anything in U -
test for belonging to S is equivalent to test for belonging to C there.

Simple candidates for adding to U (not locked and no children
not in U) get moved from C to U - U remains closed non-revealing
and union remains unchanged.

Time complexity is O(#C) here.


		Maximal non-revealing subsets

If S is not a non-revealing subset, there is a locked element x
in S such that parent of x is not in S.

Obviously, no non-revealing subset of S may contain x.  Removing
such elements one by one will obviously end with the maximal
non-revealing subset (possibly empty one).  Note that removal
of an element will require removal of all its locked children,
etc.

If the set had been non-shifting, it will remain non-shifting
after such removals.
Proof: suppose S was non-shifting, x is a locked element of S,
parent of x is not in S and S - {x} is not non-shifting.  Then
there is an element m in S - {x} and a subtree mounted strictly
inside m, such that m contains an element not in in S - {x}.
Since S is non-shifting, everything in that subtree must belong
to S.  But that means that this subtree must contain x somewhere
*and* that parent of x either belongs that subtree or is
equal to m.  Either way it must belong to S.  Contradiction.

// same representation as for finding maximal non-shifting subsets:
// S is a disjoint union of non-revealing set U and a set C.
// Elements of U are removed from their parents' lists of children.
// in the end C becomes empty and maximal non-revealing non-shifting
// subset of S is now in U
Trim_locked
	while (C is non-empty)
		handle_locked(first(C))

handle_locked(m)
	cutoff = m
	for (p = m; p in C; p = parent(p)) {
		unmark(p);
		remove p from C;
		if (!locked(p))
			cutoff = parent(p)
	}
	if (p in U)
		cutoff = p
	while (m != cutoff) {
		remove m from children(parent(m));
		add m to U;
		m = parent(m);
	}

Let (x_0, ..., x_n = m) be the maximal chain of descent of
m within S.
* If it contains some elements of U, let x_k be the last one
of those.  Then union of U with {x_{k+1}, ..., x_n} is obviously
non-revealing.
* otherwise if all its elements are locked, then none of
{x_0, ..., x_n} may be elements of a non-revealing subset of S.
* otherwise let x_k be the first unlocked element of the chain.
Then none of {x_0, ..., x_{k-1}} may be an element of a non-revealing
subset of S and union of U and {x_k, ..., x_n} is non-revealing.

handle_locked(m) finds which of these cases applies and adjusts
C and U accordingly.  U remains non-revealing, union of C and U
still contains any non-revealing subset of S and after the
call of handle_locked(m) m is guaranteed to be not in C.
So having it called for each element of S would suffice to
empty C, leaving U the maximal non-revealing subset of S.

However, handle_locked(m) is a no-op when m belongs to U,
so it's enough to have it called for elements of C while there
are any.

Time complexity: number of calls of handle_locked() is
limited by #C, each iteration of the first loop in handle_locked()
removes an element for C, so their total number of executions is
also limited by #C; number of iterations in the second loop is
no greater than the number of iterations of the first loop.


		Reparenting

After we'd calculated the final set, we still need to deal with
reparenting - if an element of the final set has a child not
in it, we need to reparent such child.

Such children can only be root-overmounting (otherwise the
set wouldn't be non-shifting) and their parents can not belong
to the original set, since the original is guaranteed to
be closed.

		Putting all of that together

The plan is to
	* find all candidates
	* trim down to maximal non-shifting subset
	* trim down to maximal non-revealing subset
	* reparent anything that needs to be reparented
	* return the resulting set to the caller

For the 2nd and 3rd steps we want to separate the set into
growing non-revealing subset, initially containing the
original set ("U" in terms of the pseudocode above) and
everything we are still not sure about ("C").  It means
that for the output of the 1st step we'd like the extra
candidates separated from the stuff already in the original
set.  For the 4th step we would like the additions to U
separate from the original set.

So let's go for
	* original set ("set").  Linkage via mnt_list
	* unclassified candidates ("C").  Linkage via mnt_umounting.
	* anything in U that hadn't been in the original set -
elements of C will gradually be either discarded or moved
there.  In other words, it's the candidates we have already
decided to unmount.  Its role is reasonably close to the
old "to_umount", so let's use that name.  Linkage via mnt_list.

For gather_candidates() we'll need to maintain both C (S - set) and
intersection of S with set, with the latter emptied once we are done.
Use mnt_umounting for both, that'll give a cheap way to check belonging
to their union (in gather_candidates()) and to C itself (at later stages).
Call that predicate is_candidate(); it would be !list_empty(&m->mnt_umounting)

All elements of the original set are marked with MNT_UMOUNT and we'll need
the same for elements added when joining the contents of to_umount to set in
the end.  Let's set MNT_UMOUNT at the time we add an element to to_umount;
that's close to what the old 'umount_one' is doing, so let's keep that name.
It also gives us another predicate we need - "belongs to union of set and
to_umount"; will_be_unmounted() for now.

Another helper - unmark and remove from C; remove_candidate(m)

propagate_unmount(set)
{
	// C <- candidates
	gather_candidates(set);

	// make set + C + to_umount non-shifting
	for (m = first(C); m; m = trim_one(m))
		;

	// ... and non-revealing
	while (!empty(C))
		handle_locked(first(C));

	// now to_umount consists of all non-rejected candidates
	// deal with reparenting of remaining overmounts on those
	list_for_each_entry(m, to_umount, mnt_list) {
		while (!empty(children(m))) // should be at most one
			reparent(first(children(m)))
	}

	list_splice_tail_init(set, &to_umount);
}

gather_candidates(set)
{
	LIST_HEAD(visited)	// intersection of S with set
	C = {}			// S - set, i.e. candidates

	list_for_each_entry(m, set, mnt_list) {
		if (is_candidate(m))
			continue;
		list_add(&m->mnt_umounting, &visited);
		p = parent(m);
		q = propagation_next(p, p);
		while (q) {
			child = look_up(q, mountpoint(m));
			if (child) {
				if (is_candidate(child)) {
					q = skip_them(q, p);
					continue;
				}
				if (will_be_unmounted(child))
					head = &visited;
				else
					head = &C;
				list_add(&child->mnt_umounting, head);
			}
			q = propagation_next(q, p);
		}
	}
	while (!list_empty(&visited))	// empty visited
		list_del_init(visited.next);
}

trim_one(m)
{
	bool remove_this = false, found = false, umount_this = false;
	struct mount *res, *n;

	for each n in children(m) {
		if (!is_candidate(n)) {
			found = true;
			if (mountpoint(n) != root(m)) {
				remove_this = true;
				break;
			}
		}
	}
	if (found) {
		struct mount *this = m;
		for (struct mount *p = parent(this);
		     !marked(this) && is_candidate(p);
		     this = p, p = parent(p)) {
			mark(this);
			if (mountpoint(this) != root(p))
				remove_candidate(p);
		}
	} else if (!locked(m) && children(m) is empty) {
		remove_this = true;
		umount_this = true;
	}
	res = next(m);
	if (remove_this) {
		remove_candidate(m);
		if (umount_this)
			umount_one(m);
	}
	return res;
}

handle_locked(m)
{
	struct mount *cutoff = m, *p;

	for (p = m; is_candidate(p); p = parent(p)) {
		remove_candidate(p);
		if (!locked(p))
			cutoff = parent(p);
	}
	if (will_be_unmounted(p))
		cutoff = p;
	while (m != cutoff) {
		umount_one(m);
		m = parent(m);
	}
}

reparent(m)
{
	struct mount *p = m;

	do {
		mp = p->mnt_mp;
		p = parent(p);
	} while (will_be_unmounted(p));
	mnt_change_mountpoint(p, mp, m);
}

is_candidate(m)
	return !list_empty(&m->mnt_umounting);

will_be_unmounted(m)
	return m->mnt.mnt_flags & MNT_UMOUNT;

umount_one(m)	// does *not* unmark or remove from C, unlike the old namesake
{
	m->mnt.mnt_flags |= MNT_UMOUNT;
	list_del_init(&m->mnt_child); // remove from parent's list of children
	move_from_ns(m, &to_umount);
}

remove_candidate(m)
{
	unmark(m);
	list_del_init(&m->mnt_umounting); // remove m from C
}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage)
  2025-05-15 11:47       ` Al Viro
@ 2025-05-16  5:21         ` Al Viro
  2025-05-19 18:11           ` Linus Torvalds
  2025-05-20 11:10           ` [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage) Christian Brauner
  0 siblings, 2 replies; 16+ messages in thread
From: Al Viro @ 2025-05-16  5:21 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, Christian Brauner

> On Thu, May 15, 2025 at 12:41:50PM +0100, Al Viro wrote:
> > On Tue, May 13, 2025 at 04:56:22AM +0100, Al Viro wrote:
> > 
> > > And yes, it's going to be posted along with the proof of correctness -
> > > I've had it with the amount of subtle corner cases in that area ;-/
> > 
> > FWIW, it seems to survive testing; it's _not_ final - I'm still editing
> > the promised proof, but by this point it's stylistic work.  I hadn't
> > touched that kind of formal writing for quite a while, so the text is clumsy.
> > The code changes are pretty much in the final shape.  Current diff of
> > code (all in fs/pnode.*) relative to mainline:
> 
> ... and the current notes are below; they obviously need more editing ;-/

With some cleaning the notes up, see

git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git propagate_umount

The only commit there (on top of -rc5) below; it seems to survive tests I'd
been able to find.  Please, review and hit that thing with any tests you might
have sitting around.

Note that in new variant skip_propagation_subtree() call is *not* followed
by propagation_next() - it does what combination would've done (and it
skips everything that gets propagation from that mount, peers included).

Nobody except propagate_umount() had been using it, so there's no other
callers to be broken by semantics change...

D/f/propagate_umount.txt is done as plain text; if anyone cares to turn that
into ReST - wonderful.  If anyone has editing suggestions (it's still pretty
raw as a text and I'm pretty sure that it can be improved stylistically) -
even better.

---
[PATCH] Rewrite of propagate_umount()

The variant currently in the tree has problems; trying to prove
correctness has caught at least one class of bugs (reparenting
that ends up moving the visible location of reparented mount, due
to not excluding some of the counterparts on propagation that
should've been included).

I tried to prove that it's the only bug there; I'm still not sure
whether it is.  If anyone can reconstruct and write down an analysis
of the mainline implementation, I'll gladly review it; as it is,
I ended up doing a different implementation.  Candidate collection
phase is similar, but trimming the set down until it satisfies the
constraints turned out pretty different.

I hoped to do transformation as a massage series, but that turns out
to be too convoluted.  So it's a single patch replacing propagate_umount()
and friends in one go, with notes and analysis in D/f/propagate_umount.txt
(in addition to inline comments).

As far I can tell, it is provably correct and provably linear by the number
of mounts we need to look at in order to decide what should be unmounted.
It even builds and seems to survive testing...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/Documentation/filesystems/propagate_umount.txt b/Documentation/filesystems/propagate_umount.txt
new file mode 100644
index 000000000000..d8367ba3ea82
--- /dev/null
+++ b/Documentation/filesystems/propagate_umount.txt
@@ -0,0 +1,433 @@
+	Notes on propagate_umount()
+
+Umount propagation starts with a set of mounts we are already going to
+take out.  Ideally, we would like to add all downstream cognates to
+that set - anything with the same mountpoint as one of the removed
+mounts and with parent that would receive events from the parent of that
+mount.  However, there are some constraints the resulting set must
+satisfy.
+
+It is convenient to define several properties of sets of mounts:
+
+1) A set S of mounts is non-shifting if for any mount X belonging
+to S all subtrees mounted strictly inside of X (i.e. not overmounting
+the root of X) contain only elements of S.
+
+2) A set S is non-revealing if all locked mounts that belong to S have
+parents that also belong to S.
+
+3) A set S is closed if it contains all children of its elements.
+
+The set of mounts taken out by umount(2) must be non-shifting and
+non-revealing; the first constraint is what allows to reparent
+any remaining mounts and the second is what prevents the exposure
+of any concealed mountpoints.
+
+propagate_umount() takes the original set as an argument and tries to
+extend that set.  The original set is a full subtree and its root is
+unlocked; what matters is that it's closed and non-revealing.
+Resulting set may not be closed; there might still be mounts outside
+of that set, but only on top of stacks of root-overmounting elements
+of set.  They can be reparented to the place where the bottom of
+stack is attached to a mount that will survive.  NOTE: doing that
+will violate a constraint on having no more than one mount with
+the same parent/mountpoint pair; however, the caller (umount_tree())
+will immediately remedy that - it may keep unmounted element attached
+to parent, but only if the parent itself is unmounted.  Since all
+conflicts created by reparenting have common parent *not* in the
+set and one side of the conflict (bottom of the stack of overmounts)
+is in the set, it will be resolved.  However, we rely upon umount_tree()
+doing that pretty much immediately after the call of propagate_umount().
+
+Algorithm is based on two statements:
+	1) for any set S, there is a maximal non-shifting subset of S
+and it can be calculated in O(#S) time.
+	2) for any non-shifting set S, there is a maximal non-revealing
+subset of S.  That subset is also non-shifting and it can be calculated
+in O(#S) time.
+
+		Finding candidates.
+
+We are given a closed set U and we want to find all mounts that have
+the same mountpoint as some mount m in U *and* whose parent receives
+propagation from the parent of the same mount m.  Naive implementation
+would be
+	S = {}
+	for each m in U
+		add m to S
+		p = parent(m)
+		for each q in Propagation(p) - {p}
+			child = look_up(q, mountpoint(m))
+			if child
+				add child to S
+but that can lead to excessive work - there might be propagation among the
+subtrees of U, in which case we'd end up examining the same candidates
+many times.  Since propagation is transitive, the same will happen to
+everything downstream of that candidate and it's not hard to construct
+cases where the approach above leads to the time quadratic by the actual
+number of candidates.
+
+Note that if we run into a candidate we'd already seen, it must've been
+added on an earlier iteration of the outer loop - all additions made
+during one iteration of the outer loop have different parents.  So
+if we find a child already added to the set, we know that everything
+in Propagation(parent(child)) with the same mountpoint has been already
+added.
+	S = {}
+	for each m in U
+		if m in S
+			continue
+		add m to S
+		p = parent(m)
+		q = propagation_next(p, p)
+		while q
+			child = look_up(q, mountpoint(m))
+			if child
+				if child in S
+					q = skip_them(q, p)
+					continue;
+				add child to S
+			q = propagation_next(q, p)
+where
+skip_them(q, p)
+	keep walking Propagation(p) from q until we find something
+	not in Propagation(q)
+
+would get rid of that problem, but we need a sane implementation of
+skip_them().  That's not hard to do - split propagation_next() into
+"down into mnt_slave_list" and "forward-and-up" parts, with the
+skip_them() being "repeat the forward-and-up part until we get NULL
+or something that isn't a peer of the one we are skipping".
+
+Note that there can be no absolute roots among the extra candidates -
+they all come from mount lookups.  Absolute root among the original
+set is _currently_ impossible, but it might be worth protecting
+against.
+
+		Maximal non-shifting subsets.
+
+Let's call a mount m in a set S forbidden in that set if there is a
+subtree mounted strictly inside m and containing mounts that do not
+belong to S.
+
+The set is non-shifting when none of its elements are forbidden in it.
+
+If mount m is forbidden in a set S, it is forbidden in any subset S' it
+belongs to.  In other words, it can't belong to any of the non-shifting
+subsets of S.  If we had a way to find a forbidden mount or show that
+there's none, we could use it to find the maximal non-shifting subset
+simply by finding and removing them until none remain.
+
+Suppose mount m is forbidden in S; then any mounts forbidden in S - {m}
+must have been forbidden in S itself.  Indeed, since m has descendents
+that do not belong to S, any subtree that fits into S will fit into
+S - {m} as well.
+
+So in principle we could go through elements of S, checking if they
+are forbidden in S and removing the ones that are.  Removals will
+not invalidate the checks done for earlier mounts - if they were not
+forbidden at the time we checked, they won't become forbidden later.
+It's too costly to be practical, but there is a similar approach that
+is linear by size of S.
+
+Let's say that mount x in a set S is forbidden by mount y, if
+	* both x and y belong to S.
+	* there is a chain of mounts starting at x and leaving S
+	  immediately after passing through y, with the first
+	  mountpoint strictly inside x.
+Note 1: x may be equal to y - that's the case when something not
+belonging to S is mounted strictly inside x.
+Note 2: if y does not belong to S, it can't forbid anything in S.
+Note 3: if y has no children outside of S, it can't forbid anything in S.
+
+It's easy to show that mount x is forbidden in S if and only if x is
+forbidden in S by some mount y.  And it's easy to find all mounts in S
+forbidden by a given mount.
+
+Consider the following operation:
+	Trim(S, m) = S - {x : x is forbidden by m in S}
+
+Note that if m does not belong to S or has no children outside of S we
+are guaranteed that Trim(S, m) is equal to S.
+
+The following is true: if x is forbidden by y in Trim(S, m), it was
+already forbidden by y in S.
+
+Proof: Suppose x is forbidden by y in Trim(S, m).  Then there is a
+chain of mounts (x_0 = x, ..., x_k = y, x_{k+1} = r), such that x_{k+1}
+is the first element that doesn't belong to Trim(S, m) and the
+mountpoint of x_1 is strictly inside x.  If mount r belongs to S, it must
+have been removed by Trim(S, m), i.e. it was forbidden in S by m.
+Then there was a mount chain from r to some child of m that stayed in
+S all the way until m, but that's impossible since x belongs to Trim(S, m)
+and prepending (x_0, ..., x_k) to that chain demonstrates that x is also
+prohibited in S by m, and thus can't belong to Trim(S, m).
+Therefore r can not belong to S and our chain demonstrates that
+x is prohibited by y in S.  QED.
+
+Corollary: no mount is forbidden by m in Trim(S, m).  Indeed, any
+such mount would have been forbidden by m in S and thus would have been
+in the part of S removed in Trim(S, m).
+
+Corollary: no mount is forbidden by m in Trim(Trim(S, m), n).  Indeed,
+any such would have to have been forbidden by m in Trim(S, m), which
+is impossible.
+
+Corollary: after
+	S = Trim(S, x_1)
+	S = Trim(S, x_2)
+	...
+	S = Trim(S, x_k)
+no mount remaining in S will be forbidden by either of x_1,...,x_k.
+
+If the S is ordered, the following will reduce it to the maximal non-shifting
+subset:
+	if S is not empty
+		m = first(S)
+		while true
+			S = Trim(S, m)
+			if there's no elements of S greater than m
+				break
+			m = first of such elements
+
+It's easy to see that at the beginning of each iteration all elements still
+remaining in S and preceding the current value of m will have already been
+seen as values of m.  By the time we leave the loop all elements remaining
+in S will have been seen as values of m.
+
+We also know that no mount remaining in S will be forbidden by any
+of the values of m we have observed in the loop.  In other words,
+no mount remaining in S will be forbidden, i.e. final value of S will
+be non-shifting.  It will be the maximal non-shifting subset, since we
+were removing only forbidden elements.
+
+Implementation notes:
+
+	The following reduces S to the maximal non-shifting subset
+in O(#S), assuming that S is ordered, we can mark the elements and
+originally no marks are set:
+
+	for (x = first(S); x; x = Trim_one(x))
+		;
+where
+Trim_one(m)
+	remove_this = false
+	found = false
+	for each n in children(m)
+		if n not in S
+			found = true
+			if (mountpoint(n) != root(m))
+				remove_this = true
+				break
+	if found
+		Trim_ancestors(m)
+	res = next(m)
+	if remove_this
+		unmark m and remove it from S
+	return res
+
+Trim_ancestors(m)
+	for (p = parent(m); p in S; m = p, p = parent(p)) {
+		if m is marked // all elements beneath are overmounts
+			return
+		mark m
+		if (mountpoint(m) != root(p))
+			unmark p and remove it from S
+	}
+
+Trim_one(m) will
+	* replace S with Trim(S, m)
+	* return NULL if no elements greater than m remain in S
+	* return the smallest of such elements otherwise
+	* maintain the following invariants:
+		* only elements of S are marked
+		* if p is marked and
+		     (x_0, x_1, ..., x_k = p) is an ancestry chain and
+		     all x_i belong to S
+		  then
+		     for any i > 0  x_i overmounts the root of x_{i-1}
+Proof:
+	Consider the chains excluding elements from Trim(S, m).  The last
+two elements in each are m and some child of m that does not belong to S.
+If there's no such children, Trim(S, m) is equal to S and next(m) is the
+correct return value.
+
+	m itself is removed if and only if the chain has exactly two
+elements, i.e. when the last element does not overmount the root of m.
+In other words, if there is a child not in S that does not overmount
+the root of m.	Once we are done looking at the children 'remove_this'
+is true if and only if m itself needs to be removed.
+
+	All other elements to remove will be ancestors of m, such that
+the entire descent chain from them to m is contained in S.  Let
+(x_0, x_1, ..., x_k = m) be the longest such chain.  x_i needs to be
+removed if and only if x_{i+1} does not overmount its root.
+
+	Note that due to the invariant for marks, finding x_{i+1} marked
+means that none of its ancestors will qualify for removal.  What's more,
+after we have found and removed everything we needed, all remaining
+elements of the chain can be marked - their nearest ancestors that do
+not overmount their parent's root will be outside of S.  Since ancestry
+chains can't loop, we can set the marks as we go - we won't be looking
+at that element again until after all removals.
+
+	If Trim_one(m) needs to remove m, it does that after all other
+removals.  Once those removals (if any) are done m is still an element
+of our set, so the smallest remaining element greater than m is equal
+to next(m).  Once it is found, we can finally remove m itself.
+
+	Time complexity:
+* we get no more than O(#S) calls of Trim_one()
+* the loop over children in Trim_one() never looks at the same child
+twice through all the calls.
+* iterations of that loop for children in S are no more than O(#S)
+in the worst case
+* at most two children that are not elements of S are considered per
+call of Trim_one().
+* the second loop in Trim_one() sets mark once per iteration and
+no element of S has is set more than once.
+
+	In practice we have a large closed non-revealing subset in S -
+the mounts we are already committed to unmounting.  It can be used to
+reduce the amount of work.  What's more, we can have all elements of that
+subset removed from their parents' lists of children, which considerably
+simplifies life.
+
+	Since the committed subset is closed, passing one of its elements
+to Trim_one() is a no-op - it doesn't have any children outside of S.
+No matter what we pass to Trim_one(), Trim_ancestors() will never look
+at any elements of the committed subset - it's not called by Trim_one()
+if argument belongs to that subset and it can't walk into that subset
+unless it has started in there.
+
+	So it's useful to keep the sets of committed and undecided
+candidates separately; Trim_one() needs to be called only for elements
+of the latter.
+
+	What's more, if we see that children(m) is empty and m is not
+locked, we can immediately move m into the committed subset (remove
+from the parent's list of children, etc.).  That's one fewer mount we'll
+have to look into when we check the list of children of its parent *and*
+when we get to building the non-revealing subset.
+
+		Maximal non-revealing subsets
+
+If S is not a non-revealing subset, there is a locked element x in S
+such that parent of x is not in S.
+
+Obviously, no non-revealing subset of S may contain x.  Removing such
+elements one by one will obviously end with the maximal non-revealing
+subset (possibly empty one).  Note that removal of an element will
+require removal of all its locked children, etc.
+
+If the set had been non-shifting, it will remain non-shifting after
+such removals.
+Proof: suppose S was non-shifting, x is a locked element of S, parent of x
+is not in S and S - {x} is not non-shifting.  Then there is an element m
+in S - {x} and a subtree mounted strictly inside m, such that m contains
+an element not in in S - {x}.  Since S is non-shifting, everything in
+that subtree must belong to S.  But that means that this subtree must
+contain x somewhere *and* that parent of x either belongs that subtree
+or is equal to m.  Either way it must belong to S.  Contradiction.
+
+// same representation as for finding maximal non-shifting subsets:
+// S is a disjoint union of a non-revealing set U (the ones we are committed
+// to unmount) and a set of candidates.
+// Elements of U are removed from their parents' lists of children.
+// In the end candidates becomes empty and maximal non-revealing non-shifting
+// subset of S is now in U
+	while (candidates is non-empty)
+		handle_locked(first(candidates))
+
+handle_locked(m)
+	cutoff = m
+	for (p = m; p in candidates; p = parent(p)) {
+		unmark(p) and remove p from candidates;
+		if (!locked(p))
+			cutoff = parent(p)
+	}
+	if (p in U)
+		cutoff = p
+	while (m != cutoff) {
+		remove m from children(parent(m));
+		add m to U;
+		m = parent(m);
+	}
+
+Let (x_0, ..., x_n = m) be the maximal chain of descent of m within S.
+* If it contains some elements of U, let x_k be the last one of those.
+Then union of U with {x_{k+1}, ..., x_n} is obviously non-revealing.
+* otherwise if all its elements are locked, then none of {x_0, ..., x_n}
+may be elements of a non-revealing subset of S.
+* otherwise let x_k be the first unlocked element of the chain.  Then none
+of {x_0, ..., x_{k-1}} may be an element of a non-revealing subset of
+S and union of U and {x_k, ..., x_n} is non-revealing.
+
+handle_locked(m) finds which of these cases applies and adjusts C and
+U accordingly.  U remains non-revealing, union of C and U still contains
+any non-revealing subset of S and after the call of handle_locked(m) m
+is guaranteed to be not in C.  So having it called for each element of S
+would suffice to empty C, leaving U the maximal non-revealing subset of S.
+
+However, handle_locked(m) is a no-op when m belongs to U, so it's enough
+to have it called for elements of C while there are any.
+
+Time complexity: number of calls of handle_locked() is limited by #C,
+each iteration of the first loop in handle_locked() removes an element
+for C, so their total number of executions is also limited by #C;
+number of iterations in the second loop is no greater than the number
+of iterations of the first loop.
+
+
+		Reparenting
+
+After we'd calculated the final set, we still need to deal with
+reparenting - if an element of the final set has a child not in it,
+we need to reparent such child.
+
+Such children can only be root-overmounting (otherwise the set wouldn't
+be non-shifting) and their parents can not belong to the original set,
+since the original is guaranteed to be closed.
+
+
+		Putting all of that together
+
+The plan is to
+	* find all candidates
+	* trim down to maximal non-shifting subset
+	* trim down to maximal non-revealing subset
+	* reparent anything that needs to be reparented
+	* return the resulting set to the caller
+
+For the 2nd and 3rd steps we want to separate the set into growing
+non-revealing subset, initially containing the original set ("U" in
+terms of the pseudocode above) and everything we are still not sure about
+("candidates").  It means that for the output of the 1st step we'd like
+the extra candidates separated from the stuff already in the original set.
+For the 4th step we would like the additions to U separate from the
+original set.
+
+So let's go for
+	* original set ("set").  Linkage via mnt_list
+	* undecided candidates ("candidates").  Linkage via mnt_umounting.
+	* anything in U that hadn't been in the original set - elements of
+candidates will gradually be either discarded or moved there.  In other
+words, it's the candidates we have already decided to unmount.	Its role
+is reasonably close to the old "to_umount", so let's use that name.
+Linkage via mnt_list.
+
+For gather_candidates() we'll need to maintain both candidates (S -
+set) and intersection of S with set, with the latter emptied once we
+are done.  Use mnt_umounting for both, that'll give a cheap way to check
+belonging to their union (in gather_candidates()) and to candidates
+itself (at later stages).  Call that predicate is_candidate(); it would
+be !list_empty(&m->mnt_umounting)
+
+All elements of the original set are marked with MNT_UMOUNT and we'll
+need the same for elements added when joining the contents of to_umount
+to set in the end.  Let's set MNT_UMOUNT at the time we add an element
+to to_umount; that's close to what the old 'umount_one' is doing, so
+let's keep that name.  It also gives us another predicate we need -
+"belongs to union of set and to_umount"; will_be_unmounted() for now.
+
+Another helper - unmark and remove from candidates; remove_candidate(m)
diff --git a/fs/pnode.c b/fs/pnode.c
index fb77427df39e..9b2f1ac80f25 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -24,11 +24,6 @@ static inline struct mount *first_slave(struct mount *p)
 	return list_entry(p->mnt_slave_list.next, struct mount, mnt_slave);
 }
 
-static inline struct mount *last_slave(struct mount *p)
-{
-	return list_entry(p->mnt_slave_list.prev, struct mount, mnt_slave);
-}
-
 static inline struct mount *next_slave(struct mount *p)
 {
 	return list_entry(p->mnt_slave.next, struct mount, mnt_slave);
@@ -136,6 +131,23 @@ void change_mnt_propagation(struct mount *mnt, int type)
 	}
 }
 
+static struct mount *__propagation_next(struct mount *m,
+					 struct mount *origin)
+{
+	while (1) {
+		struct mount *master = m->mnt_master;
+
+		if (master == origin->mnt_master) {
+			struct mount *next = next_peer(m);
+			return (next == origin) ? NULL : next;
+		} else if (m->mnt_slave.next != &master->mnt_slave_list)
+			return next_slave(m);
+
+		/* back at master */
+		m = master;
+	}
+}
+
 /*
  * get the next mount in the propagation tree.
  * @m: the mount seen last
@@ -153,31 +165,26 @@ static struct mount *propagation_next(struct mount *m,
 	if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
 		return first_slave(m);
 
-	while (1) {
-		struct mount *master = m->mnt_master;
-
-		if (master == origin->mnt_master) {
-			struct mount *next = next_peer(m);
-			return (next == origin) ? NULL : next;
-		} else if (m->mnt_slave.next != &master->mnt_slave_list)
-			return next_slave(m);
+	return __propagation_next(m, origin);
+}
 
-		/* back at master */
-		m = master;
-	}
+static inline bool peers(const struct mount *m1, const struct mount *m2)
+{
+	return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
 }
 
 static struct mount *skip_propagation_subtree(struct mount *m,
 						struct mount *origin)
 {
 	/*
-	 * Advance m such that propagation_next will not return
-	 * the slaves of m.
+	 * Advance m past everything that gets propagation from it.
 	 */
-	if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
-		m = last_slave(m);
+	struct mount *p = __propagation_next(m, origin);
 
-	return m;
+	while (p && peers(m, p))
+		p = __propagation_next(p, origin);
+
+	return p;
 }
 
 static struct mount *next_group(struct mount *m, struct mount *origin)
@@ -216,11 +223,6 @@ static struct mount *next_group(struct mount *m, struct mount *origin)
 static struct mount *last_dest, *first_source, *last_source, *dest_master;
 static struct hlist_head *list;
 
-static inline bool peers(const struct mount *m1, const struct mount *m2)
-{
-	return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
-}
-
 static int propagate_one(struct mount *m, struct mountpoint *dest_mp)
 {
 	struct mount *child;
@@ -463,181 +465,219 @@ void propagate_mount_unlock(struct mount *mnt)
 	}
 }
 
-static void umount_one(struct mount *mnt, struct list_head *to_umount)
+static LIST_HEAD(to_umount);	// committed to unmounting those
+static LIST_HEAD(candidates);	// undecided unmount candidates
+
+static inline struct mount *first_candidate(void)
 {
-	CLEAR_MNT_MARK(mnt);
-	mnt->mnt.mnt_flags |= MNT_UMOUNT;
-	list_del_init(&mnt->mnt_child);
-	list_del_init(&mnt->mnt_umounting);
-	move_from_ns(mnt, to_umount);
+	if (list_empty(&candidates))
+		return NULL;
+	return list_first_entry(&candidates, struct mount, mnt_umounting);
 }
 
-/*
- * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
- * parent propagates to.
- */
-static bool __propagate_umount(struct mount *mnt,
-			       struct list_head *to_umount,
-			       struct list_head *to_restore)
+static inline bool is_candidate(struct mount *m)
 {
-	bool progress = false;
-	struct mount *child;
+	return !list_empty(&m->mnt_umounting);
+}
 
-	/*
-	 * The state of the parent won't change if this mount is
-	 * already unmounted or marked as without children.
-	 */
-	if (mnt->mnt.mnt_flags & (MNT_UMOUNT | MNT_MARKED))
-		goto out;
+static inline bool will_be_unmounted(struct mount *m)
+{
+	return m->mnt.mnt_flags & MNT_UMOUNT;
+}
 
-	/* Verify topper is the only grandchild that has not been
-	 * speculatively unmounted.
-	 */
-	list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
-		if (child->mnt_mountpoint == mnt->mnt.mnt_root)
-			continue;
-		if (!list_empty(&child->mnt_umounting) && IS_MNT_MARKED(child))
-			continue;
-		/* Found a mounted child */
-		goto children;
-	}
+static void umount_one(struct mount *m)
+{
+	m->mnt.mnt_flags |= MNT_UMOUNT;
+	list_del_init(&m->mnt_child);
+	move_from_ns(m, &to_umount);
+}
 
-	/* Mark mounts that can be unmounted if not locked */
-	SET_MNT_MARK(mnt);
-	progress = true;
+static void remove_candidate(struct mount *m)
+{
+	CLEAR_MNT_MARK(m);	// unmark on removal from candidates
+	list_del_init(&m->mnt_umounting);
+}
 
-	/* If a mount is without children and not locked umount it. */
-	if (!IS_MNT_LOCKED(mnt)) {
-		umount_one(mnt, to_umount);
-	} else {
-children:
-		list_move_tail(&mnt->mnt_umounting, to_restore);
+static void gather_candidates(struct list_head *set)
+{
+	LIST_HEAD(visited);
+	struct mount *m, *p, *q;
+
+	list_for_each_entry(m, set, mnt_list) {
+		if (is_candidate(m))
+			continue;
+		list_add(&m->mnt_umounting, &visited);
+		p = m->mnt_parent;
+		q = propagation_next(p, p);
+		while (q) {
+			struct mount *child = __lookup_mnt(&q->mnt,
+							   m->mnt_mountpoint);
+			if (child) {
+				struct list_head *head;
+
+				/*
+				 * We might've already run into this one.  That
+				 * must've happened on earlier iteration of the
+				 * outer loop; in that case we can skip those
+				 * parents that get propagation from q - there
+				 * will be nothing new on those as well.
+				 */
+				if (is_candidate(child)) {
+					q = skip_propagation_subtree(q, p);
+					continue;
+				}
+				if (will_be_unmounted(child))
+					head = &visited;
+				else
+					head = &candidates;
+				list_add(&child->mnt_umounting, head);
+			}
+			q = propagation_next(q, p);
+		}
 	}
-out:
-	return progress;
+	while (!list_empty(&visited))	// empty visited
+		list_del_init(visited.next);
 }
 
-static void umount_list(struct list_head *to_umount,
-			struct list_head *to_restore)
+/*
+ * We know that some child of @m can't be unmounted.  In all places where the
+ * chain of descent of @m has child not overmounting the root of parent,
+ * the parent can't be unmounted either.
+ */
+static void trim_ancestors(struct mount *m)
 {
-	struct mount *mnt, *child, *tmp;
-	list_for_each_entry(mnt, to_umount, mnt_list) {
-		list_for_each_entry_safe(child, tmp, &mnt->mnt_mounts, mnt_child) {
-			/* topper? */
-			if (child->mnt_mountpoint == mnt->mnt.mnt_root)
-				list_move_tail(&child->mnt_umounting, to_restore);
-			else
-				umount_one(child, to_umount);
-		}
+	struct mount *p;
+
+	for (p = m->mnt_parent; is_candidate(p); m = p, p = p->mnt_parent) {
+		if (IS_MNT_MARKED(m))	// all candidates beneath are overmounts
+			return;
+		SET_MNT_MARK(m);
+		if (m->mnt_mountpoint != p->mnt.mnt_root)
+			remove_candidate(p);
 	}
 }
 
-static void restore_mounts(struct list_head *to_restore)
+/*
+ * Find and exclude all umount candidates forbidden by @m
+ * (see Documentation/filesystems/propagate_umount.txt)
+ * If we can immediately tell that @m is OK to unmount (unlocked
+ * and all children are already committed to unmounting) commit
+ * to unmounting it.
+ * Returns the next candidate to be trimmed.
+ */
+static struct mount *trim_one(struct mount *m)
 {
-	/* Restore mounts to a clean working state */
-	while (!list_empty(to_restore)) {
-		struct mount *mnt, *parent;
-		struct mountpoint *mp;
-
-		mnt = list_first_entry(to_restore, struct mount, mnt_umounting);
-		CLEAR_MNT_MARK(mnt);
-		list_del_init(&mnt->mnt_umounting);
-
-		/* Should this mount be reparented? */
-		mp = mnt->mnt_mp;
-		parent = mnt->mnt_parent;
-		while (parent->mnt.mnt_flags & MNT_UMOUNT) {
-			mp = parent->mnt_mp;
-			parent = parent->mnt_parent;
-		}
-		if (parent != mnt->mnt_parent) {
-			mnt_change_mountpoint(parent, mp, mnt);
-			mnt_notify_add(mnt);
+	bool remove_this = false, found = false, umount_this = false;
+	struct mount *n;
+	struct list_head *next;
+
+	list_for_each_entry(n, &m->mnt_mounts, mnt_child) {
+		if (!is_candidate(n)) {
+			found = true;
+			if (n->mnt_mountpoint != m->mnt.mnt_root) {
+				remove_this = true;
+				break;
+			}
 		}
 	}
+	if (found) {
+		trim_ancestors(m);
+	} else if (!IS_MNT_LOCKED(m) && list_empty(&m->mnt_mounts)) {
+		remove_this = true;
+		umount_this = true;
+	}
+	next = m->mnt_umounting.next;
+	if (remove_this) {
+		remove_candidate(m);
+		if (umount_this)
+			umount_one(m);
+	}
+	if (next == &candidates)
+		return NULL;
+
+	return list_entry(next, struct mount, mnt_umounting);
 }
 
-static void cleanup_umount_visitations(struct list_head *visited)
+static void handle_locked(struct mount *m)
 {
-	while (!list_empty(visited)) {
-		struct mount *mnt =
-			list_first_entry(visited, struct mount, mnt_umounting);
-		list_del_init(&mnt->mnt_umounting);
+	struct mount *cutoff = m, *p;
+
+	for (p = m; is_candidate(p); p = p->mnt_parent) {
+		remove_candidate(p);
+		if (!IS_MNT_LOCKED(p))
+			cutoff = p->mnt_parent;
+	}
+	if (will_be_unmounted(p))
+		cutoff = p;
+	while (m != cutoff) {
+		umount_one(m);
+		m = m->mnt_parent;
 	}
 }
 
 /*
- * collect all mounts that receive propagation from the mount in @list,
- * and return these additional mounts in the same list.
- * @list: the list of mounts to be unmounted.
+ * @m is not to going away, and it overmounts the top of a stack of mounts
+ * that are going away.  We know that all of those are fully overmounted
+ * by the one above (@m being the topmost of the chain), so @m can be slid
+ * in place where the bottom of the stack is attached.
  *
- * vfsmount lock must be held for write
+ * NOTE: here we temporarily violate a constraint - two mounts end up with
+ * the same parent and mountpoint; that will be remedied as soon as we
+ * return from propagate_umount() - its caller (umount_tree()) will detach
+ * the stack from the parent it (and now @m) is attached to.  umount_tree()
+ * might choose to keep unmounted pieces stuck to each other, but it always
+ * detaches them from the mounts that remain in the tree.
  */
-int propagate_umount(struct list_head *list)
+static void reparent(struct mount *m)
 {
-	struct mount *mnt;
-	LIST_HEAD(to_restore);
-	LIST_HEAD(to_umount);
-	LIST_HEAD(visited);
+	struct mount *p = m;
+	struct mountpoint *mp;
 
-	/* Find candidates for unmounting */
-	list_for_each_entry_reverse(mnt, list, mnt_list) {
-		struct mount *parent = mnt->mnt_parent;
-		struct mount *m;
+	do {
+		mp = p->mnt_mp;
+		p = p->mnt_parent;
+	} while (will_be_unmounted(p));
 
-		/*
-		 * If this mount has already been visited it is known that it's
-		 * entire peer group and all of their slaves in the propagation
-		 * tree for the mountpoint has already been visited and there is
-		 * no need to visit them again.
-		 */
-		if (!list_empty(&mnt->mnt_umounting))
-			continue;
+	mnt_change_mountpoint(p, mp, m);
+	mnt_notify_add(m);
+}
 
-		list_add_tail(&mnt->mnt_umounting, &visited);
-		for (m = propagation_next(parent, parent); m;
-		     m = propagation_next(m, parent)) {
-			struct mount *child = __lookup_mnt(&m->mnt,
-							   mnt->mnt_mountpoint);
-			if (!child)
-				continue;
+/**
+ * propagate_umount - apply propagation rules to the set of mounts for umount()
+ * @set: the list of mounts to be unmounted.
+ *
+ * Collect all mounts that receive propagation from the mount in @set and have
+ * no obstacles to being unmounted.  Add these additional mounts to the set.
+ *
+ * See Documentation/filesystems/propagate_umount.txt if you do anything in
+ * this area.
+ *
+ * Locks held:
+ * mount_lock (write_seqlock), namespace_sem (exclusive).
+ */
+void propagate_umount(struct list_head *set)
+{
+	struct mount *m;
 
-			if (!list_empty(&child->mnt_umounting)) {
-				/*
-				 * If the child has already been visited it is
-				 * know that it's entire peer group and all of
-				 * their slaves in the propgation tree for the
-				 * mountpoint has already been visited and there
-				 * is no need to visit this subtree again.
-				 */
-				m = skip_propagation_subtree(m, parent);
-				continue;
-			} else if (child->mnt.mnt_flags & MNT_UMOUNT) {
-				/*
-				 * We have come across a partially unmounted
-				 * mount in a list that has not been visited
-				 * yet. Remember it has been visited and
-				 * continue about our merry way.
-				 */
-				list_add_tail(&child->mnt_umounting, &visited);
-				continue;
-			}
+	// collect all candidates
+	gather_candidates(set);
 
-			/* Check the child and parents while progress is made */
-			while (__propagate_umount(child,
-						  &to_umount, &to_restore)) {
-				/* Is the parent a umount candidate? */
-				child = child->mnt_parent;
-				if (list_empty(&child->mnt_umounting))
-					break;
-			}
-		}
-	}
+	// reduce the set until it's non-shifting
+	for (m = first_candidate(); m; m = trim_one(m))
+		;
 
-	umount_list(&to_umount, &to_restore);
-	restore_mounts(&to_restore);
-	cleanup_umount_visitations(&visited);
-	list_splice_tail(&to_umount, list);
+	// ... and non-revealing
+	while (!list_empty(&candidates))
+		handle_locked(first_candidate());
 
-	return 0;
+	// now to_umount consists of all acceptable candidates
+	// deal with reparenting of remaining overmounts on those
+	list_for_each_entry(m, &to_umount, mnt_list) {
+		while (!list_empty(&m->mnt_mounts)) // should be at most one
+			reparent(list_first_entry(&m->mnt_mounts,
+						  struct mount, mnt_child));
+	}
+
+	// and fold them into the set
+	list_splice_tail_init(&to_umount, set);
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 34b6247af01d..d84a397bfd43 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -39,7 +39,7 @@ static inline void set_mnt_shared(struct mount *mnt)
 void change_mnt_propagation(struct mount *, int);
 int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
-int propagate_umount(struct list_head *);
+void propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
 void propagate_mount_unlock(struct mount *);
 void mnt_release_group_id(struct mount *);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage)
  2025-05-16  5:21         ` [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage) Al Viro
@ 2025-05-19 18:11           ` Linus Torvalds
  2025-05-19 21:35             ` Al Viro
  2025-05-20 11:10           ` [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage) Christian Brauner
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2025-05-19 18:11 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric W. Biederman, linux-fsdevel, Christian Brauner

;


On Thu, 15 May 2025 at 22:21, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> With some cleaning the notes up, see
>
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git propagate_umount

Delayed response because this isn't really my area, and it all
somewhat confuses me. Eric - mind taking a look?

But the new code certainly looks sensible, even if I can't really
claim I understand it.

The one reaction I had was almost entirely syntactic - I wish it
didn't use those global lists.

Yes, yes, they are protected by global locks, and that part is
unavoidable, but I'm looking at that code that gathers up the umount
lists, and I actually find the data flow a bit confusing just because
it ends up hiding the state "elsewhere" (that being the global lists).

It just makes the state propagation from gather_candidates() -> users a bit odd.

Not _new_ odd, mind you, but when I was looking at your patch with

    git show --new FETCH_HEAD fs/

I had to go back and forth in the patch just to see where the source
of some the lists came from.

So I think it would make things a bit more explicit if you did something like

    struct umount_state {
        struct list_head umount,candidates;
    };

and made that a local variable in propagate_umount() with

    struct umount_state state = {
            .umount = LIST_HEAD_INIT(state.umount),
            .candidates = LIST_HEAD_INIT(state.candidates),
    };

and then passed that state pointer along an argument.

Although now that I write out that initializer from hell, I am
certainly not impressed by the clarity of that.

So maybe my reaction is bogus. It's certainly not the important part
of the patch.

Also, I realize that I'm only listing your new state globals. The
above is equally true of the existing state that is also done that
way, and your two new lists are actually less confusing than some of
the old things that I think should *also* be part of that umount
propagation state.

(The one I look at in particular is the imaginatively named "list" variable:

    static struct hlist_head *list;

which really is the least descriptive name ever).

Another thing that is either purely syntactic, or shows that I
*really* don't understand your patch. Why do you do this odd thing:

        // reduce the set until it's non-shifting
        for (m = first_candidate(); m; m = trim_one(m))
                ;

which seems to just walk the candidates list in a very non-obvious
manner (this is one of those "I had to go back and forth to see what
first_candidate() did and what lists it used" cases).

It *seems* to be the same as

        list_for_each_entry_safe(m, tmp, &candidates, mnt_umounting)
                trim_one(m);

because if I read that code right, 'trim_one()' will just always
return the next entry in that candidate list.

But again - I might be missing something important and maybe that list
gets modified in other ways while trimming (again, with those globals
it's kind of hard to see the data flow).

So I may have just misread this *completely*.

                   Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage)
  2025-05-19 18:11           ` Linus Torvalds
@ 2025-05-19 21:35             ` Al Viro
  2025-05-19 22:08               ` Linus Torvalds
                                 ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Al Viro @ 2025-05-19 21:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric W. Biederman, linux-fsdevel, Christian Brauner

On Mon, May 19, 2025 at 11:11:10AM -0700, Linus Torvalds wrote:
> Another thing that is either purely syntactic, or shows that I
> *really* don't understand your patch. Why do you do this odd thing:
> 
>         // reduce the set until it's non-shifting
>         for (m = first_candidate(); m; m = trim_one(m))
>                 ;
> 
> which seems to just walk the candidates list in a very non-obvious
> manner (this is one of those "I had to go back and forth to see what
> first_candidate() did and what lists it used" cases).
> 
> It *seems* to be the same as
> 
>         list_for_each_entry_safe(m, tmp, &candidates, mnt_umounting)
>                 trim_one(m);
> 
> because if I read that code right, 'trim_one()' will just always
> return the next entry in that candidate list.

	list_for_each_entry_safe() is safe wrt removal of the _current_
element.  Suppose the list is <A, B, C, D> and trim_one(A) wants to
take out A, B and D.  What we want is to have it followed by trim_one(C),
but how could *anything* outside of trim_one() get to C?

	list_for_each_entry_safe() would pick B as the next entry
to consider, which will end up with looping in B ever after.

	What trim_one(A) does instead is
* note that A itself needs to be removed
* remove B and D
* remember who currently follows A (the list is down to <A, C>, so C it is)
* remove A
* return C.

	We could have a separate list and have trim_one(m) move m to in
case it still looks like a valid candidate.  Then the loop would turn
into
	while !empty(candidates)
		trim_one(first candidate)
	move the second list over to candidates (or just use it on the
next stage instead of candidates)
What's slightly confusing about that variant is that m might survive
trim_one(m), only to be taken out by subsequent call of trim_one() for
another mount.  So remove_candidate(p) would become "remove p from
candidates or from the set of seen candidates, whichever p currently
belongs to"...

	Another variant would be to steal one more bit from mnt_flags,
set it for all candidates when collecting them, have is_candidate() check
that instead of list_empty(&m->mnt_umounting) and clean it where this
variant removes from the list; trim_one() would immediately return if
bit is not set.  Then we could really do list_for_each_entry_safe(),
with another loop doing list removals afterwards.  Extra work that way,
though, and I still think it's more confusing...
	OTOH, "steal one more bit" would allow to get rid of
->mnt_umounting - we could use ->mnt_list for all these sets.  IIRC,
the reason for ->mnt_umounting was that we used to maintain ->mnt_list
for everything in a namespace, and having to restore it if we decide
that candidate has to stay alive would be painful.  These days we
use rbtree for iterators, so ->mnt_list on those suckers is fair
game...

	Re globals - the other bunch is part of propagate_mnt(), not
propagate_umount()...  IIRC, at some point I got fed up by arseloads
of arguments passed from propagate_umount() down to the helpers and
went "fuck it, we have a hard dependency on global serialization anyway,
might as well make all those arguments global" (and I remember trying
to go with per-namespace locks; stuff of nightmares, that)...

	I'll look into gathering that stuff into a single structure
passed by the callers, but the whole thing (especially on the umount
side) really depends upon not running into another instance working
at the same time.  Trying to cope with duelling umount propagations
from two threads... the data structures would be the least of headache
sources there.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage)
  2025-05-19 21:35             ` Al Viro
@ 2025-05-19 22:08               ` Linus Torvalds
  2025-05-19 22:26                 ` Linus Torvalds
  2025-05-20 22:27               ` Eric W. Biederman
       [not found]               ` <20250520075317.GB2023217@ZenIV>
  2 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2025-05-19 22:08 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric W. Biederman, linux-fsdevel, Christian Brauner

On Mon, 19 May 2025 at 14:35, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         What trim_one(A) does instead is
> * note that A itself needs to be removed
> * remove B and D

Yeah, I wondered, but it wasn't obvious from the flow that the same
list was getting changed...  But now that you point it out, it's
fairly obvious in trim_one -> trim_ancestors -> remove_candidate().

I did see the direct call to remove_candidate(), but that only
affected 'A' itself.

And I guess passing the list head around wouldn't have helped make
that flow more obvious, because the removal is through the list head,
it's just that list_del_init() on the list entry.

A comment may have helped, just because I found that list traversal
pattern so odd. But yes

>  [..]  Then the loop would turn into
>
>        while !empty(candidates)
>                trim_one(first candidate)

would have made me not think it's oddly written.

            Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage)
  2025-05-19 22:08               ` Linus Torvalds
@ 2025-05-19 22:26                 ` Linus Torvalds
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2025-05-19 22:26 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric W. Biederman, linux-fsdevel, Christian Brauner

On Mon, 19 May 2025 at 15:08, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And I guess passing the list head around wouldn't have helped make
> that flow more obvious, because the removal is through the list head,

     is -> isn't, obviously

           Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage)
  2025-05-16  5:21         ` [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage) Al Viro
  2025-05-19 18:11           ` Linus Torvalds
@ 2025-05-20 11:10           ` Christian Brauner
  2025-05-21  2:11             ` Al Viro
  1 sibling, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2025-05-20 11:10 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric W. Biederman, linux-fsdevel, Linus Torvalds

On Fri, May 16, 2025 at 06:21:39AM +0100, Al Viro wrote:
> > On Thu, May 15, 2025 at 12:41:50PM +0100, Al Viro wrote:
> > > On Tue, May 13, 2025 at 04:56:22AM +0100, Al Viro wrote:
> > > 
> > > > And yes, it's going to be posted along with the proof of correctness -
> > > > I've had it with the amount of subtle corner cases in that area ;-/
> > > 
> > > FWIW, it seems to survive testing; it's _not_ final - I'm still editing
> > > the promised proof, but by this point it's stylistic work.  I hadn't
> > > touched that kind of formal writing for quite a while, so the text is clumsy.
> > > The code changes are pretty much in the final shape.  Current diff of
> > > code (all in fs/pnode.*) relative to mainline:
> > 
> > ... and the current notes are below; they obviously need more editing ;-/
> 
> With some cleaning the notes up, see
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git propagate_umount
> 
> The only commit there (on top of -rc5) below; it seems to survive tests I'd
> been able to find.  Please, review and hit that thing with any tests you might
> have sitting around.
> 
> Note that in new variant skip_propagation_subtree() call is *not* followed
> by propagation_next() - it does what combination would've done (and it
> skips everything that gets propagation from that mount, peers included).
> 
> Nobody except propagate_umount() had been using it, so there's no other
> callers to be broken by semantics change...
> 
> D/f/propagate_umount.txt is done as plain text; if anyone cares to turn that
> into ReST - wonderful.  If anyone has editing suggestions (it's still pretty
> raw as a text and I'm pretty sure that it can be improved stylistically) -
> even better.
> 
> ---
> [PATCH] Rewrite of propagate_umount()
> 
> The variant currently in the tree has problems; trying to prove
> correctness has caught at least one class of bugs (reparenting
> that ends up moving the visible location of reparented mount, due
> to not excluding some of the counterparts on propagation that
> should've been included).
> 
> I tried to prove that it's the only bug there; I'm still not sure
> whether it is.  If anyone can reconstruct and write down an analysis
> of the mainline implementation, I'll gladly review it; as it is,
> I ended up doing a different implementation.  Candidate collection
> phase is similar, but trimming the set down until it satisfies the
> constraints turned out pretty different.
> 
> I hoped to do transformation as a massage series, but that turns out
> to be too convoluted.  So it's a single patch replacing propagate_umount()
> and friends in one go, with notes and analysis in D/f/propagate_umount.txt
> (in addition to inline comments).
> 
> As far I can tell, it is provably correct and provably linear by the number
> of mounts we need to look at in order to decide what should be unmounted.
> It even builds and seems to survive testing...
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/Documentation/filesystems/propagate_umount.txt b/Documentation/filesystems/propagate_umount.txt
> new file mode 100644
> index 000000000000..d8367ba3ea82
> --- /dev/null
> +++ b/Documentation/filesystems/propagate_umount.txt
> @@ -0,0 +1,433 @@
> +	Notes on propagate_umount()
> +
> +Umount propagation starts with a set of mounts we are already going to
> +take out.  Ideally, we would like to add all downstream cognates to
> +that set - anything with the same mountpoint as one of the removed
> +mounts and with parent that would receive events from the parent of that
> +mount.  However, there are some constraints the resulting set must
> +satisfy.
> +
> +It is convenient to define several properties of sets of mounts:
> +
> +1) A set S of mounts is non-shifting if for any mount X belonging
> +to S all subtrees mounted strictly inside of X (i.e. not overmounting
> +the root of X) contain only elements of S.

I think "shifting" is misleading. I would suggest either "isolated" or
"contained" or ideally "closed" which would mean...

> +
> +2) A set S is non-revealing if all locked mounts that belong to S have
> +parents that also belong to S.
> +
> +3) A set S is closed if it contains all children of its elements.

... that should probably become "complete" or "saturated".

> +
> +The set of mounts taken out by umount(2) must be non-shifting and
> +non-revealing; the first constraint is what allows to reparent
> +any remaining mounts and the second is what prevents the exposure
> +of any concealed mountpoints.
> +
> +propagate_umount() takes the original set as an argument and tries to
> +extend that set.  The original set is a full subtree and its root is
> +unlocked; what matters is that it's closed and non-revealing.
> +Resulting set may not be closed; there might still be mounts outside
> +of that set, but only on top of stacks of root-overmounting elements
> +of set.  They can be reparented to the place where the bottom of
> +stack is attached to a mount that will survive.  NOTE: doing that
> +will violate a constraint on having no more than one mount with
> +the same parent/mountpoint pair; however, the caller (umount_tree())

I would prefer if this would insert the term "shadow mounts" since
that's what we've traditionally used for that.

> +will immediately remedy that - it may keep unmounted element attached
> +to parent, but only if the parent itself is unmounted.  Since all
> +conflicts created by reparenting have common parent *not* in the
> +set and one side of the conflict (bottom of the stack of overmounts)
> +is in the set, it will be resolved.  However, we rely upon umount_tree()
> +doing that pretty much immediately after the call of propagate_umount().
> +
> +Algorithm is based on two statements:
> +	1) for any set S, there is a maximal non-shifting subset of S
> +and it can be calculated in O(#S) time.
> +	2) for any non-shifting set S, there is a maximal non-revealing
> +subset of S.  That subset is also non-shifting and it can be calculated
> +in O(#S) time.
> +
> +		Finding candidates.
> +
> +We are given a closed set U and we want to find all mounts that have
> +the same mountpoint as some mount m in U *and* whose parent receives
> +propagation from the parent of the same mount m.  Naive implementation
> +would be
> +	S = {}
> +	for each m in U
> +		add m to S
> +		p = parent(m)
> +		for each q in Propagation(p) - {p}
> +			child = look_up(q, mountpoint(m))
> +			if child
> +				add child to S
> +but that can lead to excessive work - there might be propagation among the
> +subtrees of U, in which case we'd end up examining the same candidates
> +many times.  Since propagation is transitive, the same will happen to
> +everything downstream of that candidate and it's not hard to construct
> +cases where the approach above leads to the time quadratic by the actual
> +number of candidates.
> +
> +Note that if we run into a candidate we'd already seen, it must've been
> +added on an earlier iteration of the outer loop - all additions made
> +during one iteration of the outer loop have different parents.  So
> +if we find a child already added to the set, we know that everything
> +in Propagation(parent(child)) with the same mountpoint has been already
> +added.
> +	S = {}
> +	for each m in U
> +		if m in S
> +			continue
> +		add m to S
> +		p = parent(m)
> +		q = propagation_next(p, p)
> +		while q
> +			child = look_up(q, mountpoint(m))
> +			if child
> +				if child in S
> +					q = skip_them(q, p)
> +					continue;
> +				add child to S
> +			q = propagation_next(q, p)
> +where
> +skip_them(q, p)
> +	keep walking Propagation(p) from q until we find something
> +	not in Propagation(q)
> +
> +would get rid of that problem, but we need a sane implementation of
> +skip_them().  That's not hard to do - split propagation_next() into
> +"down into mnt_slave_list" and "forward-and-up" parts, with the
> +skip_them() being "repeat the forward-and-up part until we get NULL
> +or something that isn't a peer of the one we are skipping".
> +
> +Note that there can be no absolute roots among the extra candidates -
> +they all come from mount lookups.  Absolute root among the original
> +set is _currently_ impossible, but it might be worth protecting
> +against.
> +
> +		Maximal non-shifting subsets.
> +
> +Let's call a mount m in a set S forbidden in that set if there is a
> +subtree mounted strictly inside m and containing mounts that do not
> +belong to S.
> +
> +The set is non-shifting when none of its elements are forbidden in it.
> +
> +If mount m is forbidden in a set S, it is forbidden in any subset S' it
> +belongs to.  In other words, it can't belong to any of the non-shifting
> +subsets of S.  If we had a way to find a forbidden mount or show that
> +there's none, we could use it to find the maximal non-shifting subset
> +simply by finding and removing them until none remain.
> +
> +Suppose mount m is forbidden in S; then any mounts forbidden in S - {m}
> +must have been forbidden in S itself.  Indeed, since m has descendents
> +that do not belong to S, any subtree that fits into S will fit into
> +S - {m} as well.
> +
> +So in principle we could go through elements of S, checking if they
> +are forbidden in S and removing the ones that are.  Removals will
> +not invalidate the checks done for earlier mounts - if they were not
> +forbidden at the time we checked, they won't become forbidden later.
> +It's too costly to be practical, but there is a similar approach that
> +is linear by size of S.
> +
> +Let's say that mount x in a set S is forbidden by mount y, if
> +	* both x and y belong to S.
> +	* there is a chain of mounts starting at x and leaving S
> +	  immediately after passing through y, with the first
> +	  mountpoint strictly inside x.
> +Note 1: x may be equal to y - that's the case when something not
> +belonging to S is mounted strictly inside x.
> +Note 2: if y does not belong to S, it can't forbid anything in S.
> +Note 3: if y has no children outside of S, it can't forbid anything in S.
> +
> +It's easy to show that mount x is forbidden in S if and only if x is
> +forbidden in S by some mount y.  And it's easy to find all mounts in S
> +forbidden by a given mount.
> +
> +Consider the following operation:
> +	Trim(S, m) = S - {x : x is forbidden by m in S}
> +
> +Note that if m does not belong to S or has no children outside of S we
> +are guaranteed that Trim(S, m) is equal to S.
> +
> +The following is true: if x is forbidden by y in Trim(S, m), it was
> +already forbidden by y in S.
> +
> +Proof: Suppose x is forbidden by y in Trim(S, m).  Then there is a
> +chain of mounts (x_0 = x, ..., x_k = y, x_{k+1} = r), such that x_{k+1}
> +is the first element that doesn't belong to Trim(S, m) and the
> +mountpoint of x_1 is strictly inside x.  If mount r belongs to S, it must
> +have been removed by Trim(S, m), i.e. it was forbidden in S by m.
> +Then there was a mount chain from r to some child of m that stayed in
> +S all the way until m, but that's impossible since x belongs to Trim(S, m)
> +and prepending (x_0, ..., x_k) to that chain demonstrates that x is also
> +prohibited in S by m, and thus can't belong to Trim(S, m).
> +Therefore r can not belong to S and our chain demonstrates that
> +x is prohibited by y in S.  QED.
> +
> +Corollary: no mount is forbidden by m in Trim(S, m).  Indeed, any
> +such mount would have been forbidden by m in S and thus would have been
> +in the part of S removed in Trim(S, m).
> +
> +Corollary: no mount is forbidden by m in Trim(Trim(S, m), n).  Indeed,
> +any such would have to have been forbidden by m in Trim(S, m), which
> +is impossible.
> +
> +Corollary: after
> +	S = Trim(S, x_1)
> +	S = Trim(S, x_2)
> +	...
> +	S = Trim(S, x_k)
> +no mount remaining in S will be forbidden by either of x_1,...,x_k.
> +
> +If the S is ordered, the following will reduce it to the maximal non-shifting
> +subset:
> +	if S is not empty
> +		m = first(S)
> +		while true
> +			S = Trim(S, m)
> +			if there's no elements of S greater than m
> +				break
> +			m = first of such elements
> +
> +It's easy to see that at the beginning of each iteration all elements still
> +remaining in S and preceding the current value of m will have already been
> +seen as values of m.  By the time we leave the loop all elements remaining
> +in S will have been seen as values of m.
> +
> +We also know that no mount remaining in S will be forbidden by any
> +of the values of m we have observed in the loop.  In other words,
> +no mount remaining in S will be forbidden, i.e. final value of S will
> +be non-shifting.  It will be the maximal non-shifting subset, since we
> +were removing only forbidden elements.
> +
> +Implementation notes:
> +
> +	The following reduces S to the maximal non-shifting subset
> +in O(#S), assuming that S is ordered, we can mark the elements and
> +originally no marks are set:
> +
> +	for (x = first(S); x; x = Trim_one(x))
> +		;
> +where
> +Trim_one(m)
> +	remove_this = false
> +	found = false
> +	for each n in children(m)
> +		if n not in S
> +			found = true
> +			if (mountpoint(n) != root(m))
> +				remove_this = true
> +				break
> +	if found
> +		Trim_ancestors(m)
> +	res = next(m)
> +	if remove_this
> +		unmark m and remove it from S
> +	return res
> +
> +Trim_ancestors(m)
> +	for (p = parent(m); p in S; m = p, p = parent(p)) {
> +		if m is marked // all elements beneath are overmounts
> +			return
> +		mark m
> +		if (mountpoint(m) != root(p))
> +			unmark p and remove it from S
> +	}
> +
> +Trim_one(m) will
> +	* replace S with Trim(S, m)
> +	* return NULL if no elements greater than m remain in S
> +	* return the smallest of such elements otherwise
> +	* maintain the following invariants:
> +		* only elements of S are marked
> +		* if p is marked and
> +		     (x_0, x_1, ..., x_k = p) is an ancestry chain and
> +		     all x_i belong to S
> +		  then
> +		     for any i > 0  x_i overmounts the root of x_{i-1}
> +Proof:
> +	Consider the chains excluding elements from Trim(S, m).  The last
> +two elements in each are m and some child of m that does not belong to S.
> +If there's no such children, Trim(S, m) is equal to S and next(m) is the
> +correct return value.
> +
> +	m itself is removed if and only if the chain has exactly two
> +elements, i.e. when the last element does not overmount the root of m.
> +In other words, if there is a child not in S that does not overmount
> +the root of m.	Once we are done looking at the children 'remove_this'
> +is true if and only if m itself needs to be removed.
> +
> +	All other elements to remove will be ancestors of m, such that
> +the entire descent chain from them to m is contained in S.  Let
> +(x_0, x_1, ..., x_k = m) be the longest such chain.  x_i needs to be
> +removed if and only if x_{i+1} does not overmount its root.
> +
> +	Note that due to the invariant for marks, finding x_{i+1} marked
> +means that none of its ancestors will qualify for removal.  What's more,
> +after we have found and removed everything we needed, all remaining
> +elements of the chain can be marked - their nearest ancestors that do
> +not overmount their parent's root will be outside of S.  Since ancestry
> +chains can't loop, we can set the marks as we go - we won't be looking
> +at that element again until after all removals.
> +
> +	If Trim_one(m) needs to remove m, it does that after all other
> +removals.  Once those removals (if any) are done m is still an element
> +of our set, so the smallest remaining element greater than m is equal
> +to next(m).  Once it is found, we can finally remove m itself.
> +
> +	Time complexity:
> +* we get no more than O(#S) calls of Trim_one()
> +* the loop over children in Trim_one() never looks at the same child
> +twice through all the calls.
> +* iterations of that loop for children in S are no more than O(#S)
> +in the worst case
> +* at most two children that are not elements of S are considered per
> +call of Trim_one().
> +* the second loop in Trim_one() sets mark once per iteration and
> +no element of S has is set more than once.
> +
> +	In practice we have a large closed non-revealing subset in S -
> +the mounts we are already committed to unmounting.  It can be used to
> +reduce the amount of work.  What's more, we can have all elements of that
> +subset removed from their parents' lists of children, which considerably
> +simplifies life.
> +
> +	Since the committed subset is closed, passing one of its elements
> +to Trim_one() is a no-op - it doesn't have any children outside of S.
> +No matter what we pass to Trim_one(), Trim_ancestors() will never look
> +at any elements of the committed subset - it's not called by Trim_one()
> +if argument belongs to that subset and it can't walk into that subset
> +unless it has started in there.
> +
> +	So it's useful to keep the sets of committed and undecided
> +candidates separately; Trim_one() needs to be called only for elements
> +of the latter.
> +
> +	What's more, if we see that children(m) is empty and m is not
> +locked, we can immediately move m into the committed subset (remove
> +from the parent's list of children, etc.).  That's one fewer mount we'll
> +have to look into when we check the list of children of its parent *and*
> +when we get to building the non-revealing subset.
> +
> +		Maximal non-revealing subsets
> +
> +If S is not a non-revealing subset, there is a locked element x in S
> +such that parent of x is not in S.
> +
> +Obviously, no non-revealing subset of S may contain x.  Removing such
> +elements one by one will obviously end with the maximal non-revealing
> +subset (possibly empty one).  Note that removal of an element will
> +require removal of all its locked children, etc.
> +
> +If the set had been non-shifting, it will remain non-shifting after
> +such removals.
> +Proof: suppose S was non-shifting, x is a locked element of S, parent of x
> +is not in S and S - {x} is not non-shifting.  Then there is an element m
> +in S - {x} and a subtree mounted strictly inside m, such that m contains
> +an element not in in S - {x}.  Since S is non-shifting, everything in
> +that subtree must belong to S.  But that means that this subtree must
> +contain x somewhere *and* that parent of x either belongs that subtree
> +or is equal to m.  Either way it must belong to S.  Contradiction.
> +
> +// same representation as for finding maximal non-shifting subsets:
> +// S is a disjoint union of a non-revealing set U (the ones we are committed
> +// to unmount) and a set of candidates.
> +// Elements of U are removed from their parents' lists of children.
> +// In the end candidates becomes empty and maximal non-revealing non-shifting
> +// subset of S is now in U
> +	while (candidates is non-empty)
> +		handle_locked(first(candidates))
> +
> +handle_locked(m)
> +	cutoff = m
> +	for (p = m; p in candidates; p = parent(p)) {
> +		unmark(p) and remove p from candidates;
> +		if (!locked(p))
> +			cutoff = parent(p)
> +	}
> +	if (p in U)
> +		cutoff = p
> +	while (m != cutoff) {
> +		remove m from children(parent(m));
> +		add m to U;
> +		m = parent(m);
> +	}
> +
> +Let (x_0, ..., x_n = m) be the maximal chain of descent of m within S.
> +* If it contains some elements of U, let x_k be the last one of those.
> +Then union of U with {x_{k+1}, ..., x_n} is obviously non-revealing.
> +* otherwise if all its elements are locked, then none of {x_0, ..., x_n}
> +may be elements of a non-revealing subset of S.
> +* otherwise let x_k be the first unlocked element of the chain.  Then none
> +of {x_0, ..., x_{k-1}} may be an element of a non-revealing subset of
> +S and union of U and {x_k, ..., x_n} is non-revealing.
> +
> +handle_locked(m) finds which of these cases applies and adjusts C and
> +U accordingly.  U remains non-revealing, union of C and U still contains
> +any non-revealing subset of S and after the call of handle_locked(m) m
> +is guaranteed to be not in C.  So having it called for each element of S
> +would suffice to empty C, leaving U the maximal non-revealing subset of S.
> +
> +However, handle_locked(m) is a no-op when m belongs to U, so it's enough
> +to have it called for elements of C while there are any.
> +
> +Time complexity: number of calls of handle_locked() is limited by #C,
> +each iteration of the first loop in handle_locked() removes an element
> +for C, so their total number of executions is also limited by #C;
> +number of iterations in the second loop is no greater than the number
> +of iterations of the first loop.
> +
> +
> +		Reparenting
> +
> +After we'd calculated the final set, we still need to deal with
> +reparenting - if an element of the final set has a child not in it,
> +we need to reparent such child.
> +
> +Such children can only be root-overmounting (otherwise the set wouldn't
> +be non-shifting) and their parents can not belong to the original set,
> +since the original is guaranteed to be closed.
> +
> +
> +		Putting all of that together
> +
> +The plan is to
> +	* find all candidates
> +	* trim down to maximal non-shifting subset
> +	* trim down to maximal non-revealing subset
> +	* reparent anything that needs to be reparented
> +	* return the resulting set to the caller
> +
> +For the 2nd and 3rd steps we want to separate the set into growing
> +non-revealing subset, initially containing the original set ("U" in
> +terms of the pseudocode above) and everything we are still not sure about
> +("candidates").  It means that for the output of the 1st step we'd like
> +the extra candidates separated from the stuff already in the original set.
> +For the 4th step we would like the additions to U separate from the
> +original set.
> +
> +So let's go for
> +	* original set ("set").  Linkage via mnt_list
> +	* undecided candidates ("candidates").  Linkage via mnt_umounting.
> +	* anything in U that hadn't been in the original set - elements of
> +candidates will gradually be either discarded or moved there.  In other
> +words, it's the candidates we have already decided to unmount.	Its role
> +is reasonably close to the old "to_umount", so let's use that name.
> +Linkage via mnt_list.
> +
> +For gather_candidates() we'll need to maintain both candidates (S -
> +set) and intersection of S with set, with the latter emptied once we
> +are done.  Use mnt_umounting for both, that'll give a cheap way to check
> +belonging to their union (in gather_candidates()) and to candidates
> +itself (at later stages).  Call that predicate is_candidate(); it would
> +be !list_empty(&m->mnt_umounting)
> +
> +All elements of the original set are marked with MNT_UMOUNT and we'll
> +need the same for elements added when joining the contents of to_umount
> +to set in the end.  Let's set MNT_UMOUNT at the time we add an element
> +to to_umount; that's close to what the old 'umount_one' is doing, so
> +let's keep that name.  It also gives us another predicate we need -
> +"belongs to union of set and to_umount"; will_be_unmounted() for now.
> +
> +Another helper - unmark and remove from candidates; remove_candidate(m)

Stray sentence?

> diff --git a/fs/pnode.c b/fs/pnode.c
> index fb77427df39e..9b2f1ac80f25 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -24,11 +24,6 @@ static inline struct mount *first_slave(struct mount *p)
>  	return list_entry(p->mnt_slave_list.next, struct mount, mnt_slave);
>  }
>  
> -static inline struct mount *last_slave(struct mount *p)
> -{
> -	return list_entry(p->mnt_slave_list.prev, struct mount, mnt_slave);
> -}
> -
>  static inline struct mount *next_slave(struct mount *p)
>  {
>  	return list_entry(p->mnt_slave.next, struct mount, mnt_slave);
> @@ -136,6 +131,23 @@ void change_mnt_propagation(struct mount *mnt, int type)
>  	}
>  }
>  
> +static struct mount *__propagation_next(struct mount *m,
> +					 struct mount *origin)
> +{
> +	while (1) {
> +		struct mount *master = m->mnt_master;
> +
> +		if (master == origin->mnt_master) {
> +			struct mount *next = next_peer(m);
> +			return (next == origin) ? NULL : next;
> +		} else if (m->mnt_slave.next != &master->mnt_slave_list)
> +			return next_slave(m);

Please add a comment to that helper that explains how it walks the
propagation tree. I remember having to fix bugs in that code and the
lack of comments was noticable.

> +
> +		/* back at master */
> +		m = master;
> +	}
> +}
> +
>  /*
>   * get the next mount in the propagation tree.
>   * @m: the mount seen last
> @@ -153,31 +165,26 @@ static struct mount *propagation_next(struct mount *m,
>  	if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
>  		return first_slave(m);
>  
> -	while (1) {
> -		struct mount *master = m->mnt_master;
> -
> -		if (master == origin->mnt_master) {
> -			struct mount *next = next_peer(m);
> -			return (next == origin) ? NULL : next;
> -		} else if (m->mnt_slave.next != &master->mnt_slave_list)
> -			return next_slave(m);
> +	return __propagation_next(m, origin);
> +}
>  
> -		/* back at master */
> -		m = master;
> -	}
> +static inline bool peers(const struct mount *m1, const struct mount *m2)
> +{
> +	return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
>  }
>  
>  static struct mount *skip_propagation_subtree(struct mount *m,
>  						struct mount *origin)
>  {
>  	/*
> -	 * Advance m such that propagation_next will not return
> -	 * the slaves of m.
> +	 * Advance m past everything that gets propagation from it.
>  	 */
> -	if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
> -		m = last_slave(m);
> +	struct mount *p = __propagation_next(m, origin);
>  
> -	return m;
> +	while (p && peers(m, p))
> +		p = __propagation_next(p, origin);
> +
> +	return p;
>  }
>  
>  static struct mount *next_group(struct mount *m, struct mount *origin)
> @@ -216,11 +223,6 @@ static struct mount *next_group(struct mount *m, struct mount *origin)
>  static struct mount *last_dest, *first_source, *last_source, *dest_master;
>  static struct hlist_head *list;
>  
> -static inline bool peers(const struct mount *m1, const struct mount *m2)
> -{
> -	return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
> -}
> -
>  static int propagate_one(struct mount *m, struct mountpoint *dest_mp)
>  {
>  	struct mount *child;
> @@ -463,181 +465,219 @@ void propagate_mount_unlock(struct mount *mnt)
>  	}
>  }
>  
> -static void umount_one(struct mount *mnt, struct list_head *to_umount)
> +static LIST_HEAD(to_umount);	// committed to unmounting those
> +static LIST_HEAD(candidates);	// undecided unmount candidates
> +
> +static inline struct mount *first_candidate(void)
>  {
> -	CLEAR_MNT_MARK(mnt);
> -	mnt->mnt.mnt_flags |= MNT_UMOUNT;
> -	list_del_init(&mnt->mnt_child);
> -	list_del_init(&mnt->mnt_umounting);
> -	move_from_ns(mnt, to_umount);
> +	if (list_empty(&candidates))
> +		return NULL;
> +	return list_first_entry(&candidates, struct mount, mnt_umounting);
>  }
>  
> -/*
> - * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
> - * parent propagates to.
> - */
> -static bool __propagate_umount(struct mount *mnt,
> -			       struct list_head *to_umount,
> -			       struct list_head *to_restore)
> +static inline bool is_candidate(struct mount *m)
>  {
> -	bool progress = false;
> -	struct mount *child;
> +	return !list_empty(&m->mnt_umounting);
> +}
>  
> -	/*
> -	 * The state of the parent won't change if this mount is
> -	 * already unmounted or marked as without children.
> -	 */
> -	if (mnt->mnt.mnt_flags & (MNT_UMOUNT | MNT_MARKED))
> -		goto out;
> +static inline bool will_be_unmounted(struct mount *m)
> +{
> +	return m->mnt.mnt_flags & MNT_UMOUNT;
> +}
>  
> -	/* Verify topper is the only grandchild that has not been
> -	 * speculatively unmounted.
> -	 */
> -	list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
> -		if (child->mnt_mountpoint == mnt->mnt.mnt_root)
> -			continue;
> -		if (!list_empty(&child->mnt_umounting) && IS_MNT_MARKED(child))
> -			continue;
> -		/* Found a mounted child */
> -		goto children;
> -	}
> +static void umount_one(struct mount *m)
> +{
> +	m->mnt.mnt_flags |= MNT_UMOUNT;
> +	list_del_init(&m->mnt_child);
> +	move_from_ns(m, &to_umount);
> +}
>  
> -	/* Mark mounts that can be unmounted if not locked */
> -	SET_MNT_MARK(mnt);
> -	progress = true;
> +static void remove_candidate(struct mount *m)
> +{
> +	CLEAR_MNT_MARK(m);	// unmark on removal from candidates
> +	list_del_init(&m->mnt_umounting);
> +}
>  
> -	/* If a mount is without children and not locked umount it. */
> -	if (!IS_MNT_LOCKED(mnt)) {
> -		umount_one(mnt, to_umount);
> -	} else {
> -children:
> -		list_move_tail(&mnt->mnt_umounting, to_restore);
> +static void gather_candidates(struct list_head *set)
> +{
> +	LIST_HEAD(visited);
> +	struct mount *m, *p, *q;
> +
> +	list_for_each_entry(m, set, mnt_list) {
> +		if (is_candidate(m))
> +			continue;
> +		list_add(&m->mnt_umounting, &visited);
> +		p = m->mnt_parent;
> +		q = propagation_next(p, p);
> +		while (q) {
> +			struct mount *child = __lookup_mnt(&q->mnt,
> +							   m->mnt_mountpoint);
> +			if (child) {
> +				struct list_head *head;
> +
> +				/*
> +				 * We might've already run into this one.  That
> +				 * must've happened on earlier iteration of the
> +				 * outer loop; in that case we can skip those
> +				 * parents that get propagation from q - there
> +				 * will be nothing new on those as well.
> +				 */
> +				if (is_candidate(child)) {
> +					q = skip_propagation_subtree(q, p);
> +					continue;
> +				}
> +				if (will_be_unmounted(child))
> +					head = &visited;
> +				else
> +					head = &candidates;
> +				list_add(&child->mnt_umounting, head);
> +			}
> +			q = propagation_next(q, p);
> +		}
>  	}
> -out:
> -	return progress;
> +	while (!list_empty(&visited))	// empty visited
> +		list_del_init(visited.next);
>  }
>  
> -static void umount_list(struct list_head *to_umount,
> -			struct list_head *to_restore)
> +/*
> + * We know that some child of @m can't be unmounted.  In all places where the
> + * chain of descent of @m has child not overmounting the root of parent,
> + * the parent can't be unmounted either.
> + */
> +static void trim_ancestors(struct mount *m)
>  {
> -	struct mount *mnt, *child, *tmp;
> -	list_for_each_entry(mnt, to_umount, mnt_list) {
> -		list_for_each_entry_safe(child, tmp, &mnt->mnt_mounts, mnt_child) {
> -			/* topper? */
> -			if (child->mnt_mountpoint == mnt->mnt.mnt_root)
> -				list_move_tail(&child->mnt_umounting, to_restore);
> -			else
> -				umount_one(child, to_umount);
> -		}
> +	struct mount *p;
> +
> +	for (p = m->mnt_parent; is_candidate(p); m = p, p = p->mnt_parent) {
> +		if (IS_MNT_MARKED(m))	// all candidates beneath are overmounts
> +			return;
> +		SET_MNT_MARK(m);
> +		if (m->mnt_mountpoint != p->mnt.mnt_root)
> +			remove_candidate(p);
>  	}
>  }
>  
> -static void restore_mounts(struct list_head *to_restore)
> +/*
> + * Find and exclude all umount candidates forbidden by @m
> + * (see Documentation/filesystems/propagate_umount.txt)
> + * If we can immediately tell that @m is OK to unmount (unlocked
> + * and all children are already committed to unmounting) commit
> + * to unmounting it.
> + * Returns the next candidate to be trimmed.
> + */
> +static struct mount *trim_one(struct mount *m)
>  {
> -	/* Restore mounts to a clean working state */
> -	while (!list_empty(to_restore)) {
> -		struct mount *mnt, *parent;
> -		struct mountpoint *mp;
> -
> -		mnt = list_first_entry(to_restore, struct mount, mnt_umounting);
> -		CLEAR_MNT_MARK(mnt);
> -		list_del_init(&mnt->mnt_umounting);
> -
> -		/* Should this mount be reparented? */
> -		mp = mnt->mnt_mp;
> -		parent = mnt->mnt_parent;
> -		while (parent->mnt.mnt_flags & MNT_UMOUNT) {
> -			mp = parent->mnt_mp;
> -			parent = parent->mnt_parent;
> -		}
> -		if (parent != mnt->mnt_parent) {
> -			mnt_change_mountpoint(parent, mp, mnt);
> -			mnt_notify_add(mnt);
> +	bool remove_this = false, found = false, umount_this = false;
> +	struct mount *n;
> +	struct list_head *next;
> +
> +	list_for_each_entry(n, &m->mnt_mounts, mnt_child) {
> +		if (!is_candidate(n)) {
> +			found = true;
> +			if (n->mnt_mountpoint != m->mnt.mnt_root) {
> +				remove_this = true;
> +				break;
> +			}
>  		}
>  	}
> +	if (found) {
> +		trim_ancestors(m);
> +	} else if (!IS_MNT_LOCKED(m) && list_empty(&m->mnt_mounts)) {
> +		remove_this = true;
> +		umount_this = true;
> +	}
> +	next = m->mnt_umounting.next;
> +	if (remove_this) {
> +		remove_candidate(m);
> +		if (umount_this)
> +			umount_one(m);
> +	}
> +	if (next == &candidates)
> +		return NULL;
> +
> +	return list_entry(next, struct mount, mnt_umounting);
>  }
>  
> -static void cleanup_umount_visitations(struct list_head *visited)
> +static void handle_locked(struct mount *m)
>  {
> -	while (!list_empty(visited)) {
> -		struct mount *mnt =
> -			list_first_entry(visited, struct mount, mnt_umounting);
> -		list_del_init(&mnt->mnt_umounting);
> +	struct mount *cutoff = m, *p;
> +
> +	for (p = m; is_candidate(p); p = p->mnt_parent) {
> +		remove_candidate(p);
> +		if (!IS_MNT_LOCKED(p))
> +			cutoff = p->mnt_parent;
> +	}
> +	if (will_be_unmounted(p))
> +		cutoff = p;
> +	while (m != cutoff) {
> +		umount_one(m);
> +		m = m->mnt_parent;
>  	}
>  }
>  
>  /*
> - * collect all mounts that receive propagation from the mount in @list,
> - * and return these additional mounts in the same list.
> - * @list: the list of mounts to be unmounted.
> + * @m is not to going away, and it overmounts the top of a stack of mounts
> + * that are going away.  We know that all of those are fully overmounted
> + * by the one above (@m being the topmost of the chain), so @m can be slid
> + * in place where the bottom of the stack is attached.
>   *
> - * vfsmount lock must be held for write
> + * NOTE: here we temporarily violate a constraint - two mounts end up with
> + * the same parent and mountpoint; that will be remedied as soon as we
> + * return from propagate_umount() - its caller (umount_tree()) will detach
> + * the stack from the parent it (and now @m) is attached to.  umount_tree()
> + * might choose to keep unmounted pieces stuck to each other, but it always
> + * detaches them from the mounts that remain in the tree.
>   */
> -int propagate_umount(struct list_head *list)
> +static void reparent(struct mount *m)
>  {
> -	struct mount *mnt;
> -	LIST_HEAD(to_restore);
> -	LIST_HEAD(to_umount);
> -	LIST_HEAD(visited);
> +	struct mount *p = m;
> +	struct mountpoint *mp;
>  
> -	/* Find candidates for unmounting */
> -	list_for_each_entry_reverse(mnt, list, mnt_list) {
> -		struct mount *parent = mnt->mnt_parent;
> -		struct mount *m;
> +	do {
> +		mp = p->mnt_mp;
> +		p = p->mnt_parent;
> +	} while (will_be_unmounted(p));

That's nice and clear!

>  
> -		/*
> -		 * If this mount has already been visited it is known that it's
> -		 * entire peer group and all of their slaves in the propagation
> -		 * tree for the mountpoint has already been visited and there is
> -		 * no need to visit them again.
> -		 */
> -		if (!list_empty(&mnt->mnt_umounting))
> -			continue;
> +	mnt_change_mountpoint(p, mp, m);
> +	mnt_notify_add(m);
> +}
>  
> -		list_add_tail(&mnt->mnt_umounting, &visited);
> -		for (m = propagation_next(parent, parent); m;
> -		     m = propagation_next(m, parent)) {
> -			struct mount *child = __lookup_mnt(&m->mnt,
> -							   mnt->mnt_mountpoint);
> -			if (!child)
> -				continue;
> +/**
> + * propagate_umount - apply propagation rules to the set of mounts for umount()
> + * @set: the list of mounts to be unmounted.
> + *
> + * Collect all mounts that receive propagation from the mount in @set and have
> + * no obstacles to being unmounted.  Add these additional mounts to the set.
> + *
> + * See Documentation/filesystems/propagate_umount.txt if you do anything in
> + * this area.
> + *
> + * Locks held:
> + * mount_lock (write_seqlock), namespace_sem (exclusive).
> + */
> +void propagate_umount(struct list_head *set)
> +{
> +	struct mount *m;
>  
> -			if (!list_empty(&child->mnt_umounting)) {
> -				/*
> -				 * If the child has already been visited it is
> -				 * know that it's entire peer group and all of
> -				 * their slaves in the propgation tree for the
> -				 * mountpoint has already been visited and there
> -				 * is no need to visit this subtree again.
> -				 */
> -				m = skip_propagation_subtree(m, parent);
> -				continue;
> -			} else if (child->mnt.mnt_flags & MNT_UMOUNT) {
> -				/*
> -				 * We have come across a partially unmounted
> -				 * mount in a list that has not been visited
> -				 * yet. Remember it has been visited and
> -				 * continue about our merry way.
> -				 */
> -				list_add_tail(&child->mnt_umounting, &visited);
> -				continue;
> -			}
> +	// collect all candidates
> +	gather_candidates(set);
>  
> -			/* Check the child and parents while progress is made */
> -			while (__propagate_umount(child,
> -						  &to_umount, &to_restore)) {
> -				/* Is the parent a umount candidate? */
> -				child = child->mnt_parent;
> -				if (list_empty(&child->mnt_umounting))
> -					break;
> -			}
> -		}
> -	}
> +	// reduce the set until it's non-shifting

Re earlier comments: "Reduce the set until it's closed" seems a lot clearer to me.

> +	for (m = first_candidate(); m; m = trim_one(m))
> +		;
>  
> -	umount_list(&to_umount, &to_restore);
> -	restore_mounts(&to_restore);
> -	cleanup_umount_visitations(&visited);
> -	list_splice_tail(&to_umount, list);
> +	// ... and non-revealing
> +	while (!list_empty(&candidates))
> +		handle_locked(first_candidate());
>  
> -	return 0;
> +	// now to_umount consists of all acceptable candidates
> +	// deal with reparenting of remaining overmounts on those
> +	list_for_each_entry(m, &to_umount, mnt_list) {
> +		while (!list_empty(&m->mnt_mounts)) // should be at most one
> +			reparent(list_first_entry(&m->mnt_mounts,
> +						  struct mount, mnt_child));
> +	}
> +
> +	// and fold them into the set
> +	list_splice_tail_init(&to_umount, set);
>  }
> diff --git a/fs/pnode.h b/fs/pnode.h
> index 34b6247af01d..d84a397bfd43 100644
> --- a/fs/pnode.h
> +++ b/fs/pnode.h
> @@ -39,7 +39,7 @@ static inline void set_mnt_shared(struct mount *mnt)
>  void change_mnt_propagation(struct mount *, int);
>  int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
>  		struct hlist_head *);
> -int propagate_umount(struct list_head *);
> +void propagate_umount(struct list_head *);
>  int propagate_mount_busy(struct mount *, int);
>  void propagate_mount_unlock(struct mount *);
>  void mnt_release_group_id(struct mount *);

Ok, this looks a lot cleaner to me than what we had before and the
argument seems sane as well. With the requested changes:

Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage)
  2025-05-19 21:35             ` Al Viro
  2025-05-19 22:08               ` Linus Torvalds
@ 2025-05-20 22:27               ` Eric W. Biederman
  2025-05-20 23:08                 ` Al Viro
       [not found]               ` <20250520075317.GB2023217@ZenIV>
  2 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2025-05-20 22:27 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Christian Brauner

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Mon, May 19, 2025 at 11:11:10AM -0700, Linus Torvalds wrote:
>> Another thing that is either purely syntactic, or shows that I
>> *really* don't understand your patch. Why do you do this odd thing:
>> 
>>         // reduce the set until it's non-shifting
>>         for (m = first_candidate(); m; m = trim_one(m))
>>                 ;
>> 
>> which seems to just walk the candidates list in a very non-obvious
>> manner (this is one of those "I had to go back and forth to see what
>> first_candidate() did and what lists it used" cases).
>> 
>> It *seems* to be the same as
>> 
>>         list_for_each_entry_safe(m, tmp, &candidates, mnt_umounting)
>>                 trim_one(m);
>> 
>> because if I read that code right, 'trim_one()' will just always
>> return the next entry in that candidate list.
>
>
> 	Another variant would be to steal one more bit from mnt_flags,
> set it for all candidates when collecting them, have is_candidate() check
> that instead of list_empty(&m->mnt_umounting) and clean it where this
> variant removes from the list; trim_one() would immediately return if
> bit is not set.  Then we could really do list_for_each_entry_safe(),
> with another loop doing list removals afterwards.  Extra work that way,
> though, and I still think it's more confusing...

I have only skimmed this so far, and I am a bit confused what we
are using MNT_MARK for.   I would think we should be able to use
MNT_MARK instead of stealing another bit.

Regardless I believe you said the goal is to make the code as readable
as possible, so next time it needs to be audited a decade from now
it won't be hard to figure out what is going on.

To that end I think leaving everything on the candidate list, and
flipping a bit when we decide that that the mount should be kept
will be easier to understand.

That way we can have all of the mostly naive algorithms examine
a mount and see what we should do with it, in all of the various
cases, and we don't have to be clever.

The only way I can see to avoid difficult audits is to remove as
much cleverness from the code as the problem domain allows.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage)
  2025-05-20 22:27               ` Eric W. Biederman
@ 2025-05-20 23:08                 ` Al Viro
  2025-05-23  2:10                   ` [RFC][CFT][PATCH v2] Rewrite of propagate_umount() Al Viro
  0 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2025-05-20 23:08 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linus Torvalds, linux-fsdevel, Christian Brauner

On Tue, May 20, 2025 at 05:27:55PM -0500, Eric W. Biederman wrote:

> I have only skimmed this so far, and I am a bit confused what we
> are using MNT_MARK for.   I would think we should be able to use
> MNT_MARK instead of stealing another bit.

MNT_MARK is used to avoid walking the same ancestry chain again and
again - once we have done that once and removed all non-overmounts,
we no longer will run into any when reaching that chain.  Otherwise
we can run into unpleasant situation: 1000-long chain of overmounts,
with a bush on top of it.  Basically, when we see something in
that bush with strictly internal non-candidate, we need to take out
all non-overmounts in the chain of ancestor candidates.  Walking
the same 1000-long chunk for each of those would be not only pointless,
it would give O(N^2) worst case time complexity.

Anyway, I have something I believe is a more readable approach -
have an explicit MNT_UMOUNT_CANDIDATE set by gather_candidates(),
use it for is_candidate(m) and instead of remove_candidate() in
trim_ancestors() just clean the MNT_UMOUNT_CANDIDATE there.

Then, in the beginning of trim_one() *and* handle_locked() have
	if (!is_candidate(m)) {
		remove_from_candidate_list(m);
		return;
	}

Ta-da - trim_one() can be used with list_for_each_entry_safe() now,
so it doesn't need to return anything.

gather_candidates() doesn't need the list of already seen elements 
of original set - we just set MNT_UMOUNT_CANDIDATE on encountered
mounts, still put the ones not in original set on candidates and
and leave the original ones on the original list.  Instead of
dissolving linkage in 'visited' we loop through the original set
and remove those MNT_UMOUNT_CANDIDATE there.

What's more, that allows to kill mnt_umounting linkage - we never
use both at the same time now, so 'candidates' can use mnt_list
linkage instead.

I've started tests right now, will update D/f/propagate_umount.txt,
convert away from global lists and push if it survives the tests.

In the meanwhile, the incremental I'm testing right now is this:

diff --git a/fs/mount.h b/fs/mount.h
index 7aecf2a60472..65275d031e61 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -84,7 +84,6 @@ struct mount {
 		struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
 		struct hlist_node mnt_umount;
 	};
-	struct list_head mnt_umounting; /* list entry for umount propagation */
 #ifdef CONFIG_FSNOTIFY
 	struct fsnotify_mark_connector __rcu *mnt_fsnotify_marks;
 	__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index 34b47bd82c38..028db59e2b26 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -382,7 +382,6 @@ static struct mount *alloc_vfsmnt(const char *name)
 		INIT_LIST_HEAD(&mnt->mnt_slave_list);
 		INIT_LIST_HEAD(&mnt->mnt_slave);
 		INIT_HLIST_NODE(&mnt->mnt_mp_list);
-		INIT_LIST_HEAD(&mnt->mnt_umounting);
 		INIT_HLIST_HEAD(&mnt->mnt_stuck_children);
 		RB_CLEAR_NODE(&mnt->mnt_node);
 		mnt->mnt.mnt_idmap = &nop_mnt_idmap;
diff --git a/fs/pnode.c b/fs/pnode.c
index 9b2f1ac80f25..605bb22011e0 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -470,14 +470,12 @@ static LIST_HEAD(candidates);	// undecided unmount candidates
 
 static inline struct mount *first_candidate(void)
 {
-	if (list_empty(&candidates))
-		return NULL;
-	return list_first_entry(&candidates, struct mount, mnt_umounting);
+	return list_first_entry_or_null(&candidates, struct mount, mnt_list);
 }
 
 static inline bool is_candidate(struct mount *m)
 {
-	return !list_empty(&m->mnt_umounting);
+	return m->mnt.mnt_flags & MNT_UMOUNT_CANDIDATE;
 }
 
 static inline bool will_be_unmounted(struct mount *m)
@@ -492,29 +490,26 @@ static void umount_one(struct mount *m)
 	move_from_ns(m, &to_umount);
 }
 
-static void remove_candidate(struct mount *m)
+static void remove_from_candidate_list(struct mount *m)
 {
-	CLEAR_MNT_MARK(m);	// unmark on removal from candidates
-	list_del_init(&m->mnt_umounting);
+	m->mnt.mnt_flags &= ~(MNT_MARKED | MNT_UMOUNT_CANDIDATE);
+	list_del_init(&m->mnt_list);
 }
 
 static void gather_candidates(struct list_head *set)
 {
-	LIST_HEAD(visited);
 	struct mount *m, *p, *q;
 
 	list_for_each_entry(m, set, mnt_list) {
 		if (is_candidate(m))
 			continue;
-		list_add(&m->mnt_umounting, &visited);
+		m->mnt.mnt_flags |= MNT_UMOUNT_CANDIDATE;
 		p = m->mnt_parent;
 		q = propagation_next(p, p);
 		while (q) {
 			struct mount *child = __lookup_mnt(&q->mnt,
 							   m->mnt_mountpoint);
 			if (child) {
-				struct list_head *head;
-
 				/*
 				 * We might've already run into this one.  That
 				 * must've happened on earlier iteration of the
@@ -526,17 +521,15 @@ static void gather_candidates(struct list_head *set)
 					q = skip_propagation_subtree(q, p);
 					continue;
 				}
-				if (will_be_unmounted(child))
-					head = &visited;
-				else
-					head = &candidates;
-				list_add(&child->mnt_umounting, head);
+				child->mnt.mnt_flags |= MNT_UMOUNT_CANDIDATE;
+				if (!will_be_unmounted(child))
+					list_add(&child->mnt_list, &candidates);
 			}
 			q = propagation_next(q, p);
 		}
 	}
-	while (!list_empty(&visited))	// empty visited
-		list_del_init(visited.next);
+	list_for_each_entry(m, set, mnt_list)
+		m->mnt.mnt_flags &= ~MNT_UMOUNT_CANDIDATE;
 }
 
 /*
@@ -553,7 +546,7 @@ static void trim_ancestors(struct mount *m)
 			return;
 		SET_MNT_MARK(m);
 		if (m->mnt_mountpoint != p->mnt.mnt_root)
-			remove_candidate(p);
+			p->mnt.mnt_flags &= ~MNT_UMOUNT_CANDIDATE;
 	}
 }
 
@@ -563,13 +556,19 @@ static void trim_ancestors(struct mount *m)
  * If we can immediately tell that @m is OK to unmount (unlocked
  * and all children are already committed to unmounting) commit
  * to unmounting it.
- * Returns the next candidate to be trimmed.
+ * Only @m itself might be taken from the candidates list;
+ * anything found by trim_ancestors() is marked non-candidate
+ * and left on the list.
  */
-static struct mount *trim_one(struct mount *m)
+static void trim_one(struct mount *m)
 {
 	bool remove_this = false, found = false, umount_this = false;
 	struct mount *n;
-	struct list_head *next;
+
+	if (!is_candidate(m)) { // trim_ancestors() left it on list
+		remove_from_candidate_list(m);
+		return;
+	}
 
 	list_for_each_entry(n, &m->mnt_mounts, mnt_child) {
 		if (!is_candidate(n)) {
@@ -586,24 +585,23 @@ static struct mount *trim_one(struct mount *m)
 		remove_this = true;
 		umount_this = true;
 	}
-	next = m->mnt_umounting.next;
 	if (remove_this) {
-		remove_candidate(m);
+		remove_from_candidate_list(m);
 		if (umount_this)
 			umount_one(m);
 	}
-	if (next == &candidates)
-		return NULL;
-
-	return list_entry(next, struct mount, mnt_umounting);
 }
 
 static void handle_locked(struct mount *m)
 {
 	struct mount *cutoff = m, *p;
 
+	if (!is_candidate(m)) { // trim_ancestors() left it on list
+		remove_from_candidate_list(m);
+		return;
+	}
 	for (p = m; is_candidate(p); p = p->mnt_parent) {
-		remove_candidate(p);
+		remove_from_candidate_list(p);
 		if (!IS_MNT_LOCKED(p))
 			cutoff = p->mnt_parent;
 	}
@@ -657,14 +655,14 @@ static void reparent(struct mount *m)
  */
 void propagate_umount(struct list_head *set)
 {
-	struct mount *m;
+	struct mount *m, *p;
 
 	// collect all candidates
 	gather_candidates(set);
 
 	// reduce the set until it's non-shifting
-	for (m = first_candidate(); m; m = trim_one(m))
-		;
+	list_for_each_entry_safe(m, p, &candidates, mnt_list)
+		trim_one(m);
 
 	// ... and non-revealing
 	while (!list_empty(&candidates))
diff --git a/include/linux/mount.h b/include/linux/mount.h
index dcc17ce8a959..33af9c4058fa 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -50,10 +50,12 @@ struct path;
 #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
 
 #define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
-			    MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED)
+			    MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED | \
+			    MNT_UMOUNT_CANDIDATE)
 
 #define MNT_INTERNAL	0x4000
 
+#define MNT_UMOUNT_CANDIDATE	0x020000
 #define MNT_LOCK_ATIME		0x040000
 #define MNT_LOCK_NOEXEC		0x080000
 #define MNT_LOCK_NOSUID		0x100000


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage)
  2025-05-20 11:10           ` [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage) Christian Brauner
@ 2025-05-21  2:11             ` Al Viro
  0 siblings, 0 replies; 16+ messages in thread
From: Al Viro @ 2025-05-21  2:11 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Eric W. Biederman, linux-fsdevel, Linus Torvalds

On Tue, May 20, 2025 at 01:10:24PM +0200, Christian Brauner wrote:
> > +It is convenient to define several properties of sets of mounts:
> > +
> > +1) A set S of mounts is non-shifting if for any mount X belonging
> > +to S all subtrees mounted strictly inside of X (i.e. not overmounting
> > +the root of X) contain only elements of S.
> 
> I think "shifting" is misleading. I would suggest either "isolated" or
> "contained" or ideally "closed" which would mean...

Umm...  I'm not sure.  "Shifting" in a sense that pulling that set out
and reparenting everything that remains to the nearest surviving ancestor
won't change the pathnames.  "Contained" or "isolated"... what would
that be about?

> > +of that set, but only on top of stacks of root-overmounting elements
> > +of set.  They can be reparented to the place where the bottom of
> > +stack is attached to a mount that will survive.  NOTE: doing that
> > +will violate a constraint on having no more than one mount with
> > +the same parent/mountpoint pair; however, the caller (umount_tree())
> 
> I would prefer if this would insert the term "shadow mounts" since
> that's what we've traditionally used for that.

There's a bit of ambiguity - if we have done

mount -t tmpfs none /tmp/foo
touch /tmp/foo/A
mount -t tmpfs none /tmp/foo
touch /tmp/foo/B

we have two mounts, one overmounting the root of another.  Does "shadow"
apply to the lower (with A on it) or the upper (with B on it)?

> > +{
> > +	while (1) {
> > +		struct mount *master = m->mnt_master;
> > +
> > +		if (master == origin->mnt_master) {
> > +			struct mount *next = next_peer(m);
> > +			return (next == origin) ? NULL : next;
> > +		} else if (m->mnt_slave.next != &master->mnt_slave_list)
> > +			return next_slave(m);
> 
> Please add a comment to that helper that explains how it walks the
> propagation tree. I remember having to fix bugs in that code and the
> lack of comments was noticable.

Ugh...  Let's separate that - it's not specific to propagate_umount()
and the helper is the "we hadn't gone into ->mnt_slave_list" half of
propagation_next(), verbatim.

I agree that comments there would be a good thing, but it (and next_group())
belong to different layer - how do we walk the propagation graph.

FWIW, the current variant of that thing (which seems to survive the tests
so far) already has a plenty in it; let's try to keep at least some parts
in separate commits...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC][CFT][PATCH v2] Rewrite of propagate_umount()
  2025-05-20 23:08                 ` Al Viro
@ 2025-05-23  2:10                   ` Al Viro
  0 siblings, 0 replies; 16+ messages in thread
From: Al Viro @ 2025-05-23  2:10 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linus Torvalds, linux-fsdevel, Christian Brauner

[
Updated and force-pushed into the same branch.  Seems to survive
the local testing.  Sending out now, in case I end up off-net -
bailing out the damn window well once about 15 minutes at the
moment...

Please, review and comment; more testing would be welcome, ditto
for suggestions on the documentation part.  Christian, so far
I've left the terms unchanged; esp. regarding the "shadow mounts" -
I could go with "shadowing" and "shadowed", but I'm not sure I've
seen those used anywhere.
]

Rewrite of propagate_umount()

The variant currently in the tree has problems; trying to prove
correctness has caught at least one class of bugs (reparenting
that ends up moving the visible location of reparented mount, due
to not excluding some of the counterparts on propagation that
should've been included).

I tried to prove that it's the only bug there; I'm still not sure
whether it is.  If anyone can reconstruct and write down an analysis
of the mainline implementation, I'll gladly review it; as it is,
I ended up doing a different implementation.  Candidate collection
phase is similar, but trimming the set down until it satisfies the
constraints turned out pretty different.

I hoped to do transformation as a massage series, but that turns out
to be too convoluted.  So it's a single patch replacing propagate_umount()
and friends in one go, with notes and analysis in D/f/propagate_umount.txt
(in addition to inline comments).

As far I can tell, it is provably correct and provably linear by the number
of mounts we need to look at in order to decide what should be unmounted.
It even builds and seems to survive testing...

Another nice thing that fell out of that is that ->mnt_umounting is no longer
needed.

Compared to the first version:
    * explicit MNT_UMOUNT_CANDIDATE flag for is_candidate()
    * trim_ancestors() only clears that flag, leaving the suckers on list
    * trim_one() and handle_locked() take the stuff with flag cleared off
the list.  That allows to iterate with list_for_each_entry_safe() when calling
trim_one() - it removes at most one element from the list now.
    * no globals - I didn't bother with any kind of context, not worth it.

    * Notes updated accordingly; I have not touch the terms yet.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/Documentation/filesystems/propagate_umount.txt b/Documentation/filesystems/propagate_umount.txt
new file mode 100644
index 000000000000..5b48540b4059
--- /dev/null
+++ b/Documentation/filesystems/propagate_umount.txt
@@ -0,0 +1,484 @@
+	Notes on propagate_umount()
+
+Umount propagation starts with a set of mounts we are already going to
+take out.  Ideally, we would like to add all downstream cognates to
+that set - anything with the same mountpoint as one of the removed
+mounts and with parent that would receive events from the parent of that
+mount.  However, there are some constraints the resulting set must
+satisfy.
+
+It is convenient to define several properties of sets of mounts:
+
+1) A set S of mounts is non-shifting if for any mount X belonging
+to S all subtrees mounted strictly inside of X (i.e. not overmounting
+the root of X) contain only elements of S.
+
+2) A set S is non-revealing if all locked mounts that belong to S have
+parents that also belong to S.
+
+3) A set S is closed if it contains all children of its elements.
+
+The set of mounts taken out by umount(2) must be non-shifting and
+non-revealing; the first constraint is what allows to reparent
+any remaining mounts and the second is what prevents the exposure
+of any concealed mountpoints.
+
+propagate_umount() takes the original set as an argument and tries to
+extend that set.  The original set is a full subtree and its root is
+unlocked; what matters is that it's closed and non-revealing.
+Resulting set may not be closed; there might still be mounts outside
+of that set, but only on top of stacks of root-overmounting elements
+of set.  They can be reparented to the place where the bottom of
+stack is attached to a mount that will survive.  NOTE: doing that
+will violate a constraint on having no more than one mount with
+the same parent/mountpoint pair; however, the caller (umount_tree())
+will immediately remedy that - it may keep unmounted element attached
+to parent, but only if the parent itself is unmounted.  Since all
+conflicts created by reparenting have common parent *not* in the
+set and one side of the conflict (bottom of the stack of overmounts)
+is in the set, it will be resolved.  However, we rely upon umount_tree()
+doing that pretty much immediately after the call of propagate_umount().
+
+Algorithm is based on two statements:
+	1) for any set S, there is a maximal non-shifting subset of S
+and it can be calculated in O(#S) time.
+	2) for any non-shifting set S, there is a maximal non-revealing
+subset of S.  That subset is also non-shifting and it can be calculated
+in O(#S) time.
+
+		Finding candidates.
+
+We are given a closed set U and we want to find all mounts that have
+the same mountpoint as some mount m in U *and* whose parent receives
+propagation from the parent of the same mount m.  Naive implementation
+would be
+	S = {}
+	for each m in U
+		add m to S
+		p = parent(m)
+		for each q in Propagation(p) - {p}
+			child = look_up(q, mountpoint(m))
+			if child
+				add child to S
+but that can lead to excessive work - there might be propagation among the
+subtrees of U, in which case we'd end up examining the same candidates
+many times.  Since propagation is transitive, the same will happen to
+everything downstream of that candidate and it's not hard to construct
+cases where the approach above leads to the time quadratic by the actual
+number of candidates.
+
+Note that if we run into a candidate we'd already seen, it must've been
+added on an earlier iteration of the outer loop - all additions made
+during one iteration of the outer loop have different parents.  So
+if we find a child already added to the set, we know that everything
+in Propagation(parent(child)) with the same mountpoint has been already
+added.
+	S = {}
+	for each m in U
+		if m in S
+			continue
+		add m to S
+		p = parent(m)
+		q = propagation_next(p, p)
+		while q
+			child = look_up(q, mountpoint(m))
+			if child
+				if child in S
+					q = skip_them(q, p)
+					continue;
+				add child to S
+			q = propagation_next(q, p)
+where
+skip_them(q, p)
+	keep walking Propagation(p) from q until we find something
+	not in Propagation(q)
+
+would get rid of that problem, but we need a sane implementation of
+skip_them().  That's not hard to do - split propagation_next() into
+"down into mnt_slave_list" and "forward-and-up" parts, with the
+skip_them() being "repeat the forward-and-up part until we get NULL
+or something that isn't a peer of the one we are skipping".
+
+Note that there can be no absolute roots among the extra candidates -
+they all come from mount lookups.  Absolute root among the original
+set is _currently_ impossible, but it might be worth protecting
+against.
+
+		Maximal non-shifting subsets.
+
+Let's call a mount m in a set S forbidden in that set if there is a
+subtree mounted strictly inside m and containing mounts that do not
+belong to S.
+
+The set is non-shifting when none of its elements are forbidden in it.
+
+If mount m is forbidden in a set S, it is forbidden in any subset S' it
+belongs to.  In other words, it can't belong to any of the non-shifting
+subsets of S.  If we had a way to find a forbidden mount or show that
+there's none, we could use it to find the maximal non-shifting subset
+simply by finding and removing them until none remain.
+
+Suppose mount m is forbidden in S; then any mounts forbidden in S - {m}
+must have been forbidden in S itself.  Indeed, since m has descendents
+that do not belong to S, any subtree that fits into S will fit into
+S - {m} as well.
+
+So in principle we could go through elements of S, checking if they
+are forbidden in S and removing the ones that are.  Removals will
+not invalidate the checks done for earlier mounts - if they were not
+forbidden at the time we checked, they won't become forbidden later.
+It's too costly to be practical, but there is a similar approach that
+is linear by size of S.
+
+Let's say that mount x in a set S is forbidden by mount y, if
+	* both x and y belong to S.
+	* there is a chain of mounts starting at x and leaving S
+	  immediately after passing through y, with the first
+	  mountpoint strictly inside x.
+Note 1: x may be equal to y - that's the case when something not
+belonging to S is mounted strictly inside x.
+Note 2: if y does not belong to S, it can't forbid anything in S.
+Note 3: if y has no children outside of S, it can't forbid anything in S.
+
+It's easy to show that mount x is forbidden in S if and only if x is
+forbidden in S by some mount y.  And it's easy to find all mounts in S
+forbidden by a given mount.
+
+Consider the following operation:
+	Trim(S, m) = S - {x : x is forbidden by m in S}
+
+Note that if m does not belong to S or has no children outside of S we
+are guaranteed that Trim(S, m) is equal to S.
+
+The following is true: if x is forbidden by y in Trim(S, m), it was
+already forbidden by y in S.
+
+Proof: Suppose x is forbidden by y in Trim(S, m).  Then there is a
+chain of mounts (x_0 = x, ..., x_k = y, x_{k+1} = r), such that x_{k+1}
+is the first element that doesn't belong to Trim(S, m) and the
+mountpoint of x_1 is strictly inside x.  If mount r belongs to S, it must
+have been removed by Trim(S, m), i.e. it was forbidden in S by m.
+Then there was a mount chain from r to some child of m that stayed in
+S all the way until m, but that's impossible since x belongs to Trim(S, m)
+and prepending (x_0, ..., x_k) to that chain demonstrates that x is also
+prohibited in S by m, and thus can't belong to Trim(S, m).
+Therefore r can not belong to S and our chain demonstrates that
+x is prohibited by y in S.  QED.
+
+Corollary: no mount is forbidden by m in Trim(S, m).  Indeed, any
+such mount would have been forbidden by m in S and thus would have been
+in the part of S removed in Trim(S, m).
+
+Corollary: no mount is forbidden by m in Trim(Trim(S, m), n).  Indeed,
+any such would have to have been forbidden by m in Trim(S, m), which
+is impossible.
+
+Corollary: after
+	S = Trim(S, x_1)
+	S = Trim(S, x_2)
+	...
+	S = Trim(S, x_k)
+no mount remaining in S will be forbidden by either of x_1,...,x_k.
+
+The following will reduce S to its maximal non-shifting subset:
+	visited = {}
+	while S contains elements not belonging to visited
+		let m be an arbitrary such element of S
+		S = Trim(S, m)
+		add m to visited
+
+S never grows, so the number of elements of S not belonging to visited
+decreases at least by one on each iteration.  When the loop terminates,
+all mounts remaining in S belong to visited.  It's easy to see that at
+the beginning of each iteration no mount remaining in S will be forbidden
+by any element of visited.  In other words, no mount remaining in S will
+be forbidden, i.e. final value of S will be non-shifting.  It will be
+the maximal non-shifting subset, since we were removing only forbidden
+elements.
+
+	There are two difficulties in implementing the above in linear
+time, both due to the fact that Trim() might need to remove more than one
+element.  Naive implementation of Trim() is vulnerable to running into a
+long chain of mounts, each mounted on top of parent's root.  Nothing in
+that chain is prohibited, so nothing gets removed from it.  We need to
+recognize such chains and avoid walking them again on subsequent calls of
+Trim(), otherwise we will end up with worst-case time being quadratic by
+the number of elements in S.  Another difficulty is in implementing the
+outer loop - we need to iterate through all elements of a shrinking set.
+That would be trivial if we never removed more than one element at a time
+(linked list, with list_for_each_entry_safe for iterator), but we may
+need to remove more than one entry, possibly including the ones we have
+already visited.
+
+	Let's start with naive algorithm for Trim():
+
+Trim_one(m)
+	found = false
+	for each n in children(m)
+		if n not in S
+			found = true
+			if (mountpoint(n) != root(m))
+				remove m from S
+				break
+	if found
+		Trim_ancestors(m)
+
+Trim_ancestors(m)
+	for (; parent(m) in S; m = parent(m)) {
+		if (mountpoint(m) != root(parent(m)))
+			remove parent(m) from S
+	}
+
+If m belongs to S, Trim_one(m) will replace S with Trim(S, m).
+Proof:
+	Consider the chains excluding elements from Trim(S, m).  The last
+two elements in such chain are m and some child of m that does not belong
+to S.  If m has no such children, Trim(S, m) is equal to S.
+	m itself is removed if and only if the chain has exactly two
+elements, i.e. when the last element does not overmount the root of m.
+In other words, that happens when m has a child not in S that does not
+overmount the root of m.
+	All other elements to remove will be ancestors of m, such that
+the entire descent chain from them to m is contained in S.  Let
+(x_0, x_1, ..., x_k = m) be the longest such chain.  x_i needs to be
+removed if and only if x_{i+1} does not overmount its root.  It's easy
+to see that Trim_ancestors(m) will iterate through that chain from
+x_k to x_1 and that it will remove exactly the elements that need to be
+removed.
+
+	Note that if the loop in Trim_ancestors() walks into an already
+visited element, we are guaranteed that remaining iterations will see
+only elements that had already been visited and remove none of them.
+That's the weakness that makes it vulnerable to long chains of full
+overmounts.
+
+	It's easy to deal with, if we can afford setting marks on
+elements of S; we would mark all elements already visited by
+Trim_ancestors() and have it bail out as soon as it sees an already
+marked element.
+
+	The problems with iterating through the set can be dealt with in
+several ways, depending upon the representation we choose for our set.
+One useful observation is that we are given a closed subset in S - the
+original set passed to propagate_umount().  Its elements can neither
+prohibit anything nor be prohibited by anything - all their descendents
+belong to S, so they can not occur anywhere in any excluding chain.
+In other words, the elements of that subset will remain in S until
+the end and Trim_one(S, m) is a no-op for all m from that subset.
+
+	That suggests keeping S as a disjoint union of a closed set U
+('will be unmounted, no matter what') and the set of all elements of
+S that do not belong to U.  That set ('candidates') is all we need
+to iterate through.  Let's represent it as a subset in a cyclic list,
+consisting of all list elements that are marked as candidates (initially -
+all of them).  Then we could have Trim_ancestors() only remove the mark,
+leaving the elements on the list.  Then Trim_one() would never remove
+anything other than its argument from the containing list, allowing to
+use list_for_each_entry_safe() as iterator.
+
+	Assuming that representation we get the following:
+
+	list_for_each_entry_safe(m, ..., Candidates, ...)
+		Trim_one(m)
+where
+Trim_one(m)
+	if (m is not marked as a candidate)
+		strip the "seen by Trim_ancestors" mark from m
+		remove m from the Candidates list
+		return
+		
+	remove_this = false
+	found = false
+	for each n in children(m)
+		if n not in S
+			found = true
+			if (mountpoint(n) != root(m))
+				remove_this = true
+				break
+	if found
+		Trim_ancestors(m)
+	if remove_this
+		strip the "seen by Trim_ancestors" mark from m
+		strip the "candidate" mark from m
+		remove m from the Candidate list
+
+Trim_ancestors(m)
+	for (p = parent(m); p is marked as candidate ; m = p, p = parent(p)) {
+		if m is marked as seen by Trim_ancestors
+			return
+		mark m as seen by Trim_ancestors
+		if (mountpoint(m) != root(p))
+			strip the "candidate" mark from p
+	}
+
+	Terminating condition in the loop in Trim_ancestors() is correct,
+since that that loop will never run into p belonging to U - p is always
+an ancestor of argument of Trim_one() and since U is closed, the argument
+of Trim_one() would also have to belong to U.  But Trim_one() is never
+called for elements of U.  In other words, p belongs to S if and only
+if it belongs to candidates.
+
+	Time complexity:
+* we get no more than O(#S) calls of Trim_one()
+* the loop over children in Trim_one() never looks at the same child
+twice through all the calls.
+* iterations of that loop for children in S are no more than O(#S)
+in the worst case
+* at most two children that are not elements of S are considered per
+call of Trim_one().
+* the loop in Trim_ancestors() sets its mark once per iteration and
+no element of S has is set more than once.
+
+	In the end we may have some elements excluded from S by
+Trim_ancestors() still stuck on the list.  We could do a separate
+loop removing them from the list (also no worse than O(#S) time),
+but it's easier to leave that until the next phase - there we will
+iterate through the candidates anyway.
+
+	The caller has already removed all elements of U from their parents'
+lists of children, which means that checking if child belongs to S is
+equivalent to checking if it's marked as a candidate; we'll never see
+the elements of U in the loop over children in Trim_one().
+
+	What's more, if we see that children(m) is empty and m is not
+locked, we can immediately move m into the committed subset (remove
+from the parent's list of children, etc.).  That's one fewer mount we'll
+have to look into when we check the list of children of its parent *and*
+when we get to building the non-revealing subset.
+
+		Maximal non-revealing subsets
+
+If S is not a non-revealing subset, there is a locked element x in S
+such that parent of x is not in S.
+
+Obviously, no non-revealing subset of S may contain x.  Removing such
+elements one by one will obviously end with the maximal non-revealing
+subset (possibly empty one).  Note that removal of an element will
+require removal of all its locked children, etc.
+
+If the set had been non-shifting, it will remain non-shifting after
+such removals.
+Proof: suppose S was non-shifting, x is a locked element of S, parent of x
+is not in S and S - {x} is not non-shifting.  Then there is an element m
+in S - {x} and a subtree mounted strictly inside m, such that m contains
+an element not in in S - {x}.  Since S is non-shifting, everything in
+that subtree must belong to S.  But that means that this subtree must
+contain x somewhere *and* that parent of x either belongs that subtree
+or is equal to m.  Either way it must belong to S.  Contradiction.
+
+// same representation as for finding maximal non-shifting subsets:
+// S is a disjoint union of a non-revealing set U (the ones we are committed
+// to unmount) and a set of candidates, represented as a subset of list
+// elements that have "is a candidate" mark on them.
+// Elements of U are removed from their parents' lists of children.
+// In the end candidates becomes empty and maximal non-revealing non-shifting
+// subset of S is now in U
+	while (Candidates list is non-empty)
+		handle_locked(first(Candidates))
+
+handle_locked(m)
+	if m is not marked as a candidate
+		strip the "seen by Trim_ancestors" mark from m
+		remove m from the list
+		return
+	cutoff = m
+	for (p = m; p in candidates; p = parent(p)) {
+		strip the "seen by Trim_ancestors" mark from p
+		strip the "candidate" mark from p
+		remove p from the Candidates list
+		if (!locked(p))
+			cutoff = parent(p)
+	}
+	if p in U
+		cutoff = p
+	while m != cutoff
+		remove m from children(parent(m))
+		add m to U
+		m = parent(m)
+
+Let (x_0, ..., x_n = m) be the maximal chain of descent of m within S.
+* If it contains some elements of U, let x_k be the last one of those.
+Then union of U with {x_{k+1}, ..., x_n} is obviously non-revealing.
+* otherwise if all its elements are locked, then none of {x_0, ..., x_n}
+may be elements of a non-revealing subset of S.
+* otherwise let x_k be the first unlocked element of the chain.  Then none
+of {x_0, ..., x_{k-1}} may be an element of a non-revealing subset of
+S and union of U and {x_k, ..., x_n} is non-revealing.
+
+handle_locked(m) finds which of these cases applies and adjusts Candidates
+and U accordingly.  U remains non-revealing, union of Candidates and
+U still contains any non-revealing subset of S and after the call of
+handle_locked(m) m is guaranteed to be not in Candidates list.  So having
+it called for each element of S would suffice to empty Candidates,
+leaving U the maximal non-revealing subset of S.
+
+However, handle_locked(m) is a no-op when m belongs to U, so it's enough
+to have it called for elements of Candidates list until none remain.
+
+Time complexity: number of calls of handle_locked() is limited by
+#Candidates, each iteration of the first loop in handle_locked() removes
+an element from the list, so their total number of executions is also
+limited by #Candidates; number of iterations in the second loop is no
+greater than the number of iterations of the first loop.
+
+
+		Reparenting
+
+After we'd calculated the final set, we still need to deal with
+reparenting - if an element of the final set has a child not in it,
+we need to reparent such child.
+
+Such children can only be root-overmounting (otherwise the set wouldn't
+be non-shifting) and their parents can not belong to the original set,
+since the original is guaranteed to be closed.
+
+
+		Putting all of that together
+
+The plan is to
+	* find all candidates
+	* trim down to maximal non-shifting subset
+	* trim down to maximal non-revealing subset
+	* reparent anything that needs to be reparented
+	* return the resulting set to the caller
+
+For the 2nd and 3rd steps we want to separate the set into growing
+non-revealing subset, initially containing the original set ("U" in
+terms of the pseudocode above) and everything we are still not sure about
+("candidates").  It means that for the output of the 1st step we'd like
+the extra candidates separated from the stuff already in the original set.
+For the 4th step we would like the additions to U separate from the
+original set.
+
+So let's go for
+	* original set ("set").  Linkage via mnt_list
+	* undecided candidates ("candidates").  Subset of a list,
+consisting of all its elements marked with a new flag (MNT_UMOUNT_CANDIDATE).
+Initially all elements of the list will be marked that way; in the
+end the list will become empty and no mounts will remain marked with
+that flag.
+	* Reuse MNT_MARKED for "has been already seen by trim_ancestors()".
+	* anything in U that hadn't been in the original set - elements of
+candidates will gradually be either discarded or moved there.  In other
+words, it's the candidates we have already decided to unmount.	Its role
+is reasonably close to the old "to_umount", so let's use that name.
+Linkage via mnt_list.
+
+For gather_candidates() we'll need to maintain both candidates (S -
+set) and intersection of S with set.  Use MNT_UMOUNT_CANDIDATE for
+all elements we encounter, putting the ones not already in the original
+set into the list of candidates.  When we are done, strip that flag from
+all elements of the original set.  That gives a cheap way to check
+if element belongs to S (in gather_candidates) and to candidates
+itself (at later stages).  Call that predicate is_candidate(); it would
+be m->mnt_flags & MNT_UMOUNT_CANDIDATE.
+
+All elements of the original set are marked with MNT_UMOUNT and we'll
+need the same for elements added when joining the contents of to_umount
+to set in the end.  Let's set MNT_UMOUNT at the time we add an element
+to to_umount; that's close to what the old 'umount_one' is doing, so
+let's keep that name.  It also gives us another predicate we need -
+"belongs to union of set and to_umount"; will_be_unmounted() for now.
+
+Removals from the candidates list should strip both MNT_MARKED and
+MNT_UMOUNT_CANDIDATE; call it remove_from_candidates_list().
diff --git a/fs/mount.h b/fs/mount.h
index 7aecf2a60472..65275d031e61 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -84,7 +84,6 @@ struct mount {
 		struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
 		struct hlist_node mnt_umount;
 	};
-	struct list_head mnt_umounting; /* list entry for umount propagation */
 #ifdef CONFIG_FSNOTIFY
 	struct fsnotify_mark_connector __rcu *mnt_fsnotify_marks;
 	__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index 34b47bd82c38..028db59e2b26 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -382,7 +382,6 @@ static struct mount *alloc_vfsmnt(const char *name)
 		INIT_LIST_HEAD(&mnt->mnt_slave_list);
 		INIT_LIST_HEAD(&mnt->mnt_slave);
 		INIT_HLIST_NODE(&mnt->mnt_mp_list);
-		INIT_LIST_HEAD(&mnt->mnt_umounting);
 		INIT_HLIST_HEAD(&mnt->mnt_stuck_children);
 		RB_CLEAR_NODE(&mnt->mnt_node);
 		mnt->mnt.mnt_idmap = &nop_mnt_idmap;
diff --git a/fs/pnode.c b/fs/pnode.c
index fb77427df39e..8d29d34e4bcd 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -24,11 +24,6 @@ static inline struct mount *first_slave(struct mount *p)
 	return list_entry(p->mnt_slave_list.next, struct mount, mnt_slave);
 }
 
-static inline struct mount *last_slave(struct mount *p)
-{
-	return list_entry(p->mnt_slave_list.prev, struct mount, mnt_slave);
-}
-
 static inline struct mount *next_slave(struct mount *p)
 {
 	return list_entry(p->mnt_slave.next, struct mount, mnt_slave);
@@ -136,6 +131,23 @@ void change_mnt_propagation(struct mount *mnt, int type)
 	}
 }
 
+static struct mount *__propagation_next(struct mount *m,
+					 struct mount *origin)
+{
+	while (1) {
+		struct mount *master = m->mnt_master;
+
+		if (master == origin->mnt_master) {
+			struct mount *next = next_peer(m);
+			return (next == origin) ? NULL : next;
+		} else if (m->mnt_slave.next != &master->mnt_slave_list)
+			return next_slave(m);
+
+		/* back at master */
+		m = master;
+	}
+}
+
 /*
  * get the next mount in the propagation tree.
  * @m: the mount seen last
@@ -153,31 +165,26 @@ static struct mount *propagation_next(struct mount *m,
 	if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
 		return first_slave(m);
 
-	while (1) {
-		struct mount *master = m->mnt_master;
-
-		if (master == origin->mnt_master) {
-			struct mount *next = next_peer(m);
-			return (next == origin) ? NULL : next;
-		} else if (m->mnt_slave.next != &master->mnt_slave_list)
-			return next_slave(m);
+	return __propagation_next(m, origin);
+}
 
-		/* back at master */
-		m = master;
-	}
+static inline bool peers(const struct mount *m1, const struct mount *m2)
+{
+	return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
 }
 
 static struct mount *skip_propagation_subtree(struct mount *m,
 						struct mount *origin)
 {
 	/*
-	 * Advance m such that propagation_next will not return
-	 * the slaves of m.
+	 * Advance m past everything that gets propagation from it.
 	 */
-	if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
-		m = last_slave(m);
+	struct mount *p = __propagation_next(m, origin);
+
+	while (p && peers(m, p))
+		p = __propagation_next(p, origin);
 
-	return m;
+	return p;
 }
 
 static struct mount *next_group(struct mount *m, struct mount *origin)
@@ -216,11 +223,6 @@ static struct mount *next_group(struct mount *m, struct mount *origin)
 static struct mount *last_dest, *first_source, *last_source, *dest_master;
 static struct hlist_head *list;
 
-static inline bool peers(const struct mount *m1, const struct mount *m2)
-{
-	return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
-}
-
 static int propagate_one(struct mount *m, struct mountpoint *dest_mp)
 {
 	struct mount *child;
@@ -463,181 +465,214 @@ void propagate_mount_unlock(struct mount *mnt)
 	}
 }
 
-static void umount_one(struct mount *mnt, struct list_head *to_umount)
+static inline bool is_candidate(struct mount *m)
 {
-	CLEAR_MNT_MARK(mnt);
-	mnt->mnt.mnt_flags |= MNT_UMOUNT;
-	list_del_init(&mnt->mnt_child);
-	list_del_init(&mnt->mnt_umounting);
-	move_from_ns(mnt, to_umount);
+	return m->mnt.mnt_flags & MNT_UMOUNT_CANDIDATE;
 }
 
-/*
- * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
- * parent propagates to.
- */
-static bool __propagate_umount(struct mount *mnt,
-			       struct list_head *to_umount,
-			       struct list_head *to_restore)
+static inline bool will_be_unmounted(struct mount *m)
 {
-	bool progress = false;
-	struct mount *child;
+	return m->mnt.mnt_flags & MNT_UMOUNT;
+}
 
-	/*
-	 * The state of the parent won't change if this mount is
-	 * already unmounted or marked as without children.
-	 */
-	if (mnt->mnt.mnt_flags & (MNT_UMOUNT | MNT_MARKED))
-		goto out;
+static void umount_one(struct mount *m, struct list_head *to_umount)
+{
+	m->mnt.mnt_flags |= MNT_UMOUNT;
+	list_del_init(&m->mnt_child);
+	move_from_ns(m, to_umount);
+}
 
-	/* Verify topper is the only grandchild that has not been
-	 * speculatively unmounted.
-	 */
-	list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
-		if (child->mnt_mountpoint == mnt->mnt.mnt_root)
-			continue;
-		if (!list_empty(&child->mnt_umounting) && IS_MNT_MARKED(child))
-			continue;
-		/* Found a mounted child */
-		goto children;
-	}
+static void remove_from_candidate_list(struct mount *m)
+{
+	m->mnt.mnt_flags &= ~(MNT_MARKED | MNT_UMOUNT_CANDIDATE);
+	list_del_init(&m->mnt_list);
+}
 
-	/* Mark mounts that can be unmounted if not locked */
-	SET_MNT_MARK(mnt);
-	progress = true;
+static void gather_candidates(struct list_head *set,
+			      struct list_head *candidates)
+{
+	struct mount *m, *p, *q;
 
-	/* If a mount is without children and not locked umount it. */
-	if (!IS_MNT_LOCKED(mnt)) {
-		umount_one(mnt, to_umount);
-	} else {
-children:
-		list_move_tail(&mnt->mnt_umounting, to_restore);
+	list_for_each_entry(m, set, mnt_list) {
+		if (is_candidate(m))
+			continue;
+		m->mnt.mnt_flags |= MNT_UMOUNT_CANDIDATE;
+		p = m->mnt_parent;
+		q = propagation_next(p, p);
+		while (q) {
+			struct mount *child = __lookup_mnt(&q->mnt,
+							   m->mnt_mountpoint);
+			if (child) {
+				/*
+				 * We might've already run into this one.  That
+				 * must've happened on earlier iteration of the
+				 * outer loop; in that case we can skip those
+				 * parents that get propagation from q - there
+				 * will be nothing new on those as well.
+				 */
+				if (is_candidate(child)) {
+					q = skip_propagation_subtree(q, p);
+					continue;
+				}
+				child->mnt.mnt_flags |= MNT_UMOUNT_CANDIDATE;
+				if (!will_be_unmounted(child))
+					list_add(&child->mnt_list, candidates);
+			}
+			q = propagation_next(q, p);
+		}
 	}
-out:
-	return progress;
+	list_for_each_entry(m, set, mnt_list)
+		m->mnt.mnt_flags &= ~MNT_UMOUNT_CANDIDATE;
 }
 
-static void umount_list(struct list_head *to_umount,
-			struct list_head *to_restore)
+/*
+ * We know that some child of @m can't be unmounted.  In all places where the
+ * chain of descent of @m has child not overmounting the root of parent,
+ * the parent can't be unmounted either.
+ */
+static void trim_ancestors(struct mount *m)
 {
-	struct mount *mnt, *child, *tmp;
-	list_for_each_entry(mnt, to_umount, mnt_list) {
-		list_for_each_entry_safe(child, tmp, &mnt->mnt_mounts, mnt_child) {
-			/* topper? */
-			if (child->mnt_mountpoint == mnt->mnt.mnt_root)
-				list_move_tail(&child->mnt_umounting, to_restore);
-			else
-				umount_one(child, to_umount);
-		}
+	struct mount *p;
+
+	for (p = m->mnt_parent; is_candidate(p); m = p, p = p->mnt_parent) {
+		if (IS_MNT_MARKED(m))	// all candidates beneath are overmounts
+			return;
+		SET_MNT_MARK(m);
+		if (m->mnt_mountpoint != p->mnt.mnt_root)
+			p->mnt.mnt_flags &= ~MNT_UMOUNT_CANDIDATE;
 	}
 }
 
-static void restore_mounts(struct list_head *to_restore)
+/*
+ * Find and exclude all umount candidates forbidden by @m
+ * (see Documentation/filesystems/propagate_umount.txt)
+ * If we can immediately tell that @m is OK to unmount (unlocked
+ * and all children are already committed to unmounting) commit
+ * to unmounting it.
+ * Only @m itself might be taken from the candidates list;
+ * anything found by trim_ancestors() is marked non-candidate
+ * and left on the list.
+ */
+static void trim_one(struct mount *m, struct list_head *to_umount)
 {
-	/* Restore mounts to a clean working state */
-	while (!list_empty(to_restore)) {
-		struct mount *mnt, *parent;
-		struct mountpoint *mp;
-
-		mnt = list_first_entry(to_restore, struct mount, mnt_umounting);
-		CLEAR_MNT_MARK(mnt);
-		list_del_init(&mnt->mnt_umounting);
-
-		/* Should this mount be reparented? */
-		mp = mnt->mnt_mp;
-		parent = mnt->mnt_parent;
-		while (parent->mnt.mnt_flags & MNT_UMOUNT) {
-			mp = parent->mnt_mp;
-			parent = parent->mnt_parent;
-		}
-		if (parent != mnt->mnt_parent) {
-			mnt_change_mountpoint(parent, mp, mnt);
-			mnt_notify_add(mnt);
+	bool remove_this = false, found = false, umount_this = false;
+	struct mount *n;
+
+	if (!is_candidate(m)) { // trim_ancestors() left it on list
+		remove_from_candidate_list(m);
+		return;
+	}
+
+	list_for_each_entry(n, &m->mnt_mounts, mnt_child) {
+		if (!is_candidate(n)) {
+			found = true;
+			if (n->mnt_mountpoint != m->mnt.mnt_root) {
+				remove_this = true;
+				break;
+			}
 		}
 	}
+	if (found) {
+		trim_ancestors(m);
+	} else if (!IS_MNT_LOCKED(m) && list_empty(&m->mnt_mounts)) {
+		remove_this = true;
+		umount_this = true;
+	}
+	if (remove_this) {
+		remove_from_candidate_list(m);
+		if (umount_this)
+			umount_one(m, to_umount);
+	}
 }
 
-static void cleanup_umount_visitations(struct list_head *visited)
+static void handle_locked(struct mount *m, struct list_head *to_umount)
 {
-	while (!list_empty(visited)) {
-		struct mount *mnt =
-			list_first_entry(visited, struct mount, mnt_umounting);
-		list_del_init(&mnt->mnt_umounting);
+	struct mount *cutoff = m, *p;
+
+	if (!is_candidate(m)) { // trim_ancestors() left it on list
+		remove_from_candidate_list(m);
+		return;
+	}
+	for (p = m; is_candidate(p); p = p->mnt_parent) {
+		remove_from_candidate_list(p);
+		if (!IS_MNT_LOCKED(p))
+			cutoff = p->mnt_parent;
+	}
+	if (will_be_unmounted(p))
+		cutoff = p;
+	while (m != cutoff) {
+		umount_one(m, to_umount);
+		m = m->mnt_parent;
 	}
 }
 
 /*
- * collect all mounts that receive propagation from the mount in @list,
- * and return these additional mounts in the same list.
- * @list: the list of mounts to be unmounted.
+ * @m is not to going away, and it overmounts the top of a stack of mounts
+ * that are going away.  We know that all of those are fully overmounted
+ * by the one above (@m being the topmost of the chain), so @m can be slid
+ * in place where the bottom of the stack is attached.
  *
- * vfsmount lock must be held for write
+ * NOTE: here we temporarily violate a constraint - two mounts end up with
+ * the same parent and mountpoint; that will be remedied as soon as we
+ * return from propagate_umount() - its caller (umount_tree()) will detach
+ * the stack from the parent it (and now @m) is attached to.  umount_tree()
+ * might choose to keep unmounted pieces stuck to each other, but it always
+ * detaches them from the mounts that remain in the tree.
  */
-int propagate_umount(struct list_head *list)
+static void reparent(struct mount *m)
 {
-	struct mount *mnt;
-	LIST_HEAD(to_restore);
-	LIST_HEAD(to_umount);
-	LIST_HEAD(visited);
-
-	/* Find candidates for unmounting */
-	list_for_each_entry_reverse(mnt, list, mnt_list) {
-		struct mount *parent = mnt->mnt_parent;
-		struct mount *m;
+	struct mount *p = m;
+	struct mountpoint *mp;
 
-		/*
-		 * If this mount has already been visited it is known that it's
-		 * entire peer group and all of their slaves in the propagation
-		 * tree for the mountpoint has already been visited and there is
-		 * no need to visit them again.
-		 */
-		if (!list_empty(&mnt->mnt_umounting))
-			continue;
+	do {
+		mp = p->mnt_mp;
+		p = p->mnt_parent;
+	} while (will_be_unmounted(p));
 
-		list_add_tail(&mnt->mnt_umounting, &visited);
-		for (m = propagation_next(parent, parent); m;
-		     m = propagation_next(m, parent)) {
-			struct mount *child = __lookup_mnt(&m->mnt,
-							   mnt->mnt_mountpoint);
-			if (!child)
-				continue;
+	mnt_change_mountpoint(p, mp, m);
+	mnt_notify_add(m);
+}
 
-			if (!list_empty(&child->mnt_umounting)) {
-				/*
-				 * If the child has already been visited it is
-				 * know that it's entire peer group and all of
-				 * their slaves in the propgation tree for the
-				 * mountpoint has already been visited and there
-				 * is no need to visit this subtree again.
-				 */
-				m = skip_propagation_subtree(m, parent);
-				continue;
-			} else if (child->mnt.mnt_flags & MNT_UMOUNT) {
-				/*
-				 * We have come across a partially unmounted
-				 * mount in a list that has not been visited
-				 * yet. Remember it has been visited and
-				 * continue about our merry way.
-				 */
-				list_add_tail(&child->mnt_umounting, &visited);
-				continue;
-			}
+/**
+ * propagate_umount - apply propagation rules to the set of mounts for umount()
+ * @set: the list of mounts to be unmounted.
+ *
+ * Collect all mounts that receive propagation from the mount in @set and have
+ * no obstacles to being unmounted.  Add these additional mounts to the set.
+ *
+ * See Documentation/filesystems/propagate_umount.txt if you do anything in
+ * this area.
+ *
+ * Locks held:
+ * mount_lock (write_seqlock), namespace_sem (exclusive).
+ */
+void propagate_umount(struct list_head *set)
+{
+	struct mount *m, *p;
+	LIST_HEAD(to_umount);	// committed to unmounting
+	LIST_HEAD(candidates);	// undecided umount candidates
 
-			/* Check the child and parents while progress is made */
-			while (__propagate_umount(child,
-						  &to_umount, &to_restore)) {
-				/* Is the parent a umount candidate? */
-				child = child->mnt_parent;
-				if (list_empty(&child->mnt_umounting))
-					break;
-			}
-		}
+	// collect all candidates
+	gather_candidates(set, &candidates);
+
+	// reduce the set until it's non-shifting
+	list_for_each_entry_safe(m, p, &candidates, mnt_list)
+		trim_one(m, &to_umount);
+
+	// ... and non-revealing
+	while (!list_empty(&candidates)) {
+		m = list_first_entry(&candidates,struct mount, mnt_list);
+		handle_locked(m, &to_umount);
 	}
 
-	umount_list(&to_umount, &to_restore);
-	restore_mounts(&to_restore);
-	cleanup_umount_visitations(&visited);
-	list_splice_tail(&to_umount, list);
+	// now to_umount consists of all acceptable candidates
+	// deal with reparenting of remaining overmounts on those
+	list_for_each_entry(m, &to_umount, mnt_list) {
+		while (!list_empty(&m->mnt_mounts)) // should be at most one
+			reparent(list_first_entry(&m->mnt_mounts,
+						  struct mount, mnt_child));
+	}
 
-	return 0;
+	// and fold them into the set
+	list_splice_tail_init(&to_umount, set);
 }
diff --git a/fs/pnode.h b/fs/pnode.h
index 34b6247af01d..d84a397bfd43 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -39,7 +39,7 @@ static inline void set_mnt_shared(struct mount *mnt)
 void change_mnt_propagation(struct mount *, int);
 int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
-int propagate_umount(struct list_head *);
+void propagate_umount(struct list_head *);
 int propagate_mount_busy(struct mount *, int);
 void propagate_mount_unlock(struct mount *);
 void mnt_release_group_id(struct mount *);
diff --git a/include/linux/mount.h b/include/linux/mount.h
index dcc17ce8a959..33af9c4058fa 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -50,10 +50,12 @@ struct path;
 #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
 
 #define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
-			    MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED)
+			    MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED | \
+			    MNT_UMOUNT_CANDIDATE)
 
 #define MNT_INTERNAL	0x4000
 
+#define MNT_UMOUNT_CANDIDATE	0x020000
 #define MNT_LOCK_ATIME		0x040000
 #define MNT_LOCK_NOEXEC		0x080000
 #define MNT_LOCK_NOSUID		0x100000

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH][RFC] replace collect_mounts()/drop_collected_mounts() with safer variant
       [not found]                       ` <20250617041754.GA1591763@ZenIV>
@ 2025-06-17 21:18                         ` Al Viro
  0 siblings, 0 replies; 16+ messages in thread
From: Al Viro @ 2025-06-17 21:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Christian Brauner, Jan Kara, Miklos Szeredi,
	Paul Moore, Eric W. Biederman

collect_mounts() has several problems - one can't iterate over the results
directly, so it has to be done with callback passed to iterate_mounts();
it also has oopsable race with d_invalidate(); it also creates temporary
clones of mounts invisibly for sync umount (IOW, you can have umount return
with filesystem not mounted in any other locations and yet have it still
busy as umount(2) returns).

A saner approach is to give caller an array of struct path that would pin
every mount in a subtree, without cloning any mounts.

	* collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone
	* collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or
a pointer to array of struct path, one for each chunk of tree visible under
'where' (i.e. the first element is a copy of where, followed by (mount,root)
for everything mounted under it - the same set collect_mounts() would give).
Unlike collect_mounts(), the mounts are *not* cloned - we just get (pinning)
references to roots of subtree in the caller's namespaces.
	Array is terminated by {NULL, NULL} struct path.  If it fits into
preallocated array (on-stack, normally), that's where it goes; otherwise
it's allocated by kmalloc_array().  Passing 0 as size means that 'preallocated'
is ignored (and expected to be NULL).
	* drop_collected_paths(paths, preallocated) is given the array returned
by collect_paths() and the preallocated array used passed to the same.  All
mount/dentry references are dropped and array is kfree'd if it's not equal to
'preallocated'.
	* instead of iterate_mounts(), users should just iterate over array
of struct path - nothing exotic is needed for that.  Existing users (all in
audit_tree.c) are converted.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/namespace.c b/fs/namespace.c
index bb95e5102916..aafbcd0a1c8f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2280,21 +2280,62 @@ struct mount *copy_tree(struct mount *src_root, struct dentry *dentry,
 	return dst_mnt;
 }
 
-/* Caller should check returned pointer for errors */
+static inline bool extend_array(struct path **res, struct path **to_free,
+				unsigned n, unsigned *count, unsigned new_count)
+{
+	struct path *p;
+
+	if (likely(n < *count))
+		return true;
+	p = kmalloc_array(new_count, sizeof(struct path), GFP_KERNEL);
+	if (p && *count)
+		memcpy(p, *res, *count * sizeof(struct path));
+	*count = new_count;
+	kfree(*to_free);
+	*to_free = *res = p;
+	return p;
+}
 
-struct vfsmount *collect_mounts(const struct path *path)
+struct path *collect_paths(const struct path *path,
+			      struct path *prealloc, unsigned count)
 {
-	struct mount *tree;
-	namespace_lock();
-	if (!check_mnt(real_mount(path->mnt)))
-		tree = ERR_PTR(-EINVAL);
-	else
-		tree = copy_tree(real_mount(path->mnt), path->dentry,
-				 CL_COPY_ALL | CL_PRIVATE);
-	namespace_unlock();
-	if (IS_ERR(tree))
-		return ERR_CAST(tree);
-	return &tree->mnt;
+	struct mount *root = real_mount(path->mnt);
+	struct mount *child;
+	struct path *res = prealloc, *to_free = NULL;
+	unsigned n = 0;
+
+	guard(rwsem_read)(&namespace_sem);
+
+	if (!check_mnt(root))
+		return ERR_PTR(-EINVAL);
+	if (!extend_array(&res, &to_free, 0, &count, 32))
+		return ERR_PTR(-ENOMEM);
+	res[n++] = *path;
+	list_for_each_entry(child, &root->mnt_mounts, mnt_child) {
+		if (!is_subdir(child->mnt_mountpoint, path->dentry))
+			continue;
+		for (struct mount *m = child; m; m = next_mnt(m, child)) {
+			if (!extend_array(&res, &to_free, n, &count, 2 * count))
+				return ERR_PTR(-ENOMEM);
+			res[n].mnt = &m->mnt;
+			res[n].dentry = m->mnt.mnt_root;
+			n++;
+		}
+	}
+	if (!extend_array(&res, &to_free, n, &count, count + 1))
+		return ERR_PTR(-ENOMEM);
+	memset(res + n, 0, (count - n) * sizeof(struct path));
+	for (struct path *p = res; p->mnt; p++)
+		path_get(p);
+	return res;
+}
+
+void drop_collected_paths(struct path *paths, struct path *prealloc)
+{
+	for (struct path *p = paths; p->mnt; p++)
+		path_put(p);
+	if (paths != prealloc)
+		kfree(paths);
 }
 
 static void free_mnt_ns(struct mnt_namespace *);
@@ -2335,15 +2376,6 @@ void dissolve_on_fput(struct vfsmount *mnt)
 	free_mnt_ns(ns);
 }
 
-void drop_collected_mounts(struct vfsmount *mnt)
-{
-	namespace_lock();
-	lock_mount_hash();
-	umount_tree(real_mount(mnt), 0);
-	unlock_mount_hash();
-	namespace_unlock();
-}
-
 static bool __has_locked_children(struct mount *mnt, struct dentry *dentry)
 {
 	struct mount *child;
@@ -2443,21 +2475,6 @@ struct vfsmount *clone_private_mount(const struct path *path)
 }
 EXPORT_SYMBOL_GPL(clone_private_mount);
 
-int iterate_mounts(int (*f)(struct vfsmount *, void *), void *arg,
-		   struct vfsmount *root)
-{
-	struct mount *mnt;
-	int res = f(root, arg);
-	if (res)
-		return res;
-	list_for_each_entry(mnt, &real_mount(root)->mnt_list, mnt_list) {
-		res = f(&mnt->mnt, arg);
-		if (res)
-			return res;
-	}
-	return 0;
-}
-
 static void lock_mnt_tree(struct mount *mnt)
 {
 	struct mount *p;
@@ -6138,7 +6155,11 @@ void put_mnt_ns(struct mnt_namespace *ns)
 {
 	if (!refcount_dec_and_test(&ns->ns.count))
 		return;
-	drop_collected_mounts(&ns->root->mnt);
+	namespace_lock();
+	lock_mount_hash();
+	umount_tree(ns->root, 0);
+	unlock_mount_hash();
+	namespace_unlock();
 	free_mnt_ns(ns);
 }
 
diff --git a/fs/pnode.h b/fs/pnode.h
index bfc10c095cbf..04f1ac53aa49 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -28,8 +28,6 @@
 #define CL_SHARED_TO_SLAVE	0x20
 #define CL_COPY_MNT_NS_FILE	0x40
 
-#define CL_COPY_ALL		(CL_COPY_UNBINDABLE | CL_COPY_MNT_NS_FILE)
-
 static inline void set_mnt_shared(struct mount *mnt)
 {
 	mnt->mnt.mnt_flags &= ~MNT_SHARED_MASK;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index cae7324650b6..65fa8442c00a 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -118,10 +118,8 @@ extern int may_umount_tree(struct vfsmount *);
 extern int may_umount(struct vfsmount *);
 int do_mount(const char *, const char __user *,
 		     const char *, unsigned long, void *);
-extern struct vfsmount *collect_mounts(const struct path *);
-extern void drop_collected_mounts(struct vfsmount *);
-extern int iterate_mounts(int (*)(struct vfsmount *, void *), void *,
-			  struct vfsmount *);
+extern struct path *collect_paths(const struct path *, struct path *, unsigned);
+extern void drop_collected_paths(struct path *, struct path *);
 extern void kern_unmount_array(struct vfsmount *mnt[], unsigned int num);
 
 extern int cifs_root_data(char **dev, char **opts);
diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c
index f2f38903b2fe..68e042ae93c7 100644
--- a/kernel/audit_tree.c
+++ b/kernel/audit_tree.c
@@ -668,12 +668,6 @@ int audit_remove_tree_rule(struct audit_krule *rule)
 	return 0;
 }
 
-static int compare_root(struct vfsmount *mnt, void *arg)
-{
-	return inode_to_key(d_backing_inode(mnt->mnt_root)) ==
-	       (unsigned long)arg;
-}
-
 void audit_trim_trees(void)
 {
 	struct list_head cursor;
@@ -683,8 +677,9 @@ void audit_trim_trees(void)
 	while (cursor.next != &tree_list) {
 		struct audit_tree *tree;
 		struct path path;
-		struct vfsmount *root_mnt;
 		struct audit_node *node;
+		struct path *paths;
+		struct path array[16];
 		int err;
 
 		tree = container_of(cursor.next, struct audit_tree, list);
@@ -696,9 +691,9 @@ void audit_trim_trees(void)
 		if (err)
 			goto skip_it;
 
-		root_mnt = collect_mounts(&path);
+		paths = collect_paths(&path, array, 16);
 		path_put(&path);
-		if (IS_ERR(root_mnt))
+		if (IS_ERR(paths))
 			goto skip_it;
 
 		spin_lock(&hash_lock);
@@ -706,14 +701,17 @@ void audit_trim_trees(void)
 			struct audit_chunk *chunk = find_chunk(node);
 			/* this could be NULL if the watch is dying else where... */
 			node->index |= 1U<<31;
-			if (iterate_mounts(compare_root,
-					   (void *)(chunk->key),
-					   root_mnt))
-				node->index &= ~(1U<<31);
+			for (struct path *p = paths; p->dentry; p++) {
+				struct inode *inode = p->dentry->d_inode;
+				if (inode_to_key(inode) == chunk->key) {
+					node->index &= ~(1U<<31);
+					break;
+				}
+			}
 		}
 		spin_unlock(&hash_lock);
 		trim_marked(tree);
-		drop_collected_mounts(root_mnt);
+		drop_collected_paths(paths, array);
 skip_it:
 		put_tree(tree);
 		mutex_lock(&audit_filter_mutex);
@@ -742,9 +740,14 @@ void audit_put_tree(struct audit_tree *tree)
 	put_tree(tree);
 }
 
-static int tag_mount(struct vfsmount *mnt, void *arg)
+static int tag_mounts(struct path *paths, struct audit_tree *tree)
 {
-	return tag_chunk(d_backing_inode(mnt->mnt_root), arg);
+	for (struct path *p = paths; p->dentry; p++) {
+		int err = tag_chunk(p->dentry->d_inode, tree);
+		if (err)
+			return err;
+	}
+	return 0;
 }
 
 /*
@@ -801,7 +804,8 @@ int audit_add_tree_rule(struct audit_krule *rule)
 {
 	struct audit_tree *seed = rule->tree, *tree;
 	struct path path;
-	struct vfsmount *mnt;
+	struct path array[16];
+	struct path *paths;
 	int err;
 
 	rule->tree = NULL;
@@ -828,16 +832,16 @@ int audit_add_tree_rule(struct audit_krule *rule)
 	err = kern_path(tree->pathname, 0, &path);
 	if (err)
 		goto Err;
-	mnt = collect_mounts(&path);
+	paths = collect_paths(paths, array, 16);
 	path_put(&path);
-	if (IS_ERR(mnt)) {
-		err = PTR_ERR(mnt);
+	if (IS_ERR(paths)) {
+		err = PTR_ERR(paths);
 		goto Err;
 	}
 
 	get_tree(tree);
-	err = iterate_mounts(tag_mount, tree, mnt);
-	drop_collected_mounts(mnt);
+	err = tag_mounts(paths, tree);
+	drop_collected_paths(paths, array);
 
 	if (!err) {
 		struct audit_node *node;
@@ -872,20 +876,21 @@ int audit_tag_tree(char *old, char *new)
 	struct list_head cursor, barrier;
 	int failed = 0;
 	struct path path1, path2;
-	struct vfsmount *tagged;
+	struct path array[16];
+	struct path *paths;
 	int err;
 
 	err = kern_path(new, 0, &path2);
 	if (err)
 		return err;
-	tagged = collect_mounts(&path2);
+	paths = collect_paths(&path2, array, 16);
 	path_put(&path2);
-	if (IS_ERR(tagged))
-		return PTR_ERR(tagged);
+	if (IS_ERR(paths))
+		return PTR_ERR(paths);
 
 	err = kern_path(old, 0, &path1);
 	if (err) {
-		drop_collected_mounts(tagged);
+		drop_collected_paths(paths, array);
 		return err;
 	}
 
@@ -914,7 +919,7 @@ int audit_tag_tree(char *old, char *new)
 			continue;
 		}
 
-		failed = iterate_mounts(tag_mount, tree, tagged);
+		failed = tag_mounts(paths, tree);
 		if (failed) {
 			put_tree(tree);
 			mutex_lock(&audit_filter_mutex);
@@ -955,7 +960,7 @@ int audit_tag_tree(char *old, char *new)
 	list_del(&cursor);
 	mutex_unlock(&audit_filter_mutex);
 	path_put(&path1);
-	drop_collected_mounts(tagged);
+	drop_collected_paths(paths, array);
 	return failed;
 }
 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-06-17 21:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-11 23:27 [BUG] propagate_umount() breakage Al Viro
2025-05-12  4:50 ` Eric W. Biederman
2025-05-13  3:56   ` Al Viro
2025-05-15 11:41     ` Al Viro
2025-05-15 11:47       ` Al Viro
2025-05-16  5:21         ` [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage) Al Viro
2025-05-19 18:11           ` Linus Torvalds
2025-05-19 21:35             ` Al Viro
2025-05-19 22:08               ` Linus Torvalds
2025-05-19 22:26                 ` Linus Torvalds
2025-05-20 22:27               ` Eric W. Biederman
2025-05-20 23:08                 ` Al Viro
2025-05-23  2:10                   ` [RFC][CFT][PATCH v2] Rewrite of propagate_umount() Al Viro
     [not found]               ` <20250520075317.GB2023217@ZenIV>
     [not found]                 ` <87y0uqlvxs.fsf@email.froward.int.ebiederm.org>
     [not found]                   ` <20250520231854.GF2023217@ZenIV>
     [not found]                     ` <20250521023219.GA1309405@ZenIV>
     [not found]                       ` <20250617041754.GA1591763@ZenIV>
2025-06-17 21:18                         ` [PATCH][RFC] replace collect_mounts()/drop_collected_mounts() with safer variant Al Viro
2025-05-20 11:10           ` [RFC][CFT][PATCH] Rewrite of propagate_umount() (was Re: [BUG] propagate_umount() breakage) Christian Brauner
2025-05-21  2:11             ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).