linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [git pull] mount fixes
@ 2025-06-08 17:24 Al Viro
  2025-06-08 18:53 ` pr-tracker-bot
  0 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2025-06-08 17:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel

The following changes since commit a82ba839915926f8713183fd023c6d9357bae26c:

  Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs (2025-05-30 15:04:11 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git tags/pull-fixes

for you to fetch changes up to 12f147ddd6de7382dad54812e65f3f08d05809fc:

  do_change_type(): refuse to operate on unmounted/not ours mounts (2025-06-07 01:37:56 -0400)

----------------------------------------------------------------
mount-related bugfixes

this cycle regression (well, bugfix for this cycle bugfix for v6.15-rc1 regression)
	do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases
	selftests/mount_setattr: adapt detached mount propagation test
v6.15	fs: allow clone_private_mount() for a path on real rootfs
v6.11	fs/fhandle.c: fix a race in call of has_locked_children()
v5.15	fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2)
v5.15	clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns
v5.7	path_overmount(): avoid false negatives
v3.12	finish_automount(): don't leak MNT_LOCKED from parent to child
v2.6.15	do_change_type(): refuse to operate on unmounted/not ours mounts

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

----------------------------------------------------------------
Al Viro (7):
      fs/fhandle.c: fix a race in call of has_locked_children()
      path_overmount(): avoid false negatives
      finish_automount(): don't leak MNT_LOCKED from parent to child
      fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2)
      do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases
      clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns
      do_change_type(): refuse to operate on unmounted/not ours mounts

Christian Brauner (1):
      selftests/mount_setattr: adapt detached mount propagation test

KONDO KAZUMA(近藤 和真) (1):
      fs: allow clone_private_mount() for a path on real rootfs

 fs/namespace.c                                     | 113 +++++++++++++--------
 include/linux/mount.h                              |   3 +-
 .../selftests/mount_setattr/mount_setattr_test.c   |  17 +---
 3 files changed, 74 insertions(+), 59 deletions(-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [git pull] mount fixes
  2025-06-08 17:24 Al Viro
@ 2025-06-08 18:53 ` pr-tracker-bot
  0 siblings, 0 replies; 17+ messages in thread
From: pr-tracker-bot @ 2025-06-08 18:53 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel

The pull request you sent on Sun, 8 Jun 2025 18:24:09 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git tags/pull-fixes

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/35b574a6c2279fe47d13ffafb8389f1adc87a1d1

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCHES][RFC][CFT] mount fixes
@ 2025-08-15 23:33 Al Viro
  2025-08-15 23:34 ` [PATCH 1/4] fix the softlockups in attach_recursive_mnt() Al Viro
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Al Viro @ 2025-08-15 23:33 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Christian Brauner, Jan Kara, Lai, Yi,
	Tycho Andersen, Andrei Vagin, Pavel Tikhomirov

	Several regression fixes of varying severity; live in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git, individual
patches in followups.  Help with testing and reviewing would be very
welcome.

1) attach_recursive_mnt() slowdown - severe in some pathological cases.
This cycle regression, fortunately trivial to fix.  Underlying mechanism
of slowdown (bypassed by that fix) is due to bad semantics of skip_mnt_tree(),
but that's a separate story.
The case with severe slowdown is mounting on top of a long chain of overmounts,
all peers with each other.  Ends up being time-quardatic by the depth of chain,
easily triggers softlockup warnings.  Since the slow part is under mount_lock,
all pathname resolution pretty much stops for the duration.  Kudos for catching
that to "Lai, Yi" <yi1.lai@linux.intel.com>...

2) reparenting in propagate_umount() should *NOT* be done to mounts that are
themselves going to be taken out.  I'd even put "<argument> is not to going
away, and it overmounts the top of a stack of mounts that are going away"
right on top of reparent()...  and lost the logics in its caller that used
to check that at some point during reshuffling the queue ;-/  This cycle's
regression and that one is not just a slowdown - in some cases it might end
up as data corruptor.  Trivial to fix...

3) permission checks for --make-{shared,slave,private,unlinkable} (i.e.
MS_{SHARED,...} in mount(2) flags) ended up being too restrictive.
The checks used to be too weak until the last cycle.  Then they were
made stricter - basically to "mount is a part of our namespace and we have
admin permissions there".  Turns out that CRIU userland depended upon the
weak^Wabsent checks we used to have there before.  Sane replacement that
would suffice for them is "mount is a part of *SOME* namespace that we have
admin permissions for".  Last cycle userland regression...

4) change_mnt_propagation() slowdown in some cases.  On umount we want all
victims out of propagation graph and propagation between the surviving mounts
to be unchanged.  So if victim used to have slaves, they need to be transfered
to its peer (if any) or master.  In case when victim had many peers, all
taken out by that umount(), that ended up with all its slaves being gradually
transferred between all peers until we finally ran out of those.  It can
easily lead to quadratic time.  The patch in -rc1 switched that to "just
find where they'll end up upfront, and move them once", which eliminated
that... except that I hadn't noticed that on massage of change_mnt_propagation()
we ended up calculating the place where they'd be transferred in cases
when there had been nothing to transfer.  With obvious effects when there
had been a large peer group entirely taken out, with not a single slave between
them.  The minimal fix ("call propagation_source() only if we are going to
use its return value") is enough to recover in all cases.
Longer term we should kick all victims out of propagation graph at once
and I have that plotted out, but that's for the next merge window; for
now the minimal obvious fix is good enough.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] fix the softlockups in attach_recursive_mnt()
  2025-08-15 23:33 [PATCHES][RFC][CFT] mount fixes Al Viro
@ 2025-08-15 23:34 ` Al Viro
  2025-08-19 10:18   ` Christian Brauner
  2025-08-15 23:34 ` [PATCH 2/4] propagate_umount(): only surviving overmounts should be remounted Al Viro
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2025-08-15 23:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Christian Brauner, Jan Kara, Lai, Yi,
	Tycho Andersen, Andrei Vagin, Pavel Tikhomirov

In case when we mounting something on top of a large stack of overmounts,
all of them being peers of each other, we get quadratic time by the
depth of overmount stack.  Easily fixed by doing commit_tree() before
reparenting the overmount; simplifies commit_tree() as well - it doesn't
need to skip the already mounted stuff that had been reparented on top
of the new mounts.

Since we are holding mount_lock through both reparenting and call of
commit_tree(), the order does not matter from the mount hash point
of view.

Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Fixes: 663206854f02 "copy_tree(): don't link the mounts via mnt_list"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/namespace.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ddfd4457d338..1c97f93d1865 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1197,10 +1197,7 @@ static void commit_tree(struct mount *mnt)
 
 	if (!mnt_ns_attached(mnt)) {
 		for (struct mount *m = mnt; m; m = next_mnt(m, mnt))
-			if (unlikely(mnt_ns_attached(m)))
-				m = skip_mnt_tree(m);
-			else
-				mnt_add_to_ns(n, m);
+			mnt_add_to_ns(n, m);
 		n->nr_mounts += n->pending_mounts;
 		n->pending_mounts = 0;
 	}
@@ -2704,6 +2701,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 			lock_mnt_tree(child);
 		q = __lookup_mnt(&child->mnt_parent->mnt,
 				 child->mnt_mountpoint);
+		commit_tree(child);
 		if (q) {
 			struct mountpoint *mp = root.mp;
 			struct mount *r = child;
@@ -2713,7 +2711,6 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 				mp = shorter;
 			mnt_change_mountpoint(r, mp, q);
 		}
-		commit_tree(child);
 	}
 	unpin_mountpoint(&root);
 	unlock_mount_hash();
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] propagate_umount(): only surviving overmounts should be remounted
  2025-08-15 23:33 [PATCHES][RFC][CFT] mount fixes Al Viro
  2025-08-15 23:34 ` [PATCH 1/4] fix the softlockups in attach_recursive_mnt() Al Viro
@ 2025-08-15 23:34 ` Al Viro
  2025-08-19 10:19   ` Christian Brauner
  2025-08-15 23:35 ` [PATCH 3/4] use uniform permission checks for all mount propagation changes Al Viro
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2025-08-15 23:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Christian Brauner, Jan Kara, Lai, Yi,
	Tycho Andersen, Andrei Vagin, Pavel Tikhomirov

... as the comments in reparent() clearly say.  As it is, we reparent
*all* overmounts of the mounts being taken out, including those that
are taken out themselves.  It's not only a potentially massive slowdown
(on a pathological setup we might end up with O(N^2) time for N mounts
being kicked out), it can end up with incorrect ->overmount in the
surviving mounts.

Fixes: f0d0ba19985d "Rewrite of propagate_umount()"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/pnode.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 81f7599bdac4..1c789f88b3d2 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -637,10 +637,11 @@ void propagate_umount(struct list_head *set)
 	}
 
 	// now to_umount consists of all acceptable candidates
-	// deal with reparenting of remaining overmounts on those
+	// deal with reparenting of surviving overmounts on those
 	list_for_each_entry(m, &to_umount, mnt_list) {
-		if (m->overmount)
-			reparent(m->overmount);
+		struct mount *over = m->overmount;
+		if (over && !will_be_unmounted(over))
+			reparent(over);
 	}
 
 	// and fold them into the set
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] use uniform permission checks for all mount propagation changes
  2025-08-15 23:33 [PATCHES][RFC][CFT] mount fixes Al Viro
  2025-08-15 23:34 ` [PATCH 1/4] fix the softlockups in attach_recursive_mnt() Al Viro
  2025-08-15 23:34 ` [PATCH 2/4] propagate_umount(): only surviving overmounts should be remounted Al Viro
@ 2025-08-15 23:35 ` Al Viro
  2025-08-16 18:28   ` Andrei Vagin
                     ` (2 more replies)
  2025-08-15 23:36 ` [PATCH 4/4] change_mnt_propagation(): calculate propagation source only if we'll need it Al Viro
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 17+ messages in thread
From: Al Viro @ 2025-08-15 23:35 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Christian Brauner, Jan Kara, Lai, Yi,
	Tycho Andersen, Andrei Vagin, Pavel Tikhomirov

do_change_type() and do_set_group() are operating on different
aspects of the same thing - propagation graph.  The latter
asks for mounts involved to be mounted in namespace(s) the caller
has CAP_SYS_ADMIN for.  The former is a mess - originally it
didn't even check that mount *is* mounted.  That got fixed,
but the resulting check turns out to be too strict for userland -
in effect, we check that mount is in our namespace, having already
checked that we have CAP_SYS_ADMIN there.

What we really need (in both cases) is
	* we only touch mounts that are mounted.  Hard requirement,
data corruption if that's get violated.
	* we don't allow to mess with a namespace unless you already
have enough permissions to do so (i.e. CAP_SYS_ADMIN in its userns).

That's an equivalent of what do_set_group() does; let's extract that
into a helper (may_change_propagation()) and use it in both
do_set_group() and do_change_type().

Fixes: 12f147ddd6de "do_change_type(): refuse to operate on unmounted/not ours mounts"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/namespace.c | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 1c97f93d1865..88db58061919 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2859,6 +2859,19 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
 	return attach_recursive_mnt(mnt, p, mp);
 }
 
+static int may_change_propagation(const struct mount *m)
+{
+        struct mnt_namespace *ns = m->mnt_ns;
+
+	 // it must be mounted in some namespace
+	 if (IS_ERR_OR_NULL(ns))         // is_mounted()
+		 return -EINVAL;
+	 // and the caller must be admin in userns of that namespace
+	 if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
+		 return -EPERM;
+	 return 0;
+}
+
 /*
  * Sanity check the flags to change_mnt_propagation.
  */
@@ -2895,10 +2908,10 @@ static int do_change_type(struct path *path, int ms_flags)
 		return -EINVAL;
 
 	namespace_lock();
-	if (!check_mnt(mnt)) {
-		err = -EINVAL;
+	err = may_change_propagation(mnt);
+	if (err)
 		goto out_unlock;
-	}
+
 	if (type == MS_SHARED) {
 		err = invent_group_ids(mnt, recurse);
 		if (err)
@@ -3344,18 +3357,11 @@ static int do_set_group(struct path *from_path, struct path *to_path)
 
 	namespace_lock();
 
-	err = -EINVAL;
-	/* To and From must be mounted */
-	if (!is_mounted(&from->mnt))
-		goto out;
-	if (!is_mounted(&to->mnt))
-		goto out;
-
-	err = -EPERM;
-	/* We should be allowed to modify mount namespaces of both mounts */
-	if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
+	err = may_change_propagation(from);
+	if (err)
 		goto out;
-	if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
+	err = may_change_propagation(to);
+	if (err)
 		goto out;
 
 	err = -EINVAL;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] change_mnt_propagation(): calculate propagation source only if we'll need it
  2025-08-15 23:33 [PATCHES][RFC][CFT] mount fixes Al Viro
                   ` (2 preceding siblings ...)
  2025-08-15 23:35 ` [PATCH 3/4] use uniform permission checks for all mount propagation changes Al Viro
@ 2025-08-15 23:36 ` Al Viro
  2025-08-19 10:20   ` Christian Brauner
  2025-08-16 15:58 ` [PATCHES][RFC][CFT] mount fixes Al Viro
  2025-08-19 16:12 ` [git pull] " Al Viro
  5 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2025-08-15 23:36 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Christian Brauner, Jan Kara, Lai, Yi,
	Tycho Andersen, Andrei Vagin, Pavel Tikhomirov

We only need it when mount in question was sending events downstream (then
recepients need to switch to new master) or the mount is being turned into
slave (then we need a new master for it).

That wouldn't be a big deal, except that it causes quite a bit of work
when umount_tree() is taking a large peer group out.  Adding a trivial
"don't bother calling propagation_source() unless we are going to use
its results" logics improves the things quite a bit.

We are still doing unnecessary work on bulk removals from propagation graph,
but the full solution for that will have to wait for the next merge window.

Fixes: 955336e204ab "do_make_slave(): choose new master sanely"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/pnode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 1c789f88b3d2..6f7d02f3fa98 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -111,7 +111,8 @@ void change_mnt_propagation(struct mount *mnt, int type)
 		return;
 	}
 	if (IS_MNT_SHARED(mnt)) {
-		m = propagation_source(mnt);
+		if (type == MS_SLAVE || !hlist_empty(&mnt->mnt_slave_list))
+			m = propagation_source(mnt);
 		if (list_empty(&mnt->mnt_share)) {
 			mnt_release_group_id(mnt);
 		} else {
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCHES][RFC][CFT] mount fixes
  2025-08-15 23:33 [PATCHES][RFC][CFT] mount fixes Al Viro
                   ` (3 preceding siblings ...)
  2025-08-15 23:36 ` [PATCH 4/4] change_mnt_propagation(): calculate propagation source only if we'll need it Al Viro
@ 2025-08-16 15:58 ` Al Viro
  2025-08-19 16:12 ` [git pull] " Al Viro
  5 siblings, 0 replies; 17+ messages in thread
From: Al Viro @ 2025-08-16 15:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Christian Brauner, Jan Kara, Lai, Yi,
	Tycho Andersen, Andrei Vagin, Pavel Tikhomirov

On Sat, Aug 16, 2025 at 12:33:16AM +0100, Al Viro wrote:

> 4) change_mnt_propagation() slowdown in some cases.  On umount we want all
> victims out of propagation graph and propagation between the surviving mounts
> to be unchanged.  So if victim used to have slaves, they need to be transfered
> to its peer (if any) or master.  In case when victim had many peers, all
> taken out by that umount(), that ended up with all its slaves being gradually
> transferred between all peers until we finally ran out of those.  It can
> easily lead to quadratic time.  The patch in -rc1 switched that to "just
> find where they'll end up upfront, and move them once", which eliminated
> that... except that I hadn't noticed that on massage of change_mnt_propagation()
> we ended up calculating the place where they'd be transferred in cases
> when there had been nothing to transfer.  With obvious effects when there
> had been a large peer group entirely taken out, with not a single slave between
> them.  The minimal fix ("call propagation_source() only if we are going to
> use its return value") is enough to recover in all cases.
> Longer term we should kick all victims out of propagation graph at once
> and I have that plotted out, but that's for the next merge window; for
> now the minimal obvious fix is good enough.

FWIW, proposed longer term fix (on top of this series, completely
untested) would be the patch below.  Basically, calculate where the slaves
end up for all mounts to be removed, taking the mounts themselves out
of propagation graph, then do all transfers; duplicate work on finding
destinations is avoided that way, since if we run into a mount that
already had destination found, we don't need to trace the rest of the way.
That's guaranteed O(removed mounts) for finding destinations and removing
from propagation graph and O(surviving mounts that have master removed)
for transfers.

diff --git a/fs/namespace.c b/fs/namespace.c
index 88db58061919..5c68a05f9679 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1842,6 +1842,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	if (how & UMOUNT_PROPAGATE)
 		propagate_umount(&tmp_list);
 
+	bulk_make_private(&tmp_list);
+
 	while (!list_empty(&tmp_list)) {
 		struct mnt_namespace *ns;
 		bool disconnect;
@@ -1866,7 +1868,6 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 				umount_mnt(p);
 			}
 		}
-		change_mnt_propagation(p, MS_PRIVATE);
 		if (disconnect)
 			hlist_add_head(&p->mnt_umount, &unmounted);
 
diff --git a/fs/pnode.c b/fs/pnode.c
index 6f7d02f3fa98..9fe2ddaf52db 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -70,19 +70,6 @@ static inline bool will_be_unmounted(struct mount *m)
 	return m->mnt.mnt_flags & MNT_UMOUNT;
 }
 
-static struct mount *propagation_source(struct mount *mnt)
-{
-	do {
-		struct mount *m;
-		for (m = next_peer(mnt); m != mnt; m = next_peer(m)) {
-			if (!will_be_unmounted(m))
-				return m;
-		}
-		mnt = mnt->mnt_master;
-	} while (mnt && will_be_unmounted(mnt));
-	return mnt;
-}
-
 static void transfer_propagation(struct mount *mnt, struct mount *to)
 {
 	struct hlist_node *p = NULL, *n;
@@ -111,11 +98,10 @@ void change_mnt_propagation(struct mount *mnt, int type)
 		return;
 	}
 	if (IS_MNT_SHARED(mnt)) {
-		if (type == MS_SLAVE || !hlist_empty(&mnt->mnt_slave_list))
-			m = propagation_source(mnt);
 		if (list_empty(&mnt->mnt_share)) {
 			mnt_release_group_id(mnt);
 		} else {
+			m = next_peer(mnt);
 			list_del_init(&mnt->mnt_share);
 			mnt->mnt_group_id = 0;
 		}
@@ -136,6 +122,57 @@ void change_mnt_propagation(struct mount *mnt, int type)
 	}
 }
 
+static struct mount *trace_transfers(struct mount *m)
+{
+	while (1) {
+		struct mount *next = next_peer(m);
+
+		if (next != m) {
+			list_del_init(&m->mnt_share);
+			m->mnt_group_id = 0;
+			m->mnt_master = next;
+		} else {
+			if (IS_MNT_SHARED(m))
+				mnt_release_group_id(m);
+			next = m->mnt_master;
+		}
+		hlist_del_init(&m->mnt_slave);
+		CLEAR_MNT_SHARED(m);
+		SET_MNT_MARK(m);
+
+		if (!next || !will_be_unmounted(next))
+			return next;
+		if (IS_MNT_MARKED(next))
+			return next->mnt_master;
+		m = next;
+	}
+}
+
+static void set_destinations(struct mount *m, struct mount *master)
+{
+	struct mount *next;
+
+	while ((next = m->mnt_master) != master) {
+		m->mnt_master = master;
+		m = next;
+	}
+}
+
+void bulk_make_private(struct list_head *set)
+{
+	struct mount *m;
+
+	list_for_each_entry(m, set, mnt_list)
+		if (!IS_MNT_MARKED(m))
+			set_destinations(m, trace_transfers(m));
+
+	list_for_each_entry(m, set, mnt_list) {
+		transfer_propagation(m, m->mnt_master);
+		m->mnt_master = NULL;
+		CLEAR_MNT_MARK(m);
+	}
+}
+
 static struct mount *__propagation_next(struct mount *m,
 					 struct mount *origin)
 {
diff --git a/fs/pnode.h b/fs/pnode.h
index 00ab153e3e9d..b029db225f33 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -42,6 +42,7 @@ static inline bool peers(const struct mount *m1, const struct mount *m2)
 }
 
 void change_mnt_propagation(struct mount *, int);
+void bulk_make_private(struct list_head *);
 int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
 		struct hlist_head *);
 void propagate_umount(struct list_head *);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] use uniform permission checks for all mount propagation changes
  2025-08-15 23:35 ` [PATCH 3/4] use uniform permission checks for all mount propagation changes Al Viro
@ 2025-08-16 18:28   ` Andrei Vagin
  2025-08-19  4:44   ` Pavel Tikhomirov
  2025-08-19 10:20   ` Christian Brauner
  2 siblings, 0 replies; 17+ messages in thread
From: Andrei Vagin @ 2025-08-16 18:28 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Christian Brauner, Jan Kara,
	Lai, Yi, Tycho Andersen, Pavel Tikhomirov

On Fri, Aug 15, 2025 at 4:35 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> do_change_type() and do_set_group() are operating on different
> aspects of the same thing - propagation graph.  The latter
> asks for mounts involved to be mounted in namespace(s) the caller
> has CAP_SYS_ADMIN for.  The former is a mess - originally it
> didn't even check that mount *is* mounted.  That got fixed,
> but the resulting check turns out to be too strict for userland -
> in effect, we check that mount is in our namespace, having already
> checked that we have CAP_SYS_ADMIN there.
>
> What we really need (in both cases) is
>         * we only touch mounts that are mounted.  Hard requirement,
> data corruption if that's get violated.
>         * we don't allow to mess with a namespace unless you already
> have enough permissions to do so (i.e. CAP_SYS_ADMIN in its userns).
>
> That's an equivalent of what do_set_group() does; let's extract that
> into a helper (may_change_propagation()) and use it in both
> do_set_group() and do_change_type().
>

Al, thank you for the fix.

Acked-by: Andrei Vagin <avagin@gmail.com>

> Fixes: 12f147ddd6de "do_change_type(): refuse to operate on unmounted/not ours mounts"
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/namespace.c | 34 ++++++++++++++++++++--------------
>  1 file changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 1c97f93d1865..88db58061919 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2859,6 +2859,19 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
>         return attach_recursive_mnt(mnt, p, mp);
>  }
>
> +static int may_change_propagation(const struct mount *m)
> +{
> +        struct mnt_namespace *ns = m->mnt_ns;
> +
> +        // it must be mounted in some namespace
> +        if (IS_ERR_OR_NULL(ns))         // is_mounted()
> +                return -EINVAL;
> +        // and the caller must be admin in userns of that namespace
> +        if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
> +                return -EPERM;
> +        return 0;
> +}
> +
>  /*
>   * Sanity check the flags to change_mnt_propagation.
>   */
> @@ -2895,10 +2908,10 @@ static int do_change_type(struct path *path, int ms_flags)
>                 return -EINVAL;
>
>         namespace_lock();
> -       if (!check_mnt(mnt)) {
> -               err = -EINVAL;
> +       err = may_change_propagation(mnt);
> +       if (err)
>                 goto out_unlock;
> -       }
> +
>         if (type == MS_SHARED) {
>                 err = invent_group_ids(mnt, recurse);
>                 if (err)
> @@ -3344,18 +3357,11 @@ static int do_set_group(struct path *from_path, struct path *to_path)
>
>         namespace_lock();
>
> -       err = -EINVAL;
> -       /* To and From must be mounted */
> -       if (!is_mounted(&from->mnt))
> -               goto out;
> -       if (!is_mounted(&to->mnt))
> -               goto out;
> -
> -       err = -EPERM;
> -       /* We should be allowed to modify mount namespaces of both mounts */
> -       if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
> +       err = may_change_propagation(from);
> +       if (err)
>                 goto out;
> -       if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
> +       err = may_change_propagation(to);
> +       if (err)
>                 goto out;
>
>         err = -EINVAL;
> --
> 2.47.2
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] use uniform permission checks for all mount propagation changes
  2025-08-15 23:35 ` [PATCH 3/4] use uniform permission checks for all mount propagation changes Al Viro
  2025-08-16 18:28   ` Andrei Vagin
@ 2025-08-19  4:44   ` Pavel Tikhomirov
  2025-08-19 10:20   ` Christian Brauner
  2 siblings, 0 replies; 17+ messages in thread
From: Pavel Tikhomirov @ 2025-08-19  4:44 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Christian Brauner, Jan Kara,
	Lai, Yi, Tycho Andersen, Andrei Vagin

Looks good! Thanks!

Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Tested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
(CRIU tests PASS for the whole patchset)

On Sat, Aug 16, 2025 at 7:35 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> do_change_type() and do_set_group() are operating on different
> aspects of the same thing - propagation graph.  The latter
> asks for mounts involved to be mounted in namespace(s) the caller
> has CAP_SYS_ADMIN for.  The former is a mess - originally it
> didn't even check that mount *is* mounted.  That got fixed,
> but the resulting check turns out to be too strict for userland -
> in effect, we check that mount is in our namespace, having already
> checked that we have CAP_SYS_ADMIN there.
>
> What we really need (in both cases) is
>         * we only touch mounts that are mounted.  Hard requirement,
> data corruption if that's get violated.
>         * we don't allow to mess with a namespace unless you already
> have enough permissions to do so (i.e. CAP_SYS_ADMIN in its userns).
>
> That's an equivalent of what do_set_group() does; let's extract that
> into a helper (may_change_propagation()) and use it in both
> do_set_group() and do_change_type().
>
> Fixes: 12f147ddd6de "do_change_type(): refuse to operate on unmounted/not ours mounts"
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/namespace.c | 34 ++++++++++++++++++++--------------
>  1 file changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 1c97f93d1865..88db58061919 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2859,6 +2859,19 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
>         return attach_recursive_mnt(mnt, p, mp);
>  }
>
> +static int may_change_propagation(const struct mount *m)
> +{
> +        struct mnt_namespace *ns = m->mnt_ns;
> +
> +        // it must be mounted in some namespace
> +        if (IS_ERR_OR_NULL(ns))         // is_mounted()
> +                return -EINVAL;
> +        // and the caller must be admin in userns of that namespace
> +        if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
> +                return -EPERM;
> +        return 0;
> +}
> +
>  /*
>   * Sanity check the flags to change_mnt_propagation.
>   */
> @@ -2895,10 +2908,10 @@ static int do_change_type(struct path *path, int ms_flags)
>                 return -EINVAL;
>
>         namespace_lock();
> -       if (!check_mnt(mnt)) {
> -               err = -EINVAL;
> +       err = may_change_propagation(mnt);
> +       if (err)
>                 goto out_unlock;
> -       }
> +
>         if (type == MS_SHARED) {
>                 err = invent_group_ids(mnt, recurse);
>                 if (err)
> @@ -3344,18 +3357,11 @@ static int do_set_group(struct path *from_path, struct path *to_path)
>
>         namespace_lock();
>
> -       err = -EINVAL;
> -       /* To and From must be mounted */
> -       if (!is_mounted(&from->mnt))
> -               goto out;
> -       if (!is_mounted(&to->mnt))
> -               goto out;
> -
> -       err = -EPERM;
> -       /* We should be allowed to modify mount namespaces of both mounts */
> -       if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
> +       err = may_change_propagation(from);
> +       if (err)
>                 goto out;
> -       if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
> +       err = may_change_propagation(to);
> +       if (err)
>                 goto out;
>
>         err = -EINVAL;
> --
> 2.47.2
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] fix the softlockups in attach_recursive_mnt()
  2025-08-15 23:34 ` [PATCH 1/4] fix the softlockups in attach_recursive_mnt() Al Viro
@ 2025-08-19 10:18   ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2025-08-19 10:18 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jan Kara, Lai, Yi, Tycho Andersen,
	Andrei Vagin, Pavel Tikhomirov

On Sat, Aug 16, 2025 at 12:34:14AM +0100, Al Viro wrote:
> In case when we mounting something on top of a large stack of overmounts,
> all of them being peers of each other, we get quadratic time by the
> depth of overmount stack.  Easily fixed by doing commit_tree() before
> reparenting the overmount; simplifies commit_tree() as well - it doesn't
> need to skip the already mounted stuff that had been reparented on top
> of the new mounts.
> 
> Since we are holding mount_lock through both reparenting and call of
> commit_tree(), the order does not matter from the mount hash point
> of view.
> 
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Fixes: 663206854f02 "copy_tree(): don't link the mounts via mnt_list"
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] propagate_umount(): only surviving overmounts should be remounted
  2025-08-15 23:34 ` [PATCH 2/4] propagate_umount(): only surviving overmounts should be remounted Al Viro
@ 2025-08-19 10:19   ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2025-08-19 10:19 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jan Kara, Lai, Yi, Tycho Andersen,
	Andrei Vagin, Pavel Tikhomirov

On Sat, Aug 16, 2025 at 12:34:50AM +0100, Al Viro wrote:
> ... as the comments in reparent() clearly say.  As it is, we reparent
> *all* overmounts of the mounts being taken out, including those that
> are taken out themselves.  It's not only a potentially massive slowdown
> (on a pathological setup we might end up with O(N^2) time for N mounts
> being kicked out), it can end up with incorrect ->overmount in the
> surviving mounts.
> 
> Fixes: f0d0ba19985d "Rewrite of propagate_umount()"
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] use uniform permission checks for all mount propagation changes
  2025-08-15 23:35 ` [PATCH 3/4] use uniform permission checks for all mount propagation changes Al Viro
  2025-08-16 18:28   ` Andrei Vagin
  2025-08-19  4:44   ` Pavel Tikhomirov
@ 2025-08-19 10:20   ` Christian Brauner
  2 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2025-08-19 10:20 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jan Kara, Lai, Yi, Tycho Andersen,
	Andrei Vagin, Pavel Tikhomirov

On Sat, Aug 16, 2025 at 12:35:24AM +0100, Al Viro wrote:
> do_change_type() and do_set_group() are operating on different
> aspects of the same thing - propagation graph.  The latter
> asks for mounts involved to be mounted in namespace(s) the caller
> has CAP_SYS_ADMIN for.  The former is a mess - originally it
> didn't even check that mount *is* mounted.  That got fixed,
> but the resulting check turns out to be too strict for userland -
> in effect, we check that mount is in our namespace, having already
> checked that we have CAP_SYS_ADMIN there.
> 
> What we really need (in both cases) is
> 	* we only touch mounts that are mounted.  Hard requirement,
> data corruption if that's get violated.
> 	* we don't allow to mess with a namespace unless you already
> have enough permissions to do so (i.e. CAP_SYS_ADMIN in its userns).
> 
> That's an equivalent of what do_set_group() does; let's extract that
> into a helper (may_change_propagation()) and use it in both
> do_set_group() and do_change_type().
> 
> Fixes: 12f147ddd6de "do_change_type(): refuse to operate on unmounted/not ours mounts"
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] change_mnt_propagation(): calculate propagation source only if we'll need it
  2025-08-15 23:36 ` [PATCH 4/4] change_mnt_propagation(): calculate propagation source only if we'll need it Al Viro
@ 2025-08-19 10:20   ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2025-08-19 10:20 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jan Kara, Lai, Yi, Tycho Andersen,
	Andrei Vagin, Pavel Tikhomirov

On Sat, Aug 16, 2025 at 12:36:45AM +0100, Al Viro wrote:
> We only need it when mount in question was sending events downstream (then
> recepients need to switch to new master) or the mount is being turned into
> slave (then we need a new master for it).
> 
> That wouldn't be a big deal, except that it causes quite a bit of work
> when umount_tree() is taking a large peer group out.  Adding a trivial
> "don't bother calling propagation_source() unless we are going to use
> its results" logics improves the things quite a bit.
> 
> We are still doing unnecessary work on bulk removals from propagation graph,
> but the full solution for that will have to wait for the next merge window.
> 
> Fixes: 955336e204ab "do_make_slave(): choose new master sanely"
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [git pull] mount fixes
  2025-08-15 23:33 [PATCHES][RFC][CFT] mount fixes Al Viro
                   ` (4 preceding siblings ...)
  2025-08-16 15:58 ` [PATCHES][RFC][CFT] mount fixes Al Viro
@ 2025-08-19 16:12 ` Al Viro
  2025-08-19 17:31   ` Linus Torvalds
  2025-08-19 17:33   ` pr-tracker-bot
  5 siblings, 2 replies; 17+ messages in thread
From: Al Viro @ 2025-08-19 16:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Jan Kara, Lai, Yi, Tycho Andersen,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel

(collected *-by and slightly cleaned the text in commit message of [3/4]; otherwise
identical to what had been posted and sat in #fixes)

The following changes since commit 8742b2d8935f476449ef37e263bc4da3295c7b58:

  Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs (2025-08-12 12:10:33 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git tags/pull-fixes

for you to fetch changes up to fb924b7b8669503582e003dd7b7340ee49029801:

  change_mnt_propagation(): calculate propagation source only if we'll need it (2025-08-19 12:05:59 -0400)

----------------------------------------------------------------
fixes for several recent mount-related regressions

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

----------------------------------------------------------------
Al Viro (4):
      fix the softlockups in attach_recursive_mnt()
      propagate_umount(): only surviving overmounts should be reparented
      use uniform permission checks for all mount propagation changes
      change_mnt_propagation(): calculate propagation source only if we'll need it

 fs/namespace.c | 41 ++++++++++++++++++++++-------------------
 fs/pnode.c     | 10 ++++++----
 2 files changed, 28 insertions(+), 23 deletions(-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [git pull] mount fixes
  2025-08-19 16:12 ` [git pull] " Al Viro
@ 2025-08-19 17:31   ` Linus Torvalds
  2025-08-19 17:33   ` pr-tracker-bot
  1 sibling, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2025-08-19 17:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Jan Kara, Lai, Yi, Tycho Andersen,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel

On Tue, 19 Aug 2025 at 09:12, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> (collected *-by and slightly cleaned the text in commit message of [3/4]; otherwise
> identical to what had been posted and sat in #fixes)

Minor note relating to my workflow: I almost overlooked this pull
request, because it was a continuation of a thread with the subject
"[PATCHES][RFC][CFT]", and then shows up in my mailbox as part of that
thread.

So even though you had removed those RFC/CFT notes from the subject
line, that ended up being not entirely obvious in my MUA, and I was
about to archive the email before I started looking closer.

Now, this time I obviously caught it, but in general it might be safer
to make it more obvious that it has gone from a RFC to a "this is a
real pull".  Not being threaded is obviously one way, or just a bigger
note to make it more obvious.

And hey, this is not generally a huge problem. I could miss emails for
other random reasons like flaky spam detectors or just plain
(hopefully rare) incompetence on my part.

So obviously the general fix to "why didn't Linus pull" is to send me
a follow-up email reminding me about missed pull requests, but I
thought I'd mention this as a "avoid potential delays and confusion"
thing.

              Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [git pull] mount fixes
  2025-08-19 16:12 ` [git pull] " Al Viro
  2025-08-19 17:31   ` Linus Torvalds
@ 2025-08-19 17:33   ` pr-tracker-bot
  1 sibling, 0 replies; 17+ messages in thread
From: pr-tracker-bot @ 2025-08-19 17:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Christian Brauner, Jan Kara, Lai, Yi,
	Tycho Andersen, Andrei Vagin, Pavel Tikhomirov, linux-fsdevel

The pull request you sent on Tue, 19 Aug 2025 17:12:28 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git tags/pull-fixes

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/b19a97d57c15643494ac8bfaaa35e3ee472d41da

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-08-19 17:33 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-15 23:33 [PATCHES][RFC][CFT] mount fixes Al Viro
2025-08-15 23:34 ` [PATCH 1/4] fix the softlockups in attach_recursive_mnt() Al Viro
2025-08-19 10:18   ` Christian Brauner
2025-08-15 23:34 ` [PATCH 2/4] propagate_umount(): only surviving overmounts should be remounted Al Viro
2025-08-19 10:19   ` Christian Brauner
2025-08-15 23:35 ` [PATCH 3/4] use uniform permission checks for all mount propagation changes Al Viro
2025-08-16 18:28   ` Andrei Vagin
2025-08-19  4:44   ` Pavel Tikhomirov
2025-08-19 10:20   ` Christian Brauner
2025-08-15 23:36 ` [PATCH 4/4] change_mnt_propagation(): calculate propagation source only if we'll need it Al Viro
2025-08-19 10:20   ` Christian Brauner
2025-08-16 15:58 ` [PATCHES][RFC][CFT] mount fixes Al Viro
2025-08-19 16:12 ` [git pull] " Al Viro
2025-08-19 17:31   ` Linus Torvalds
2025-08-19 17:33   ` pr-tracker-bot
  -- strict thread matches above, loose matches on Subject: below --
2025-06-08 17:24 Al Viro
2025-06-08 18:53 ` pr-tracker-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).