[RFC] MNT_LOCKED vs. finish

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] MNT_LOCKED vs. finish_automount()
@ 2025-05-01 20:15 Al Viro
  2025-05-03  3:46 ` Eric W. Biederman
  0 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2025-05-01 20:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Eric W. Biederman, Linus Torvalds, Christian Brauner

	Back in 2011, when ->d_automount() had been introduced,
we went with "stepping on NFS referral, etc., has the submount
inherit the flags of parent one" (with the obvious exceptions
for internal-only flags).  Back then MNT_LOCKED didn't exist.

	Two years later, when MNT_LOCKED had been added, an explicit
"don't set MNT_LOCKED on expirable mounts when propagating across
the userns boundary; their underlying mountpoints can be exposed
whenever the original expires anyway".  Same went for root of
subtree attached by explicit mount --[r]bind - the mountpoint
had been exposed before the call, after all and for roots of
any propagation copies created by such (same reason).  Normal mount
(created by do_new_mount()) could never get MNT_LOCKED to start with.

	However, mounts created by finish_automount() bloody well
could - if the parent mount had MNT_LOCKED on it, submounts would
inherited it.  Even if they had been expirable.  Moreover, all their
propagation copies would have MNT_LOCKED stripped out.

	IMO this inconsistency is a bug; MNT_LOCKED should not
be inherited in finish_automount().

	Eric, is there something subtle I'm missing here?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] MNT_LOCKED vs. finish_automount()
  2025-05-01 20:15 [RFC] MNT_LOCKED vs. finish_automount() Al Viro
@ 2025-05-03  3:46 ` Eric W. Biederman
  2025-05-04 23:24   ` Al Viro
  0 siblings, 1 reply; 7+ messages in thread
From: Eric W. Biederman @ 2025-05-03  3:46 UTC (permalink / raw)
  To: Al Airy; +Cc: linux-fsdevel, Linus Torvalds, Christian Brauner

Al Viro <viro@zeniv.linux.org.uk> writes:

> 	Back in 2011, when ->d_automount() had been introduced,
> we went with "stepping on NFS referral, etc., has the submount
> inherit the flags of parent one" (with the obvious exceptions
> for internal-only flags).  Back then MNT_LOCKED didn't exist.
>
> 	Two years later, when MNT_LOCKED had been added, an explicit
> "don't set MNT_LOCKED on expirable mounts when propagating across
> the userns boundary; their underlying mountpoints can be exposed
> whenever the original expires anyway".  Same went for root of
> subtree attached by explicit mount --[r]bind - the mountpoint
> had been exposed before the call, after all and for roots of
> any propagation copies created by such (same reason).  Normal mount
> (created by do_new_mount()) could never get MNT_LOCKED to start with.
>
> 	However, mounts created by finish_automount() bloody well
> could - if the parent mount had MNT_LOCKED on it, submounts would
> inherited it.  Even if they had been expirable.  Moreover, all their
> propagation copies would have MNT_LOCKED stripped out.
>
> 	IMO this inconsistency is a bug; MNT_LOCKED should not
> be inherited in finish_automount().
>
> 	Eric, is there something subtle I'm missing here?

I don't think you are missing anything.  This looks like a pretty clear
cut case of simply not realizing finish_automount was special in a way
that could result in MNT_LOCKED getting set.

I skimmed through the code just a minute ago and my reading of it
matches your reading of it above.

The intended semantics of MNT_LOCKED are to not let an unprivileged user
see under mounts they would never be able to see under without creating
a mount namespace.

The mount point of an automount is pretty clearly something that is safe
to see under.  Doubly so if this is a directory that will always be
empty on a pseudo filesystem (aka autofs).

Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] MNT_LOCKED vs. finish_automount()
  2025-05-03  3:46 ` Eric W. Biederman
@ 2025-05-04 23:24   ` Al Viro
  2025-05-05 13:52     ` Eric W. Biederman
  2025-05-05 13:54     ` Christian Brauner
  0 siblings, 2 replies; 7+ messages in thread
From: Al Viro @ 2025-05-04 23:24 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, Christian Brauner

On Fri, May 02, 2025 at 10:46:26PM -0500, Eric W. Biederman wrote:
> Al Viro <viro@zeniv.linux.org.uk> writes:
> 
> > 	Back in 2011, when ->d_automount() had been introduced,
> > we went with "stepping on NFS referral, etc., has the submount
> > inherit the flags of parent one" (with the obvious exceptions
> > for internal-only flags).  Back then MNT_LOCKED didn't exist.
> >
> > 	Two years later, when MNT_LOCKED had been added, an explicit
> > "don't set MNT_LOCKED on expirable mounts when propagating across
> > the userns boundary; their underlying mountpoints can be exposed
> > whenever the original expires anyway".  Same went for root of
> > subtree attached by explicit mount --[r]bind - the mountpoint
> > had been exposed before the call, after all and for roots of
> > any propagation copies created by such (same reason).  Normal mount
> > (created by do_new_mount()) could never get MNT_LOCKED to start with.
> >
> > 	However, mounts created by finish_automount() bloody well
> > could - if the parent mount had MNT_LOCKED on it, submounts would
> > inherited it.  Even if they had been expirable.  Moreover, all their
> > propagation copies would have MNT_LOCKED stripped out.
> >
> > 	IMO this inconsistency is a bug; MNT_LOCKED should not
> > be inherited in finish_automount().
> >
> > 	Eric, is there something subtle I'm missing here?
> 
> I don't think you are missing anything.  This looks like a pretty clear
> cut case of simply not realizing finish_automount was special in a way
> that could result in MNT_LOCKED getting set.
> 
> I skimmed through the code just a minute ago and my reading of it
> matches your reading of it above.
> 
> The intended semantics of MNT_LOCKED are to not let an unprivileged user
> see under mounts they would never be able to see under without creating
> a mount namespace.
> 
> The mount point of an automount is pretty clearly something that is safe
> to see under.  Doubly so if this is a directory that will always be
> empty on a pseudo filesystem (aka autofs).

Does anybody have objections to the following?

[PATCH] finish_automount(): don't leak MNT_LOCKED from parent to child

Intention for MNT_LOCKED had always been to protect the internal
mountpoints within a subtree that got copied across the userns boundary,
not the mountpoint that tree got attached to - after all, it _was_
exposed before the copying.

For roots of secondary copies that is enforced in attach_recursive_mnt() -
MNT_LOCKED is explicitly stripped for those.  For the root of primary
copy we are almost always guaranteed that MNT_LOCKED won't be there,
so attach_recursive_mnt() doesn't bother.  Unfortunately, one call
chain got overlooked - triggering e.g. NFS referral will have the
submount inherit the public flags from parent; that's fine for such
things as read-only, nosuid, etc., but not for MNT_LOCKED.

This is particularly pointless since the mount attached by finish_automount()
is usually expirable, which makes any protection granted by MNT_LOCKED
null and void; just wait for a while and that mount will go away on its own.

The minimal fix is to have do_add_mount() treat MNT_LOCKED the same
way as other internal-only flags.  Longer term it would be cleaner to
deal with that in attach_recursive_mnt(), but that takes a bit more
massage, so let's go with the one-liner fix for now.

Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/namespace.c b/fs/namespace.c
index 04a9bb9f31fa..352b4ccf1aaa 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3761,7 +3761,7 @@ static int do_add_mount(struct mount *newmnt, struct mountpoint *mp,
 {
 	struct mount *parent = real_mount(path->mnt);
 
-	mnt_flags &= ~MNT_INTERNAL_FLAGS;
+	mnt_flags &= ~(MNT_INTERNAL_FLAGS | MNT_LOCKED);
 
 	if (unlikely(!check_mnt(parent))) {
 		/* that's acceptable only for automounts done in private ns */

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC] MNT_LOCKED vs. finish_automount()
  2025-05-04 23:24   ` Al Viro
@ 2025-05-05 13:52     ` Eric W. Biederman
  2025-05-05 19:02       ` Al Viro
  2025-05-05 13:54     ` Christian Brauner
  1 sibling, 1 reply; 7+ messages in thread
From: Eric W. Biederman @ 2025-05-05 13:52 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, Christian Brauner

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Fri, May 02, 2025 at 10:46:26PM -0500, Eric W. Biederman wrote:
>> Al Viro <viro@zeniv.linux.org.uk> writes:
>> 
>> > 	Back in 2011, when ->d_automount() had been introduced,
>> > we went with "stepping on NFS referral, etc., has the submount
>> > inherit the flags of parent one" (with the obvious exceptions
>> > for internal-only flags).  Back then MNT_LOCKED didn't exist.
>> >
>> > 	Two years later, when MNT_LOCKED had been added, an explicit
>> > "don't set MNT_LOCKED on expirable mounts when propagating across
>> > the userns boundary; their underlying mountpoints can be exposed
>> > whenever the original expires anyway".  Same went for root of
>> > subtree attached by explicit mount --[r]bind - the mountpoint
>> > had been exposed before the call, after all and for roots of
>> > any propagation copies created by such (same reason).  Normal mount
>> > (created by do_new_mount()) could never get MNT_LOCKED to start with.
>> >
>> > 	However, mounts created by finish_automount() bloody well
>> > could - if the parent mount had MNT_LOCKED on it, submounts would
>> > inherited it.  Even if they had been expirable.  Moreover, all their
>> > propagation copies would have MNT_LOCKED stripped out.
>> >
>> > 	IMO this inconsistency is a bug; MNT_LOCKED should not
>> > be inherited in finish_automount().
>> >
>> > 	Eric, is there something subtle I'm missing here?
>> 
>> I don't think you are missing anything.  This looks like a pretty clear
>> cut case of simply not realizing finish_automount was special in a way
>> that could result in MNT_LOCKED getting set.
>> 
>> I skimmed through the code just a minute ago and my reading of it
>> matches your reading of it above.
>> 
>> The intended semantics of MNT_LOCKED are to not let an unprivileged user
>> see under mounts they would never be able to see under without creating
>> a mount namespace.
>> 
>> The mount point of an automount is pretty clearly something that is safe
>> to see under.  Doubly so if this is a directory that will always be
>> empty on a pseudo filesystem (aka autofs).
>
> Does anybody have objections to the following?
>
> [PATCH] finish_automount(): don't leak MNT_LOCKED from parent to child
>
> Intention for MNT_LOCKED had always been to protect the internal
> mountpoints within a subtree that got copied across the userns boundary,
> not the mountpoint that tree got attached to - after all, it _was_
> exposed before the copying.
>
> For roots of secondary copies that is enforced in attach_recursive_mnt() -
> MNT_LOCKED is explicitly stripped for those.  For the root of primary
> copy we are almost always guaranteed that MNT_LOCKED won't be there,
> so attach_recursive_mnt() doesn't bother.  Unfortunately, one call
> chain got overlooked - triggering e.g. NFS referral will have the
> submount inherit the public flags from parent; that's fine for such
> things as read-only, nosuid, etc., but not for MNT_LOCKED.
>
> This is particularly pointless since the mount attached by finish_automount()
> is usually expirable, which makes any protection granted by MNT_LOCKED
> null and void; just wait for a while and that mount will go away on its own.
>
> The minimal fix is to have do_add_mount() treat MNT_LOCKED the same
> way as other internal-only flags.  Longer term it would be cleaner to
> deal with that in attach_recursive_mnt(), but that takes a bit more
> massage, so let's go with the one-liner fix for now.

How would you deal with this in attach_recursive_mnt?  The problem case
appears to be an automount on top of a recursive mount.  So I don't
see the recursive mount code path being involved at all.



Given that only new mounts explicitly specified by a user and
finish_automount call do_add_mount this looks safe.  Having
MNT_LOCKED set will always be inappropriate in these two
cases.

Eric


> Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users")
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 04a9bb9f31fa..352b4ccf1aaa 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3761,7 +3761,7 @@ static int do_add_mount(struct mount *newmnt, struct mountpoint *mp,
>  {
>  	struct mount *parent = real_mount(path->mnt);
>  
> -	mnt_flags &= ~MNT_INTERNAL_FLAGS;
> +	mnt_flags &= ~(MNT_INTERNAL_FLAGS | MNT_LOCKED);
>  
>  	if (unlikely(!check_mnt(parent))) {
>  		/* that's acceptable only for automounts done in private ns */

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] MNT_LOCKED vs. finish_automount()
  2025-05-05 13:52     ` Eric W. Biederman
@ 2025-05-05 19:02       ` Al Viro
  2025-05-06  2:25         ` Al Viro
  0 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2025-05-05 19:02 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, Christian Brauner

On Mon, May 05, 2025 at 08:52:16AM -0500, Eric W. Biederman wrote:

> How would you deal with this in attach_recursive_mnt?  The problem case
> appears to be an automount on top of a recursive mount.  So I don't
> see the recursive mount code path being involved at all.
> 
> 
> 
> Given that only new mounts explicitly specified by a user and
> finish_automount call do_add_mount this looks safe.  Having
> MNT_LOCKED set will always be inappropriate in these two
> cases.

do_add_mount() -> graft_tree() -> attach_recursive_mnt()

and the last one is where we actually attach them to mount tree.
And that's really where all subtrees are attached, except for the very
local surgery in pivot_root().

Said that, from digging I'd done last night there might be a different
long-term approach; the thing is, there are very few places where we
ever look at MNT_LOCKED for mounts with !mnt_has_parent():
	do_umount()
	can_umount()
	do_move_mount()
	pivot_root()
and two of those four are of dubious use - can_umount() will be followed
by rechecking in do_umount(), so there's not much point in bailing out
early in a very pathological case.  And do_move_mount() might check
MNT_LOCKED on such, but only in move-from-anon case, where we do *NOT*
set MNT_LOCKED on namespace root.

IOW, looking at the way things are now, I'm no longer sure that the trick
you've done in da362b09e42 "umount: Do not allow unmounting rootfs"
had been a good idea long-term - it certainly made for smaller fix,
but it overloaded semantics of MNT_LOCKED, making it not just about
protecting the mountpoint against exposure.

What if we add explicit checks for mnt_has_parent() in do_umount() and
pivot_root() next to existing checks for MNT_LOCKED?  Then we can
	* have clone_mnt() treat that flag as "internal, ignore" (matter
of fact, I would simply add MNT_LOCKED to MNT_INTERNAL_FLAGS and have
clone_mnt() strip that, same as we do in do_add_mount()).
	* have copy_tree() set it right next to attach_mnt(), if the source
had it.  Yes, leaving it clear for root of copied tree.
	* no need to clear it in the end of __do_loopback() (racy at the
moment, BTW - no mount_lock held there, so race with mount -o remount,ro
for the original is possible, so something is needed there)
	* have lock_mnt_tree() skip the MNT_LOCKED not just for the expirable
but for the root of subtree as well.
	* don't bother stripping MNT_LOCKED for roots of propagation copies
(or anyone, for that matter) in attach_recursive_mnt()
	* don't bother setting it on absolute root of initial namespace

Looks like we end up with overall simpler code that way; it's certainly
conceptually simpler - MNT_LOCKED is set only on the mounts that do
have parent we care to protect, with that being done atomically wrt
mount becoming reachable (lock_mnt_tree() is in the same lock_mount_hash()
scope where we call commit_tree() that makes the entire subtree reachable).

Your argument in "mnt: Move the clear of MNT_LOCKED from copy_tree to
it's callers" about wanting to keep it in collect_mounts() for audit
purposes is wrong, AFAICS - audit does not see or care about MNT_LOCKED;
the only thing we use the result of collect_mounts() for is passing
it to iterate_mounts(), which literally "apply this callback to each
vfsmount in the set", completely ignoring anything else.  What's more,
all callbacks audit is passing to it actually look only at the inode of
that mount's root...

Anyway, that's longer-term stuff; I'll put together a patch along those
lines on top of this one.  Do you have any objections to the minimal
fix as posted?  Or, for that matter, to a variant that simply adds
MNT_LOCKED to MNT_INTERNAL_FLAGS (not used anywhere outside of
do_add_mount()):

[PATCH] finish_automount(): don't leak MNT_LOCKED from parent to child

Intention for MNT_LOCKED had always been to protect the internal
mountpoints within a subtree that got copied across the userns boundary,
not the mountpoint that tree got attached to - after all, it _was_
exposed before the copying.

For roots of secondary copies that is enforced in attach_recursive_mnt() -
MNT_LOCKED is explicitly stripped for those.  For the root of primary
copy we are almost always guaranteed that MNT_LOCKED won't be there,
so attach_recursive_mnt() doesn't bother.  Unfortunately, one call
chain got overlooked - triggering e.g. NFS referral will have the
submount inherit the public flags from parent; that's fine for such
things as read-only, nosuid, etc., but not for MNT_LOCKED.

This is particularly pointless since the mount attached by finish_automount()
is usually expirable, which makes any protection granted by MNT_LOCKED
null and void; just wait for a while and that mount will go away on its own.

Include MNT_LOCKED into the set of flags to be ignored by do_add_mount() - it
really is an internal flag.

Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users")
---
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 5b69f8ede215..c1bb278b475c 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -49,8 +49,8 @@ struct path;
 				 | MNT_READONLY | MNT_NOSYMFOLLOW)
 #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )

-#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
-			    MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED)
+#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_DOOMED | MNT_MARKED | MNT_LOCKED \
+			    | MNT_WRITE_HOLD | MNT_SYNC_UMOUNT | MNT_INTERNAL)

 #define MNT_INTERNAL	0x4000

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC] MNT_LOCKED vs. finish_automount()
  2025-05-05 19:02       ` Al Viro
@ 2025-05-06  2:25         ` Al Viro
  0 siblings, 0 replies; 7+ messages in thread
From: Al Viro @ 2025-05-06  2:25 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, Christian Brauner

On Mon, May 05, 2025 at 08:02:15PM +0100, Al Viro wrote:

> Said that, from digging I'd done last night there might be a different
> long-term approach; the thing is, there are very few places where we
> ever look at MNT_LOCKED for mounts with !mnt_has_parent():
> 	do_umount()
> 	can_umount()
> 	do_move_mount()
> 	pivot_root()
> and two of those four are of dubious use - can_umount() will be followed
> by rechecking in do_umount(), so there's not much point in bailing out
> early in a very pathological case.  And do_move_mount() might check
> MNT_LOCKED on such, but only in move-from-anon case, where we do *NOT*
> set MNT_LOCKED on namespace root.

Even better, we have a check for !mnt_has_parent() downstream of that in
pivot_root() and we reject such there.  So having that set on absolute
root of non-anon namespaces buys us very little - do_umount() is the only
place that would need to explicitly check for that.

> IOW, looking at the way things are now, I'm no longer sure that the trick
> you've done in da362b09e42 "umount: Do not allow unmounting rootfs"
> had been a good idea long-term - it certainly made for smaller fix,
> but it overloaded semantics of MNT_LOCKED, making it not just about
> protecting the mountpoint against exposure.
> 
> What if we add explicit checks for mnt_has_parent() in do_umount() and
> pivot_root() next to existing checks for MNT_LOCKED?  Then we can
> 	* have clone_mnt() treat that flag as "internal, ignore" (matter
> of fact, I would simply add MNT_LOCKED to MNT_INTERNAL_FLAGS and have
> clone_mnt() strip that, same as we do in do_add_mount()).
> 	* have copy_tree() set it right next to attach_mnt(), if the source
> had it.  Yes, leaving it clear for root of copied tree.
> 	* no need to clear it in the end of __do_loopback() (racy at the
> moment, BTW - no mount_lock held there, so race with mount -o remount,ro
> for the original is possible, so something is needed there)
> 	* have lock_mnt_tree() skip the MNT_LOCKED not just for the expirable
> but for the root of subtree as well.
> 	* don't bother stripping MNT_LOCKED for roots of propagation copies
> (or anyone, for that matter) in attach_recursive_mnt()
> 	* don't bother setting it on absolute root of initial namespace
> 
> Looks like we end up with overall simpler code that way; it's certainly
> conceptually simpler - MNT_LOCKED is set only on the mounts that do
> have parent we care to protect, with that being done atomically wrt
> mount becoming reachable (lock_mnt_tree() is in the same lock_mount_hash()
> scope where we call commit_tree() that makes the entire subtree reachable).
> 
> Your argument in "mnt: Move the clear of MNT_LOCKED from copy_tree to
> it's callers" about wanting to keep it in collect_mounts() for audit
> purposes is wrong, AFAICS - audit does not see or care about MNT_LOCKED;
> the only thing we use the result of collect_mounts() for is passing
> it to iterate_mounts(), which literally "apply this callback to each
> vfsmount in the set", completely ignoring anything else.  What's more,
> all callbacks audit is passing to it actually look only at the inode of
> that mount's root...
> 
> Anyway, that's longer-term stuff; I'll put together a patch along those
> lines on top of this one.

Here it is, on top of the previous.  Objections, comments?

[PATCH] don't set MNT_LOCKED on parentless mounts

Originally MNT_LOCKED meant only one thing - "don't let this mount to
be peeled off its parent, we don't want to have mountpoint exposed".
Accordingly, it had only been set on mounts that *do* have a parent.
Later it had been overloaded with another use - setting it on the
absolute root had given free protection against umount(2) on absolute
root (possible to trigger, oopsed).  Not a bad trick, but it ended
up costing more than it bought us.  Unfortunately, the cost included
both hard-to-reason-about logics and a subtle race between
mount -o remount,ro and mount --[r]bind - lockless &= ~MNT_LOCKED in
the end of __do_loopback() could race with sb_prepare_remount_readonly()
setting and clearing MNT_HOLD_WRITE (under mount_lock, as it should
be).

Turns out that nobody except umount(2) had ever made use of having
MNT_LOCKED set on absolute root.  So let's give up on that trick,
clever as it had been, add an explicit check in do_umount() and
return to using MNT_LOCKED only for mounts that have a parent.

It means that
	* clone_mnt() no longer copies MNT_LOCKED
	* copy_tree() sets it on submounts if their counterparts had
been marked such, and do that right next to attach_mnt() in there,
in the same mount_lock scope.
	* __do_loopback() no longer needs to strip MNT_LOCKED off the
root of subtree it's about to return; no store, no race.
	* init_mount_tree() doesn't bother setting MNT_LOCKED on absolute
root.
	* lock_mnt_tree() does not set MNT_LOCKED on the subtree's root;
accordingly, its caller (loop in attach_recursive_mnt()) does not need to
bother stripping that MNT_LOCKED.  Note that lock_mnt_tree() setting
MNT_LOCKED on submounts happens in the same mount_lock scope as __attach_mnt()
(from commit_tree()) that makes them reachable.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/namespace.c b/fs/namespace.c
index 04a9bb9f31fa..165b3bd26857 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1363,7 +1363,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	}
 
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags;
-	mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);
+	mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL|MNT_LOCKED);
 
 	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_idmap = mnt_idmap_get(mnt_idmap(&old->mnt));
@@ -2038,6 +2038,9 @@ static int do_umount(struct mount *mnt, int flags)
 	if (mnt->mnt.mnt_flags & MNT_LOCKED)
 		goto out;
 
+	if (!mnt_has_parent(mnt))
+		goto out;
+
 	event++;
 	if (flags & MNT_DETACH) {
 		if (mnt_ns_attached(mnt) || !list_empty(&mnt->mnt_list))
@@ -2308,6 +2311,8 @@ struct mount *copy_tree(struct mount *src_root, struct dentry *dentry,
 			if (IS_ERR(dst_mnt))
 				goto out;
 			lock_mount_hash();
+			if (src_mnt->mnt.mnt_flags & MNT_LOCKED)
+				dst_mnt->mnt.mnt_flags |= MNT_LOCKED;
 			list_add_tail(&dst_mnt->mnt_list, &res->mnt_list);
 			attach_mnt(dst_mnt, dst_parent, src_parent->mnt_mp, false);
 			unlock_mount_hash();
@@ -2547,7 +2552,7 @@ static void lock_mnt_tree(struct mount *mnt)
 		if (flags & MNT_NOEXEC)
 			flags |= MNT_LOCK_NOEXEC;
 		/* Don't allow unprivileged users to reveal what is under a mount */
-		if (list_empty(&p->mnt_expire))
+		if (list_empty(&p->mnt_expire) && p != mnt)
 			flags |= MNT_LOCKED;
 		p->mnt.mnt_flags = flags;
 	}
@@ -2758,7 +2763,6 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 		/* Notice when we are propagating across user namespaces */
 		if (child->mnt_parent->mnt_ns->user_ns != user_ns)
 			lock_mnt_tree(child);
-		child->mnt.mnt_flags &= ~MNT_LOCKED;
 		commit_tree(child);
 	}
 	put_mountpoint(smp);
@@ -3027,26 +3031,21 @@ static inline bool may_copy_tree(struct path *path)
 
 static struct mount *__do_loopback(struct path *old_path, int recurse)
 {
-	struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt);
+	struct mount *old = real_mount(old_path->mnt);
 
 	if (IS_MNT_UNBINDABLE(old))
-		return mnt;
+		return ERR_PTR(-EINVAL);
 
 	if (!may_copy_tree(old_path))
-		return mnt;
+		return ERR_PTR(-EINVAL);
 
 	if (!recurse && has_locked_children(old, old_path->dentry))
-		return mnt;
+		return ERR_PTR(-EINVAL);
 
 	if (recurse)
-		mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
+		return copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
 	else
-		mnt = clone_mnt(old, old_path->dentry, 0);
-
-	if (!IS_ERR(mnt))
-		mnt->mnt.mnt_flags &= ~MNT_LOCKED;
-
-	return mnt;
+		return clone_mnt(old, old_path->dentry, 0);
 }
 
 /*
@@ -4789,11 +4788,11 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
 	if (!path_mounted(&root))
 		goto out4; /* not a mountpoint */
 	if (!mnt_has_parent(root_mnt))
-		goto out4; /* not attached */
+		goto out4; /* absolute root */
 	if (!path_mounted(&new))
 		goto out4; /* not a mountpoint */
 	if (!mnt_has_parent(new_mnt))
-		goto out4; /* not attached */
+		goto out4; /* absolute root */
 	/* make sure we can reach put_old from new_root */
 	if (!is_path_reachable(old_mnt, old.dentry, &new))
 		goto out4;
@@ -6191,7 +6190,6 @@ static void __init init_mount_tree(void)
 
 	root.mnt = mnt;
 	root.dentry = mnt->mnt_root;
-	mnt->mnt_flags |= MNT_LOCKED;
 
 	set_fs_pwd(current->fs, &root);
 	set_fs_root(current->fs, &root);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC] MNT_LOCKED vs. finish_automount()
  2025-05-04 23:24   ` Al Viro
  2025-05-05 13:52     ` Eric W. Biederman
@ 2025-05-05 13:54     ` Christian Brauner
  1 sibling, 0 replies; 7+ messages in thread
From: Christian Brauner @ 2025-05-05 13:54 UTC (permalink / raw)
  To: Al Viro; +Cc: Eric W. Biederman, linux-fsdevel, Linus Torvalds

On Mon, May 05, 2025 at 12:24:41AM +0100, Al Viro wrote:
> On Fri, May 02, 2025 at 10:46:26PM -0500, Eric W. Biederman wrote:
> > Al Viro <viro@zeniv.linux.org.uk> writes:
> > 
> > > 	Back in 2011, when ->d_automount() had been introduced,
> > > we went with "stepping on NFS referral, etc., has the submount
> > > inherit the flags of parent one" (with the obvious exceptions
> > > for internal-only flags).  Back then MNT_LOCKED didn't exist.
> > >
> > > 	Two years later, when MNT_LOCKED had been added, an explicit
> > > "don't set MNT_LOCKED on expirable mounts when propagating across
> > > the userns boundary; their underlying mountpoints can be exposed
> > > whenever the original expires anyway".  Same went for root of
> > > subtree attached by explicit mount --[r]bind - the mountpoint
> > > had been exposed before the call, after all and for roots of
> > > any propagation copies created by such (same reason).  Normal mount
> > > (created by do_new_mount()) could never get MNT_LOCKED to start with.
> > >
> > > 	However, mounts created by finish_automount() bloody well
> > > could - if the parent mount had MNT_LOCKED on it, submounts would
> > > inherited it.  Even if they had been expirable.  Moreover, all their
> > > propagation copies would have MNT_LOCKED stripped out.
> > >
> > > 	IMO this inconsistency is a bug; MNT_LOCKED should not
> > > be inherited in finish_automount().
> > >
> > > 	Eric, is there something subtle I'm missing here?
> > 
> > I don't think you are missing anything.  This looks like a pretty clear
> > cut case of simply not realizing finish_automount was special in a way
> > that could result in MNT_LOCKED getting set.
> > 
> > I skimmed through the code just a minute ago and my reading of it
> > matches your reading of it above.
> > 
> > The intended semantics of MNT_LOCKED are to not let an unprivileged user
> > see under mounts they would never be able to see under without creating
> > a mount namespace.
> > 
> > The mount point of an automount is pretty clearly something that is safe
> > to see under.  Doubly so if this is a directory that will always be
> > empty on a pseudo filesystem (aka autofs).
> 
> Does anybody have objections to the following?
> 
> [PATCH] finish_automount(): don't leak MNT_LOCKED from parent to child
> 
> Intention for MNT_LOCKED had always been to protect the internal
> mountpoints within a subtree that got copied across the userns boundary,
> not the mountpoint that tree got attached to - after all, it _was_
> exposed before the copying.
> 
> For roots of secondary copies that is enforced in attach_recursive_mnt() -
> MNT_LOCKED is explicitly stripped for those.  For the root of primary
> copy we are almost always guaranteed that MNT_LOCKED won't be there,
> so attach_recursive_mnt() doesn't bother.  Unfortunately, one call
> chain got overlooked - triggering e.g. NFS referral will have the

/me rolls eyes. Why is it always NFS.

> submount inherit the public flags from parent; that's fine for such
> things as read-only, nosuid, etc., but not for MNT_LOCKED.
> 
> This is particularly pointless since the mount attached by finish_automount()
> is usually expirable, which makes any protection granted by MNT_LOCKED
> null and void; just wait for a while and that mount will go away on its own.
> 
> The minimal fix is to have do_add_mount() treat MNT_LOCKED the same
> way as other internal-only flags.  Longer term it would be cleaner to
> deal with that in attach_recursive_mnt(), but that takes a bit more
> massage, so let's go with the one-liner fix for now.
> 
> Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users")
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Reviewed-by: Christian Brauner <brauner@kernel.org>

> diff --git a/fs/namespace.c b/fs/namespace.c
> index 04a9bb9f31fa..352b4ccf1aaa 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3761,7 +3761,7 @@ static int do_add_mount(struct mount *newmnt, struct mountpoint *mp,
>  {
>  	struct mount *parent = real_mount(path->mnt);
>  
> -	mnt_flags &= ~MNT_INTERNAL_FLAGS;
> +	mnt_flags &= ~(MNT_INTERNAL_FLAGS | MNT_LOCKED);
>  
>  	if (unlikely(!check_mnt(parent))) {
>  		/* that's acceptable only for automounts done in private ns */

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-05-06  2:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-01 20:15 [RFC] MNT_LOCKED vs. finish_automount() Al Viro
2025-05-03  3:46 ` Eric W. Biederman
2025-05-04 23:24   ` Al Viro
2025-05-05 13:52     ` Eric W. Biederman
2025-05-05 19:02       ` Al Viro
2025-05-06  2:25         ` Al Viro
2025-05-05 13:54     ` Christian Brauner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.