[RFC] umount/__detach

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] umount/__detach_mounts() race
@ 2022-02-21  3:02 Al Viro
  2022-02-21  5:04 ` Al Viro
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Al Viro @ 2022-02-21  3:02 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Eric Biederman, Linus Torvalds

	umount_tree() is very definitely not supposed to be called
on MNT_UMOUNT subtrees (== stuck-together fragments that got
unmounted, but not split into individual mount nodes).  Refcounting
rules are different there and umount_tree() assumes that we start with
the normal ones.

	do_umount() appears to be checking for that:

	if (flags & MNT_DETACH) {
		if (!list_empty(&mnt->mnt_list))
			umount_tree(mnt, UMOUNT_PROPAGATE);
		retval = 0;
	} else {
		shrink_submounts(mnt);
		retval = -EBUSY;
		if (!propagate_mount_busy(mnt, 2)) {
			if (!list_empty(&mnt->mnt_list))
				umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
			retval = 0;
		}
	}

which would prevent umount_tree() on those - mnt_list eviction happens
for the same nodes that get MNT_UMOUNT.  However, shrink_submounts()
will call umount_tree() for e.g. nfs automounts it finds on victim
mount, and if ours happens to be already unmounted, with automounts
stuck to it, we have trouble.

It looks like something that should be impossible to hit, but...

A: umount(2) looks the sucker up
B: rmdir(2) in another namespace (where nothing is mounted on that mountpoint)
does __detach_mounts(), which grabs namespace_sem, sees the mount A is about
to try and kill and calls umount_tree(mnt, UMOUNT_CONNECTED).  Which detaches
our mount (and its children, automounts included) from the namespace it's in,
modifies their refcounts accordingly and keeps the entire thing in one
piece.
A: in do_umount() blocks on namespace_sem
B: drops namespace_sem
A: gets to the quoted code.  mnt is already MNT_UMOUNT (and has empty
->mnt_list), but it does have (equally MNT_UMOUNT) automounts under it,
etc.  So shrink_submounts() finds something to umount and calls umount_tree().
Buggered refcounts happen.

Does anybody see a problem with the following patch?

diff --git a/fs/namespace.c b/fs/namespace.c
index 42d4fc21263b2..1604b9d7a69d9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1654,21 +1654,20 @@ static int do_umount(struct mount *mnt, int flags)
 	lock_mount_hash();
 
 	/* Recheck MNT_LOCKED with the locks held */
+	/* ... and don't go there if we'd raced and it's already unmounted */
 	retval = -EINVAL;
-	if (mnt->mnt.mnt_flags & MNT_LOCKED)
+	if (mnt->mnt.mnt_flags & (MNT_LOCKED|MNT_UMOUNT))
 		goto out;
 
 	event++;
 	if (flags & MNT_DETACH) {
-		if (!list_empty(&mnt->mnt_list))
-			umount_tree(mnt, UMOUNT_PROPAGATE);
+		umount_tree(mnt, UMOUNT_PROPAGATE);
 		retval = 0;
 	} else {
 		shrink_submounts(mnt);
 		retval = -EBUSY;
 		if (!propagate_mount_busy(mnt, 2)) {
-			if (!list_empty(&mnt->mnt_list))
-				umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
+			umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
 			retval = 0;
 		}
 	}

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC] umount/__detach_mounts() race
  2022-02-21  3:02 [RFC] umount/__detach_mounts() race Al Viro
@ 2022-02-21  5:04 ` Al Viro
  2022-02-21  5:19   ` Al Viro
  2022-02-21 16:46   ` Eric W. Biederman
  2022-02-21 17:04 ` Eric W. Biederman
  2022-02-21 19:53 ` Al Viro
  2 siblings, 2 replies; 7+ messages in thread
From: Al Viro @ 2022-02-21  5:04 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Eric Biederman, Linus Torvalds

	BTW, while looking through the vicinity - I think this 
        if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations)
			return mnt;
in __do_loopback() is too permissive.  I'd prefer to replace it with
        if (!check_mnt(old)) {
		// allow binding objects from internal instance of nsfs
		if (old->mnt_ns != MNT_NS_INTERAL ||
		    old_path->dentry->d_op != &ns_dentry_operations)
			return mnt;
	}

Suppose we'd bound an nsfs object e.g. on /tmp/foo.  Consider a race
between mount --bind /tmp/foo /tmp/bar and umount -l /tmp/foo.

do_loopback() resolves /tmp/foo.  We have dentry from nsfs and mount
that sits on /tmp/foo.  We'd resolved /tmp/bar.

In the meanwhile, umount has resolved /tmp/foo and unmounted it.
struct mount is still alive due to the reference held by do_loopback().

do_loopback() finally grabs namespace_sem.  It verifies that mountpoint
to be (/tmp/bar) is still OK (it is) and calls __do_loopback().  The
check in __do_loopback() passes - old is unmounted, but dentry is
nsfs one, so we proceed to clone old.

And that's where the things go wrong - we copy the flags, including
MNT_UMOUNT, to the new instance.  And proceed to attach it on /tmp/bar.

It's really not a good thing.  E.g. __detach_mnt() will assume that
reference to the parent mount of /tmp/bar is *not* held.  As the
matter of fact, it is, so we'll get a leak if the mountpoint (/tmp/bar,
that is) gets unlinked in another namespace.  That's not the only way
to get trouble - we are really not supposed to have MNT_UMOUNT mounts
attached to !MNT_UMOUNT ones.

Eric, do you see any problems with the change above?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] umount/__detach_mounts() race
  2022-02-21  5:04 ` Al Viro
@ 2022-02-21  5:19   ` Al Viro
  2022-02-21 16:46   ` Eric W. Biederman
  1 sibling, 0 replies; 7+ messages in thread
From: Al Viro @ 2022-02-21  5:19 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Eric Biederman, Linus Torvalds

On Mon, Feb 21, 2022 at 05:04:21AM +0000, Al Viro wrote:

> It's really not a good thing.  E.g. __detach_mnt() will assume that
				      __detach_mounts(), that is

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] umount/__detach_mounts() race
  2022-02-21  5:04 ` Al Viro
  2022-02-21  5:19   ` Al Viro
@ 2022-02-21 16:46   ` Eric W. Biederman
  2022-02-21 18:31     ` Al Viro
  1 sibling, 1 reply; 7+ messages in thread
From: Eric W. Biederman @ 2022-02-21 16:46 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

Al Viro <viro@zeniv.linux.org.uk> writes:

> 	BTW, while looking through the vicinity - I think this 
>         if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations)
> 			return mnt;
> in __do_loopback() is too permissive.  I'd prefer to replace it with
>         if (!check_mnt(old)) {
> 		// allow binding objects from internal instance of nsfs
> 		if (old->mnt_ns != MNT_NS_INTERAL ||
> 		    old_path->dentry->d_op != &ns_dentry_operations)
> 			return mnt;
> 	}
>
> Suppose we'd bound an nsfs object e.g. on /tmp/foo.  Consider a race
> between mount --bind /tmp/foo /tmp/bar and umount -l /tmp/foo.
>
> do_loopback() resolves /tmp/foo.  We have dentry from nsfs and mount
> that sits on /tmp/foo.  We'd resolved /tmp/bar.
>
> In the meanwhile, umount has resolved /tmp/foo and unmounted it.
> struct mount is still alive due to the reference held by do_loopback().
>
> do_loopback() finally grabs namespace_sem.  It verifies that mountpoint
> to be (/tmp/bar) is still OK (it is) and calls __do_loopback().  The
> check in __do_loopback() passes - old is unmounted, but dentry is
> nsfs one, so we proceed to clone old.
>
> And that's where the things go wrong - we copy the flags, including
> MNT_UMOUNT, to the new instance.  And proceed to attach it on /tmp/bar.
>
> It's really not a good thing.  E.g. __detach_mnt() will assume that
> reference to the parent mount of /tmp/bar is *not* held.  As the
> matter of fact, it is, so we'll get a leak if the mountpoint (/tmp/bar,
> that is) gets unlinked in another namespace.  That's not the only way
> to get trouble - we are really not supposed to have MNT_UMOUNT mounts
> attached to !MNT_UMOUNT ones.
>
> Eric, do you see any problems with the change above?

Such as breaking userspace code? Maybe.

Currently we exempt nsfs dentries from the same namespace restriction
when cloning them.

If I read your proposal correctly you are proposing only exempting nsfs
dentries that are internally mounted from the same namespace
restriction.

We need to keep the ordinary case of bind mounts from one nsfs dentry to
another dentry working even after it is mounted.

If my foggy brain is reasoning correctly you are proposing not allowing
bind mounts before the mount has been attached or after the mount has
been detached by umount_tree.  Not allowing already umounted mounts to
be bind mounted seems semantically fine.  Userspace should not care.

I wonder if perhaps we should have an explicit test for MNT_UMOUNT in
__do_loopback.

Eric


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] umount/__detach_mounts() race
  2022-02-21  3:02 [RFC] umount/__detach_mounts() race Al Viro
  2022-02-21  5:04 ` Al Viro
@ 2022-02-21 17:04 ` Eric W. Biederman
  2022-02-21 19:53 ` Al Viro
  2 siblings, 0 replies; 7+ messages in thread
From: Eric W. Biederman @ 2022-02-21 17:04 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

Al Viro <viro@zeniv.linux.org.uk> writes:

> 	umount_tree() is very definitely not supposed to be called
> on MNT_UMOUNT subtrees (== stuck-together fragments that got
> unmounted, but not split into individual mount nodes).  Refcounting
> rules are different there and umount_tree() assumes that we start with
> the normal ones.
>
> 	do_umount() appears to be checking for that:
>
> 	if (flags & MNT_DETACH) {
> 		if (!list_empty(&mnt->mnt_list))
> 			umount_tree(mnt, UMOUNT_PROPAGATE);
> 		retval = 0;
> 	} else {
> 		shrink_submounts(mnt);
> 		retval = -EBUSY;
> 		if (!propagate_mount_busy(mnt, 2)) {
> 			if (!list_empty(&mnt->mnt_list))
> 				umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
> 			retval = 0;
> 		}
> 	}
>
> which would prevent umount_tree() on those - mnt_list eviction happens
> for the same nodes that get MNT_UMOUNT.  However, shrink_submounts()
> will call umount_tree() for e.g. nfs automounts it finds on victim
> mount, and if ours happens to be already unmounted, with automounts
> stuck to it, we have trouble.
>
> It looks like something that should be impossible to hit, but...
>
> A: umount(2) looks the sucker up
> B: rmdir(2) in another namespace (where nothing is mounted on that mountpoint)
> does __detach_mounts(), which grabs namespace_sem, sees the mount A is about
> to try and kill and calls umount_tree(mnt, UMOUNT_CONNECTED).  Which detaches
> our mount (and its children, automounts included) from the namespace it's in,
> modifies their refcounts accordingly and keeps the entire thing in one
> piece.
> A: in do_umount() blocks on namespace_sem
> B: drops namespace_sem
> A: gets to the quoted code.  mnt is already MNT_UMOUNT (and has empty
> ->mnt_list), but it does have (equally MNT_UMOUNT) automounts under it,
> etc.  So shrink_submounts() finds something to umount and calls umount_tree().
> Buggered refcounts happen.
>
> Does anybody see a problem with the following patch?

I think it looks like 2 patches.
One patch to remove some list_empty checks.
> -		if (!list_empty(&mnt->mnt_list))
> -			umount_tree(mnt, UMOUNT_PROPAGATE);
> +		umount_tree(mnt, UMOUNT_PROPAGATE);

Another patch to add a MNT_UMOUNT condition.


The MNT_UMOUNT condition should probably be checked optimistically along
with MNT_FORCE in can_umount as well to be explicit.  I think it is
redundant with check_mnt but it would add clarity to what we are looking
for, and keep people thinking about the MNT_UMOUNT races.

Testing MNT_UMOUNT after the locks have been grabbed seem very
reasonable.  Alternately the code could call check_mnt again and lean on
that test.  But testing MNT_UMOUNT seems sufficient.


> diff --git a/fs/namespace.c b/fs/namespace.c
> index 42d4fc21263b2..1604b9d7a69d9 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1654,21 +1654,20 @@ static int do_umount(struct mount *mnt, int flags)
>  	lock_mount_hash();
>  
>  	/* Recheck MNT_LOCKED with the locks held */
> +	/* ... and don't go there if we'd raced and it's already unmounted */
>  	retval = -EINVAL;
> -	if (mnt->mnt.mnt_flags & MNT_LOCKED)
> +	if (mnt->mnt.mnt_flags & (MNT_LOCKED|MNT_UMOUNT))
>  		goto out;
>  
>  	event++;
>  	if (flags & MNT_DETACH) {
> -		if (!list_empty(&mnt->mnt_list))
> -			umount_tree(mnt, UMOUNT_PROPAGATE);
> +		umount_tree(mnt, UMOUNT_PROPAGATE);
>  		retval = 0;
>  	} else {
>  		shrink_submounts(mnt);
>  		retval = -EBUSY;
>  		if (!propagate_mount_busy(mnt, 2)) {
> -			if (!list_empty(&mnt->mnt_list))
> -				umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
> +			umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
>  			retval = 0;
>  		}
>  	}

Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] umount/__detach_mounts() race
  2022-02-21 16:46   ` Eric W. Biederman
@ 2022-02-21 18:31     ` Al Viro
  0 siblings, 0 replies; 7+ messages in thread
From: Al Viro @ 2022-02-21 18:31 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds

On Mon, Feb 21, 2022 at 10:46:26AM -0600, Eric W. Biederman wrote:

> Such as breaking userspace code? Maybe.
> 
> Currently we exempt nsfs dentries from the same namespace restriction
> when cloning them.
> 
> If I read your proposal correctly you are proposing only exempting nsfs
> dentries that are internally mounted from the same namespace
> restriction.
> 
> We need to keep the ordinary case of bind mounts from one nsfs dentry to
> another dentry working even after it is mounted.

Sure - all of that is only checked if old_path.mnt is not already in our
namespace.  If you bind it in one place and then bind that to another,
the usual logics will trigger.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] umount/__detach_mounts() race
  2022-02-21  3:02 [RFC] umount/__detach_mounts() race Al Viro
  2022-02-21  5:04 ` Al Viro
  2022-02-21 17:04 ` Eric W. Biederman
@ 2022-02-21 19:53 ` Al Viro
  2 siblings, 0 replies; 7+ messages in thread
From: Al Viro @ 2022-02-21 19:53 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Eric Biederman, Linus Torvalds

	BTW, for the folks not familiar with the area -
refcounting rules for mounts are

	(1) being a child of some mount contributes 1.  That
applies to all mounts.
	(2) being a part of a tree of some namespace contributes
number of children; ditto for a tree that had never been in any
namespace.
	(3) being a part of decaying fragment (something detached
by umount_tree()) does not contribute anything.  (1) still applies,
but (2) does not.
	(4) being a root of some namespace contributes 1.
	(5) being a part of a list passing through ->mnt_umount
contributes 1.

Everything else is due to explicit references stored in some objects
other than mnt_namespace and mount instances, including local variables,
etc.

Decaying fragments are distinguishable by having MNT_UMOUNT on all
mounts in them (namespace_lock() stabilizes that, just as it does to
everything else tree-related).  There should never be a mix of
MNT_UMOUNT and non-MNT_UMOUNT mounts in any tree.

Everything in decaying fragment also has
	NULL ->mnt_ns
	empty ->mnt_list
	NULL ->mnt_master
	empty ->mnt_shared/->mnt_slave/->mnt_slave_list
	empty ->mnt_expire

Decaying fragments can fall apart into smaller subtrees.
If nothing else, that happens when there's no more external
references to the root of fragment - its children are torn away,
moved into a list that passes through their ->mnt_umount
(see mntput_no_expire(), circa line 1210) and later dropped
(cleanup_mnt(), circa line 1138).  It also happens when
mountpoint of something in a fragment gets invalidated
(see __detach_mounts(), MNT_UMOUNT case).

However, that's the only thing that can happen to those fragments -
they can't get anything mounted/unmounted/moved.  Stepping on
something like NFS referral point fails, instead of automounting
anything, etc.

	umount_tree() should never be called on those - it's about
tearing some part of live trees (whether they are in a namespace
or not) into decaying fragments and collecting the roots of those
fragments into a list ('unmounted', passing through ->mnt_umount)
for the subsequent namespace_unlock() to drop.

Another thing: namespace root may go into decaying fragment (which can
happen only if it's explicitly passed as argument to umount_tree())
only when the namespace is about to be freed.  I think we should be
OK there, but I'm not happy with the proof - it's too convoluted and
smells like it might be brittle.  Might be worth making that more
straightforward...

Another fun area is handling of internal mounts (which is what got
me started on that round of code review).  Both in terms of
RCU delays between ->mnt_ns going NULL and in terms of sync
behaviour needed in final mntput() for those.  Final mntput()
also involves rather delicate interactions with legitimizing
references acquired in RCU mode - that's what __legitimize_mnt()
and MNT_DOOMED is about and documentation is... lacking, to
put it mildly.  It's scattered over a bunch of commit messages,
and incomplete even there.

I'll post when I get the documentation of that stuff into more or
less usable shape.  IMO it should go into Documentation/filesystems/*,
but it'll obviously need a review first...

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-02-21 19:53 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-21  3:02 [RFC] umount/__detach_mounts() race Al Viro
2022-02-21  5:04 ` Al Viro
2022-02-21  5:19   ` Al Viro
2022-02-21 16:46   ` Eric W. Biederman
2022-02-21 18:31     ` Al Viro
2022-02-21 17:04 ` Eric W. Biederman
2022-02-21 19:53 ` Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.