[BUG][RFC] broken use of __lookup_mnt() in path

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG][RFC] broken use of __lookup_mnt() in path_overmounted()
@ 2025-05-31 20:57 Al Viro
  2025-06-01 17:40 ` Al Viro
  2025-06-02  9:17 ` Christian Brauner
  0 siblings, 2 replies; 4+ messages in thread
From: Al Viro @ 2025-05-31 20:57 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel

	More audit fallout.

Rules for using mount hash (will go into the docs I'm putting together):
	* all insertions are under EXCL(namespace_sem) and write_seqlock(mount_lock)
	* all removals are under write_seqlock(mount_lock)
	* all removals are either under EXCL(namespace_sem) or have parent's
refcount equal to zero.
	* since all changes to hash chains are under write_seqlock(mount_lock),
hash seatch is safe if you have at least read_seqlock_excl(mount_lock).  In
that case the result is guaranteed to be accurate.
	* removals are done with hlist_del_init_rcu(); freeing of the removed
object is always RCU-delayed, so holding rcu_read_lock() over traversal is
memory safe.  HOWEVER, it is entirely possible to step into an object being
removed from hash chain at the time of search and get a false negative.
If you are not holding at least read_seqlock_excl(mount_lock), you *must*
sample mount_lock seqcount component before the search and check it afterwards.
	* acquiring a reference to object found as the result of traversal needs
to be done carefully - it is safe under mount_lock, but when used under rcu_read_lock()
alone we do need __legitimize_mnt(); otherwise we are risking a race with
final mntput() and/or busy mount checks in sync umount.

Search is done by __lookup_mnt().
Callers under write_seqlock(mount_lock):
	* all callers in fs/pnode.c and one in attach_recursive_mnt().
Callers under rcu_read_lock():
	* lookup_mnt() - seqcount component of mount_lock used to deal
with races.  Check is done by legitimize_mnt().
	* path_overmounted() - the callers are under EXCL(namespace_sem)
and they are holding a reference to parent, so removal of the object we
are searching for is excluded.  Reference to child is not acquired, so
the questions about validity of that do not arise.  Unfortunately, removals
of objects in the same hash chain are *not* excluded, so a false negative
is possible there.

Bug in path_overmounted() appears to have come from the corresponding
chunk of finish_automount(), which had the same race (and got turned
into a call of path_overmounted()).

One possibility is to wrap the use of __lookup_mnt() into a sample-and-recheck
loop there; for the call of path_overmounted() in finish_automount() it'll
give the right behaviour.

The other one, though...  The thing is, it's not really specific to
"move beneath" case - the secondary copies always have to deal with the
possibility of "slip under existing mount" situation.  And yes, it can
lead to problems - for all attach_recursive_mnt() callers.

Note that do_loopback() can race with mount(2) on top of the old_path
- it might've happened between the pathname resolution for source and
lock_mount() for mountpoint.  We *can't* hold namespace_sem over the
pathname resolution - it'd be very easy to deadlock.

One possibility would be to have all of them traverse the chain of
overmounts on top of the thing we are going to copy/move, basically
pretending that we'd lost the race to whatever had done the overmount.
The problem with that is there might have been no race at all - it
might've come from "." or a descriptor + empty path.  In this case
we currently copy/move the entire thing and it just might be used
deliberately.

Another possibility is to teach attach_recursive_mnt() to cope with the
possibility of overmounts in source - it's not that hard, all we need is
to find the top of overmount chain there in the very beginning and in all
"slip under existing mount" cases have the existing mount reparented to
the top of that chain.

I'm not sure which one gives better semantics, TBH.  I'd be more inclined
towards the second option - on the theory that we can combine it with
the first one in cases where that's applicable.

Comments?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG][RFC] broken use of __lookup_mnt() in path_overmounted()
  2025-05-31 20:57 [BUG][RFC] broken use of __lookup_mnt() in path_overmounted() Al Viro
@ 2025-06-01 17:40 ` Al Viro
  2025-06-02  9:17 ` Christian Brauner
  1 sibling, 0 replies; 4+ messages in thread
From: Al Viro @ 2025-06-01 17:40 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel

On Sat, May 31, 2025 at 09:57:00PM +0100, Al Viro wrote:

> One possibility is to wrap the use of __lookup_mnt() into a sample-and-recheck
> loop there; for the call of path_overmounted() in finish_automount() it'll
> give the right behaviour.

OK, that's definitely the right thing to do, whatever we end up doing
with checks in do_move_mount().

So the rules become:
	Mount hash lookup (__lookup_mnt()) requires mount_lock - either
holding its spinlock component, or seqretry on its seqcount component.
	If we are not holding the spinlock side of mount_lock, we must
be under rcu_read_lock() at least for the duration of lookup.
	Result is safe to dereference as long as
		1) mount_lock is still held or
		2) rcu_read_lock() is still held or
		3) namespace_sem had been held since before the lookup *AND*
parent's refcount remains positive.  This covers only the continued safety
of access to the result of lookup; we still must've satisfied the rules
above for the lookup itself.
	Acquiring a reference to result in cases (1) and (3) is safe; in case (2)
it must be done with __legitimize_mnt(result, seq), with seq being a value of
mount_lock seqcount component sampled *BEFORE* the lookup.

That's pretty close to the rules for the rest of mount tree walking...

Complications wrt namespace_sem come from dissolving of lazy-umounted
trees; stuck children get detached when parent's refcount drops to zero.
That happens outside of namespace_sem and I don't see any sane way to
change that.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG][RFC] broken use of __lookup_mnt() in path_overmounted()
  2025-05-31 20:57 [BUG][RFC] broken use of __lookup_mnt() in path_overmounted() Al Viro
  2025-06-01 17:40 ` Al Viro
@ 2025-06-02  9:17 ` Christian Brauner
  2025-06-19  2:35   ` Al Viro
  1 sibling, 1 reply; 4+ messages in thread
From: Christian Brauner @ 2025-06-02  9:17 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel

On Sat, May 31, 2025 at 09:57:00PM +0100, Al Viro wrote:
> 	More audit fallout.
> 
> Rules for using mount hash (will go into the docs I'm putting together):
> 	* all insertions are under EXCL(namespace_sem) and write_seqlock(mount_lock)
> 	* all removals are under write_seqlock(mount_lock)
> 	* all removals are either under EXCL(namespace_sem) or have parent's
> refcount equal to zero.
> 	* since all changes to hash chains are under write_seqlock(mount_lock),
> hash seatch is safe if you have at least read_seqlock_excl(mount_lock).  In
> that case the result is guaranteed to be accurate.
> 	* removals are done with hlist_del_init_rcu(); freeing of the removed
> object is always RCU-delayed, so holding rcu_read_lock() over traversal is
> memory safe.  HOWEVER, it is entirely possible to step into an object being
> removed from hash chain at the time of search and get a false negative.
> If you are not holding at least read_seqlock_excl(mount_lock), you *must*
> sample mount_lock seqcount component before the search and check it afterwards.
> 	* acquiring a reference to object found as the result of traversal needs
> to be done carefully - it is safe under mount_lock, but when used under rcu_read_lock()
> alone we do need __legitimize_mnt(); otherwise we are risking a race with
> final mntput() and/or busy mount checks in sync umount.
> 
> Search is done by __lookup_mnt().
> Callers under write_seqlock(mount_lock):
> 	* all callers in fs/pnode.c and one in attach_recursive_mnt().
> Callers under rcu_read_lock():
> 	* lookup_mnt() - seqcount component of mount_lock used to deal
> with races.  Check is done by legitimize_mnt().
> 	* path_overmounted() - the callers are under EXCL(namespace_sem)
> and they are holding a reference to parent, so removal of the object we
> are searching for is excluded.  Reference to child is not acquired, so
> the questions about validity of that do not arise.  Unfortunately, removals
> of objects in the same hash chain are *not* excluded, so a false negative
> is possible there.
> 
> Bug in path_overmounted() appears to have come from the corresponding
> chunk of finish_automount(), which had the same race (and got turned
> into a call of path_overmounted()).
> 
> One possibility is to wrap the use of __lookup_mnt() into a sample-and-recheck
> loop there; for the call of path_overmounted() in finish_automount() it'll
> give the right behaviour.
> 
> The other one, though...  The thing is, it's not really specific to
> "move beneath" case - the secondary copies always have to deal with the
> possibility of "slip under existing mount" situation.  And yes, it can
> lead to problems - for all attach_recursive_mnt() callers.

Fwiw, I have pointed this out in one of my really early submission of
this work and had asked whether we generally want the same check. That's
also why I added the shadow-mount check into the automount path because
that could be used for that sort of issue to afair but my memory is
fuzzy here.

> 
> Note that do_loopback() can race with mount(2) on top of the old_path
> - it might've happened between the pathname resolution for source and
> lock_mount() for mountpoint.  We *can't* hold namespace_sem over the
> pathname resolution - it'd be very easy to deadlock.

Yes.

> 
> One possibility would be to have all of them traverse the chain of
> overmounts on top of the thing we are going to copy/move, basically
> pretending that we'd lost the race to whatever had done the overmount.
> The problem with that is there might have been no race at all - it
> might've come from "." or a descriptor + empty path.  In this case
> we currently copy/move the entire thing and it just might be used
> deliberately.
> 
> Another possibility is to teach attach_recursive_mnt() to cope with the
> possibility of overmounts in source - it's not that hard, all we need is
> to find the top of overmount chain there in the very beginning and in all
> "slip under existing mount" cases have the existing mount reparented to
> the top of that chain.

I'm seeing you have already sent a second mail so I take it you made a
decision already but the second option seems sanest to me.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG][RFC] broken use of __lookup_mnt() in path_overmounted()
  2025-06-02  9:17 ` Christian Brauner
@ 2025-06-19  2:35   ` Al Viro
  0 siblings, 0 replies; 4+ messages in thread
From: Al Viro @ 2025-06-19  2:35 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel

On Mon, Jun 02, 2025 at 11:17:25AM +0200, Christian Brauner wrote:

> Fwiw, I have pointed this out in one of my really early submission of
> this work and had asked whether we generally want the same check. That's
> also why I added the shadow-mount check into the automount path because
> that could be used for that sort of issue to afair but my memory is
> fuzzy here.

Actually, the check in do_move_mount() is too early.  Look:
(after having made sure . is on a private mount)
mount -t tmpfs none A
mkdir A/x
mount --make-shared A
mount --bind A/x B
mount --make-slave B
mount -t tmpfs other A/x
umount B

... and now move_mount() B beneath A/x.  See what happens?  We get one
secondary copy, attached on top of the root of primary.  _After_ we'd
entered attach_recursive_mnt(), so all checks in do_move_mount() have
nothing to catch - yet.  So we end up with that secondary being side-by-side
with the "other" tmpfs...

The unpleasant part is that we'll need to backport that stuff, so it
has to be done _before_ the do_move_mount()/attach_recursive_mnt()
cleanups ;-/

Once the side-by-side thing is eliminated, we can (and IMO should) add
mnt->mnt_overmount pointing to whatever's mounted on top of root of
mnt (NULL if nothing is).  That simplifies quite a few things, including
the prevention of side-by-side shite, but we can't do it first, more's
the pity...

Hell knows, maybe MNT_OVERMOUNTED as the first step would be doable -
it would allow for simpler (and lower-overhead) intermediate step
before the introduction of ->mnt_overmount and further simplifications...

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-06-19  2:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-31 20:57 [BUG][RFC] broken use of __lookup_mnt() in path_overmounted() Al Viro
2025-06-01 17:40 ` Al Viro
2025-06-02  9:17 ` Christian Brauner
2025-06-19  2:35   ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).