From: Al Viro <viro@zeniv.linux.org.uk>
To: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH][RFC] ->mnt_devname is never NULL
Date: Wed, 23 Apr 2025 02:30:34 +0100 [thread overview]
Message-ID: <20250423013034.GB2023217@ZenIV> (raw)
In-Reply-To: <20250422122514.GZ2023217@ZenIV>
On Tue, Apr 22, 2025 at 01:25:14PM +0100, Al Viro wrote:
> Consider the following setup:
>
> mkdir foo
> mkdir bar
> mkdir splat
> mount -t tmpfs none foo # mount 1
> mount -t tmpfs none bar # mount 2
> mkdir bar/baz
> mount -t tmpfs none bar/baz # mount 3
>
> then
>
> A: move_mount(AT_FDCWD, "foo", AT_FDCWD, "bar/baz", MOVE_MOUNT_BENEATH)
> gets to do_move_mount() and into do_lock_mount() called by it.
>
> path->mnt points to mount 3, path->dentry - to its root. Both are pinned.
> do_lock_mount() goes into the first iteration of loop. beneath is true,
> so it picks dentry - that of #3 mountpoint, i.e. "/baz" on #2 tmpfs instance.
>
> At that point refcount of that dentry is 3 - one from being a positive on
> tmpfs, one from being a mountpoint and one more just grabbed by do_lock_mount().
>
> Now we enter inode_lock(dentry->d_inode). Note that at that point A is not
> holding any locks. Suppose it gets preempted at this moment for whatever reason.
>
> B: mount --move bar/baz splat
> Proceeds without any problems, mount #3 gets moved to "splat". Now refcount
> of mount #2 is not pinned by anything and refcount of "/baz" on it is 2, since
> it's no longer a mountpoint.
>
> B: umount bar
BTW, another way to hit the same would be umount("bar", MNT_DETACH) instead of
mount --move + umount - just one syscall to get in; as long as it hits
namespace_lock() first before the move_mount() does, that's it.
And yes, on mainline with
if (unlikely(strcmp(current->comm, "wanker") == 0))
msleep(60000);
in front of that inode_lock() the above does trigger the spew about busy dentries
and inodes on umount -l and an oops from do_move_mount() -> do_lock_mount() ->
dput() -> iput() once the damn thing notices that mnt got unmounted and tries
to clean up. Commit below + same debugging patch => no problem.
I'd prefer a reproducer with decent hit rate without an artificial delay inserted
just before that inode_lock() - that's fine to demonstrate that bug exists in
mainline (and fixed by the patch), but for CI it's not a good idea...
Any suggestions (as well as more testing the fix itself) would be very welcome...
commit e8dbfe567bc5362ecb5ebadb4457a2110163a901
Author: Al Viro <viro@zeniv.linux.org.uk>
Date: Tue Apr 22 15:27:03 2025 -0400
fix a couple of races in MNT_TREE_BENEATH handling by do_move_mount()
Normally do_lock_mount(path, _) is locking a mountpoint pinned by
*path and at the time when matching unlock_mount() unlocks that
location it is still pinned by the same thing.
Unfortunately, for 'beneath' case it's no longer that simple -
the object being locked is not the one *path points to. It's the
mountpoint of path->mnt. The thing is, without sufficient locking
->mnt_parent may change under us and none of the locks are held
at that point. The rules are
* mount_lock stabilizes m->mnt_parent for any mount m.
* namespace_sem stabilizes m->mnt_parent, provided that
m is mounted.
* if either of the above holds and refcount of m is positive,
we are guaranteed the same for refcount of m->mnt_parent.
namespace_sem nests inside inode_lock(), so do_lock_mount() has
to take inode_lock() before grabbing namespace_sem. It does
recheck that path->mnt is still mounted in the same place after
getting namespace_sem, and it does take care to pin the dentry.
It is needed, since otherwise we might end up with racing mount --move
(or umount) happening while we were getting locks; in that case
dentry would no longer be a mountpoint and could've been evicted
on memory pressure along with its inode - not something you want
when grabbing lock on that inode.
However, pinning a dentry is not enough - the matching mount is
also pinned only by the fact that path->mnt is mounted on top it
and at that point we are not holding any locks whatsoever, so
the same kind of races could end up with all references to
that mount gone just as we are about to enter inode_lock().
If that happens, we are left with filesystem being shut down while
we are holding a dentry reference on it; results are not pretty.
What we need to do is grab both dentry and mount at the same time;
that makes inode_lock() safe *and* avoids the problem with fs getting
shut down under us. After taking namespace_sem we verify that
path->mnt is still mounted (which stabilizes its ->mnt_parent) and
check that it's still mounted at the same place. From that point
on to the matching namespace_unlock() we are guaranteed that
mount/dentry pair we'd grabbed are also pinned by being the mountpoint
of path->mnt, so we can quietly drop both the dentry reference (as
the current code does) and mnt one - it's OK to do under namespace_sem,
since we are not dropping the final refs.
That solves the problem on do_lock_mount() side; unlock_mount()
also has one, since dentry is guaranteed to stay pinned only until
the namespace_unlock(). That's easy to fix - just have inode_unlock()
done earlier, while it's still pinned by mp->m_dentry.
Fixes: 6ac392815628 "fs: allow to mount beneath top mount" # v6.5+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/namespace.c b/fs/namespace.c
index fa17762268f5..bbda516444ff 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2827,56 +2827,62 @@ static struct mountpoint *do_lock_mount(struct path *path, bool beneath)
struct vfsmount *mnt = path->mnt;
struct dentry *dentry;
struct mountpoint *mp = ERR_PTR(-ENOENT);
+ struct path under = {};
for (;;) {
- struct mount *m;
+ struct mount *m = real_mount(mnt);
if (beneath) {
- m = real_mount(mnt);
+ path_put(&under);
read_seqlock_excl(&mount_lock);
- dentry = dget(m->mnt_mountpoint);
+ under.mnt = mntget(&m->mnt_parent->mnt);
+ under.dentry = dget(m->mnt_mountpoint);
read_sequnlock_excl(&mount_lock);
+ dentry = under.dentry;
} else {
dentry = path->dentry;
}
inode_lock(dentry->d_inode);
- if (unlikely(cant_mount(dentry))) {
- inode_unlock(dentry->d_inode);
- goto out;
- }
-
namespace_lock();
- if (beneath && (!is_mounted(mnt) || m->mnt_mountpoint != dentry)) {
+ if (unlikely(cant_mount(dentry) || !is_mounted(mnt)))
+ break; // not to be mounted on
+
+ if (beneath && unlikely(m->mnt_mountpoint != dentry ||
+ &m->mnt_parent->mnt != under.mnt)) {
namespace_unlock();
inode_unlock(dentry->d_inode);
- goto out;
+ continue; // got moved
}
mnt = lookup_mnt(path);
- if (likely(!mnt))
+ if (unlikely(mnt)) {
+ namespace_unlock();
+ inode_unlock(dentry->d_inode);
+ path_put(path);
+ path->mnt = mnt;
+ path->dentry = dget(mnt->mnt_root);
+ continue; // got overmounted
+ }
+ mp = get_mountpoint(dentry);
+ if (IS_ERR(mp))
break;
-
- namespace_unlock();
- inode_unlock(dentry->d_inode);
- if (beneath)
- dput(dentry);
- path_put(path);
- path->mnt = mnt;
- path->dentry = dget(mnt->mnt_root);
- }
-
- mp = get_mountpoint(dentry);
- if (IS_ERR(mp)) {
- namespace_unlock();
- inode_unlock(dentry->d_inode);
+ if (beneath) {
+ /*
+ * @under duplicates the references that will stay
+ * at least until namespace_unlock(), so the path_put()
+ * below is safe (and OK to do under namespace_lock -
+ * we are not dropping the final references here).
+ */
+ path_put(&under);
+ }
+ return mp;
}
-
-out:
+ namespace_unlock();
+ inode_unlock(dentry->d_inode);
if (beneath)
- dput(dentry);
-
+ path_put(&under);
return mp;
}
@@ -2887,14 +2893,11 @@ static inline struct mountpoint *lock_mount(struct path *path)
static void unlock_mount(struct mountpoint *where)
{
- struct dentry *dentry = where->m_dentry;
-
+ inode_unlock(where->m_dentry->d_inode);
read_seqlock_excl(&mount_lock);
put_mountpoint(where);
read_sequnlock_excl(&mount_lock);
-
namespace_unlock();
- inode_unlock(dentry->d_inode);
}
static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
next prev parent reply other threads:[~2025-04-23 1:30 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-21 3:35 [PATCH][RFC] ->mnt_devname is never NULL Al Viro
2025-04-21 7:56 ` Christian Brauner
2025-04-21 16:29 ` Al Viro
2025-04-21 17:03 ` Al Viro
2025-04-22 3:14 ` [PATCH][RFC] do_lock_mount() races in 'beneath' case Al Viro
2025-04-22 7:47 ` Christian Brauner
2025-04-22 7:43 ` [PATCH][RFC] ->mnt_devname is never NULL Christian Brauner
2025-04-22 7:31 ` Christian Brauner
2025-04-22 12:25 ` Al Viro
2025-04-22 13:40 ` Christian Brauner
2025-04-23 1:30 ` Al Viro [this message]
2025-04-23 22:20 ` Al Viro
2025-04-24 8:56 ` Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250423013034.GB2023217@ZenIV \
--to=viro@zeniv.linux.org.uk \
--cc=brauner@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).