From: "Darrick J. Wong" <djwong@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>,
Linux Filesystem Development List <linux-fsdevel@vger.kernel.org>,
linux-block@vger.kernel.org, fstests@vger.kernel.org
Subject: Re: Flaky test: generic/085
Date: Thu, 13 Jun 2024 10:17:09 -0700 [thread overview]
Message-ID: <20240613171709.GA3855044@frogsfrogsfrogs> (raw)
In-Reply-To: <20240613-lackmantel-einsehen-90f0d727358d@brauner>
On Thu, Jun 13, 2024 at 11:55:37AM +0200, Christian Brauner wrote:
> On Wed, Jun 12, 2024 at 03:47:16PM GMT, Theodore Ts'o wrote:
> > On Wed, Jun 12, 2024 at 01:25:07PM +0200, Christian Brauner wrote:
> > > I've been trying to reproduce this with pmem yesterday and wasn't able to.
> > >
> > > What's the kernel config and test config that's used?
> > >
> >
> > The kernel config can be found here:
> >
> > https://github.com/tytso/xfstests-bld/blob/master/kernel-build/kernel-configs/config-6.1
> >
> > Drop it into .config in the build directory of any kernel sources
> > newer than 6.1, and then run "make olddefconfig". This is all
> > automated in the install-kconfig script which I use:
> >
> > https://github.com/tytso/xfstests-bld/blob/master/kernel-build/install-kconfig
> >
> > The VM has 4 CPU's, and 26GiB of memory, and kernel is booted with the
> > boot command line options "memmap=4G!9G memmap=9G!14G", which sets up
> > fake /dev/pmem0 and /dev/pmem1 devices backed by RAM. This is my poor
> > engineer's way of testing DAX without needing to get access to
> > expensive VM's with pmem. :-)
> >
> > I'm assuming this is a timing-dependant bug which is easiest to
> > trigger on fast devices, so a ramdisk might also work. FWIW, I also
> > can see failures relatively frequently using the ext4/nojournal
> > configuration on a SSD-backed cloud block device (GCE's Persistent
> > Disk SSD product).
> >
> > As a result, if you grab my xfstests-bld repo from github, and then
> > run "qemu-xfstests -c ext4/nojournal C 20 generic/085" it should
> > also reproduce. See the Documentation/kvm-quickstart.md for more details.
>
> Thanks, Ted! Ok, I think I figured it out.
>
> P1
> dm_resume()
> -> bdev_freeze()
> mutex_lock(&bdev->bd_fsfreeze_mutex);
> atomic_inc_return(&bdev->bd_fsfreeze_count); // == 1
> mutex_unlock(&bdev->bd_fsfreeze_mutex);
>
> P2 P3
> setup_bdev_super()
> bdev_file_open_by_dev();
> atomic_read(&bdev->bd_fsfreeze_count); // != 0
>
> bdev_thaw()
> mutex_lock(&bdev->bd_fsfreeze_mutex);
> atomic_dec_return(&bdev->bd_fsfreeze_count); // == 0
> mutex_unlock(&bdev->bd_fsfreeze_mutex);
> bd_holder_lock();
> // grab passive reference on sb via sb->s_count
> bd_holder_unlock();
> // Sleep to be woken when superblock ready or dead
> bdev_fput()
> bd_holder_lock()
> // yield bdev
> bd_holder_unlock()
>
> deactivate_locked_super()
> // notify that superblock is dead
>
> // get woken and see that superblock is dead; fail
>
> In words this basically means that P1 calls dm_suspend() which calls
> into bdev_freeze() before the block device has been claimed by the
> filesystem. This brings bdev->bd_fsfreeze_count to 1 and no call into
> fs_bdev_freeze() is required.
>
> Now P2 tries to mount that frozen block device. It claims it and checks
> bdev->bd_fsfreeze_count. As it's elevated it aborts mounting holding
> sb->s_umount all the time ofc.
>
> In the meantime P3 calls dm_resume() it sees that the block device is
> already claimed by a filesystem and calls into fs_bdev_thaw().
>
> It takes a passive reference and realizes that the filesystem isn't
> ready yet. So P3 puts itself to sleep to wait for the filesystem to
> become ready.
>
> P2 puts the last active reference to the filesystem and marks it as
> dying.
>
> Now P3 gets woken, sees that the filesystem is dying and
> get_bdev_super() fails.
>
> So Darrick is correct about the fix but the reasoning is a bit
> different. :)
>
> Patch appended and on #vfs.fixes.
> From 35224b919d6778ca5dd11f76659ae849594bd2bf Mon Sep 17 00:00:00 2001
> From: Christian Brauner <brauner@kernel.org>
> Date: Thu, 13 Jun 2024 11:38:14 +0200
> Subject: [PATCH] fs: don't misleadingly warn during thaw operations
>
> The block device may have been frozen before it was claimed by a
> filesystem. Concurrently another process might try to mount that
> frozen block device and has temporarily claimed the block device for
> that purpose causing a concurrent fs_bdev_thaw() to end up here. The
> mounter is already about to abort mounting because they still saw an
> elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return
> NULL in that case.
>
> For example, P1 calls dm_suspend() which calls into bdev_freeze() before
> the block device has been claimed by the filesystem. This brings
> bdev->bd_fsfreeze_count to 1 and no call into fs_bdev_freeze() is
> required.
>
> Now P2 tries to mount that frozen block device. It claims it and checks
> bdev->bd_fsfreeze_count. As it's elevated it aborts mounting.
>
> In the meantime P3 calls dm_resume() it sees that the block device is
> already claimed by a filesystem and calls into fs_bdev_thaw().
>
> It takes a passive reference and realizes that the filesystem isn't
> ready yet. So P3 puts itself to sleep to wait for the filesystem to
> become ready.
>
> P2 puts the last active reference to the filesystem and marks it as
> dying. Now P3 gets woken, sees that the filesystem is dying and
> get_bdev_super() fails.
Wow that's twisty. But it makes sense to me, so
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> Fixes: 49ef8832fb1a ("bdev: implement freeze and thaw holder operations")
> Cc: <stable@vger.kernel.org>
> Reported-by: Theodore Ts'o <tytso@mit.edu>
> Link: https://lore.kernel.org/r/20240611085210.GA1838544@mit.edu
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> fs/super.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index b72f1d288e95..095ba793e10c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1502,8 +1502,17 @@ static int fs_bdev_thaw(struct block_device *bdev)
>
> lockdep_assert_held(&bdev->bd_fsfreeze_mutex);
>
> + /*
> + * The block device may have been frozen before it was claimed by a
> + * filesystem. Concurrently another process might try to mount that
> + * frozen block device and has temporarily claimed the block device for
> + * that purpose causing a concurrent fs_bdev_thaw() to end up here. The
> + * mounter is already about to abort mounting because they still saw an
> + * elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return
> + * NULL in that case.
> + */
> sb = get_bdev_super(bdev);
> - if (WARN_ON_ONCE(!sb))
> + if (!sb)
> return -EINVAL;
>
> if (sb->s_op->thaw_super)
> --
> 2.43.0
>
prev parent reply other threads:[~2024-06-13 17:17 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-11 8:52 Flaky test: generic/085 Theodore Ts'o
2024-06-11 16:37 ` Darrick J. Wong
2024-06-12 11:25 ` Christian Brauner
2024-06-12 14:47 ` Theodore Ts'o
2024-06-13 9:55 ` Christian Brauner
2024-06-13 17:17 ` Darrick J. Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240613171709.GA3855044@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=brauner@kernel.org \
--cc=fstests@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox