* Flaky test: generic/085
@ 2024-06-11 8:52 Theodore Ts'o
2024-06-11 16:37 ` Darrick J. Wong
0 siblings, 1 reply; 6+ messages in thread
From: Theodore Ts'o @ 2024-06-11 8:52 UTC (permalink / raw)
To: Linux Filesystem Development List, linux-block, fstests
Hi, I've recently found a flaky test, generic/085 on 6.10-rc2 and
fs-next. It's failing on both ext4 and xfs, and it reproduces more
easiy with the dax config:
xfs/4k: 20 tests, 1 failures, 137 seconds
Flaky: generic/085: 5% (1/20)
xfs/dax: 20 tests, 11 failures, 71 seconds
Flaky: generic/085: 55% (11/20)
ext4/4k: 20 tests, 111 seconds
ext4/dax: 20 tests, 8 failures, 69 seconds
Flaky: generic/085: 40% (8/20)
Totals: 80 tests, 0 skipped, 20 failures, 0 errors, 388s
The failure is caused by a WARN_ON in fs_bdev_thaw() in fs/super.c:
static int fs_bdev_thaw(struct block_device *bdev)
{
...
sb = get_bdev_super(bdev);
if (WARN_ON_ONCE(!sb))
return -EINVAL;
The generic/085 test which exercises races between the fs
freeze/unfeeze and mount/umount code paths, so this appears to be
either a VFS-level or block device layer bug. Modulo the warning, it
looks relatively harmless, so I'll just exclude generic/085 from my
test appliance, at least for now. Hopefully someone will have a
chance to take a look at it?
Thanks,
- Ted
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Flaky test: generic/085 2024-06-11 8:52 Flaky test: generic/085 Theodore Ts'o @ 2024-06-11 16:37 ` Darrick J. Wong 2024-06-12 11:25 ` Christian Brauner 0 siblings, 1 reply; 6+ messages in thread From: Darrick J. Wong @ 2024-06-11 16:37 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Linux Filesystem Development List, linux-block, fstests On Tue, Jun 11, 2024 at 09:52:10AM +0100, Theodore Ts'o wrote: > Hi, I've recently found a flaky test, generic/085 on 6.10-rc2 and > fs-next. It's failing on both ext4 and xfs, and it reproduces more > easiy with the dax config: > > xfs/4k: 20 tests, 1 failures, 137 seconds > Flaky: generic/085: 5% (1/20) > xfs/dax: 20 tests, 11 failures, 71 seconds > Flaky: generic/085: 55% (11/20) > ext4/4k: 20 tests, 111 seconds > ext4/dax: 20 tests, 8 failures, 69 seconds > Flaky: generic/085: 40% (8/20) > Totals: 80 tests, 0 skipped, 20 failures, 0 errors, 388s > > The failure is caused by a WARN_ON in fs_bdev_thaw() in fs/super.c: > > static int fs_bdev_thaw(struct block_device *bdev) > { > ... > sb = get_bdev_super(bdev); > if (WARN_ON_ONCE(!sb)) > return -EINVAL; > > > The generic/085 test which exercises races between the fs > freeze/unfeeze and mount/umount code paths, so this appears to be > either a VFS-level or block device layer bug. Modulo the warning, it > looks relatively harmless, so I'll just exclude generic/085 from my > test appliance, at least for now. Hopefully someone will have a > chance to take a look at it? I think this can happen if fs_bdev_thaw races with unmount? Let's say that the _umount $lvdev in the second loop in generic/085 starts the unmount process, which clears SB_ACTIVE from the super_block. Then the first loop tries to freeze the bdev (and fails), and immediately tries to thaw the bdev. The thaw code calls fs_bdev_thaw because the unmount process is still running & so the fs is still holding the bdev. But get_bdev_super sees that SB_ACTIVE has been cleared from the super_block so it returns NULL, which trips the warning. If that's correct, then I think the WARN_ON_ONCE should go away. --D > Thanks, > > - Ted > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Flaky test: generic/085 2024-06-11 16:37 ` Darrick J. Wong @ 2024-06-12 11:25 ` Christian Brauner 2024-06-12 14:47 ` Theodore Ts'o 0 siblings, 1 reply; 6+ messages in thread From: Christian Brauner @ 2024-06-12 11:25 UTC (permalink / raw) To: Darrick J. Wong Cc: Theodore Ts'o, Linux Filesystem Development List, linux-block, fstests On Tue, Jun 11, 2024 at 09:37:01AM -0700, Darrick J. Wong wrote: > On Tue, Jun 11, 2024 at 09:52:10AM +0100, Theodore Ts'o wrote: > > Hi, I've recently found a flaky test, generic/085 on 6.10-rc2 and > > fs-next. It's failing on both ext4 and xfs, and it reproduces more > > easiy with the dax config: > > > > xfs/4k: 20 tests, 1 failures, 137 seconds > > Flaky: generic/085: 5% (1/20) > > xfs/dax: 20 tests, 11 failures, 71 seconds > > Flaky: generic/085: 55% (11/20) > > ext4/4k: 20 tests, 111 seconds > > ext4/dax: 20 tests, 8 failures, 69 seconds > > Flaky: generic/085: 40% (8/20) > > Totals: 80 tests, 0 skipped, 20 failures, 0 errors, 388s > > > > The failure is caused by a WARN_ON in fs_bdev_thaw() in fs/super.c: > > > > static int fs_bdev_thaw(struct block_device *bdev) > > { > > ... > > sb = get_bdev_super(bdev); > > if (WARN_ON_ONCE(!sb)) > > return -EINVAL; > > > > > > The generic/085 test which exercises races between the fs > > freeze/unfeeze and mount/umount code paths, so this appears to be > > either a VFS-level or block device layer bug. Modulo the warning, it > > looks relatively harmless, so I'll just exclude generic/085 from my > > test appliance, at least for now. Hopefully someone will have a > > chance to take a look at it? > > I think this can happen if fs_bdev_thaw races with unmount? > > Let's say that the _umount $lvdev in the second loop in generic/085 > starts the unmount process, which clears SB_ACTIVE from the super_block. > Then the first loop tries to freeze the bdev (and fails), and > immediately tries to thaw the bdev. The thaw code calls fs_bdev_thaw > because the unmount process is still running & so the fs is still > holding the bdev. But get_bdev_super sees that SB_ACTIVE has been > cleared from the super_block so it returns NULL, which trips the > warning. > > If that's correct, then I think the WARN_ON_ONCE should go away. I've been trying to reproduce this with pmem yesterday and wasn't able to. SB_ACTIVE is cleared in generic_shutdown_super(). If we're in there we know that there are no active references to the superblock anymore. That includes freeze requests: * Freezes are nestable from kernel and userspace but all nested freezers share a single active reference in sb->s_active. * The nested freeze requests are counted in sb->s_writers.freeze_{kcount,ucount}. * The last thaw request (sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount == 0) releases the sb->s_active reference. * Nested freezes from the block layer via bdev_freeze() are additionally counted in bdev->bd_fsfreeze_count protected by bdev->bd_fsfreeze_mutex. The device mapper suspend logic that generic/085 uses relies on bdev_freeze() and bdev_thaw() from the block layer. So all those dm freezes should be counted in bdev->bd_fsfreeze_count. And device mapper has logic to ensure that only a single freeze request is ever made. So bdev->bd_fsfreeze_count in that test should be 1. So when a bdev_thaw() request comes via dm_suspend(): * bdev_thaw() is called and encounters bdev->bd_fsfreeze_count == 1. * As there aren't any fs initiated freezes we know that sb->s_writers.kcount == 0 and sb->s_writer.ucount == 1 == bdev->bd_fsfreeze_count. * When fs_bdev_thaw() the superblock is still valid and we've got at least one active reference taken during the bdev_freeze() request. * get_bdev_super() tries to grab an active reference to the superblock but fails. That can indeed happen because SB_ACTIVE is cleared. But for that to be the case we must've dropped the last active reference, forgot to take it during the original freeze, miscounted bdev->bd_fsfreeze_count, or missed a nested sb->s_writers.freeze_{kcount,ucount}. What's the kernel config and test config that's used? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Flaky test: generic/085 2024-06-12 11:25 ` Christian Brauner @ 2024-06-12 14:47 ` Theodore Ts'o 2024-06-13 9:55 ` Christian Brauner 0 siblings, 1 reply; 6+ messages in thread From: Theodore Ts'o @ 2024-06-12 14:47 UTC (permalink / raw) To: Christian Brauner Cc: Darrick J. Wong, Linux Filesystem Development List, linux-block, fstests On Wed, Jun 12, 2024 at 01:25:07PM +0200, Christian Brauner wrote: > I've been trying to reproduce this with pmem yesterday and wasn't able to. > > What's the kernel config and test config that's used? > The kernel config can be found here: https://github.com/tytso/xfstests-bld/blob/master/kernel-build/kernel-configs/config-6.1 Drop it into .config in the build directory of any kernel sources newer than 6.1, and then run "make olddefconfig". This is all automated in the install-kconfig script which I use: https://github.com/tytso/xfstests-bld/blob/master/kernel-build/install-kconfig The VM has 4 CPU's, and 26GiB of memory, and kernel is booted with the boot command line options "memmap=4G!9G memmap=9G!14G", which sets up fake /dev/pmem0 and /dev/pmem1 devices backed by RAM. This is my poor engineer's way of testing DAX without needing to get access to expensive VM's with pmem. :-) I'm assuming this is a timing-dependant bug which is easiest to trigger on fast devices, so a ramdisk might also work. FWIW, I also can see failures relatively frequently using the ext4/nojournal configuration on a SSD-backed cloud block device (GCE's Persistent Disk SSD product). As a result, if you grab my xfstests-bld repo from github, and then run "qemu-xfstests -c ext4/nojournal C 20 generic/085" it should also reproduce. See the Documentation/kvm-quickstart.md for more details. - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Flaky test: generic/085 2024-06-12 14:47 ` Theodore Ts'o @ 2024-06-13 9:55 ` Christian Brauner 2024-06-13 17:17 ` Darrick J. Wong 0 siblings, 1 reply; 6+ messages in thread From: Christian Brauner @ 2024-06-13 9:55 UTC (permalink / raw) To: Theodore Ts'o, Darrick J. Wong Cc: Linux Filesystem Development List, linux-block, fstests [-- Attachment #1: Type: text/plain, Size: 3315 bytes --] On Wed, Jun 12, 2024 at 03:47:16PM GMT, Theodore Ts'o wrote: > On Wed, Jun 12, 2024 at 01:25:07PM +0200, Christian Brauner wrote: > > I've been trying to reproduce this with pmem yesterday and wasn't able to. > > > > What's the kernel config and test config that's used? > > > > The kernel config can be found here: > > https://github.com/tytso/xfstests-bld/blob/master/kernel-build/kernel-configs/config-6.1 > > Drop it into .config in the build directory of any kernel sources > newer than 6.1, and then run "make olddefconfig". This is all > automated in the install-kconfig script which I use: > > https://github.com/tytso/xfstests-bld/blob/master/kernel-build/install-kconfig > > The VM has 4 CPU's, and 26GiB of memory, and kernel is booted with the > boot command line options "memmap=4G!9G memmap=9G!14G", which sets up > fake /dev/pmem0 and /dev/pmem1 devices backed by RAM. This is my poor > engineer's way of testing DAX without needing to get access to > expensive VM's with pmem. :-) > > I'm assuming this is a timing-dependant bug which is easiest to > trigger on fast devices, so a ramdisk might also work. FWIW, I also > can see failures relatively frequently using the ext4/nojournal > configuration on a SSD-backed cloud block device (GCE's Persistent > Disk SSD product). > > As a result, if you grab my xfstests-bld repo from github, and then > run "qemu-xfstests -c ext4/nojournal C 20 generic/085" it should > also reproduce. See the Documentation/kvm-quickstart.md for more details. Thanks, Ted! Ok, I think I figured it out. P1 dm_resume() -> bdev_freeze() mutex_lock(&bdev->bd_fsfreeze_mutex); atomic_inc_return(&bdev->bd_fsfreeze_count); // == 1 mutex_unlock(&bdev->bd_fsfreeze_mutex); P2 P3 setup_bdev_super() bdev_file_open_by_dev(); atomic_read(&bdev->bd_fsfreeze_count); // != 0 bdev_thaw() mutex_lock(&bdev->bd_fsfreeze_mutex); atomic_dec_return(&bdev->bd_fsfreeze_count); // == 0 mutex_unlock(&bdev->bd_fsfreeze_mutex); bd_holder_lock(); // grab passive reference on sb via sb->s_count bd_holder_unlock(); // Sleep to be woken when superblock ready or dead bdev_fput() bd_holder_lock() // yield bdev bd_holder_unlock() deactivate_locked_super() // notify that superblock is dead // get woken and see that superblock is dead; fail In words this basically means that P1 calls dm_suspend() which calls into bdev_freeze() before the block device has been claimed by the filesystem. This brings bdev->bd_fsfreeze_count to 1 and no call into fs_bdev_freeze() is required. Now P2 tries to mount that frozen block device. It claims it and checks bdev->bd_fsfreeze_count. As it's elevated it aborts mounting holding sb->s_umount all the time ofc. In the meantime P3 calls dm_resume() it sees that the block device is already claimed by a filesystem and calls into fs_bdev_thaw(). It takes a passive reference and realizes that the filesystem isn't ready yet. So P3 puts itself to sleep to wait for the filesystem to become ready. P2 puts the last active reference to the filesystem and marks it as dying. Now P3 gets woken, sees that the filesystem is dying and get_bdev_super() fails. So Darrick is correct about the fix but the reasoning is a bit different. :) Patch appended and on #vfs.fixes. [-- Attachment #2: 0001-fs-don-t-misleadingly-warn-during-thaw-operations.patch --] [-- Type: text/x-diff, Size: 2618 bytes --] From 35224b919d6778ca5dd11f76659ae849594bd2bf Mon Sep 17 00:00:00 2001 From: Christian Brauner <brauner@kernel.org> Date: Thu, 13 Jun 2024 11:38:14 +0200 Subject: [PATCH] fs: don't misleadingly warn during thaw operations The block device may have been frozen before it was claimed by a filesystem. Concurrently another process might try to mount that frozen block device and has temporarily claimed the block device for that purpose causing a concurrent fs_bdev_thaw() to end up here. The mounter is already about to abort mounting because they still saw an elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return NULL in that case. For example, P1 calls dm_suspend() which calls into bdev_freeze() before the block device has been claimed by the filesystem. This brings bdev->bd_fsfreeze_count to 1 and no call into fs_bdev_freeze() is required. Now P2 tries to mount that frozen block device. It claims it and checks bdev->bd_fsfreeze_count. As it's elevated it aborts mounting. In the meantime P3 calls dm_resume() it sees that the block device is already claimed by a filesystem and calls into fs_bdev_thaw(). It takes a passive reference and realizes that the filesystem isn't ready yet. So P3 puts itself to sleep to wait for the filesystem to become ready. P2 puts the last active reference to the filesystem and marks it as dying. Now P3 gets woken, sees that the filesystem is dying and get_bdev_super() fails. Fixes: 49ef8832fb1a ("bdev: implement freeze and thaw holder operations") Cc: <stable@vger.kernel.org> Reported-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20240611085210.GA1838544@mit.edu Signed-off-by: Christian Brauner <brauner@kernel.org> --- fs/super.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/super.c b/fs/super.c index b72f1d288e95..095ba793e10c 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1502,8 +1502,17 @@ static int fs_bdev_thaw(struct block_device *bdev) lockdep_assert_held(&bdev->bd_fsfreeze_mutex); + /* + * The block device may have been frozen before it was claimed by a + * filesystem. Concurrently another process might try to mount that + * frozen block device and has temporarily claimed the block device for + * that purpose causing a concurrent fs_bdev_thaw() to end up here. The + * mounter is already about to abort mounting because they still saw an + * elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return + * NULL in that case. + */ sb = get_bdev_super(bdev); - if (WARN_ON_ONCE(!sb)) + if (!sb) return -EINVAL; if (sb->s_op->thaw_super) -- 2.43.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: Flaky test: generic/085 2024-06-13 9:55 ` Christian Brauner @ 2024-06-13 17:17 ` Darrick J. Wong 0 siblings, 0 replies; 6+ messages in thread From: Darrick J. Wong @ 2024-06-13 17:17 UTC (permalink / raw) To: Christian Brauner Cc: Theodore Ts'o, Linux Filesystem Development List, linux-block, fstests On Thu, Jun 13, 2024 at 11:55:37AM +0200, Christian Brauner wrote: > On Wed, Jun 12, 2024 at 03:47:16PM GMT, Theodore Ts'o wrote: > > On Wed, Jun 12, 2024 at 01:25:07PM +0200, Christian Brauner wrote: > > > I've been trying to reproduce this with pmem yesterday and wasn't able to. > > > > > > What's the kernel config and test config that's used? > > > > > > > The kernel config can be found here: > > > > https://github.com/tytso/xfstests-bld/blob/master/kernel-build/kernel-configs/config-6.1 > > > > Drop it into .config in the build directory of any kernel sources > > newer than 6.1, and then run "make olddefconfig". This is all > > automated in the install-kconfig script which I use: > > > > https://github.com/tytso/xfstests-bld/blob/master/kernel-build/install-kconfig > > > > The VM has 4 CPU's, and 26GiB of memory, and kernel is booted with the > > boot command line options "memmap=4G!9G memmap=9G!14G", which sets up > > fake /dev/pmem0 and /dev/pmem1 devices backed by RAM. This is my poor > > engineer's way of testing DAX without needing to get access to > > expensive VM's with pmem. :-) > > > > I'm assuming this is a timing-dependant bug which is easiest to > > trigger on fast devices, so a ramdisk might also work. FWIW, I also > > can see failures relatively frequently using the ext4/nojournal > > configuration on a SSD-backed cloud block device (GCE's Persistent > > Disk SSD product). > > > > As a result, if you grab my xfstests-bld repo from github, and then > > run "qemu-xfstests -c ext4/nojournal C 20 generic/085" it should > > also reproduce. See the Documentation/kvm-quickstart.md for more details. > > Thanks, Ted! Ok, I think I figured it out. > > P1 > dm_resume() > -> bdev_freeze() > mutex_lock(&bdev->bd_fsfreeze_mutex); > atomic_inc_return(&bdev->bd_fsfreeze_count); // == 1 > mutex_unlock(&bdev->bd_fsfreeze_mutex); > > P2 P3 > setup_bdev_super() > bdev_file_open_by_dev(); > atomic_read(&bdev->bd_fsfreeze_count); // != 0 > > bdev_thaw() > mutex_lock(&bdev->bd_fsfreeze_mutex); > atomic_dec_return(&bdev->bd_fsfreeze_count); // == 0 > mutex_unlock(&bdev->bd_fsfreeze_mutex); > bd_holder_lock(); > // grab passive reference on sb via sb->s_count > bd_holder_unlock(); > // Sleep to be woken when superblock ready or dead > bdev_fput() > bd_holder_lock() > // yield bdev > bd_holder_unlock() > > deactivate_locked_super() > // notify that superblock is dead > > // get woken and see that superblock is dead; fail > > In words this basically means that P1 calls dm_suspend() which calls > into bdev_freeze() before the block device has been claimed by the > filesystem. This brings bdev->bd_fsfreeze_count to 1 and no call into > fs_bdev_freeze() is required. > > Now P2 tries to mount that frozen block device. It claims it and checks > bdev->bd_fsfreeze_count. As it's elevated it aborts mounting holding > sb->s_umount all the time ofc. > > In the meantime P3 calls dm_resume() it sees that the block device is > already claimed by a filesystem and calls into fs_bdev_thaw(). > > It takes a passive reference and realizes that the filesystem isn't > ready yet. So P3 puts itself to sleep to wait for the filesystem to > become ready. > > P2 puts the last active reference to the filesystem and marks it as > dying. > > Now P3 gets woken, sees that the filesystem is dying and > get_bdev_super() fails. > > So Darrick is correct about the fix but the reasoning is a bit > different. :) > > Patch appended and on #vfs.fixes. > From 35224b919d6778ca5dd11f76659ae849594bd2bf Mon Sep 17 00:00:00 2001 > From: Christian Brauner <brauner@kernel.org> > Date: Thu, 13 Jun 2024 11:38:14 +0200 > Subject: [PATCH] fs: don't misleadingly warn during thaw operations > > The block device may have been frozen before it was claimed by a > filesystem. Concurrently another process might try to mount that > frozen block device and has temporarily claimed the block device for > that purpose causing a concurrent fs_bdev_thaw() to end up here. The > mounter is already about to abort mounting because they still saw an > elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return > NULL in that case. > > For example, P1 calls dm_suspend() which calls into bdev_freeze() before > the block device has been claimed by the filesystem. This brings > bdev->bd_fsfreeze_count to 1 and no call into fs_bdev_freeze() is > required. > > Now P2 tries to mount that frozen block device. It claims it and checks > bdev->bd_fsfreeze_count. As it's elevated it aborts mounting. > > In the meantime P3 calls dm_resume() it sees that the block device is > already claimed by a filesystem and calls into fs_bdev_thaw(). > > It takes a passive reference and realizes that the filesystem isn't > ready yet. So P3 puts itself to sleep to wait for the filesystem to > become ready. > > P2 puts the last active reference to the filesystem and marks it as > dying. Now P3 gets woken, sees that the filesystem is dying and > get_bdev_super() fails. Wow that's twisty. But it makes sense to me, so Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D > Fixes: 49ef8832fb1a ("bdev: implement freeze and thaw holder operations") > Cc: <stable@vger.kernel.org> > Reported-by: Theodore Ts'o <tytso@mit.edu> > Link: https://lore.kernel.org/r/20240611085210.GA1838544@mit.edu > Signed-off-by: Christian Brauner <brauner@kernel.org> > --- > fs/super.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/fs/super.c b/fs/super.c > index b72f1d288e95..095ba793e10c 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -1502,8 +1502,17 @@ static int fs_bdev_thaw(struct block_device *bdev) > > lockdep_assert_held(&bdev->bd_fsfreeze_mutex); > > + /* > + * The block device may have been frozen before it was claimed by a > + * filesystem. Concurrently another process might try to mount that > + * frozen block device and has temporarily claimed the block device for > + * that purpose causing a concurrent fs_bdev_thaw() to end up here. The > + * mounter is already about to abort mounting because they still saw an > + * elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return > + * NULL in that case. > + */ > sb = get_bdev_super(bdev); > - if (WARN_ON_ONCE(!sb)) > + if (!sb) > return -EINVAL; > > if (sb->s_op->thaw_super) > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-06-13 17:17 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-11 8:52 Flaky test: generic/085 Theodore Ts'o 2024-06-11 16:37 ` Darrick J. Wong 2024-06-12 11:25 ` Christian Brauner 2024-06-12 14:47 ` Theodore Ts'o 2024-06-13 9:55 ` Christian Brauner 2024-06-13 17:17 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox