From: "Theodore Ts'o" <tytso@mit.edu>
To: Zorro Lang <zlang@kernel.org>
Cc: linux-ext4@vger.kernel.org, fstests@vger.kernel.org,
regressions@lists.linux.dev
Subject: Re: [fstests generic/388, 455, 475, 482 ...] Ext4 journal recovery test fails
Date: Sun, 3 Sep 2023 16:40:01 -0400 [thread overview]
Message-ID: <ZPTvIb6hwIjY7T2M@mit.edu> (raw)
In-Reply-To: <20230903120001.qjv5uva2zaqthgk2@zlang-mailbox>
On Sun, Sep 03, 2023 at 08:00:01PM +0800, Zorro Lang wrote:
> Hi ext4 folks,
>
> Recently I found lots of fstests cases which belong to "recoveryloop" (e.g.
> g/388 [1], g/455 [2], g/475 [3] and g/482 [4]) or does fs shutdown/resize test
> (e.g. ext4/059 [5], g/530 [6]) failed ext4 with 1k blocksize, the kernel is
> linux v6.6-rc0+ (HEAD=b84acc11b1c9).
>
> I tested with MKFS_OPTIONS="-b 1024", no specific MOUNT_OPTIONS. I hit these
> failure several times, and I didn't hit them on my last regression test on
> v6.5-rc7+. So I think this might be a regression problem. And I didn't hit
> this failures on xfs. If this's a known issue will be fixed soon, feel free
> to tell me.
TL;DR: there definitely seenms to be something going on with g/455 and
g/482 with the ext4/1k blocksize case in Linus's latest upstream tree,
although it wasn't there in the ext4 branch which I sent to Linus to
pull.
Unfortunately, generic/475 is a known failure, especially in the 1k
block size case. The rate seems to change a bit over time. For
example from 6.2:
ext4/1k: 522 tests, 2 failures, 45 skipped, 6153 seconds
Flaky: generic/051: 40% (2/5) generic/475: 60% (3/5)
and from 6.1.0-rc4:
ext4/1k: 522 tests, 2 failures, 45 skipped, 5660 seconds
Flaky: generic/051: 60% (3/5) generic/475: 40% (2/5)
In 6.5-rc3, it looks like the rate has gotten worse:
ext4/1k: 30 tests, 29 failures, 2402 seconds
Flaky: generic/475: 97% (29/30)
Alas, finding a root cause for generic/475 has been challenging. I
suspect that it happens when we crash while doing a large truncate on
a highly fragmented file system, such as that the truncate has to span
multiple truncates, with the inode on the orphan list so the kernel
can complete the truncate if we trash mid-truncate when we clean up
the orphan list. However, that's just a theory, and I don't yet have
hard evidence.
The generic/388 test is very different. It uses the shutdown ioctl,
and that's something that ext4 has never completely handled correctly.
Doing it right would require adding some locks in hot paths, so it's
one which I've suppressed for all of my ext4 tests[1].
[1] https://github.com/tytso/xfstests-bld/blob/master/test-appliance/files/root/fs/ext4/exclude
The generic/455 and generic/482 tests work by using dm-log-writes, and
that was *not* failing on the branch (v6.5.0-rc3-60-g768d612f7982) for
which I sent a pull request to Linus:
ext4/1k: 10 tests, 63 seconds
generic/455 Pass 4s
generic/482 Pass 8s
generic/455 Pass 5s
generic/482 Pass 8s
generic/455 Pass 5s
generic/482 Pass 7s
generic/455 Pass 5s
generic/482 Pass 8s
generic/455 Pass 5s
generic/482 Pass 8s
Totals: 10 tests, 0 skipped, 0 failures, 0 errors, 63s
... but I can confirm that it's failing on Linus's upstream (I tested
commit 708283abf896):
ext4/1k: 2 tests, 2 failures, 31 seconds
generic/455 Failed 4s
generic/455 Failed 2s
generic/455 Pass 5s
generic/455 Failed 3s
generic/455 Failed 2s
generic/482 Failed 2s
generic/482 Failed 3s
generic/482 Failed 1s
generic/482 Failed 3s
generic/482 Failed 4s
Totals: 10 tests, 0 skipped, 9 failures, 0 errors, 29s
- Ted
P.S. After doing some digging, it appears generic/455 does have some
failures on 6.4 (20%) and 6.5-rc3 (5%) on the ext4/1k blocksize test
config. But *something* is definitely causing a much greater failure
rate in Linus's upstream. The good news is that should make it
relatively easy to bisect. I'll look into it. Thanks for flagging
this.
next prev parent reply other threads:[~2023-09-03 20:40 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-03 12:00 [fstests generic/388, 455, 475, 482 ...] Ext4 journal recovery test fails Zorro Lang
2023-09-03 20:40 ` Theodore Ts'o [this message]
2023-09-04 6:08 ` Theodore Ts'o
2023-09-05 22:11 ` Matthew Wilcox
2023-09-06 11:03 ` Ritesh Harjani
2023-09-06 12:38 ` Matthew Wilcox
2023-09-06 19:51 ` Matthew Wilcox
2023-09-07 2:56 ` Ritesh Harjani
2023-09-07 3:47 ` Matthew Wilcox
2023-09-07 13:35 ` Ritesh Harjani
2023-09-07 14:15 ` Matthew Wilcox
2023-09-07 14:59 ` Ritesh Harjani
2023-09-10 9:26 ` Linux regression tracking (Thorsten Leemhuis)
2023-09-11 3:43 ` Theodore Ts'o
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZPTvIb6hwIjY7T2M@mit.edu \
--to=tytso@mit.edu \
--cc=fstests@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=regressions@lists.linux.dev \
--cc=zlang@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox