From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Lionel Bouton <lionel-subscription@bouton.name>,
linux-btrfs@vger.kernel.org
Subject: Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
Date: Fri, 5 Jul 2024 08:19:31 +0930 [thread overview]
Message-ID: <52ea9f1f-ff91-402c-b997-ec08200ff049@gmx.com> (raw)
In-Reply-To: <2650d27a-5127-4ec9-b62f-ec1683d0cecf@gmx.com>
在 2024/7/5 08:08, Qu Wenruo 写道:
>
>
> 在 2024/7/4 21:51, Lionel Bouton 写道:
>> Le 30/06/2024 à 12:59, Lionel Bouton a écrit :
>>> Le 22/06/2024 à 11:41, Qu Wenruo a écrit :
>>>>
>>>>
>>>> 在 2024/6/22 18:21, Lionel Bouton 写道:
>>>> [...]
>>>>>>
>>>>>> I'll mount the filesystem and run a scrub again to see if I can
>>>>>> reproduce the problem. It should be noticeably quicker, we made
>>>>>> updates to the Ceph cluster and should get approximately 2x the I/O
>>>>>> bandwidth.
>>>>>> I plan to keep the disk snapshot for at least several weeks so if you
>>>>>> want to test something else just say so.
>>>>>
>>>>>
>>>>> The scrub is finished, here are the results :
>>>>>
>>>>> UUID: 61e86d80-d6e4-4f9e-a312-885194c5e690
>>>>> Scrub started: Wed Jun 19 00:01:59 2024
>>>>> Status: finished
>>>>> Duration: 81:04:21
>>>>> Total to scrub: 18.83TiB
>>>>> Rate: 67.67MiB/s
>>>>> Error summary: no errors found
>>>>>
>>>>> So the scrub error isn't deterministic. I'll shut down the test VM for
>>>>> now and keep the disk snapshot it uses for at least a couple of
>>>>> week if
>>>>> it is needed for further tests.
>>>>> The original filesystem is scrubbed monthly, I'll reply to this
>>>>> message
>>>>> if another error shows up.
>>>>
>>>> I briefly remembered that there was a bug related to scrub that can
>>>> report false alerts:
>>>>
>>>> f546c4282673 ("btrfs: scrub: avoid use-after-free when chunk length is
>>>> not 64K aligned")
>>>>
>>>> But that should be automatically backported, and in that case it should
>>>> have some errors like "unable to find chunk map" error messages in the
>>>> kernel log.
>>>>
>>>> Otherwise, I have no extra clues.
>>>>
>>>> Have you tried kernels like v6.8/6.9 and can you reproduce the bug in
>>>> those newer kernels?
>>>
>>> I've just upgraded the kernel to 6.9.7 (and btrfs-progs to 6.9.2) and
>>> monthly scrubs with it will start next week. That said the last
>>> filesystem scrub with 6.6.30 ran without errors so it might be hard to
>>> reproduce.
>>> One difference with the last scrub vs the previous one which reported
>>> checksum errors is the underlying device speed : it is getting faster
>>> as we replace HDDs with SSDs on the Ceph cluster (it might be a cause
>>> if there's a race condition somewhere). Other than that there's
>>> nothing I can think of.
>>>
>>> In fact the only 2 major changes before the scrub checksum errors
>>> where :
>>> - a noticeable increase in constant I/O load,
>>> - an upgrade to the 6.6 kernel.
>>>
>>> As nobody else reported the same behavior I'm not ruling out an
>>> hardware glitch either.
>>> I'll reply to this thread if a future scrub reports a non reproducible
>>> checksum error again.
>>
>> I didn't expect to have something to report so soon...
>> Another virtual machine running on another physical server but using the
>> same Ceph cluster just reported csum errors that aren't reproducible.
>> This was with kernel 6.6.13 and btrfs-progs 6.8.2.
>> Fortunately this filesystem is small and can be scrubbed in 2 minutes :
>> I just ran the scrub again (less than 5 hours after the one that
>> reported errors) and no error are reported this time.
>>
>> I'll upgrade this VM to 6.9.7+ too. If 6.6 has indeed a scrub bug and
>> not 6.9 it might be easier to verify than I anticipated : most of our
>> VMs have migrated or are in the process of migrating to 6.6 which is the
>> latest LTS. If the problem manifest itself on a small filesystem too I
>> expect other systems to fail scrubs sooner or later if 6.6 is affected
>> by a scrub bug.
>
> So far it looks like it's the commit f546c4282673 ("btrfs: scrub: avoid
> use-after-free when chunk length is not 64K aligned") fixing the error.
>
> In that case, it looks like 6.6 is EOL at that time thus didn't got
> backports.
Nope, just as you mentioned 6.6 is LTS, and the last time I checked the
stable tree.
And it's already merged into 6.6.15, so it is not the case.
Let me dig deeper to find out why.
Thanks,
Qu
prev parent reply other threads:[~2024-07-04 22:49 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-04 14:12 BUG: scrub reports uncorrectable csum errors linked to readable file (data: single) Lionel Bouton
2024-06-06 22:51 ` Lionel Bouton
2024-06-06 23:05 ` Qu Wenruo
2024-06-06 23:21 ` Lionel Bouton
2024-06-06 23:30 ` Qu Wenruo
2024-06-06 23:46 ` Lionel Bouton
2024-06-08 16:15 ` Lionel Bouton
2024-06-08 22:48 ` Qu Wenruo
2024-06-09 0:16 ` Lionel Bouton
2024-06-10 12:52 ` Lionel Bouton
2024-06-18 21:45 ` Lionel Bouton
2024-06-22 8:51 ` Lionel Bouton
2024-06-22 9:41 ` Qu Wenruo
2024-06-30 10:59 ` Lionel Bouton
2024-07-04 12:21 ` Lionel Bouton
2024-07-04 22:38 ` Qu Wenruo
2024-07-04 22:49 ` Qu Wenruo [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52ea9f1f-ff91-402c-b997-ec08200ff049@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=lionel-subscription@bouton.name \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox