Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Lionel Bouton <lionel-subscription@bouton.name>,
	linux-btrfs@vger.kernel.org
Subject: Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
Date: Sat, 22 Jun 2024 19:11:14 +0930	[thread overview]
Message-ID: <08774378-624a-4586-9f24-c108f1ffeebb@gmx.com> (raw)
In-Reply-To: <ae6aab7d-029d-4e33-ace7-e8f0b0a2fa36@bouton.name>



在 2024/6/22 18:21, Lionel Bouton 写道:
[...]
>>
>> I'll mount the filesystem and run a scrub again to see if I can
>> reproduce the problem. It should be noticeably quicker, we made
>> updates to the Ceph cluster and should get approximately 2x the I/O
>> bandwidth.
>> I plan to keep the disk snapshot for at least several weeks so if you
>> want to test something else just say so.
>
>
> The scrub is finished, here are the results :
>
> UUID: 61e86d80-d6e4-4f9e-a312-885194c5e690
> Scrub started:    Wed Jun 19 00:01:59 2024
> Status:           finished
> Duration:         81:04:21
> Total to scrub:   18.83TiB
> Rate:             67.67MiB/s
> Error summary:    no errors found
>
> So the scrub error isn't deterministic. I'll shut down the test VM for
> now and keep the disk snapshot it uses for at least a couple of week if
> it is needed for further tests.
> The original filesystem is scrubbed monthly, I'll reply to this message
> if another error shows up.

I briefly remembered that there was a bug related to scrub that can
report false alerts:

f546c4282673 ("btrfs: scrub: avoid use-after-free when chunk length is
not 64K aligned")

But that should be automatically backported, and in that case it should
have some errors like "unable to find chunk map" error messages in the
kernel log.

Otherwise, I have no extra clues.

Have you tried kernels like v6.8/6.9 and can you reproduce the bug in
those newer kernels?

Thanks,
Qu
>
>>
>> Best regards,
>> Lionel
>>
>>>
>>> If btrfs check reports error, and logical-resolve failed to locate the
>>> file and its position, it means the corruption is in bookend exntets.
>>>
>>> If btrfs check reports error and logical-resolve can locate the file and
>>> position, it's a different problem then.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> The stats output show that this file was not modified since its
>>>> creation
>>>> more than 6 years ago. This is what led me to report a bug in scrub.
>>>>
>>>>
>>>>>
>>>>>> - Some including the original subvolume have only logged 2
>>>>>> warnings and
>>>>>> one (root 151335) logged only one warning.
>>>>>> - All of the snapshots reported a warning for offset 20480.
>>>>>> - When several offsets are logged their order seems random.
>>>>>
>>>>> One problem of scrub is, it may ratelimit the output, thus we can not
>>>>> just rely on dmesg to know the damage.
>>>>
>>>> I wondered about this: I've read other threads where ratelimiting is
>>>> mentioned but I was not sure if it could apply here. Thanks for
>>>> clarifying.
>>>>
>>>>>
>>>>>>
>>>>>> I'm attaching kernel logs beginning with the scrub start and
>>>>>> ending with
>>>>>> the last error. As of now there are no new errors/warnings even
>>>>>> though
>>>>>> the scrub is still ongoing, 15 hours after the last error. I didn't
>>>>>> clean the log frombalance logs linked to the same filesystem.
>>>>>>
>>>>>> Side-note for the curious or in the unlikely case it could be
>>>>>> linked to
>>>>>> the problem:
>>>>>> Historically this filesystem was mounted with ssd_spread without any
>>>>>> issue (I guess several years ago it made sense to me reading the
>>>>>> documentation and given the IO latencies I saw on the Ceph
>>>>>> cluster). A
>>>>>> recent kernel filled the whole available space with nearly empty
>>>>>> block
>>>>>> groups which seemed to re-appear each time we mounted with
>>>>>> ssd_spread.
>>>>>> We switched to "ssd" since then and there is a mostly daily 90mn
>>>>>> balance
>>>>>> to reach back the previous stateprogressively (this is the reason
>>>>>> behind
>>>>>> the balance related logs). Having some space available for new block
>>>>>> groups seems to be a good idea but additionally as we use Ceph
>>>>>> RBD, we
>>>>>> want it to be able to deallocate unused space: having many nearly
>>>>>> empty
>>>>>> block groups could waste resources (especially if the used space in
>>>>>> these groups is in <4MB continuous chunks which is the default RBD
>>>>>> object size).
>>>>>>
>>>>>>
>>>>>> More information :
>>>>>> The scrub is run monthly. This is the first time an error was ever
>>>>>> reported. The previous scrub was run successfully at the beginning of
>>>>>> May with a 6.6.13 kernel.
>>>>>>
>>>>>> There is a continuous defragmentation process running (it
>>>>>> processes the
>>>>>> whole filesystem slowly ignoring snapshots and defragments file by
>>>>>> file
>>>>>> based on a fragmentation estimation using filefrag -v). All
>>>>>> defragmentations are logged and I can confirm the file for which this
>>>>>> error was reported was not defragmented for at least a year (I
>>>>>> checked
>>>>>> because I wanted to rule out a possible race condition between
>>>>>> defragmentation and scrub).
>>>>>
>>>>> I'm wondering if there is any direct IO conflicting with
>>>>> buffered/defrag IO.
>>>>>
>>>>> It's known that conflicting buffered/direct IO can lead to contents
>>>>> change halfway, and lead to btrfs csum mismatch.
>>>>> So far that's the only thing leading to known btrfs csum mismatch I
>>>>> can
>>>>> think of.
>>>>
>>>> But here it seems there isn't an actual mismatch as reading the file is
>>>> possible and gives the data which was written in it 6 years ago. Tthis
>>>> seems an error in scrub (or a neutrino flipping a bit somewhere during
>>>> the scrub). The VM runs on a server with ECC RAM so it is unlikely
>>>> to be
>>>> a simple bitflip but when everything likely has been ruled out...
>>>>
>>>> Thanks for your feedback,
>>>> Lionel
>>>>
>>>>>
>>>>> But 6.x kernel should all log a warning message for it.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>>
>>>>>> In addition to the above, among the notable IO are :
>>>>>> - a moderately IO intensive PostgreSQL replication subscriber,
>>>>>> - ponctual large synchronisations using Syncthing,
>>>>>> - snapshot creations/removals occur approximately every 80
>>>>>> minutes. The
>>>>>> last snapshot operation was logged 31 minutes before the errors
>>>>>> (removal
>>>>>> occur asynchronously but where was no unusual load at this time that
>>>>>> could point to a particularly long snapshot cleanup process).
>>>>>>
>>>>>> To sum up, there are many processes accessing the filesystem but
>>>>>> historically it saw far higher IO load during scrubs than what was
>>>>>> occurring here.
>>>>>>
>>>>>>
>>>>>> Reproducing this might be difficult: the whole scrub can take a week
>>>>>> depending on load. That said I can easily prepare a kernel and/or new
>>>>>> btrfs-progs binaries to launch scrubs or other non-destructive
>>>>>> tasks the
>>>>>> week-end or at the end of the day (UTC+2 timezone).
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Lionel Bouton
>>>>>
>>>>
>>
>
>

next prev parent reply	other threads:[~2024-06-22  9:41 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-04 14:12 BUG: scrub reports uncorrectable csum errors linked to readable file (data: single) Lionel Bouton
2024-06-06 22:51 ` Lionel Bouton
2024-06-06 23:05 ` Qu Wenruo
2024-06-06 23:21   ` Lionel Bouton
2024-06-06 23:30     ` Qu Wenruo
2024-06-06 23:46       ` Lionel Bouton
2024-06-08 16:15         ` Lionel Bouton
2024-06-08 22:48           ` Qu Wenruo
2024-06-09  0:16             ` Lionel Bouton
2024-06-10 12:52               ` Lionel Bouton
2024-06-18 21:45       ` Lionel Bouton
2024-06-22  8:51         ` Lionel Bouton
2024-06-22  9:41           ` Qu Wenruo [this message]
2024-06-30 10:59             ` Lionel Bouton
2024-07-04 12:21               ` Lionel Bouton
2024-07-04 22:38                 ` Qu Wenruo
2024-07-04 22:49                   ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=08774378-624a-4586-9f24-c108f1ffeebb@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lionel-subscription@bouton.name \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox