From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Lionel Bouton <lionel-subscription@bouton.name>,
linux-btrfs@vger.kernel.org
Subject: Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
Date: Fri, 7 Jun 2024 08:35:42 +0930 [thread overview]
Message-ID: <4525e502-c209-4672-ae32-68296436d204@gmx.com> (raw)
In-Reply-To: <c2fa6c37-d2a8-4d77-b095-e04e3db35ef0@bouton.name>
在 2024/6/4 23:42, Lionel Bouton 写道:
> Hi,
>
> It seems one of our system hit an (harmless ?) bug while scrubbing.
>
> Kernel: 6.6.30 with Gentoo patches (sys-kernel/gentoo-sources-6.6.30
> ebuild).
> btrfs-progs: 6.8.1
> The system is a VM using QEMU/KVM.
> The affected filesystem is directly on top of a single virtio-scsi block
> device (no partition, no LVM, ...) with a Ceph RBD backend.
> Profiles for metadata: dup, data: single.
>
> The scrub process is ongoing but it logged uncorrectable errors :
>
> # btrfs scrub status /mnt/store
> UUID: <our_uuid>
> Scrub started: Mon Jun 3 09:35:28 2024
> Status: running
> Duration: 26:12:07
> Time left: 114:45:38
> ETA: Sun Jun 9 06:33:13 2024
> Total to scrub: 18.84TiB
> Bytes scrubbed: 3.50TiB (18.59%)
> Rate: 38.92MiB/s
> Error summary: csum=61
> Corrected: 0
> Uncorrectable: 61
> Unverified: 0
>
> According to the logs all errors are linked to a single file (renamed to
> <file> in the attached log) and happened within a couple of seconds.
> Most errors are duplicates as there are many snapshots of the subvolume
> it is in. This system is mainly used as a backup server, storing copies
> of data for other servers and creating snapshots of them.
>
> I was about to overwrite the file to correct the problem by fetching it
> from its source but I didn't get an error trying to read it (cat <file>
> > /dev/null)). I used diff and md5sum to double check: the file is fine.
>
> Result of stat <file> on the subvolume used as base for the snapshots:
> File: <file>
> Size: 1799195 Blocks: 3520 IO Block: 4096 regular file
> Device: 0,36 Inode: 6676106 Links: 1
> Access: (0644/-rw-r--r--) Uid: ( 1000/<uid>) Gid: ( 1000/<gid>)
> Access: 2018-02-15 05:22:38.506204046 +0100
> Modify: 2018-02-15 05:21:35.612914799 +0100
> Change: 2018-02-15 05:22:38.506204046 +0100
> Birth: 2018-02-15 05:22:38.486203934 +0100
>
> AFAIK the kernel logged a warning for nearly each snapshot of the same
> subvolume (they were all created after the file), the subvolume from
> which snapshots are created is root 257 in the logs. What seems
> interesting to me and might help diagnose this is the pattern for
> offsets in the warnings.
> - There are 3 offsets appearing: 20480, 86016, 151552 (exactly 64k
> between them).
> - Most snapshots seem to report a warning for each of them.
This is how btrfs reports a corrupted data sector, it would go through
all inodes (including ones in different subvolumes) and report the error.
So it really means if you want to delete the file, you need to remove
all the inodes from all different snapshots.
So if you got a correct copy matching the csum, you can do a more
dangerous but more straight forward way to fix, by directly overwriting
the contents on disk.
> - Some including the original subvolume have only logged 2 warnings and
> one (root 151335) logged only one warning.
> - All of the snapshots reported a warning for offset 20480.
> - When several offsets are logged their order seems random.
One problem of scrub is, it may ratelimit the output, thus we can not
just rely on dmesg to know the damage.
>
> I'm attaching kernel logs beginning with the scrub start and ending with
> the last error. As of now there are no new errors/warnings even though
> the scrub is still ongoing, 15 hours after the last error. I didn't
> clean the log frombalance logs linked to the same filesystem.
>
> Side-note for the curious or in the unlikely case it could be linked to
> the problem:
> Historically this filesystem was mounted with ssd_spread without any
> issue (I guess several years ago it made sense to me reading the
> documentation and given the IO latencies I saw on the Ceph cluster). A
> recent kernel filled the whole available space with nearly empty block
> groups which seemed to re-appear each time we mounted with ssd_spread.
> We switched to "ssd" since then and there is a mostly daily 90mn balance
> to reach back the previous stateprogressively (this is the reason behind
> the balance related logs). Having some space available for new block
> groups seems to be a good idea but additionally as we use Ceph RBD, we
> want it to be able to deallocate unused space: having many nearly empty
> block groups could waste resources (especially if the used space in
> these groups is in <4MB continuous chunks which is the default RBD
> object size).
>
>
> More information :
> The scrub is run monthly. This is the first time an error was ever
> reported. The previous scrub was run successfully at the beginning of
> May with a 6.6.13 kernel.
>
> There is a continuous defragmentation process running (it processes the
> whole filesystem slowly ignoring snapshots and defragments file by file
> based on a fragmentation estimation using filefrag -v). All
> defragmentations are logged and I can confirm the file for which this
> error was reported was not defragmented for at least a year (I checked
> because I wanted to rule out a possible race condition between
> defragmentation and scrub).
I'm wondering if there is any direct IO conflicting with buffered/defrag IO.
It's known that conflicting buffered/direct IO can lead to contents
change halfway, and lead to btrfs csum mismatch.
So far that's the only thing leading to known btrfs csum mismatch I can
think of.
But 6.x kernel should all log a warning message for it.
Thanks,
Qu
>
> In addition to the above, among the notable IO are :
> - a moderately IO intensive PostgreSQL replication subscriber,
> - ponctual large synchronisations using Syncthing,
> - snapshot creations/removals occur approximately every 80 minutes. The
> last snapshot operation was logged 31 minutes before the errors (removal
> occur asynchronously but where was no unusual load at this time that
> could point to a particularly long snapshot cleanup process).
>
> To sum up, there are many processes accessing the filesystem but
> historically it saw far higher IO load during scrubs than what was
> occurring here.
>
>
> Reproducing this might be difficult: the whole scrub can take a week
> depending on load. That said I can easily prepare a kernel and/or new
> btrfs-progs binaries to launch scrubs or other non-destructive tasks the
> week-end or at the end of the day (UTC+2 timezone).
>
> Best regards,
>
> Lionel Bouton
next prev parent reply other threads:[~2024-06-06 23:05 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-04 14:12 BUG: scrub reports uncorrectable csum errors linked to readable file (data: single) Lionel Bouton
2024-06-06 22:51 ` Lionel Bouton
2024-06-06 23:05 ` Qu Wenruo [this message]
2024-06-06 23:21 ` Lionel Bouton
2024-06-06 23:30 ` Qu Wenruo
2024-06-06 23:46 ` Lionel Bouton
2024-06-08 16:15 ` Lionel Bouton
2024-06-08 22:48 ` Qu Wenruo
2024-06-09 0:16 ` Lionel Bouton
2024-06-10 12:52 ` Lionel Bouton
2024-06-18 21:45 ` Lionel Bouton
2024-06-22 8:51 ` Lionel Bouton
2024-06-22 9:41 ` Qu Wenruo
2024-06-30 10:59 ` Lionel Bouton
2024-07-04 12:21 ` Lionel Bouton
2024-07-04 22:38 ` Qu Wenruo
2024-07-04 22:49 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4525e502-c209-4672-ae32-68296436d204@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=lionel-subscription@bouton.name \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox