Linux Btrfs filesystem development
 help / color / mirror / Atom feed
* BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
@ 2024-06-04 14:12 Lionel Bouton
  2024-06-06 22:51 ` Lionel Bouton
  2024-06-06 23:05 ` Qu Wenruo
  0 siblings, 2 replies; 17+ messages in thread
From: Lionel Bouton @ 2024-06-04 14:12 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5429 bytes --]

Hi,

It seems one of our system hit an (harmless ?) bug while scrubbing.

Kernel: 6.6.30 with Gentoo patches (sys-kernel/gentoo-sources-6.6.30 
ebuild).
btrfs-progs: 6.8.1
The system is a VM using QEMU/KVM.
The affected filesystem is directly on top of a single virtio-scsi block 
device (no partition, no LVM, ...) with a Ceph RBD backend.
Profiles for metadata: dup, data: single.

The scrub process is ongoing but it logged uncorrectable errors :

# btrfs scrub status /mnt/store
UUID:             <our_uuid>
Scrub started:    Mon Jun  3 09:35:28 2024
Status:           running
Duration:         26:12:07
Time left:        114:45:38
ETA:              Sun Jun  9 06:33:13 2024
Total to scrub:   18.84TiB
Bytes scrubbed:   3.50TiB  (18.59%)
Rate:             38.92MiB/s
Error summary:    csum=61
   Corrected:      0
   Uncorrectable:  61
   Unverified:     0

According to the logs all errors are linked to a single file (renamed to 
<file> in the attached log) and happened within a couple of seconds. 
Most errors are duplicates as there are many snapshots of the subvolume 
it is in. This system is mainly used as a backup server, storing copies 
of data for other servers and creating snapshots of them.

I was about to overwrite the file to correct the problem by fetching it 
from its source but I didn't get an error trying to read it (cat <file> 
 > /dev/null)). I used diff and md5sum to double check: the file is fine.

Result of stat <file> on the subvolume used as base for the snapshots:
   File: <file>
   Size: 1799195         Blocks: 3520       IO Block: 4096 regular file
Device: 0,36    Inode: 6676106     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/<uid>)   Gid: ( 1000/<gid>)
Access: 2018-02-15 05:22:38.506204046 +0100
Modify: 2018-02-15 05:21:35.612914799 +0100
Change: 2018-02-15 05:22:38.506204046 +0100
  Birth: 2018-02-15 05:22:38.486203934 +0100

AFAIK the kernel logged a warning for nearly each snapshot of the same 
subvolume (they were all created after the file), the subvolume from 
which snapshots are created is root 257 in the logs. What seems 
interesting to me and might help diagnose this is the pattern for 
offsets in the warnings.
- There are 3 offsets appearing: 20480, 86016, 151552 (exactly 64k 
between them).
- Most snapshots seem to report a warning for each of them.
- Some including the original subvolume have only logged 2 warnings and 
one (root 151335) logged only one warning.
- All of the snapshots reported a warning for offset 20480.
- When several offsets are logged their order seems random.

I'm attaching kernel logs beginning with the scrub start and ending with 
the last error. As of now there are no new errors/warnings even though 
the scrub is still ongoing, 15 hours after the last error. I didn't 
clean the log frombalance logs linked to the same filesystem.

Side-note for the curious or in the unlikely case it could be linked to 
the problem:
Historically this filesystem was mounted with ssd_spread without any 
issue (I guess several years ago it made sense to me reading the 
documentation and given the IO latencies I saw on the Ceph cluster). A 
recent kernel filled the whole available space with nearly empty block 
groups which seemed to re-appear each time we mounted with ssd_spread. 
We switched to "ssd" since then and there is a mostly daily 90mn balance 
to reach back the previous stateprogressively (this is the reason behind 
the balance related logs). Having some space available for new block 
groups seems to be a good idea but additionally as we use Ceph RBD, we 
want it to be able to deallocate unused space: having many nearly empty 
block groups could waste resources (especially if the used space in 
these groups is in <4MB continuous chunks which is the default RBD 
object size).


More information :
The scrub is run monthly. This is the first time an error was ever 
reported. The previous scrub was run successfully at the beginning of 
May with a 6.6.13 kernel.

There is a continuous defragmentation process running (it processes the 
whole filesystem slowly ignoring snapshots and defragments file by file 
based on a fragmentation estimation using filefrag -v). All 
defragmentations are logged and I can confirm the file for which this 
error was reported was not defragmented for at least a year (I checked 
because I wanted to rule out a possible race condition between 
defragmentation and scrub).

In addition to the above, among the notable IO are :
- a moderately IO intensive PostgreSQL replication subscriber,
- ponctual large synchronisations using Syncthing,
- snapshot creations/removals occur approximately every 80 minutes. The 
last snapshot operation was logged 31 minutes before the errors (removal 
occur asynchronously but where was no unusual load at this time that 
could point to a particularly long snapshot cleanup process).

To sum up, there are many processes accessing the filesystem but 
historically it saw far higher IO load during scrubs than what was 
occurring here.


Reproducing this might be difficult: the whole scrub can take a week 
depending on load. That said I can easily prepare a kernel and/or new 
btrfs-progs binaries to launch scrubs or other non-destructive tasks the 
week-end or at the end of the day (UTC+2 timezone).

Best regards,

Lionel Bouton

[-- Attachment #2: kernel.log --]
[-- Type: text/x-log, Size: 22210 bytes --]

Jun  3 09:35:28 bacchus kernel: BTRFS info (device sdb): scrub: started on devid 1
Jun  3 21:25:10 bacchus kernel: BTRFS info (device sdb): balance: start -dusage=1 -musage=1 -susage=1
Jun  3 21:25:12 bacchus kernel: BTRFS info (device sdb): relocating block group 46838433447936 flags system|dup
Jun  3 21:25:12 bacchus kernel: BTRFS info (device sdb): found 3 extents, stage: move data extents
Jun  3 21:25:12 bacchus kernel: BTRFS info (device sdb): balance: ended with status: 0
Jun  3 21:25:13 bacchus kernel: BTRFS info (device sdb): balance: start -dusage=2 -musage=2 -susage=2
Jun  3 21:25:13 bacchus kernel: BTRFS info (device sdb): balance: ended with status: 0
Jun  3 21:25:13 bacchus kernel: BTRFS info (device sdb): balance: start -dusage=3 -musage=3 -susage=3
Jun  3 21:25:13 bacchus kernel: BTRFS info (device sdb): balance: ended with status: 0
Jun  3 21:25:14 bacchus kernel: BTRFS info (device sdb): balance: start -dusage=4 -musage=4 -susage=4
Jun  3 21:25:14 bacchus kernel: BTRFS info (device sdb): balance: ended with status: 0
Jun  3 21:25:14 bacchus kernel: BTRFS info (device sdb): balance: start -dusage=5 -musage=5 -susage=5
Jun  3 21:25:14 bacchus kernel: BTRFS info (device sdb): balance: ended with status: 0
Jun  3 21:25:15 bacchus kernel: BTRFS info (device sdb): balance: start -dusage=6 -musage=6 -susage=6
Jun  3 21:25:15 bacchus kernel: BTRFS info (device sdb): relocating block group 28159613337600 flags metadata|dup
Jun  3 21:46:48 bacchus kernel: BTRFS info (device sdb): found 3862 extents, stage: move data extents
Jun  3 21:48:17 bacchus kernel: BTRFS info (device sdb): relocating block group 28154244628480 flags metadata|dup
Jun  3 21:59:19 bacchus kernel: BTRFS info (device sdb): found 3855 extents, stage: move data extents
Jun  3 22:01:01 bacchus kernel: BTRFS info (device sdb): relocating block group 27057954226176 flags metadata|dup
Jun  3 22:08:23 bacchus kernel: BTRFS info (device sdb): found 3751 extents, stage: move data extents
Jun  3 22:09:27 bacchus kernel: BTRFS info (device sdb): relocating block group 25452710199296 flags metadata|dup
Jun  3 22:15:33 bacchus kernel: BTRFS info (device sdb): found 3826 extents, stage: move data extents
Jun  3 22:17:12 bacchus kernel: BTRFS info (device sdb): relocating block group 25348557242368 flags metadata|dup
Jun  3 22:22:40 bacchus kernel: BTRFS info (device sdb): found 3808 extents, stage: move data extents
Jun  3 22:23:37 bacchus kernel: BTRFS info (device sdb): relocating block group 23787269521408 flags metadata|dup
Jun  3 22:27:34 bacchus kernel: BTRFS info (device sdb): found 3932 extents, stage: move data extents
Jun  3 22:28:36 bacchus kernel: BTRFS info (device sdb): relocating block group 23088230039552 flags metadata|dup
Jun  3 22:33:47 bacchus kernel: BTRFS info (device sdb): found 3690 extents, stage: move data extents
Jun  3 22:34:18 bacchus kernel: BTRFS info (device sdb): relocating block group 22920726315008 flags metadata|dup
Jun  3 22:39:00 bacchus kernel: BTRFS info (device sdb): found 3674 extents, stage: move data extents
Jun  3 22:39:37 bacchus kernel: BTRFS info (device sdb): relocating block group 22911062638592 flags metadata|dup
Jun  3 22:44:37 bacchus kernel: BTRFS info (device sdb): found 3699 extents, stage: move data extents
Jun  3 22:45:13 bacchus kernel: BTRFS info (device sdb): relocating block group 22889587802112 flags metadata|dup
Jun  3 22:48:48 bacchus kernel: BTRFS info (device sdb): found 3616 extents, stage: move data extents
Jun  3 22:49:08 bacchus kernel: BTRFS info (device sdb): relocating block group 22869186707456 flags metadata|dup
Jun  3 22:53:20 bacchus kernel: BTRFS info (device sdb): found 3735 extents, stage: move data extents
Jun  3 22:53:42 bacchus kernel: BTRFS info (device sdb): relocating block group 22809023610880 flags metadata|dup
Jun  3 22:57:20 bacchus kernel: BTRFS info (device sdb): found 3632 extents, stage: move data extents
Jun  3 22:57:41 bacchus kernel: BTRFS info (device sdb): balance: canceled
Jun  4 00:46:03 bacchus kernel: BTRFS error (device sdb): unable to fixup (regular) error at logical 3897985335296 on dev /dev/sdb physical 3836152905728
Jun  4 00:46:03 bacchus kernel: BTRFS error (device sdb): unable to fixup (regular) error at logical 3897985335296 on dev /dev/sdb physical 3836152905728
Jun  4 00:46:03 bacchus kernel: BTRFS error (device sdb): unable to fixup (regular) error at logical 3897985335296 on dev /dev/sdb physical 3836152905728
Jun  4 00:46:03 bacchus kernel: BTRFS error (device sdb): unable to fixup (regular) error at logical 3897985335296 on dev /dev/sdb physical 3836152905728
Jun  4 00:46:03 bacchus kernel: BTRFS error (device sdb): unable to fixup (regular) error at logical 3897985335296 on dev /dev/sdb physical 3836152905728
Jun  4 00:46:03 bacchus kernel: BTRFS error (device sdb): unable to fixup (regular) error at logical 3897985531904 on dev /dev/sdb physical 3836153102336
Jun  4 00:46:03 bacchus kernel: BTRFS error (device sdb): unable to fixup (regular) error at logical 3897985400832 on dev /dev/sdb physical 3836152971264
Jun  4 00:46:03 bacchus kernel: BTRFS error (device sdb): unable to fixup (regular) error at logical 3897985466368 on dev /dev/sdb physical 3836153036800
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151505, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151505, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151498, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151498, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151492, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151492, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151492, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151426, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151426, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151426, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151444, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151444, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151523, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151523, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151517, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151517, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151517, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151511, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151511, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151511, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS error (device sdb): unable to fixup (regular) error at logical 3897989201920 on dev /dev/sdb physical 3836156772352
Jun  4 00:46:04 bacchus kernel: BTRFS error (device sdb): unable to fixup (regular) error at logical 3897989201920 on dev /dev/sdb physical 3836156772352
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151456, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151456, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151456, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151486, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151486, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151486, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151480, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151480, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151480, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151468, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151468, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151468, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151462, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151462, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151462, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151474, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151474, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151474, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151450, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151450, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151450, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151547, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151547, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151547, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 257, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 257, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151541, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151541, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151538, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151538, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151538, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151535, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151535, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151535, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151532, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151532, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151532, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151529, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151529, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151529, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151544, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151544, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151544, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151207, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151207, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151207, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151335, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 151079, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 151079, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 151079, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 150952, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 150952, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 150952, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985400832 on dev /dev/sdb, physical 3836152971264, root 150825, inode 6676106, offset 20480, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985531904 on dev /dev/sdb, physical 3836153102336, root 150825, inode 6676106, offset 151552, length 4096, links 1 (path: <file>)
Jun  4 00:46:04 bacchus kernel: BTRFS warning (device sdb): checksum error at logical 3897985466368 on dev /dev/sdb, physical 3836153036800, root 150825, inode 6676106, offset 86016, length 4096, links 1 (path: <file>)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-04 14:12 BUG: scrub reports uncorrectable csum errors linked to readable file (data: single) Lionel Bouton
@ 2024-06-06 22:51 ` Lionel Bouton
  2024-06-06 23:05 ` Qu Wenruo
  1 sibling, 0 replies; 17+ messages in thread
From: Lionel Bouton @ 2024-06-06 22:51 UTC (permalink / raw)
  To: linux-btrfs

Hi,

can anyone tell if scrub csum errors for a readable file is a known bug 
on 6.6.30 ? If so is there a patched version already (I checked 6.6.31 
and 32 changelogs when I posted the following message and didn't spot 
anything at the time).

Best regards,

Lionel Bouton

Le 04/06/2024 à 16:12, Lionel Bouton a écrit :
> Hi,
>
> It seems one of our system hit an (harmless ?) bug while scrubbing.
>
> Kernel: 6.6.30 with Gentoo patches (sys-kernel/gentoo-sources-6.6.30 
> ebuild).
> btrfs-progs: 6.8.1
> The system is a VM using QEMU/KVM.
> The affected filesystem is directly on top of a single virtio-scsi 
> block device (no partition, no LVM, ...) with a Ceph RBD backend.
> Profiles for metadata: dup, data: single.
>
> The scrub process is ongoing but it logged uncorrectable errors :
>
> # btrfs scrub status /mnt/store
> UUID:             <our_uuid>
> Scrub started:    Mon Jun  3 09:35:28 2024
> Status:           running
> Duration:         26:12:07
> Time left:        114:45:38
> ETA:              Sun Jun  9 06:33:13 2024
> Total to scrub:   18.84TiB
> Bytes scrubbed:   3.50TiB  (18.59%)
> Rate:             38.92MiB/s
> Error summary:    csum=61
>   Corrected:      0
>   Uncorrectable:  61
>   Unverified:     0
>
> According to the logs all errors are linked to a single file (renamed 
> to <file> in the attached log) and happened within a couple of 
> seconds. Most errors are duplicates as there are many snapshots of the 
> subvolume it is in. This system is mainly used as a backup server, 
> storing copies of data for other servers and creating snapshots of them.
>
> I was about to overwrite the file to correct the problem by fetching 
> it from its source but I didn't get an error trying to read it (cat 
> <file> > /dev/null)). I used diff and md5sum to double check: the file 
> is fine.
>
> Result of stat <file> on the subvolume used as base for the snapshots:
>   File: <file>
>   Size: 1799195         Blocks: 3520       IO Block: 4096 regular file
> Device: 0,36    Inode: 6676106     Links: 1
> Access: (0644/-rw-r--r--)  Uid: ( 1000/<uid>)   Gid: ( 1000/<gid>)
> Access: 2018-02-15 05:22:38.506204046 +0100
> Modify: 2018-02-15 05:21:35.612914799 +0100
> Change: 2018-02-15 05:22:38.506204046 +0100
>  Birth: 2018-02-15 05:22:38.486203934 +0100
>
> AFAIK the kernel logged a warning for nearly each snapshot of the same 
> subvolume (they were all created after the file), the subvolume from 
> which snapshots are created is root 257 in the logs. What seems 
> interesting to me and might help diagnose this is the pattern for 
> offsets in the warnings.
> - There are 3 offsets appearing: 20480, 86016, 151552 (exactly 64k 
> between them).
> - Most snapshots seem to report a warning for each of them.
> - Some including the original subvolume have only logged 2 warnings 
> and one (root 151335) logged only one warning.
> - All of the snapshots reported a warning for offset 20480.
> - When several offsets are logged their order seems random.
>
> I'm attaching kernel logs beginning with the scrub start and ending 
> with the last error. As of now there are no new errors/warnings even 
> though the scrub is still ongoing, 15 hours after the last error. I 
> didn't clean the log frombalance logs linked to the same filesystem.
>
> Side-note for the curious or in the unlikely case it could be linked 
> to the problem:
> Historically this filesystem was mounted with ssd_spread without any 
> issue (I guess several years ago it made sense to me reading the 
> documentation and given the IO latencies I saw on the Ceph cluster). A 
> recent kernel filled the whole available space with nearly empty block 
> groups which seemed to re-appear each time we mounted with ssd_spread. 
> We switched to "ssd" since then and there is a mostly daily 90mn 
> balance to reach back the previous stateprogressively (this is the 
> reason behind the balance related logs). Having some space available 
> for new block groups seems to be a good idea but additionally as we 
> use Ceph RBD, we want it to be able to deallocate unused space: having 
> many nearly empty block groups could waste resources (especially if 
> the used space in these groups is in <4MB continuous chunks which is 
> the default RBD object size).
>
>
> More information :
> The scrub is run monthly. This is the first time an error was ever 
> reported. The previous scrub was run successfully at the beginning of 
> May with a 6.6.13 kernel.
>
> There is a continuous defragmentation process running (it processes 
> the whole filesystem slowly ignoring snapshots and defragments file by 
> file based on a fragmentation estimation using filefrag -v). All 
> defragmentations are logged and I can confirm the file for which this 
> error was reported was not defragmented for at least a year (I checked 
> because I wanted to rule out a possible race condition between 
> defragmentation and scrub).
>
> In addition to the above, among the notable IO are :
> - a moderately IO intensive PostgreSQL replication subscriber,
> - ponctual large synchronisations using Syncthing,
> - snapshot creations/removals occur approximately every 80 minutes. 
> The last snapshot operation was logged 31 minutes before the errors 
> (removal occur asynchronously but where was no unusual load at this 
> time that could point to a particularly long snapshot cleanup process).
>
> To sum up, there are many processes accessing the filesystem but 
> historically it saw far higher IO load during scrubs than what was 
> occurring here.
>
>
> Reproducing this might be difficult: the whole scrub can take a week 
> depending on load. That said I can easily prepare a kernel and/or new 
> btrfs-progs binaries to launch scrubs or other non-destructive tasks 
> the week-end or at the end of the day (UTC+2 timezone).
>
> Best regards,
>
> Lionel Bouton


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-04 14:12 BUG: scrub reports uncorrectable csum errors linked to readable file (data: single) Lionel Bouton
  2024-06-06 22:51 ` Lionel Bouton
@ 2024-06-06 23:05 ` Qu Wenruo
  2024-06-06 23:21   ` Lionel Bouton
  1 sibling, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2024-06-06 23:05 UTC (permalink / raw)
  To: Lionel Bouton, linux-btrfs



在 2024/6/4 23:42, Lionel Bouton 写道:
> Hi,
>
> It seems one of our system hit an (harmless ?) bug while scrubbing.
>
> Kernel: 6.6.30 with Gentoo patches (sys-kernel/gentoo-sources-6.6.30
> ebuild).
> btrfs-progs: 6.8.1
> The system is a VM using QEMU/KVM.
> The affected filesystem is directly on top of a single virtio-scsi block
> device (no partition, no LVM, ...) with a Ceph RBD backend.
> Profiles for metadata: dup, data: single.
>
> The scrub process is ongoing but it logged uncorrectable errors :
>
> # btrfs scrub status /mnt/store
> UUID:             <our_uuid>
> Scrub started:    Mon Jun  3 09:35:28 2024
> Status:           running
> Duration:         26:12:07
> Time left:        114:45:38
> ETA:              Sun Jun  9 06:33:13 2024
> Total to scrub:   18.84TiB
> Bytes scrubbed:   3.50TiB  (18.59%)
> Rate:             38.92MiB/s
> Error summary:    csum=61
>    Corrected:      0
>    Uncorrectable:  61
>    Unverified:     0
>
> According to the logs all errors are linked to a single file (renamed to
> <file> in the attached log) and happened within a couple of seconds.
> Most errors are duplicates as there are many snapshots of the subvolume
> it is in. This system is mainly used as a backup server, storing copies
> of data for other servers and creating snapshots of them.
>
> I was about to overwrite the file to correct the problem by fetching it
> from its source but I didn't get an error trying to read it (cat <file>
>  > /dev/null)). I used diff and md5sum to double check: the file is fine.
>
> Result of stat <file> on the subvolume used as base for the snapshots:
>    File: <file>
>    Size: 1799195         Blocks: 3520       IO Block: 4096 regular file
> Device: 0,36    Inode: 6676106     Links: 1
> Access: (0644/-rw-r--r--)  Uid: ( 1000/<uid>)   Gid: ( 1000/<gid>)
> Access: 2018-02-15 05:22:38.506204046 +0100
> Modify: 2018-02-15 05:21:35.612914799 +0100
> Change: 2018-02-15 05:22:38.506204046 +0100
>   Birth: 2018-02-15 05:22:38.486203934 +0100
>
> AFAIK the kernel logged a warning for nearly each snapshot of the same
> subvolume (they were all created after the file), the subvolume from
> which snapshots are created is root 257 in the logs. What seems
> interesting to me and might help diagnose this is the pattern for
> offsets in the warnings.
> - There are 3 offsets appearing: 20480, 86016, 151552 (exactly 64k
> between them).
> - Most snapshots seem to report a warning for each of them.

This is how btrfs reports a corrupted data sector, it would go through
all inodes (including ones in different subvolumes) and report the error.

So it really means if you want to delete the file, you need to remove
all the inodes from all different snapshots.

So if you got a correct copy matching the csum, you can do a more
dangerous but more straight forward way to fix, by directly overwriting
the contents on disk.

> - Some including the original subvolume have only logged 2 warnings and
> one (root 151335) logged only one warning.
> - All of the snapshots reported a warning for offset 20480.
> - When several offsets are logged their order seems random.

One problem of scrub is, it may ratelimit the output, thus we can not
just rely on dmesg to know the damage.

>
> I'm attaching kernel logs beginning with the scrub start and ending with
> the last error. As of now there are no new errors/warnings even though
> the scrub is still ongoing, 15 hours after the last error. I didn't
> clean the log frombalance logs linked to the same filesystem.
>
> Side-note for the curious or in the unlikely case it could be linked to
> the problem:
> Historically this filesystem was mounted with ssd_spread without any
> issue (I guess several years ago it made sense to me reading the
> documentation and given the IO latencies I saw on the Ceph cluster). A
> recent kernel filled the whole available space with nearly empty block
> groups which seemed to re-appear each time we mounted with ssd_spread.
> We switched to "ssd" since then and there is a mostly daily 90mn balance
> to reach back the previous stateprogressively (this is the reason behind
> the balance related logs). Having some space available for new block
> groups seems to be a good idea but additionally as we use Ceph RBD, we
> want it to be able to deallocate unused space: having many nearly empty
> block groups could waste resources (especially if the used space in
> these groups is in <4MB continuous chunks which is the default RBD
> object size).
>
>
> More information :
> The scrub is run monthly. This is the first time an error was ever
> reported. The previous scrub was run successfully at the beginning of
> May with a 6.6.13 kernel.
>
> There is a continuous defragmentation process running (it processes the
> whole filesystem slowly ignoring snapshots and defragments file by file
> based on a fragmentation estimation using filefrag -v). All
> defragmentations are logged and I can confirm the file for which this
> error was reported was not defragmented for at least a year (I checked
> because I wanted to rule out a possible race condition between
> defragmentation and scrub).

I'm wondering if there is any direct IO conflicting with buffered/defrag IO.

It's known that conflicting buffered/direct IO can lead to contents
change halfway, and lead to btrfs csum mismatch.
So far that's the only thing leading to known btrfs csum mismatch I can
think of.

But 6.x kernel should all log a warning message for it.

Thanks,
Qu
>
> In addition to the above, among the notable IO are :
> - a moderately IO intensive PostgreSQL replication subscriber,
> - ponctual large synchronisations using Syncthing,
> - snapshot creations/removals occur approximately every 80 minutes. The
> last snapshot operation was logged 31 minutes before the errors (removal
> occur asynchronously but where was no unusual load at this time that
> could point to a particularly long snapshot cleanup process).
>
> To sum up, there are many processes accessing the filesystem but
> historically it saw far higher IO load during scrubs than what was
> occurring here.
>
>
> Reproducing this might be difficult: the whole scrub can take a week
> depending on load. That said I can easily prepare a kernel and/or new
> btrfs-progs binaries to launch scrubs or other non-destructive tasks the
> week-end or at the end of the day (UTC+2 timezone).
>
> Best regards,
>
> Lionel Bouton

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-06 23:05 ` Qu Wenruo
@ 2024-06-06 23:21   ` Lionel Bouton
  2024-06-06 23:30     ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Lionel Bouton @ 2024-06-06 23:21 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hi,

Le 07/06/2024 à 01:05, Qu Wenruo a écrit :
>
>
> 在 2024/6/4 23:42, Lionel Bouton 写道:
>> Hi,
>>
>> It seems one of our system hit an (harmless ?) bug while scrubbing.
>>
>> Kernel: 6.6.30 with Gentoo patches (sys-kernel/gentoo-sources-6.6.30
>> ebuild).
>> btrfs-progs: 6.8.1
>> The system is a VM using QEMU/KVM.
>> The affected filesystem is directly on top of a single virtio-scsi block
>> device (no partition, no LVM, ...) with a Ceph RBD backend.
>> Profiles for metadata: dup, data: single.
>>
>> The scrub process is ongoing but it logged uncorrectable errors :
>>
>> # btrfs scrub status /mnt/store
>> UUID:             <our_uuid>
>> Scrub started:    Mon Jun  3 09:35:28 2024
>> Status:           running
>> Duration:         26:12:07
>> Time left:        114:45:38
>> ETA:              Sun Jun  9 06:33:13 2024
>> Total to scrub:   18.84TiB
>> Bytes scrubbed:   3.50TiB  (18.59%)
>> Rate:             38.92MiB/s
>> Error summary:    csum=61
>>    Corrected:      0
>>    Uncorrectable:  61
>>    Unverified:     0
>>
>> According to the logs all errors are linked to a single file (renamed to
>> <file> in the attached log) and happened within a couple of seconds.
>> Most errors are duplicates as there are many snapshots of the subvolume
>> it is in. This system is mainly used as a backup server, storing copies
>> of data for other servers and creating snapshots of them.
>>
>> I was about to overwrite the file to correct the problem by fetching it
>> from its source but I didn't get an error trying to read it (cat <file>
>>  > /dev/null)). I used diff and md5sum to double check: the file is 
>> fine.
>>
>> Result of stat <file> on the subvolume used as base for the snapshots:
>>    File: <file>
>>    Size: 1799195         Blocks: 3520       IO Block: 4096 regular file
>> Device: 0,36    Inode: 6676106     Links: 1
>> Access: (0644/-rw-r--r--)  Uid: ( 1000/<uid>)   Gid: ( 1000/<gid>)
>> Access: 2018-02-15 05:22:38.506204046 +0100
>> Modify: 2018-02-15 05:21:35.612914799 +0100
>> Change: 2018-02-15 05:22:38.506204046 +0100
>>   Birth: 2018-02-15 05:22:38.486203934 +0100
>>
>> AFAIK the kernel logged a warning for nearly each snapshot of the same
>> subvolume (they were all created after the file), the subvolume from
>> which snapshots are created is root 257 in the logs. What seems
>> interesting to me and might help diagnose this is the pattern for
>> offsets in the warnings.
>> - There are 3 offsets appearing: 20480, 86016, 151552 (exactly 64k
>> between them).
>> - Most snapshots seem to report a warning for each of them.
>
> This is how btrfs reports a corrupted data sector, it would go through
> all inodes (including ones in different subvolumes) and report the error.
>
> So it really means if you want to delete the file, you need to remove
> all the inodes from all different snapshots.
>
> So if you got a correct copy matching the csum, you can do a more
> dangerous but more straight forward way to fix, by directly overwriting
> the contents on disk.

I briefly considered doing just that... but then I found out that the 
scrub errors were themselves in error and the on disk data was matching 
the checksums. When I tried to read the file not only didn't the 
filesystem report an IO error (if I'm not mistaken it should if the csum 
doesn't match) but the file content matched the original file fetched 
from its source.

The stats output show that this file was not modified since its creation 
more than 6 years ago. This is what led me to report a bug in scrub.


>
>> - Some including the original subvolume have only logged 2 warnings and
>> one (root 151335) logged only one warning.
>> - All of the snapshots reported a warning for offset 20480.
>> - When several offsets are logged their order seems random.
>
> One problem of scrub is, it may ratelimit the output, thus we can not
> just rely on dmesg to know the damage.

I wondered about this: I've read other threads where ratelimiting is 
mentioned but I was not sure if it could apply here. Thanks for clarifying.

>
>>
>> I'm attaching kernel logs beginning with the scrub start and ending with
>> the last error. As of now there are no new errors/warnings even though
>> the scrub is still ongoing, 15 hours after the last error. I didn't
>> clean the log frombalance logs linked to the same filesystem.
>>
>> Side-note for the curious or in the unlikely case it could be linked to
>> the problem:
>> Historically this filesystem was mounted with ssd_spread without any
>> issue (I guess several years ago it made sense to me reading the
>> documentation and given the IO latencies I saw on the Ceph cluster). A
>> recent kernel filled the whole available space with nearly empty block
>> groups which seemed to re-appear each time we mounted with ssd_spread.
>> We switched to "ssd" since then and there is a mostly daily 90mn balance
>> to reach back the previous stateprogressively (this is the reason behind
>> the balance related logs). Having some space available for new block
>> groups seems to be a good idea but additionally as we use Ceph RBD, we
>> want it to be able to deallocate unused space: having many nearly empty
>> block groups could waste resources (especially if the used space in
>> these groups is in <4MB continuous chunks which is the default RBD
>> object size).
>>
>>
>> More information :
>> The scrub is run monthly. This is the first time an error was ever
>> reported. The previous scrub was run successfully at the beginning of
>> May with a 6.6.13 kernel.
>>
>> There is a continuous defragmentation process running (it processes the
>> whole filesystem slowly ignoring snapshots and defragments file by file
>> based on a fragmentation estimation using filefrag -v). All
>> defragmentations are logged and I can confirm the file for which this
>> error was reported was not defragmented for at least a year (I checked
>> because I wanted to rule out a possible race condition between
>> defragmentation and scrub).
>
> I'm wondering if there is any direct IO conflicting with 
> buffered/defrag IO.
>
> It's known that conflicting buffered/direct IO can lead to contents
> change halfway, and lead to btrfs csum mismatch.
> So far that's the only thing leading to known btrfs csum mismatch I can
> think of.

But here it seems there isn't an actual mismatch as reading the file is 
possible and gives the data which was written in it 6 years ago. Tthis 
seems an error in scrub (or a neutrino flipping a bit somewhere during 
the scrub). The VM runs on a server with ECC RAM so it is unlikely to be 
a simple bitflip but when everything likely has been ruled out...

Thanks for your feedback,
Lionel

>
> But 6.x kernel should all log a warning message for it.
>
> Thanks,
> Qu
>>
>> In addition to the above, among the notable IO are :
>> - a moderately IO intensive PostgreSQL replication subscriber,
>> - ponctual large synchronisations using Syncthing,
>> - snapshot creations/removals occur approximately every 80 minutes. The
>> last snapshot operation was logged 31 minutes before the errors (removal
>> occur asynchronously but where was no unusual load at this time that
>> could point to a particularly long snapshot cleanup process).
>>
>> To sum up, there are many processes accessing the filesystem but
>> historically it saw far higher IO load during scrubs than what was
>> occurring here.
>>
>>
>> Reproducing this might be difficult: the whole scrub can take a week
>> depending on load. That said I can easily prepare a kernel and/or new
>> btrfs-progs binaries to launch scrubs or other non-destructive tasks the
>> week-end or at the end of the day (UTC+2 timezone).
>>
>> Best regards,
>>
>> Lionel Bouton
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-06 23:21   ` Lionel Bouton
@ 2024-06-06 23:30     ` Qu Wenruo
  2024-06-06 23:46       ` Lionel Bouton
  2024-06-18 21:45       ` Lionel Bouton
  0 siblings, 2 replies; 17+ messages in thread
From: Qu Wenruo @ 2024-06-06 23:30 UTC (permalink / raw)
  To: Lionel Bouton, linux-btrfs



在 2024/6/7 08:51, Lionel Bouton 写道:
> Hi,
>
> Le 07/06/2024 à 01:05, Qu Wenruo a écrit :
>>
>>
>> 在 2024/6/4 23:42, Lionel Bouton 写道:
>>> Hi,
>>>
>>> It seems one of our system hit an (harmless ?) bug while scrubbing.
>>>
>>> Kernel: 6.6.30 with Gentoo patches (sys-kernel/gentoo-sources-6.6.30
>>> ebuild).
>>> btrfs-progs: 6.8.1
>>> The system is a VM using QEMU/KVM.
>>> The affected filesystem is directly on top of a single virtio-scsi block
>>> device (no partition, no LVM, ...) with a Ceph RBD backend.
>>> Profiles for metadata: dup, data: single.
>>>
>>> The scrub process is ongoing but it logged uncorrectable errors :
>>>
>>> # btrfs scrub status /mnt/store
>>> UUID:             <our_uuid>
>>> Scrub started:    Mon Jun  3 09:35:28 2024
>>> Status:           running
>>> Duration:         26:12:07
>>> Time left:        114:45:38
>>> ETA:              Sun Jun  9 06:33:13 2024
>>> Total to scrub:   18.84TiB
>>> Bytes scrubbed:   3.50TiB  (18.59%)
>>> Rate:             38.92MiB/s
>>> Error summary:    csum=61
>>>    Corrected:      0
>>>    Uncorrectable:  61
>>>    Unverified:     0
>>>
>>> According to the logs all errors are linked to a single file (renamed to
>>> <file> in the attached log) and happened within a couple of seconds.
>>> Most errors are duplicates as there are many snapshots of the subvolume
>>> it is in. This system is mainly used as a backup server, storing copies
>>> of data for other servers and creating snapshots of them.
>>>
>>> I was about to overwrite the file to correct the problem by fetching it
>>> from its source but I didn't get an error trying to read it (cat <file>
>>>  > /dev/null)). I used diff and md5sum to double check: the file is
>>> fine.
>>>
>>> Result of stat <file> on the subvolume used as base for the snapshots:
>>>    File: <file>
>>>    Size: 1799195         Blocks: 3520       IO Block: 4096 regular file
>>> Device: 0,36    Inode: 6676106     Links: 1
>>> Access: (0644/-rw-r--r--)  Uid: ( 1000/<uid>)   Gid: ( 1000/<gid>)
>>> Access: 2018-02-15 05:22:38.506204046 +0100
>>> Modify: 2018-02-15 05:21:35.612914799 +0100
>>> Change: 2018-02-15 05:22:38.506204046 +0100
>>>   Birth: 2018-02-15 05:22:38.486203934 +0100
>>>
>>> AFAIK the kernel logged a warning for nearly each snapshot of the same
>>> subvolume (they were all created after the file), the subvolume from
>>> which snapshots are created is root 257 in the logs. What seems
>>> interesting to me and might help diagnose this is the pattern for
>>> offsets in the warnings.
>>> - There are 3 offsets appearing: 20480, 86016, 151552 (exactly 64k
>>> between them).
>>> - Most snapshots seem to report a warning for each of them.
>>
>> This is how btrfs reports a corrupted data sector, it would go through
>> all inodes (including ones in different subvolumes) and report the error.
>>
>> So it really means if you want to delete the file, you need to remove
>> all the inodes from all different snapshots.
>>
>> So if you got a correct copy matching the csum, you can do a more
>> dangerous but more straight forward way to fix, by directly overwriting
>> the contents on disk.
>
> I briefly considered doing just that... but then I found out that the
> scrub errors were themselves in error and the on disk data was matching
> the checksums. When I tried to read the file not only didn't the
> filesystem report an IO error (if I'm not mistaken it should if the csum
> doesn't match) but the file content matched the original file fetched
> from its source.

Got it, this is really weird now.

What scrub doing is read the data from disk (without bothering page
cache), and verify against checksums.

Would it be possible to run "btrfs check --check-data-csum" on the
unmounted/RO mounted fs?

That would output the error for each corrupted sector (without
ratelimit), so that you can record them all.
And try to do logical-resolve to find each corrupted location?

If btrfs check reports no error, it's 100% sure scrub is to blamce.

If btrfs check reports error, and logical-resolve failed to locate the
file and its position, it means the corruption is in bookend exntets.

If btrfs check reports error and logical-resolve can locate the file and
position, it's a different problem then.

Thanks,
Qu

>
> The stats output show that this file was not modified since its creation
> more than 6 years ago. This is what led me to report a bug in scrub.
>
>
>>
>>> - Some including the original subvolume have only logged 2 warnings and
>>> one (root 151335) logged only one warning.
>>> - All of the snapshots reported a warning for offset 20480.
>>> - When several offsets are logged their order seems random.
>>
>> One problem of scrub is, it may ratelimit the output, thus we can not
>> just rely on dmesg to know the damage.
>
> I wondered about this: I've read other threads where ratelimiting is
> mentioned but I was not sure if it could apply here. Thanks for clarifying.
>
>>
>>>
>>> I'm attaching kernel logs beginning with the scrub start and ending with
>>> the last error. As of now there are no new errors/warnings even though
>>> the scrub is still ongoing, 15 hours after the last error. I didn't
>>> clean the log frombalance logs linked to the same filesystem.
>>>
>>> Side-note for the curious or in the unlikely case it could be linked to
>>> the problem:
>>> Historically this filesystem was mounted with ssd_spread without any
>>> issue (I guess several years ago it made sense to me reading the
>>> documentation and given the IO latencies I saw on the Ceph cluster). A
>>> recent kernel filled the whole available space with nearly empty block
>>> groups which seemed to re-appear each time we mounted with ssd_spread.
>>> We switched to "ssd" since then and there is a mostly daily 90mn balance
>>> to reach back the previous stateprogressively (this is the reason behind
>>> the balance related logs). Having some space available for new block
>>> groups seems to be a good idea but additionally as we use Ceph RBD, we
>>> want it to be able to deallocate unused space: having many nearly empty
>>> block groups could waste resources (especially if the used space in
>>> these groups is in <4MB continuous chunks which is the default RBD
>>> object size).
>>>
>>>
>>> More information :
>>> The scrub is run monthly. This is the first time an error was ever
>>> reported. The previous scrub was run successfully at the beginning of
>>> May with a 6.6.13 kernel.
>>>
>>> There is a continuous defragmentation process running (it processes the
>>> whole filesystem slowly ignoring snapshots and defragments file by file
>>> based on a fragmentation estimation using filefrag -v). All
>>> defragmentations are logged and I can confirm the file for which this
>>> error was reported was not defragmented for at least a year (I checked
>>> because I wanted to rule out a possible race condition between
>>> defragmentation and scrub).
>>
>> I'm wondering if there is any direct IO conflicting with
>> buffered/defrag IO.
>>
>> It's known that conflicting buffered/direct IO can lead to contents
>> change halfway, and lead to btrfs csum mismatch.
>> So far that's the only thing leading to known btrfs csum mismatch I can
>> think of.
>
> But here it seems there isn't an actual mismatch as reading the file is
> possible and gives the data which was written in it 6 years ago. Tthis
> seems an error in scrub (or a neutrino flipping a bit somewhere during
> the scrub). The VM runs on a server with ECC RAM so it is unlikely to be
> a simple bitflip but when everything likely has been ruled out...
>
> Thanks for your feedback,
> Lionel
>
>>
>> But 6.x kernel should all log a warning message for it.
>>
>> Thanks,
>> Qu
>>>
>>> In addition to the above, among the notable IO are :
>>> - a moderately IO intensive PostgreSQL replication subscriber,
>>> - ponctual large synchronisations using Syncthing,
>>> - snapshot creations/removals occur approximately every 80 minutes. The
>>> last snapshot operation was logged 31 minutes before the errors (removal
>>> occur asynchronously but where was no unusual load at this time that
>>> could point to a particularly long snapshot cleanup process).
>>>
>>> To sum up, there are many processes accessing the filesystem but
>>> historically it saw far higher IO load during scrubs than what was
>>> occurring here.
>>>
>>>
>>> Reproducing this might be difficult: the whole scrub can take a week
>>> depending on load. That said I can easily prepare a kernel and/or new
>>> btrfs-progs binaries to launch scrubs or other non-destructive tasks the
>>> week-end or at the end of the day (UTC+2 timezone).
>>>
>>> Best regards,
>>>
>>> Lionel Bouton
>>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-06 23:30     ` Qu Wenruo
@ 2024-06-06 23:46       ` Lionel Bouton
  2024-06-08 16:15         ` Lionel Bouton
  2024-06-18 21:45       ` Lionel Bouton
  1 sibling, 1 reply; 17+ messages in thread
From: Lionel Bouton @ 2024-06-06 23:46 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hi,

Le 07/06/2024 à 01:30, Qu Wenruo a écrit :
>
>
> 在 2024/6/7 08:51, Lionel Bouton 写道:
>> Hi,
>>
>> Le 07/06/2024 à 01:05, Qu Wenruo a écrit :
>>>
>>>
>>> 在 2024/6/4 23:42, Lionel Bouton 写道:
>>>> Hi,
>>>>
>>>> It seems one of our system hit an (harmless ?) bug while scrubbing.
>>>>
>>>> Kernel: 6.6.30 with Gentoo patches (sys-kernel/gentoo-sources-6.6.30
>>>> ebuild).
>>>> btrfs-progs: 6.8.1
>>>> The system is a VM using QEMU/KVM.
>>>> The affected filesystem is directly on top of a single virtio-scsi 
>>>> block
>>>> device (no partition, no LVM, ...) with a Ceph RBD backend.
>>>> Profiles for metadata: dup, data: single.
>>>>
>>>> The scrub process is ongoing but it logged uncorrectable errors :
>>>>
>>>> # btrfs scrub status /mnt/store
>>>> UUID:             <our_uuid>
>>>> Scrub started:    Mon Jun  3 09:35:28 2024
>>>> Status:           running
>>>> Duration:         26:12:07
>>>> Time left:        114:45:38
>>>> ETA:              Sun Jun  9 06:33:13 2024
>>>> Total to scrub:   18.84TiB
>>>> Bytes scrubbed:   3.50TiB  (18.59%)
>>>> Rate:             38.92MiB/s
>>>> Error summary:    csum=61
>>>>    Corrected:      0
>>>>    Uncorrectable:  61
>>>>    Unverified:     0
>>>>
>>>> According to the logs all errors are linked to a single file 
>>>> (renamed to
>>>> <file> in the attached log) and happened within a couple of seconds.
>>>> Most errors are duplicates as there are many snapshots of the 
>>>> subvolume
>>>> it is in. This system is mainly used as a backup server, storing 
>>>> copies
>>>> of data for other servers and creating snapshots of them.
>>>>
>>>> I was about to overwrite the file to correct the problem by 
>>>> fetching it
>>>> from its source but I didn't get an error trying to read it (cat 
>>>> <file>
>>>>  > /dev/null)). I used diff and md5sum to double check: the file is
>>>> fine.
>>>>
>>>> Result of stat <file> on the subvolume used as base for the snapshots:
>>>>    File: <file>
>>>>    Size: 1799195         Blocks: 3520       IO Block: 4096 regular 
>>>> file
>>>> Device: 0,36    Inode: 6676106     Links: 1
>>>> Access: (0644/-rw-r--r--)  Uid: ( 1000/<uid>)   Gid: ( 1000/<gid>)
>>>> Access: 2018-02-15 05:22:38.506204046 +0100
>>>> Modify: 2018-02-15 05:21:35.612914799 +0100
>>>> Change: 2018-02-15 05:22:38.506204046 +0100
>>>>   Birth: 2018-02-15 05:22:38.486203934 +0100
>>>>
>>>> AFAIK the kernel logged a warning for nearly each snapshot of the same
>>>> subvolume (they were all created after the file), the subvolume from
>>>> which snapshots are created is root 257 in the logs. What seems
>>>> interesting to me and might help diagnose this is the pattern for
>>>> offsets in the warnings.
>>>> - There are 3 offsets appearing: 20480, 86016, 151552 (exactly 64k
>>>> between them).
>>>> - Most snapshots seem to report a warning for each of them.
>>>
>>> This is how btrfs reports a corrupted data sector, it would go through
>>> all inodes (including ones in different subvolumes) and report the 
>>> error.
>>>
>>> So it really means if you want to delete the file, you need to remove
>>> all the inodes from all different snapshots.
>>>
>>> So if you got a correct copy matching the csum, you can do a more
>>> dangerous but more straight forward way to fix, by directly overwriting
>>> the contents on disk.
>>
>> I briefly considered doing just that... but then I found out that the
>> scrub errors were themselves in error and the on disk data was matching
>> the checksums. When I tried to read the file not only didn't the
>> filesystem report an IO error (if I'm not mistaken it should if the csum
>> doesn't match) but the file content matched the original file fetched
>> from its source.
>
> Got it, this is really weird now.
>
> What scrub doing is read the data from disk (without bothering page
> cache), and verify against checksums.
>
> Would it be possible to run "btrfs check --check-data-csum" on the
> unmounted/RO mounted fs?

Yes with some caveats as the scrub takes approximately a week to 
complete and I can't easily stop the services on this system for a week.
The block device is RBD on Ceph, so what I can do is take a block level 
snapshot, map this snapshot to another system and run btrfs check 
--check-data-csum there. If the IO is the same than btrfs scrub this 
will probably take between 3 to 5 days to complete. I'll have to run 
this on another VM with the same kernel and btrfs-progs versions, BTRFS 
doesn't like having 2 devices showing up with the same internal 
identifiers...

>
> That would output the error for each corrupted sector (without
> ratelimit), so that you can record them all.
> And try to do logical-resolve to find each corrupted location?
>
> If btrfs check reports no error, it's 100% sure scrub is to blamce.
>
> If btrfs check reports error, and logical-resolve failed to locate the
> file and its position, it means the corruption is in bookend exntets.
>
> If btrfs check reports error and logical-resolve can locate the file and
> position, it's a different problem then.

OK. I understand. This is time for me to go to sleep, but I'll work on 
this tomorrow. I'll report as soon as check-data-sum finds something or 
at the end in several days if it didn't.

Thanks,
Lionel

>
> Thanks,
> Qu
>
>>
>> The stats output show that this file was not modified since its creation
>> more than 6 years ago. This is what led me to report a bug in scrub.
>>
>>
>>>
>>>> - Some including the original subvolume have only logged 2 warnings 
>>>> and
>>>> one (root 151335) logged only one warning.
>>>> - All of the snapshots reported a warning for offset 20480.
>>>> - When several offsets are logged their order seems random.
>>>
>>> One problem of scrub is, it may ratelimit the output, thus we can not
>>> just rely on dmesg to know the damage.
>>
>> I wondered about this: I've read other threads where ratelimiting is
>> mentioned but I was not sure if it could apply here. Thanks for 
>> clarifying.
>>
>>>
>>>>
>>>> I'm attaching kernel logs beginning with the scrub start and ending 
>>>> with
>>>> the last error. As of now there are no new errors/warnings even though
>>>> the scrub is still ongoing, 15 hours after the last error. I didn't
>>>> clean the log frombalance logs linked to the same filesystem.
>>>>
>>>> Side-note for the curious or in the unlikely case it could be 
>>>> linked to
>>>> the problem:
>>>> Historically this filesystem was mounted with ssd_spread without any
>>>> issue (I guess several years ago it made sense to me reading the
>>>> documentation and given the IO latencies I saw on the Ceph cluster). A
>>>> recent kernel filled the whole available space with nearly empty block
>>>> groups which seemed to re-appear each time we mounted with ssd_spread.
>>>> We switched to "ssd" since then and there is a mostly daily 90mn 
>>>> balance
>>>> to reach back the previous stateprogressively (this is the reason 
>>>> behind
>>>> the balance related logs). Having some space available for new block
>>>> groups seems to be a good idea but additionally as we use Ceph RBD, we
>>>> want it to be able to deallocate unused space: having many nearly 
>>>> empty
>>>> block groups could waste resources (especially if the used space in
>>>> these groups is in <4MB continuous chunks which is the default RBD
>>>> object size).
>>>>
>>>>
>>>> More information :
>>>> The scrub is run monthly. This is the first time an error was ever
>>>> reported. The previous scrub was run successfully at the beginning of
>>>> May with a 6.6.13 kernel.
>>>>
>>>> There is a continuous defragmentation process running (it processes 
>>>> the
>>>> whole filesystem slowly ignoring snapshots and defragments file by 
>>>> file
>>>> based on a fragmentation estimation using filefrag -v). All
>>>> defragmentations are logged and I can confirm the file for which this
>>>> error was reported was not defragmented for at least a year (I checked
>>>> because I wanted to rule out a possible race condition between
>>>> defragmentation and scrub).
>>>
>>> I'm wondering if there is any direct IO conflicting with
>>> buffered/defrag IO.
>>>
>>> It's known that conflicting buffered/direct IO can lead to contents
>>> change halfway, and lead to btrfs csum mismatch.
>>> So far that's the only thing leading to known btrfs csum mismatch I can
>>> think of.
>>
>> But here it seems there isn't an actual mismatch as reading the file is
>> possible and gives the data which was written in it 6 years ago. Tthis
>> seems an error in scrub (or a neutrino flipping a bit somewhere during
>> the scrub). The VM runs on a server with ECC RAM so it is unlikely to be
>> a simple bitflip but when everything likely has been ruled out...
>>
>> Thanks for your feedback,
>> Lionel
>>
>>>
>>> But 6.x kernel should all log a warning message for it.
>>>
>>> Thanks,
>>> Qu
>>>>
>>>> In addition to the above, among the notable IO are :
>>>> - a moderately IO intensive PostgreSQL replication subscriber,
>>>> - ponctual large synchronisations using Syncthing,
>>>> - snapshot creations/removals occur approximately every 80 minutes. 
>>>> The
>>>> last snapshot operation was logged 31 minutes before the errors 
>>>> (removal
>>>> occur asynchronously but where was no unusual load at this time that
>>>> could point to a particularly long snapshot cleanup process).
>>>>
>>>> To sum up, there are many processes accessing the filesystem but
>>>> historically it saw far higher IO load during scrubs than what was
>>>> occurring here.
>>>>
>>>>
>>>> Reproducing this might be difficult: the whole scrub can take a week
>>>> depending on load. That said I can easily prepare a kernel and/or new
>>>> btrfs-progs binaries to launch scrubs or other non-destructive 
>>>> tasks the
>>>> week-end or at the end of the day (UTC+2 timezone).
>>>>
>>>> Best regards,
>>>>
>>>> Lionel Bouton
>>>
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-06 23:46       ` Lionel Bouton
@ 2024-06-08 16:15         ` Lionel Bouton
  2024-06-08 22:48           ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Lionel Bouton @ 2024-06-08 16:15 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hi,

To keep this short I've removed most of past exchanges as this is just 
to keep people following this thread informed on my progress.

Le 07/06/2024 à 01:46, Lionel Bouton a écrit :
>
> [...]
>>> I briefly considered doing just that... but then I found out that the
>>> scrub errors were themselves in error and the on disk data was matching
>>> the checksums. When I tried to read the file not only didn't the
>>> filesystem report an IO error (if I'm not mistaken it should if the 
>>> csum
>>> doesn't match) but the file content matched the original file fetched
>>> from its source.
>>
>> Got it, this is really weird now.
>>
>> What scrub doing is read the data from disk (without bothering page
>> cache), and verify against checksums.
>>
>> Would it be possible to run "btrfs check --check-data-csum" on the
>> unmounted/RO mounted fs?
>
> Yes with some caveats as the scrub takes approximately a week to 
> complete and I can't easily stop the services on this system for a week.
> The block device is RBD on Ceph, so what I can do is take a block 
> level snapshot, map this snapshot to another system and run btrfs 
> check --check-data-csum there. If the IO is the same than btrfs scrub 
> this will probably take between 3 to 5 days to complete. I'll have to 
> run this on another VM with the same kernel and btrfs-progs versions, 
> BTRFS doesn't like having 2 devices showing up with the same internal 
> identifiers...
>
>>
>> That would output the error for each corrupted sector (without
>> ratelimit), so that you can record them all.
>> And try to do logical-resolve to find each corrupted location?
>>
>> If btrfs check reports no error, it's 100% sure scrub is to blamce.
>>
>> If btrfs check reports error, and logical-resolve failed to locate the
>> file and its position, it means the corruption is in bookend exntets.
>>
>> If btrfs check reports error and logical-resolve can locate the file and
>> position, it's a different problem then.
>
> OK. I understand. This is time for me to go to sleep, but I'll work on 
> this tomorrow. I'll report as soon as check-data-sum finds something 
> or at the end in several days if it didn't.

There is a bit of a slowdown. btrfs check was killed a couple hours ago 
(after running more than a day) by the OOM killer. I anticipated that it 
would need large amounts of memory (see below for the number of 
files/dirs/subvolumes) and started it on a VM with 32GB but it wasn't 
enough. It stopped after printing: "[4/7] checking fs roots".

I restarted btrfs check --check-data-csum after giving 64GB of RAM to 
the VM hoping this will be enough.
If it doesn't work and the oom-killer still is triggered I'll have to 
move other VMs around and the realistic maximum I can give the VM used 
for runing the btrfs check is ~200GB.

If someone familiar with btrfs check can estimate how much RAM is 
needed, here is some information that might be relevant:
- according to the latest estimations there should be a total of around 
50M files and 2.5M directories in the 3 main subvolumes on this filesystem.
- for each of these 3 subvolumes there should be approximately 30 snapshots.

Here is the filesystem usage output:
Overall:
     Device size:                  40.00TiB
     Device allocated:             31.62TiB
     Device unallocated:            8.38TiB
     Device missing:                  0.00B
     Device slack:                 12.00TiB
     Used:                         18.72TiB
     Free (estimated):             20.32TiB      (min: 16.13TiB)
     Free (statfs, df):            20.32TiB
     Data ratio:                       1.00
     Metadata ratio:                   2.00
     Global reserve:              512.00MiB      (used: 0.00B)
     Multiple profiles:                  no

Data,single: Size:30.51TiB, Used:18.57TiB (60.87%)
    /dev/sdb       30.51TiB

Metadata,DUP: Size:565.50GiB, Used:75.40GiB (13.33%)
    /dev/sdb        1.10TiB

System,DUP: Size:8.00MiB, Used:3.53MiB (44.14%)
    /dev/sdb       16.00MiB

Unallocated:
    /dev/sdb        8.38TiB


Best regards,
Lionel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-08 16:15         ` Lionel Bouton
@ 2024-06-08 22:48           ` Qu Wenruo
  2024-06-09  0:16             ` Lionel Bouton
  0 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2024-06-08 22:48 UTC (permalink / raw)
  To: Lionel Bouton, linux-btrfs



在 2024/6/9 01:45, Lionel Bouton 写道:
> Hi,
>
> To keep this short I've removed most of past exchanges as this is just
> to keep people following this thread informed on my progress.
>
> Le 07/06/2024 à 01:46, Lionel Bouton a écrit :
>>
>> [...]
>>>> I briefly considered doing just that... but then I found out that the
>>>> scrub errors were themselves in error and the on disk data was matching
>>>> the checksums. When I tried to read the file not only didn't the
>>>> filesystem report an IO error (if I'm not mistaken it should if the
>>>> csum
>>>> doesn't match) but the file content matched the original file fetched
>>>> from its source.
>>>
>>> Got it, this is really weird now.
>>>
>>> What scrub doing is read the data from disk (without bothering page
>>> cache), and verify against checksums.
>>>
>>> Would it be possible to run "btrfs check --check-data-csum" on the
>>> unmounted/RO mounted fs?
>>
>> Yes with some caveats as the scrub takes approximately a week to
>> complete and I can't easily stop the services on this system for a week.
>> The block device is RBD on Ceph, so what I can do is take a block
>> level snapshot, map this snapshot to another system and run btrfs
>> check --check-data-csum there. If the IO is the same than btrfs scrub
>> this will probably take between 3 to 5 days to complete. I'll have to
>> run this on another VM with the same kernel and btrfs-progs versions,
>> BTRFS doesn't like having 2 devices showing up with the same internal
>> identifiers...
>>
>>>
>>> That would output the error for each corrupted sector (without
>>> ratelimit), so that you can record them all.
>>> And try to do logical-resolve to find each corrupted location?
>>>
>>> If btrfs check reports no error, it's 100% sure scrub is to blamce.
>>>
>>> If btrfs check reports error, and logical-resolve failed to locate the
>>> file and its position, it means the corruption is in bookend exntets.
>>>
>>> If btrfs check reports error and logical-resolve can locate the file and
>>> position, it's a different problem then.
>>
>> OK. I understand. This is time for me to go to sleep, but I'll work on
>> this tomorrow. I'll report as soon as check-data-sum finds something
>> or at the end in several days if it didn't.
>
> There is a bit of a slowdown. btrfs check was killed a couple hours ago
> (after running more than a day) by the OOM killer. I anticipated that it
> would need large amounts of memory (see below for the number of
> files/dirs/subvolumes) and started it on a VM with 32GB but it wasn't
> enough. It stopped after printing: "[4/7] checking fs roots".
>
> I restarted btrfs check --check-data-csum after giving 64GB of RAM to
> the VM hoping this will be enough.
> If it doesn't work and the oom-killer still is triggered I'll have to
> move other VMs around and the realistic maximum I can give the VM used
> for runing the btrfs check is ~200GB.

That's why we have --mode=lowmem exactly for that reason.

But please be aware that, the low memory usage is traded off by doing a
lot of more IO.

Thanks,
Qu
>
> If someone familiar with btrfs check can estimate how much RAM is
> needed, here is some information that might be relevant:
> - according to the latest estimations there should be a total of around
> 50M files and 2.5M directories in the 3 main subvolumes on this filesystem.
> - for each of these 3 subvolumes there should be approximately 30
> snapshots.
>
> Here is the filesystem usage output:
> Overall:
>      Device size:                  40.00TiB
>      Device allocated:             31.62TiB
>      Device unallocated:            8.38TiB
>      Device missing:                  0.00B
>      Device slack:                 12.00TiB
>      Used:                         18.72TiB
>      Free (estimated):             20.32TiB      (min: 16.13TiB)
>      Free (statfs, df):            20.32TiB
>      Data ratio:                       1.00
>      Metadata ratio:                   2.00
>      Global reserve:              512.00MiB      (used: 0.00B)
>      Multiple profiles:                  no
>
> Data,single: Size:30.51TiB, Used:18.57TiB (60.87%)
>     /dev/sdb       30.51TiB
>
> Metadata,DUP: Size:565.50GiB, Used:75.40GiB (13.33%)
>     /dev/sdb        1.10TiB
>
> System,DUP: Size:8.00MiB, Used:3.53MiB (44.14%)
>     /dev/sdb       16.00MiB
>
> Unallocated:
>     /dev/sdb        8.38TiB
>
>
> Best regards,
> Lionel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-08 22:48           ` Qu Wenruo
@ 2024-06-09  0:16             ` Lionel Bouton
  2024-06-10 12:52               ` Lionel Bouton
  0 siblings, 1 reply; 17+ messages in thread
From: Lionel Bouton @ 2024-06-09  0:16 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Le 09/06/2024 à 00:48, Qu Wenruo a écrit :
>
>
> 在 2024/6/9 01:45, Lionel Bouton 写道:
>> Hi,
>>
>> To keep this short I've removed most of past exchanges as this is just
>> to keep people following this thread informed on my progress.
>>
>> Le 07/06/2024 à 01:46, Lionel Bouton a écrit :
>>>
>>> [...]
>>>>> I briefly considered doing just that... but then I found out that the
>>>>> scrub errors were themselves in error and the on disk data was 
>>>>> matching
>>>>> the checksums. When I tried to read the file not only didn't the
>>>>> filesystem report an IO error (if I'm not mistaken it should if the
>>>>> csum
>>>>> doesn't match) but the file content matched the original file fetched
>>>>> from its source.
>>>>
>>>> Got it, this is really weird now.
>>>>
>>>> What scrub doing is read the data from disk (without bothering page
>>>> cache), and verify against checksums.
>>>>
>>>> Would it be possible to run "btrfs check --check-data-csum" on the
>>>> unmounted/RO mounted fs?
>>>
>>> Yes with some caveats as the scrub takes approximately a week to
>>> complete and I can't easily stop the services on this system for a 
>>> week.
>>> The block device is RBD on Ceph, so what I can do is take a block
>>> level snapshot, map this snapshot to another system and run btrfs
>>> check --check-data-csum there. If the IO is the same than btrfs scrub
>>> this will probably take between 3 to 5 days to complete. I'll have to
>>> run this on another VM with the same kernel and btrfs-progs versions,
>>> BTRFS doesn't like having 2 devices showing up with the same internal
>>> identifiers...
>>>
>>>>
>>>> That would output the error for each corrupted sector (without
>>>> ratelimit), so that you can record them all.
>>>> And try to do logical-resolve to find each corrupted location?
>>>>
>>>> If btrfs check reports no error, it's 100% sure scrub is to blamce.
>>>>
>>>> If btrfs check reports error, and logical-resolve failed to locate the
>>>> file and its position, it means the corruption is in bookend exntets.
>>>>
>>>> If btrfs check reports error and logical-resolve can locate the 
>>>> file and
>>>> position, it's a different problem then.
>>>
>>> OK. I understand. This is time for me to go to sleep, but I'll work on
>>> this tomorrow. I'll report as soon as check-data-sum finds something
>>> or at the end in several days if it didn't.
>>
>> There is a bit of a slowdown. btrfs check was killed a couple hours ago
>> (after running more than a day) by the OOM killer. I anticipated that it
>> would need large amounts of memory (see below for the number of
>> files/dirs/subvolumes) and started it on a VM with 32GB but it wasn't
>> enough. It stopped after printing: "[4/7] checking fs roots".
>>
>> I restarted btrfs check --check-data-csum after giving 64GB of RAM to
>> the VM hoping this will be enough.
>> If it doesn't work and the oom-killer still is triggered I'll have to
>> move other VMs around and the realistic maximum I can give the VM used
>> for runing the btrfs check is ~200GB.
>
> That's why we have --mode=lowmem exactly for that reason.
>
> But please be aware that, the low memory usage is traded off by doing a
> lot of more IO.
>

After the OOM I remembered reading discussions about the lowmem mode but 
even then I wasn't sure how to know at which point it would be needed : 
on one hand 32GB seems like a lot for a <20TB filesystem but this 
filesystem has both many files and snapshots.

The metadata uses 75.4GB and I assume it is for the 2 copies of the dup 
profile. If most of the RAM used by check is a copy of the metadata, 
64GB should be enough this time.

The process is using 35.4G at the "checking fs roots" step and it has 
been so for several hours now.

Thanks for your patience,
Lionel

> Thanks,
> Qu
>>
>> If someone familiar with btrfs check can estimate how much RAM is
>> needed, here is some information that might be relevant:
>> - according to the latest estimations there should be a total of around
>> 50M files and 2.5M directories in the 3 main subvolumes on this 
>> filesystem.
>> - for each of these 3 subvolumes there should be approximately 30
>> snapshots.
>>
>> Here is the filesystem usage output:
>> Overall:
>>      Device size:                  40.00TiB
>>      Device allocated:             31.62TiB
>>      Device unallocated:            8.38TiB
>>      Device missing:                  0.00B
>>      Device slack:                 12.00TiB
>>      Used:                         18.72TiB
>>      Free (estimated):             20.32TiB      (min: 16.13TiB)
>>      Free (statfs, df):            20.32TiB
>>      Data ratio:                       1.00
>>      Metadata ratio:                   2.00
>>      Global reserve:              512.00MiB      (used: 0.00B)
>>      Multiple profiles:                  no
>>
>> Data,single: Size:30.51TiB, Used:18.57TiB (60.87%)
>>     /dev/sdb       30.51TiB
>>
>> Metadata,DUP: Size:565.50GiB, Used:75.40GiB (13.33%)
>>     /dev/sdb        1.10TiB
>>
>> System,DUP: Size:8.00MiB, Used:3.53MiB (44.14%)
>>     /dev/sdb       16.00MiB
>>
>> Unallocated:
>>     /dev/sdb        8.38TiB
>>
>>
>> Best regards,
>> Lionel
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-09  0:16             ` Lionel Bouton
@ 2024-06-10 12:52               ` Lionel Bouton
  0 siblings, 0 replies; 17+ messages in thread
From: Lionel Bouton @ 2024-06-10 12:52 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hi,

Here is a quick progress update and information about total process size 
that could be useful to others. See below.

Le 09/06/2024 à 02:16, Lionel Bouton a écrit :
> Le 09/06/2024 à 00:48, Qu Wenruo a écrit :
>>
>>
>> 在 2024/6/9 01:45, Lionel Bouton 写道:
>>> Hi,
>>>
>>> To keep this short I've removed most of past exchanges as this is just
>>> to keep people following this thread informed on my progress.
>>>
>>> Le 07/06/2024 à 01:46, Lionel Bouton a écrit :
>>>>
>>>> [...]
>>>>>> I briefly considered doing just that... but then I found out that 
>>>>>> the
>>>>>> scrub errors were themselves in error and the on disk data was 
>>>>>> matching
>>>>>> the checksums. When I tried to read the file not only didn't the
>>>>>> filesystem report an IO error (if I'm not mistaken it should if the
>>>>>> csum
>>>>>> doesn't match) but the file content matched the original file 
>>>>>> fetched
>>>>>> from its source.
>>>>>
>>>>> Got it, this is really weird now.
>>>>>
>>>>> What scrub doing is read the data from disk (without bothering page
>>>>> cache), and verify against checksums.
>>>>>
>>>>> Would it be possible to run "btrfs check --check-data-csum" on the
>>>>> unmounted/RO mounted fs?
>>>>
>>>> Yes with some caveats as the scrub takes approximately a week to
>>>> complete and I can't easily stop the services on this system for a 
>>>> week.
>>>> The block device is RBD on Ceph, so what I can do is take a block
>>>> level snapshot, map this snapshot to another system and run btrfs
>>>> check --check-data-csum there. If the IO is the same than btrfs scrub
>>>> this will probably take between 3 to 5 days to complete. I'll have to
>>>> run this on another VM with the same kernel and btrfs-progs versions,
>>>> BTRFS doesn't like having 2 devices showing up with the same internal
>>>> identifiers...
>>>>
>>>>>
>>>>> That would output the error for each corrupted sector (without
>>>>> ratelimit), so that you can record them all.
>>>>> And try to do logical-resolve to find each corrupted location?
>>>>>
>>>>> If btrfs check reports no error, it's 100% sure scrub is to blamce.
>>>>>
>>>>> If btrfs check reports error, and logical-resolve failed to locate 
>>>>> the
>>>>> file and its position, it means the corruption is in bookend exntets.
>>>>>
>>>>> If btrfs check reports error and logical-resolve can locate the 
>>>>> file and
>>>>> position, it's a different problem then.
>>>>
>>>> OK. I understand. This is time for me to go to sleep, but I'll work on
>>>> this tomorrow. I'll report as soon as check-data-sum finds something
>>>> or at the end in several days if it didn't.
>>>
>>> There is a bit of a slowdown. btrfs check was killed a couple hours ago
>>> (after running more than a day) by the OOM killer. I anticipated 
>>> that it
>>> would need large amounts of memory (see below for the number of
>>> files/dirs/subvolumes) and started it on a VM with 32GB but it wasn't
>>> enough. It stopped after printing: "[4/7] checking fs roots".
>>>
>>> I restarted btrfs check --check-data-csum after giving 64GB of RAM to
>>> the VM hoping this will be enough.
>>> If it doesn't work and the oom-killer still is triggered I'll have to
>>> move other VMs around and the realistic maximum I can give the VM used
>>> for runing the btrfs check is ~200GB.
>>
>> That's why we have --mode=lowmem exactly for that reason.
>>
>> But please be aware that, the low memory usage is traded off by doing a
>> lot of more IO.
>>
>
> After the OOM I remembered reading discussions about the lowmem mode 
> but even then I wasn't sure how to know at which point it would be 
> needed : on one hand 32GB seems like a lot for a <20TB filesystem but 
> this filesystem has both many files and snapshots.
>
> The metadata uses 75.4GB and I assume it is for the 2 copies of the 
> dup profile. If most of the RAM used by check is a copy of the 
> metadata, 64GB should be enough this time.
>
> The process is using 35.4G at the "checking fs roots" step and it has 
> been so for several hours now.

The process began to grow again later during the "[4/7] checking fs 
roots" step. It reached 59.6G and it is now at the "[5/7] checking csums 
against data" step. I hotplugged 16GB of RAM to the VM for a total of 
80GB while the process grew to give it some headroom.
I hope this can be of use for someone trying to guess if lowmem mode 
will be needed in their case. See below for some additional details 
about this filesystem. This is btrfs-progs 6.8.1 on kernel 6.6.30.

The check now reads at approximately 35-40MB/s (this is currently 
limited by contention on old slow HDDs scheduled to be replaced at the 
end of the week in the Ceph cluster). Given the size of the filesystem, 
we should get the result in 5 or 6 days.

Best regards,
Lionel

>
> Thanks for your patience,
> Lionel
>
>> Thanks,
>> Qu
>>>
>>> If someone familiar with btrfs check can estimate how much RAM is
>>> needed, here is some information that might be relevant:
>>> - according to the latest estimations there should be a total of around
>>> 50M files and 2.5M directories in the 3 main subvolumes on this 
>>> filesystem.
>>> - for each of these 3 subvolumes there should be approximately 30
>>> snapshots.
>>>
>>> Here is the filesystem usage output:
>>> Overall:
>>>      Device size:                  40.00TiB
>>>      Device allocated:             31.62TiB
>>>      Device unallocated:            8.38TiB
>>>      Device missing:                  0.00B
>>>      Device slack:                 12.00TiB
>>>      Used:                         18.72TiB
>>>      Free (estimated):             20.32TiB      (min: 16.13TiB)
>>>      Free (statfs, df):            20.32TiB
>>>      Data ratio:                       1.00
>>>      Metadata ratio:                   2.00
>>>      Global reserve:              512.00MiB      (used: 0.00B)
>>>      Multiple profiles:                  no
>>>
>>> Data,single: Size:30.51TiB, Used:18.57TiB (60.87%)
>>>     /dev/sdb       30.51TiB
>>>
>>> Metadata,DUP: Size:565.50GiB, Used:75.40GiB (13.33%)
>>>     /dev/sdb        1.10TiB
>>>
>>> System,DUP: Size:8.00MiB, Used:3.53MiB (44.14%)
>>>     /dev/sdb       16.00MiB
>>>
>>> Unallocated:
>>>     /dev/sdb        8.38TiB
>>>
>>>
>>> Best regards,
>>> Lionel
>>
>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-06 23:30     ` Qu Wenruo
  2024-06-06 23:46       ` Lionel Bouton
@ 2024-06-18 21:45       ` Lionel Bouton
  2024-06-22  8:51         ` Lionel Bouton
  1 sibling, 1 reply; 17+ messages in thread
From: Lionel Bouton @ 2024-06-18 21:45 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Le 07/06/2024 à 01:30, Qu Wenruo a écrit :
>
>
> 在 2024/6/7 08:51, Lionel Bouton 写道:
>> Hi,
>>
>> Le 07/06/2024 à 01:05, Qu Wenruo a écrit :
>>>
>>>
>>> 在 2024/6/4 23:42, Lionel Bouton 写道:
>>>> Hi,
>>>>
>>>> It seems one of our system hit an (harmless ?) bug while scrubbing.
>>>>
>>>> Kernel: 6.6.30 with Gentoo patches (sys-kernel/gentoo-sources-6.6.30
>>>> ebuild).
>>>> btrfs-progs: 6.8.1
>>>> The system is a VM using QEMU/KVM.
>>>> The affected filesystem is directly on top of a single virtio-scsi 
>>>> block
>>>> device (no partition, no LVM, ...) with a Ceph RBD backend.
>>>> Profiles for metadata: dup, data: single.
>>>>
>>>> The scrub process is ongoing but it logged uncorrectable errors :
>>>>
>>>> # btrfs scrub status /mnt/store
>>>> UUID:             <our_uuid>
>>>> Scrub started:    Mon Jun  3 09:35:28 2024
>>>> Status:           running
>>>> Duration:         26:12:07
>>>> Time left:        114:45:38
>>>> ETA:              Sun Jun  9 06:33:13 2024
>>>> Total to scrub:   18.84TiB
>>>> Bytes scrubbed:   3.50TiB  (18.59%)
>>>> Rate:             38.92MiB/s
>>>> Error summary:    csum=61
>>>>    Corrected:      0
>>>>    Uncorrectable:  61
>>>>    Unverified:     0
>>>>
>>>> According to the logs all errors are linked to a single file 
>>>> (renamed to
>>>> <file> in the attached log) and happened within a couple of seconds.
>>>> Most errors are duplicates as there are many snapshots of the 
>>>> subvolume
>>>> it is in. This system is mainly used as a backup server, storing 
>>>> copies
>>>> of data for other servers and creating snapshots of them.
>>>>
>>>> I was about to overwrite the file to correct the problem by 
>>>> fetching it
>>>> from its source but I didn't get an error trying to read it (cat 
>>>> <file>
>>>>  > /dev/null)). I used diff and md5sum to double check: the file is
>>>> fine.
>>>>
>>>> Result of stat <file> on the subvolume used as base for the snapshots:
>>>>    File: <file>
>>>>    Size: 1799195         Blocks: 3520       IO Block: 4096 regular 
>>>> file
>>>> Device: 0,36    Inode: 6676106     Links: 1
>>>> Access: (0644/-rw-r--r--)  Uid: ( 1000/<uid>)   Gid: ( 1000/<gid>)
>>>> Access: 2018-02-15 05:22:38.506204046 +0100
>>>> Modify: 2018-02-15 05:21:35.612914799 +0100
>>>> Change: 2018-02-15 05:22:38.506204046 +0100
>>>>   Birth: 2018-02-15 05:22:38.486203934 +0100
>>>>
>>>> AFAIK the kernel logged a warning for nearly each snapshot of the same
>>>> subvolume (they were all created after the file), the subvolume from
>>>> which snapshots are created is root 257 in the logs. What seems
>>>> interesting to me and might help diagnose this is the pattern for
>>>> offsets in the warnings.
>>>> - There are 3 offsets appearing: 20480, 86016, 151552 (exactly 64k
>>>> between them).
>>>> - Most snapshots seem to report a warning for each of them.
>>>
>>> This is how btrfs reports a corrupted data sector, it would go through
>>> all inodes (including ones in different subvolumes) and report the 
>>> error.
>>>
>>> So it really means if you want to delete the file, you need to remove
>>> all the inodes from all different snapshots.
>>>
>>> So if you got a correct copy matching the csum, you can do a more
>>> dangerous but more straight forward way to fix, by directly overwriting
>>> the contents on disk.
>>
>> I briefly considered doing just that... but then I found out that the
>> scrub errors were themselves in error and the on disk data was matching
>> the checksums. When I tried to read the file not only didn't the
>> filesystem report an IO error (if I'm not mistaken it should if the csum
>> doesn't match) but the file content matched the original file fetched
>> from its source.
>
> Got it, this is really weird now.
>
> What scrub doing is read the data from disk (without bothering page
> cache), and verify against checksums.
>
> Would it be possible to run "btrfs check --check-data-csum" on the
> unmounted/RO mounted fs?
>
> That would output the error for each corrupted sector (without
> ratelimit), so that you can record them all.
> And try to do logical-resolve to find each corrupted location?
>
> If btrfs check reports no error, it's 100% sure scrub is to blamce.

The results are in, no error reported by btrfs check:

# btrfs check --check-data-csum /dev/sdb
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking csums against data
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/sdb
UUID: 61e86d80-d6e4-4f9e-a312-885194c5e690
found 20626845626377 bytes used, no error found
total csum bytes: 19548745528
total tree bytes: 83019579392
total fs tree bytes: 47937191936
total extent tree bytes: 8051245056
btree space waste bytes: 18589648011
file data blocks allocated: 22539270656000
  referenced 21661227479040

I'll mount the filesystem and run a scrub again to see if I can 
reproduce the problem. It should be noticeably quicker, we made updates 
to the Ceph cluster and should get approximately 2x the I/O bandwidth.
I plan to keep the disk snapshot for at least several weeks so if you 
want to test something else just say so.

Best regards,
Lionel

>
> If btrfs check reports error, and logical-resolve failed to locate the
> file and its position, it means the corruption is in bookend exntets.
>
> If btrfs check reports error and logical-resolve can locate the file and
> position, it's a different problem then.
>
> Thanks,
> Qu
>
>>
>> The stats output show that this file was not modified since its creation
>> more than 6 years ago. This is what led me to report a bug in scrub.
>>
>>
>>>
>>>> - Some including the original subvolume have only logged 2 warnings 
>>>> and
>>>> one (root 151335) logged only one warning.
>>>> - All of the snapshots reported a warning for offset 20480.
>>>> - When several offsets are logged their order seems random.
>>>
>>> One problem of scrub is, it may ratelimit the output, thus we can not
>>> just rely on dmesg to know the damage.
>>
>> I wondered about this: I've read other threads where ratelimiting is
>> mentioned but I was not sure if it could apply here. Thanks for 
>> clarifying.
>>
>>>
>>>>
>>>> I'm attaching kernel logs beginning with the scrub start and ending 
>>>> with
>>>> the last error. As of now there are no new errors/warnings even though
>>>> the scrub is still ongoing, 15 hours after the last error. I didn't
>>>> clean the log frombalance logs linked to the same filesystem.
>>>>
>>>> Side-note for the curious or in the unlikely case it could be 
>>>> linked to
>>>> the problem:
>>>> Historically this filesystem was mounted with ssd_spread without any
>>>> issue (I guess several years ago it made sense to me reading the
>>>> documentation and given the IO latencies I saw on the Ceph cluster). A
>>>> recent kernel filled the whole available space with nearly empty block
>>>> groups which seemed to re-appear each time we mounted with ssd_spread.
>>>> We switched to "ssd" since then and there is a mostly daily 90mn 
>>>> balance
>>>> to reach back the previous stateprogressively (this is the reason 
>>>> behind
>>>> the balance related logs). Having some space available for new block
>>>> groups seems to be a good idea but additionally as we use Ceph RBD, we
>>>> want it to be able to deallocate unused space: having many nearly 
>>>> empty
>>>> block groups could waste resources (especially if the used space in
>>>> these groups is in <4MB continuous chunks which is the default RBD
>>>> object size).
>>>>
>>>>
>>>> More information :
>>>> The scrub is run monthly. This is the first time an error was ever
>>>> reported. The previous scrub was run successfully at the beginning of
>>>> May with a 6.6.13 kernel.
>>>>
>>>> There is a continuous defragmentation process running (it processes 
>>>> the
>>>> whole filesystem slowly ignoring snapshots and defragments file by 
>>>> file
>>>> based on a fragmentation estimation using filefrag -v). All
>>>> defragmentations are logged and I can confirm the file for which this
>>>> error was reported was not defragmented for at least a year (I checked
>>>> because I wanted to rule out a possible race condition between
>>>> defragmentation and scrub).
>>>
>>> I'm wondering if there is any direct IO conflicting with
>>> buffered/defrag IO.
>>>
>>> It's known that conflicting buffered/direct IO can lead to contents
>>> change halfway, and lead to btrfs csum mismatch.
>>> So far that's the only thing leading to known btrfs csum mismatch I can
>>> think of.
>>
>> But here it seems there isn't an actual mismatch as reading the file is
>> possible and gives the data which was written in it 6 years ago. Tthis
>> seems an error in scrub (or a neutrino flipping a bit somewhere during
>> the scrub). The VM runs on a server with ECC RAM so it is unlikely to be
>> a simple bitflip but when everything likely has been ruled out...
>>
>> Thanks for your feedback,
>> Lionel
>>
>>>
>>> But 6.x kernel should all log a warning message for it.
>>>
>>> Thanks,
>>> Qu
>>>>
>>>> In addition to the above, among the notable IO are :
>>>> - a moderately IO intensive PostgreSQL replication subscriber,
>>>> - ponctual large synchronisations using Syncthing,
>>>> - snapshot creations/removals occur approximately every 80 minutes. 
>>>> The
>>>> last snapshot operation was logged 31 minutes before the errors 
>>>> (removal
>>>> occur asynchronously but where was no unusual load at this time that
>>>> could point to a particularly long snapshot cleanup process).
>>>>
>>>> To sum up, there are many processes accessing the filesystem but
>>>> historically it saw far higher IO load during scrubs than what was
>>>> occurring here.
>>>>
>>>>
>>>> Reproducing this might be difficult: the whole scrub can take a week
>>>> depending on load. That said I can easily prepare a kernel and/or new
>>>> btrfs-progs binaries to launch scrubs or other non-destructive 
>>>> tasks the
>>>> week-end or at the end of the day (UTC+2 timezone).
>>>>
>>>> Best regards,
>>>>
>>>> Lionel Bouton
>>>
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-18 21:45       ` Lionel Bouton
@ 2024-06-22  8:51         ` Lionel Bouton
  2024-06-22  9:41           ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Lionel Bouton @ 2024-06-22  8:51 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Le 18/06/2024 à 23:45, Lionel Bouton a écrit :
> Le 07/06/2024 à 01:30, Qu Wenruo a écrit :
>>
>>
>> 在 2024/6/7 08:51, Lionel Bouton 写道:
>>> Hi,
>>>
>>> Le 07/06/2024 à 01:05, Qu Wenruo a écrit :
>>>>
>>>>
>>>> 在 2024/6/4 23:42, Lionel Bouton 写道:
>>>>> Hi,
>>>>>
>>>>> It seems one of our system hit an (harmless ?) bug while scrubbing.
>>>>>
>>>>> Kernel: 6.6.30 with Gentoo patches (sys-kernel/gentoo-sources-6.6.30
>>>>> ebuild).
>>>>> btrfs-progs: 6.8.1
>>>>> The system is a VM using QEMU/KVM.
>>>>> The affected filesystem is directly on top of a single virtio-scsi 
>>>>> block
>>>>> device (no partition, no LVM, ...) with a Ceph RBD backend.
>>>>> Profiles for metadata: dup, data: single.
>>>>>
>>>>> The scrub process is ongoing but it logged uncorrectable errors :
>>>>>
>>>>> # btrfs scrub status /mnt/store
>>>>> UUID:             <our_uuid>
>>>>> Scrub started:    Mon Jun  3 09:35:28 2024
>>>>> Status:           running
>>>>> Duration:         26:12:07
>>>>> Time left:        114:45:38
>>>>> ETA:              Sun Jun  9 06:33:13 2024
>>>>> Total to scrub:   18.84TiB
>>>>> Bytes scrubbed:   3.50TiB  (18.59%)
>>>>> Rate:             38.92MiB/s
>>>>> Error summary:    csum=61
>>>>>    Corrected:      0
>>>>>    Uncorrectable:  61
>>>>>    Unverified:     0
>>>>>
>>>>> According to the logs all errors are linked to a single file 
>>>>> (renamed to
>>>>> <file> in the attached log) and happened within a couple of seconds.
>>>>> Most errors are duplicates as there are many snapshots of the 
>>>>> subvolume
>>>>> it is in. This system is mainly used as a backup server, storing 
>>>>> copies
>>>>> of data for other servers and creating snapshots of them.
>>>>>
>>>>> I was about to overwrite the file to correct the problem by 
>>>>> fetching it
>>>>> from its source but I didn't get an error trying to read it (cat 
>>>>> <file>
>>>>>  > /dev/null)). I used diff and md5sum to double check: the file is
>>>>> fine.
>>>>>
>>>>> Result of stat <file> on the subvolume used as base for the 
>>>>> snapshots:
>>>>>    File: <file>
>>>>>    Size: 1799195         Blocks: 3520       IO Block: 4096 regular 
>>>>> file
>>>>> Device: 0,36    Inode: 6676106     Links: 1
>>>>> Access: (0644/-rw-r--r--)  Uid: ( 1000/<uid>)   Gid: ( 1000/<gid>)
>>>>> Access: 2018-02-15 05:22:38.506204046 +0100
>>>>> Modify: 2018-02-15 05:21:35.612914799 +0100
>>>>> Change: 2018-02-15 05:22:38.506204046 +0100
>>>>>   Birth: 2018-02-15 05:22:38.486203934 +0100
>>>>>
>>>>> AFAIK the kernel logged a warning for nearly each snapshot of the 
>>>>> same
>>>>> subvolume (they were all created after the file), the subvolume from
>>>>> which snapshots are created is root 257 in the logs. What seems
>>>>> interesting to me and might help diagnose this is the pattern for
>>>>> offsets in the warnings.
>>>>> - There are 3 offsets appearing: 20480, 86016, 151552 (exactly 64k
>>>>> between them).
>>>>> - Most snapshots seem to report a warning for each of them.
>>>>
>>>> This is how btrfs reports a corrupted data sector, it would go through
>>>> all inodes (including ones in different subvolumes) and report the 
>>>> error.
>>>>
>>>> So it really means if you want to delete the file, you need to remove
>>>> all the inodes from all different snapshots.
>>>>
>>>> So if you got a correct copy matching the csum, you can do a more
>>>> dangerous but more straight forward way to fix, by directly 
>>>> overwriting
>>>> the contents on disk.
>>>
>>> I briefly considered doing just that... but then I found out that the
>>> scrub errors were themselves in error and the on disk data was matching
>>> the checksums. When I tried to read the file not only didn't the
>>> filesystem report an IO error (if I'm not mistaken it should if the 
>>> csum
>>> doesn't match) but the file content matched the original file fetched
>>> from its source.
>>
>> Got it, this is really weird now.
>>
>> What scrub doing is read the data from disk (without bothering page
>> cache), and verify against checksums.
>>
>> Would it be possible to run "btrfs check --check-data-csum" on the
>> unmounted/RO mounted fs?
>>
>> That would output the error for each corrupted sector (without
>> ratelimit), so that you can record them all.
>> And try to do logical-resolve to find each corrupted location?
>>
>> If btrfs check reports no error, it's 100% sure scrub is to blamce.
>
> The results are in, no error reported by btrfs check:
>
> # btrfs check --check-data-csum /dev/sdb
> [1/7] checking root items
> [2/7] checking extents
> [3/7] checking free space tree
> [4/7] checking fs roots
> [5/7] checking csums against data
> [6/7] checking root refs
> [7/7] checking quota groups skipped (not enabled on this FS)
> Opening filesystem to check...
> Checking filesystem on /dev/sdb
> UUID: 61e86d80-d6e4-4f9e-a312-885194c5e690
> found 20626845626377 bytes used, no error found
> total csum bytes: 19548745528
> total tree bytes: 83019579392
> total fs tree bytes: 47937191936
> total extent tree bytes: 8051245056
> btree space waste bytes: 18589648011
> file data blocks allocated: 22539270656000
>  referenced 21661227479040
>
> I'll mount the filesystem and run a scrub again to see if I can 
> reproduce the problem. It should be noticeably quicker, we made 
> updates to the Ceph cluster and should get approximately 2x the I/O 
> bandwidth.
> I plan to keep the disk snapshot for at least several weeks so if you 
> want to test something else just say so.


The scrub is finished, here are the results :

UUID: 61e86d80-d6e4-4f9e-a312-885194c5e690
Scrub started:    Wed Jun 19 00:01:59 2024
Status:           finished
Duration:         81:04:21
Total to scrub:   18.83TiB
Rate:             67.67MiB/s
Error summary:    no errors found

So the scrub error isn't deterministic. I'll shut down the test VM for 
now and keep the disk snapshot it uses for at least a couple of week if 
it is needed for further tests.
The original filesystem is scrubbed monthly, I'll reply to this message 
if another error shows up.

>
> Best regards,
> Lionel
>
>>
>> If btrfs check reports error, and logical-resolve failed to locate the
>> file and its position, it means the corruption is in bookend exntets.
>>
>> If btrfs check reports error and logical-resolve can locate the file and
>> position, it's a different problem then.
>>
>> Thanks,
>> Qu
>>
>>>
>>> The stats output show that this file was not modified since its 
>>> creation
>>> more than 6 years ago. This is what led me to report a bug in scrub.
>>>
>>>
>>>>
>>>>> - Some including the original subvolume have only logged 2 
>>>>> warnings and
>>>>> one (root 151335) logged only one warning.
>>>>> - All of the snapshots reported a warning for offset 20480.
>>>>> - When several offsets are logged their order seems random.
>>>>
>>>> One problem of scrub is, it may ratelimit the output, thus we can not
>>>> just rely on dmesg to know the damage.
>>>
>>> I wondered about this: I've read other threads where ratelimiting is
>>> mentioned but I was not sure if it could apply here. Thanks for 
>>> clarifying.
>>>
>>>>
>>>>>
>>>>> I'm attaching kernel logs beginning with the scrub start and 
>>>>> ending with
>>>>> the last error. As of now there are no new errors/warnings even 
>>>>> though
>>>>> the scrub is still ongoing, 15 hours after the last error. I didn't
>>>>> clean the log frombalance logs linked to the same filesystem.
>>>>>
>>>>> Side-note for the curious or in the unlikely case it could be 
>>>>> linked to
>>>>> the problem:
>>>>> Historically this filesystem was mounted with ssd_spread without any
>>>>> issue (I guess several years ago it made sense to me reading the
>>>>> documentation and given the IO latencies I saw on the Ceph 
>>>>> cluster). A
>>>>> recent kernel filled the whole available space with nearly empty 
>>>>> block
>>>>> groups which seemed to re-appear each time we mounted with 
>>>>> ssd_spread.
>>>>> We switched to "ssd" since then and there is a mostly daily 90mn 
>>>>> balance
>>>>> to reach back the previous stateprogressively (this is the reason 
>>>>> behind
>>>>> the balance related logs). Having some space available for new block
>>>>> groups seems to be a good idea but additionally as we use Ceph 
>>>>> RBD, we
>>>>> want it to be able to deallocate unused space: having many nearly 
>>>>> empty
>>>>> block groups could waste resources (especially if the used space in
>>>>> these groups is in <4MB continuous chunks which is the default RBD
>>>>> object size).
>>>>>
>>>>>
>>>>> More information :
>>>>> The scrub is run monthly. This is the first time an error was ever
>>>>> reported. The previous scrub was run successfully at the beginning of
>>>>> May with a 6.6.13 kernel.
>>>>>
>>>>> There is a continuous defragmentation process running (it 
>>>>> processes the
>>>>> whole filesystem slowly ignoring snapshots and defragments file by 
>>>>> file
>>>>> based on a fragmentation estimation using filefrag -v). All
>>>>> defragmentations are logged and I can confirm the file for which this
>>>>> error was reported was not defragmented for at least a year (I 
>>>>> checked
>>>>> because I wanted to rule out a possible race condition between
>>>>> defragmentation and scrub).
>>>>
>>>> I'm wondering if there is any direct IO conflicting with
>>>> buffered/defrag IO.
>>>>
>>>> It's known that conflicting buffered/direct IO can lead to contents
>>>> change halfway, and lead to btrfs csum mismatch.
>>>> So far that's the only thing leading to known btrfs csum mismatch I 
>>>> can
>>>> think of.
>>>
>>> But here it seems there isn't an actual mismatch as reading the file is
>>> possible and gives the data which was written in it 6 years ago. Tthis
>>> seems an error in scrub (or a neutrino flipping a bit somewhere during
>>> the scrub). The VM runs on a server with ECC RAM so it is unlikely 
>>> to be
>>> a simple bitflip but when everything likely has been ruled out...
>>>
>>> Thanks for your feedback,
>>> Lionel
>>>
>>>>
>>>> But 6.x kernel should all log a warning message for it.
>>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>>>> In addition to the above, among the notable IO are :
>>>>> - a moderately IO intensive PostgreSQL replication subscriber,
>>>>> - ponctual large synchronisations using Syncthing,
>>>>> - snapshot creations/removals occur approximately every 80 
>>>>> minutes. The
>>>>> last snapshot operation was logged 31 minutes before the errors 
>>>>> (removal
>>>>> occur asynchronously but where was no unusual load at this time that
>>>>> could point to a particularly long snapshot cleanup process).
>>>>>
>>>>> To sum up, there are many processes accessing the filesystem but
>>>>> historically it saw far higher IO load during scrubs than what was
>>>>> occurring here.
>>>>>
>>>>>
>>>>> Reproducing this might be difficult: the whole scrub can take a week
>>>>> depending on load. That said I can easily prepare a kernel and/or new
>>>>> btrfs-progs binaries to launch scrubs or other non-destructive 
>>>>> tasks the
>>>>> week-end or at the end of the day (UTC+2 timezone).
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Lionel Bouton
>>>>
>>>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-22  8:51         ` Lionel Bouton
@ 2024-06-22  9:41           ` Qu Wenruo
  2024-06-30 10:59             ` Lionel Bouton
  0 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2024-06-22  9:41 UTC (permalink / raw)
  To: Lionel Bouton, linux-btrfs



在 2024/6/22 18:21, Lionel Bouton 写道:
[...]
>>
>> I'll mount the filesystem and run a scrub again to see if I can
>> reproduce the problem. It should be noticeably quicker, we made
>> updates to the Ceph cluster and should get approximately 2x the I/O
>> bandwidth.
>> I plan to keep the disk snapshot for at least several weeks so if you
>> want to test something else just say so.
>
>
> The scrub is finished, here are the results :
>
> UUID: 61e86d80-d6e4-4f9e-a312-885194c5e690
> Scrub started:    Wed Jun 19 00:01:59 2024
> Status:           finished
> Duration:         81:04:21
> Total to scrub:   18.83TiB
> Rate:             67.67MiB/s
> Error summary:    no errors found
>
> So the scrub error isn't deterministic. I'll shut down the test VM for
> now and keep the disk snapshot it uses for at least a couple of week if
> it is needed for further tests.
> The original filesystem is scrubbed monthly, I'll reply to this message
> if another error shows up.

I briefly remembered that there was a bug related to scrub that can
report false alerts:

f546c4282673 ("btrfs: scrub: avoid use-after-free when chunk length is
not 64K aligned")

But that should be automatically backported, and in that case it should
have some errors like "unable to find chunk map" error messages in the
kernel log.

Otherwise, I have no extra clues.

Have you tried kernels like v6.8/6.9 and can you reproduce the bug in
those newer kernels?

Thanks,
Qu
>
>>
>> Best regards,
>> Lionel
>>
>>>
>>> If btrfs check reports error, and logical-resolve failed to locate the
>>> file and its position, it means the corruption is in bookend exntets.
>>>
>>> If btrfs check reports error and logical-resolve can locate the file and
>>> position, it's a different problem then.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> The stats output show that this file was not modified since its
>>>> creation
>>>> more than 6 years ago. This is what led me to report a bug in scrub.
>>>>
>>>>
>>>>>
>>>>>> - Some including the original subvolume have only logged 2
>>>>>> warnings and
>>>>>> one (root 151335) logged only one warning.
>>>>>> - All of the snapshots reported a warning for offset 20480.
>>>>>> - When several offsets are logged their order seems random.
>>>>>
>>>>> One problem of scrub is, it may ratelimit the output, thus we can not
>>>>> just rely on dmesg to know the damage.
>>>>
>>>> I wondered about this: I've read other threads where ratelimiting is
>>>> mentioned but I was not sure if it could apply here. Thanks for
>>>> clarifying.
>>>>
>>>>>
>>>>>>
>>>>>> I'm attaching kernel logs beginning with the scrub start and
>>>>>> ending with
>>>>>> the last error. As of now there are no new errors/warnings even
>>>>>> though
>>>>>> the scrub is still ongoing, 15 hours after the last error. I didn't
>>>>>> clean the log frombalance logs linked to the same filesystem.
>>>>>>
>>>>>> Side-note for the curious or in the unlikely case it could be
>>>>>> linked to
>>>>>> the problem:
>>>>>> Historically this filesystem was mounted with ssd_spread without any
>>>>>> issue (I guess several years ago it made sense to me reading the
>>>>>> documentation and given the IO latencies I saw on the Ceph
>>>>>> cluster). A
>>>>>> recent kernel filled the whole available space with nearly empty
>>>>>> block
>>>>>> groups which seemed to re-appear each time we mounted with
>>>>>> ssd_spread.
>>>>>> We switched to "ssd" since then and there is a mostly daily 90mn
>>>>>> balance
>>>>>> to reach back the previous stateprogressively (this is the reason
>>>>>> behind
>>>>>> the balance related logs). Having some space available for new block
>>>>>> groups seems to be a good idea but additionally as we use Ceph
>>>>>> RBD, we
>>>>>> want it to be able to deallocate unused space: having many nearly
>>>>>> empty
>>>>>> block groups could waste resources (especially if the used space in
>>>>>> these groups is in <4MB continuous chunks which is the default RBD
>>>>>> object size).
>>>>>>
>>>>>>
>>>>>> More information :
>>>>>> The scrub is run monthly. This is the first time an error was ever
>>>>>> reported. The previous scrub was run successfully at the beginning of
>>>>>> May with a 6.6.13 kernel.
>>>>>>
>>>>>> There is a continuous defragmentation process running (it
>>>>>> processes the
>>>>>> whole filesystem slowly ignoring snapshots and defragments file by
>>>>>> file
>>>>>> based on a fragmentation estimation using filefrag -v). All
>>>>>> defragmentations are logged and I can confirm the file for which this
>>>>>> error was reported was not defragmented for at least a year (I
>>>>>> checked
>>>>>> because I wanted to rule out a possible race condition between
>>>>>> defragmentation and scrub).
>>>>>
>>>>> I'm wondering if there is any direct IO conflicting with
>>>>> buffered/defrag IO.
>>>>>
>>>>> It's known that conflicting buffered/direct IO can lead to contents
>>>>> change halfway, and lead to btrfs csum mismatch.
>>>>> So far that's the only thing leading to known btrfs csum mismatch I
>>>>> can
>>>>> think of.
>>>>
>>>> But here it seems there isn't an actual mismatch as reading the file is
>>>> possible and gives the data which was written in it 6 years ago. Tthis
>>>> seems an error in scrub (or a neutrino flipping a bit somewhere during
>>>> the scrub). The VM runs on a server with ECC RAM so it is unlikely
>>>> to be
>>>> a simple bitflip but when everything likely has been ruled out...
>>>>
>>>> Thanks for your feedback,
>>>> Lionel
>>>>
>>>>>
>>>>> But 6.x kernel should all log a warning message for it.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>>
>>>>>> In addition to the above, among the notable IO are :
>>>>>> - a moderately IO intensive PostgreSQL replication subscriber,
>>>>>> - ponctual large synchronisations using Syncthing,
>>>>>> - snapshot creations/removals occur approximately every 80
>>>>>> minutes. The
>>>>>> last snapshot operation was logged 31 minutes before the errors
>>>>>> (removal
>>>>>> occur asynchronously but where was no unusual load at this time that
>>>>>> could point to a particularly long snapshot cleanup process).
>>>>>>
>>>>>> To sum up, there are many processes accessing the filesystem but
>>>>>> historically it saw far higher IO load during scrubs than what was
>>>>>> occurring here.
>>>>>>
>>>>>>
>>>>>> Reproducing this might be difficult: the whole scrub can take a week
>>>>>> depending on load. That said I can easily prepare a kernel and/or new
>>>>>> btrfs-progs binaries to launch scrubs or other non-destructive
>>>>>> tasks the
>>>>>> week-end or at the end of the day (UTC+2 timezone).
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Lionel Bouton
>>>>>
>>>>
>>
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-22  9:41           ` Qu Wenruo
@ 2024-06-30 10:59             ` Lionel Bouton
  2024-07-04 12:21               ` Lionel Bouton
  0 siblings, 1 reply; 17+ messages in thread
From: Lionel Bouton @ 2024-06-30 10:59 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Le 22/06/2024 à 11:41, Qu Wenruo a écrit :
>
>
> 在 2024/6/22 18:21, Lionel Bouton 写道:
> [...]
>>>
>>> I'll mount the filesystem and run a scrub again to see if I can
>>> reproduce the problem. It should be noticeably quicker, we made
>>> updates to the Ceph cluster and should get approximately 2x the I/O
>>> bandwidth.
>>> I plan to keep the disk snapshot for at least several weeks so if you
>>> want to test something else just say so.
>>
>>
>> The scrub is finished, here are the results :
>>
>> UUID: 61e86d80-d6e4-4f9e-a312-885194c5e690
>> Scrub started:    Wed Jun 19 00:01:59 2024
>> Status:           finished
>> Duration:         81:04:21
>> Total to scrub:   18.83TiB
>> Rate:             67.67MiB/s
>> Error summary:    no errors found
>>
>> So the scrub error isn't deterministic. I'll shut down the test VM for
>> now and keep the disk snapshot it uses for at least a couple of week if
>> it is needed for further tests.
>> The original filesystem is scrubbed monthly, I'll reply to this message
>> if another error shows up.
>
> I briefly remembered that there was a bug related to scrub that can
> report false alerts:
>
> f546c4282673 ("btrfs: scrub: avoid use-after-free when chunk length is
> not 64K aligned")
>
> But that should be automatically backported, and in that case it should
> have some errors like "unable to find chunk map" error messages in the
> kernel log.
>
> Otherwise, I have no extra clues.
>
> Have you tried kernels like v6.8/6.9 and can you reproduce the bug in
> those newer kernels?

I've just upgraded the kernel to 6.9.7 (and btrfs-progs to 6.9.2) and 
monthly scrubs with it will start next week. That said the last 
filesystem scrub with 6.6.30 ran without errors so it might be hard to 
reproduce.
One difference with the last scrub vs the previous one which reported 
checksum errors is the underlying device speed : it is getting faster as 
we replace HDDs with SSDs on the Ceph cluster (it might be a cause if 
there's a race condition somewhere). Other than that there's nothing I 
can think of.

In fact the only 2 major changes before the scrub checksum errors where :
- a noticeable increase in constant I/O load,
- an upgrade to the 6.6 kernel.

As nobody else reported the same behavior I'm not ruling out an hardware 
glitch either.
I'll reply to this thread if a future scrub reports a non reproducible 
checksum error again.

Lionel

>
> Thanks,
> Qu
>>
>>>
>>> Best regards,
>>> Lionel
>>>
>>>>
>>>> If btrfs check reports error, and logical-resolve failed to locate the
>>>> file and its position, it means the corruption is in bookend exntets.
>>>>
>>>> If btrfs check reports error and logical-resolve can locate the 
>>>> file and
>>>> position, it's a different problem then.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> The stats output show that this file was not modified since its
>>>>> creation
>>>>> more than 6 years ago. This is what led me to report a bug in scrub.
>>>>>
>>>>>
>>>>>>
>>>>>>> - Some including the original subvolume have only logged 2
>>>>>>> warnings and
>>>>>>> one (root 151335) logged only one warning.
>>>>>>> - All of the snapshots reported a warning for offset 20480.
>>>>>>> - When several offsets are logged their order seems random.
>>>>>>
>>>>>> One problem of scrub is, it may ratelimit the output, thus we can 
>>>>>> not
>>>>>> just rely on dmesg to know the damage.
>>>>>
>>>>> I wondered about this: I've read other threads where ratelimiting is
>>>>> mentioned but I was not sure if it could apply here. Thanks for
>>>>> clarifying.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I'm attaching kernel logs beginning with the scrub start and
>>>>>>> ending with
>>>>>>> the last error. As of now there are no new errors/warnings even
>>>>>>> though
>>>>>>> the scrub is still ongoing, 15 hours after the last error. I didn't
>>>>>>> clean the log frombalance logs linked to the same filesystem.
>>>>>>>
>>>>>>> Side-note for the curious or in the unlikely case it could be
>>>>>>> linked to
>>>>>>> the problem:
>>>>>>> Historically this filesystem was mounted with ssd_spread without 
>>>>>>> any
>>>>>>> issue (I guess several years ago it made sense to me reading the
>>>>>>> documentation and given the IO latencies I saw on the Ceph
>>>>>>> cluster). A
>>>>>>> recent kernel filled the whole available space with nearly empty
>>>>>>> block
>>>>>>> groups which seemed to re-appear each time we mounted with
>>>>>>> ssd_spread.
>>>>>>> We switched to "ssd" since then and there is a mostly daily 90mn
>>>>>>> balance
>>>>>>> to reach back the previous stateprogressively (this is the reason
>>>>>>> behind
>>>>>>> the balance related logs). Having some space available for new 
>>>>>>> block
>>>>>>> groups seems to be a good idea but additionally as we use Ceph
>>>>>>> RBD, we
>>>>>>> want it to be able to deallocate unused space: having many nearly
>>>>>>> empty
>>>>>>> block groups could waste resources (especially if the used space in
>>>>>>> these groups is in <4MB continuous chunks which is the default RBD
>>>>>>> object size).
>>>>>>>
>>>>>>>
>>>>>>> More information :
>>>>>>> The scrub is run monthly. This is the first time an error was ever
>>>>>>> reported. The previous scrub was run successfully at the 
>>>>>>> beginning of
>>>>>>> May with a 6.6.13 kernel.
>>>>>>>
>>>>>>> There is a continuous defragmentation process running (it
>>>>>>> processes the
>>>>>>> whole filesystem slowly ignoring snapshots and defragments file by
>>>>>>> file
>>>>>>> based on a fragmentation estimation using filefrag -v). All
>>>>>>> defragmentations are logged and I can confirm the file for which 
>>>>>>> this
>>>>>>> error was reported was not defragmented for at least a year (I
>>>>>>> checked
>>>>>>> because I wanted to rule out a possible race condition between
>>>>>>> defragmentation and scrub).
>>>>>>
>>>>>> I'm wondering if there is any direct IO conflicting with
>>>>>> buffered/defrag IO.
>>>>>>
>>>>>> It's known that conflicting buffered/direct IO can lead to contents
>>>>>> change halfway, and lead to btrfs csum mismatch.
>>>>>> So far that's the only thing leading to known btrfs csum mismatch I
>>>>>> can
>>>>>> think of.
>>>>>
>>>>> But here it seems there isn't an actual mismatch as reading the 
>>>>> file is
>>>>> possible and gives the data which was written in it 6 years ago. 
>>>>> Tthis
>>>>> seems an error in scrub (or a neutrino flipping a bit somewhere 
>>>>> during
>>>>> the scrub). The VM runs on a server with ECC RAM so it is unlikely
>>>>> to be
>>>>> a simple bitflip but when everything likely has been ruled out...
>>>>>
>>>>> Thanks for your feedback,
>>>>> Lionel
>>>>>
>>>>>>
>>>>>> But 6.x kernel should all log a warning message for it.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>>
>>>>>>> In addition to the above, among the notable IO are :
>>>>>>> - a moderately IO intensive PostgreSQL replication subscriber,
>>>>>>> - ponctual large synchronisations using Syncthing,
>>>>>>> - snapshot creations/removals occur approximately every 80
>>>>>>> minutes. The
>>>>>>> last snapshot operation was logged 31 minutes before the errors
>>>>>>> (removal
>>>>>>> occur asynchronously but where was no unusual load at this time 
>>>>>>> that
>>>>>>> could point to a particularly long snapshot cleanup process).
>>>>>>>
>>>>>>> To sum up, there are many processes accessing the filesystem but
>>>>>>> historically it saw far higher IO load during scrubs than what was
>>>>>>> occurring here.
>>>>>>>
>>>>>>>
>>>>>>> Reproducing this might be difficult: the whole scrub can take a 
>>>>>>> week
>>>>>>> depending on load. That said I can easily prepare a kernel 
>>>>>>> and/or new
>>>>>>> btrfs-progs binaries to launch scrubs or other non-destructive
>>>>>>> tasks the
>>>>>>> week-end or at the end of the day (UTC+2 timezone).
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Lionel Bouton
>>>>>>
>>>>>
>>>
>>
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-06-30 10:59             ` Lionel Bouton
@ 2024-07-04 12:21               ` Lionel Bouton
  2024-07-04 22:38                 ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Lionel Bouton @ 2024-07-04 12:21 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Le 30/06/2024 à 12:59, Lionel Bouton a écrit :
> Le 22/06/2024 à 11:41, Qu Wenruo a écrit :
>>
>>
>> 在 2024/6/22 18:21, Lionel Bouton 写道:
>> [...]
>>>>
>>>> I'll mount the filesystem and run a scrub again to see if I can
>>>> reproduce the problem. It should be noticeably quicker, we made
>>>> updates to the Ceph cluster and should get approximately 2x the I/O
>>>> bandwidth.
>>>> I plan to keep the disk snapshot for at least several weeks so if you
>>>> want to test something else just say so.
>>>
>>>
>>> The scrub is finished, here are the results :
>>>
>>> UUID: 61e86d80-d6e4-4f9e-a312-885194c5e690
>>> Scrub started:    Wed Jun 19 00:01:59 2024
>>> Status:           finished
>>> Duration:         81:04:21
>>> Total to scrub:   18.83TiB
>>> Rate:             67.67MiB/s
>>> Error summary:    no errors found
>>>
>>> So the scrub error isn't deterministic. I'll shut down the test VM for
>>> now and keep the disk snapshot it uses for at least a couple of week if
>>> it is needed for further tests.
>>> The original filesystem is scrubbed monthly, I'll reply to this message
>>> if another error shows up.
>>
>> I briefly remembered that there was a bug related to scrub that can
>> report false alerts:
>>
>> f546c4282673 ("btrfs: scrub: avoid use-after-free when chunk length is
>> not 64K aligned")
>>
>> But that should be automatically backported, and in that case it should
>> have some errors like "unable to find chunk map" error messages in the
>> kernel log.
>>
>> Otherwise, I have no extra clues.
>>
>> Have you tried kernels like v6.8/6.9 and can you reproduce the bug in
>> those newer kernels?
>
> I've just upgraded the kernel to 6.9.7 (and btrfs-progs to 6.9.2) and 
> monthly scrubs with it will start next week. That said the last 
> filesystem scrub with 6.6.30 ran without errors so it might be hard to 
> reproduce.
> One difference with the last scrub vs the previous one which reported 
> checksum errors is the underlying device speed : it is getting faster 
> as we replace HDDs with SSDs on the Ceph cluster (it might be a cause 
> if there's a race condition somewhere). Other than that there's 
> nothing I can think of.
>
> In fact the only 2 major changes before the scrub checksum errors where :
> - a noticeable increase in constant I/O load,
> - an upgrade to the 6.6 kernel.
>
> As nobody else reported the same behavior I'm not ruling out an 
> hardware glitch either.
> I'll reply to this thread if a future scrub reports a non reproducible 
> checksum error again.

I didn't expect to have something to report so soon...
Another virtual machine running on another physical server but using the 
same Ceph cluster just reported csum errors that aren't reproducible.
This was with kernel 6.6.13 and btrfs-progs 6.8.2.
Fortunately this filesystem is small and can be scrubbed in 2 minutes : 
I just ran the scrub again (less than 5 hours after the one that 
reported errors) and no error are reported this time.

I'll upgrade this VM to 6.9.7+ too. If 6.6 has indeed a scrub bug and 
not 6.9 it might be easier to verify than I anticipated : most of our 
VMs have migrated or are in the process of migrating to 6.6 which is the 
latest LTS. If the problem manifest itself on a small filesystem too I 
expect other systems to fail scrubs sooner or later if 6.6 is affected 
by a scrub bug.

Like I wrote last time I'll reply here if other scrubs report non 
reproducible csum errors.

>
> Lionel
>
>>
>> Thanks,
>> Qu
>>>
>>>>
>>>> Best regards,
>>>> Lionel
>>>>
>>>>>
>>>>> If btrfs check reports error, and logical-resolve failed to locate 
>>>>> the
>>>>> file and its position, it means the corruption is in bookend exntets.
>>>>>
>>>>> If btrfs check reports error and logical-resolve can locate the 
>>>>> file and
>>>>> position, it's a different problem then.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>> The stats output show that this file was not modified since its
>>>>>> creation
>>>>>> more than 6 years ago. This is what led me to report a bug in scrub.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> - Some including the original subvolume have only logged 2
>>>>>>>> warnings and
>>>>>>>> one (root 151335) logged only one warning.
>>>>>>>> - All of the snapshots reported a warning for offset 20480.
>>>>>>>> - When several offsets are logged their order seems random.
>>>>>>>
>>>>>>> One problem of scrub is, it may ratelimit the output, thus we 
>>>>>>> can not
>>>>>>> just rely on dmesg to know the damage.
>>>>>>
>>>>>> I wondered about this: I've read other threads where ratelimiting is
>>>>>> mentioned but I was not sure if it could apply here. Thanks for
>>>>>> clarifying.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I'm attaching kernel logs beginning with the scrub start and
>>>>>>>> ending with
>>>>>>>> the last error. As of now there are no new errors/warnings even
>>>>>>>> though
>>>>>>>> the scrub is still ongoing, 15 hours after the last error. I 
>>>>>>>> didn't
>>>>>>>> clean the log frombalance logs linked to the same filesystem.
>>>>>>>>
>>>>>>>> Side-note for the curious or in the unlikely case it could be
>>>>>>>> linked to
>>>>>>>> the problem:
>>>>>>>> Historically this filesystem was mounted with ssd_spread 
>>>>>>>> without any
>>>>>>>> issue (I guess several years ago it made sense to me reading the
>>>>>>>> documentation and given the IO latencies I saw on the Ceph
>>>>>>>> cluster). A
>>>>>>>> recent kernel filled the whole available space with nearly empty
>>>>>>>> block
>>>>>>>> groups which seemed to re-appear each time we mounted with
>>>>>>>> ssd_spread.
>>>>>>>> We switched to "ssd" since then and there is a mostly daily 90mn
>>>>>>>> balance
>>>>>>>> to reach back the previous stateprogressively (this is the reason
>>>>>>>> behind
>>>>>>>> the balance related logs). Having some space available for new 
>>>>>>>> block
>>>>>>>> groups seems to be a good idea but additionally as we use Ceph
>>>>>>>> RBD, we
>>>>>>>> want it to be able to deallocate unused space: having many nearly
>>>>>>>> empty
>>>>>>>> block groups could waste resources (especially if the used 
>>>>>>>> space in
>>>>>>>> these groups is in <4MB continuous chunks which is the default RBD
>>>>>>>> object size).
>>>>>>>>
>>>>>>>>
>>>>>>>> More information :
>>>>>>>> The scrub is run monthly. This is the first time an error was ever
>>>>>>>> reported. The previous scrub was run successfully at the 
>>>>>>>> beginning of
>>>>>>>> May with a 6.6.13 kernel.
>>>>>>>>
>>>>>>>> There is a continuous defragmentation process running (it
>>>>>>>> processes the
>>>>>>>> whole filesystem slowly ignoring snapshots and defragments file by
>>>>>>>> file
>>>>>>>> based on a fragmentation estimation using filefrag -v). All
>>>>>>>> defragmentations are logged and I can confirm the file for 
>>>>>>>> which this
>>>>>>>> error was reported was not defragmented for at least a year (I
>>>>>>>> checked
>>>>>>>> because I wanted to rule out a possible race condition between
>>>>>>>> defragmentation and scrub).
>>>>>>>
>>>>>>> I'm wondering if there is any direct IO conflicting with
>>>>>>> buffered/defrag IO.
>>>>>>>
>>>>>>> It's known that conflicting buffered/direct IO can lead to contents
>>>>>>> change halfway, and lead to btrfs csum mismatch.
>>>>>>> So far that's the only thing leading to known btrfs csum mismatch I
>>>>>>> can
>>>>>>> think of.
>>>>>>
>>>>>> But here it seems there isn't an actual mismatch as reading the 
>>>>>> file is
>>>>>> possible and gives the data which was written in it 6 years ago. 
>>>>>> Tthis
>>>>>> seems an error in scrub (or a neutrino flipping a bit somewhere 
>>>>>> during
>>>>>> the scrub). The VM runs on a server with ECC RAM so it is unlikely
>>>>>> to be
>>>>>> a simple bitflip but when everything likely has been ruled out...
>>>>>>
>>>>>> Thanks for your feedback,
>>>>>> Lionel
>>>>>>
>>>>>>>
>>>>>>> But 6.x kernel should all log a warning message for it.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Qu
>>>>>>>>
>>>>>>>> In addition to the above, among the notable IO are :
>>>>>>>> - a moderately IO intensive PostgreSQL replication subscriber,
>>>>>>>> - ponctual large synchronisations using Syncthing,
>>>>>>>> - snapshot creations/removals occur approximately every 80
>>>>>>>> minutes. The
>>>>>>>> last snapshot operation was logged 31 minutes before the errors
>>>>>>>> (removal
>>>>>>>> occur asynchronously but where was no unusual load at this time 
>>>>>>>> that
>>>>>>>> could point to a particularly long snapshot cleanup process).
>>>>>>>>
>>>>>>>> To sum up, there are many processes accessing the filesystem but
>>>>>>>> historically it saw far higher IO load during scrubs than what was
>>>>>>>> occurring here.
>>>>>>>>
>>>>>>>>
>>>>>>>> Reproducing this might be difficult: the whole scrub can take a 
>>>>>>>> week
>>>>>>>> depending on load. That said I can easily prepare a kernel 
>>>>>>>> and/or new
>>>>>>>> btrfs-progs binaries to launch scrubs or other non-destructive
>>>>>>>> tasks the
>>>>>>>> week-end or at the end of the day (UTC+2 timezone).
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Lionel Bouton
>>>>>>>
>>>>>>
>>>>
>>>
>>>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-07-04 12:21               ` Lionel Bouton
@ 2024-07-04 22:38                 ` Qu Wenruo
  2024-07-04 22:49                   ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2024-07-04 22:38 UTC (permalink / raw)
  To: Lionel Bouton, linux-btrfs



在 2024/7/4 21:51, Lionel Bouton 写道:
> Le 30/06/2024 à 12:59, Lionel Bouton a écrit :
>> Le 22/06/2024 à 11:41, Qu Wenruo a écrit :
>>>
>>>
>>> 在 2024/6/22 18:21, Lionel Bouton 写道:
>>> [...]
>>>>>
>>>>> I'll mount the filesystem and run a scrub again to see if I can
>>>>> reproduce the problem. It should be noticeably quicker, we made
>>>>> updates to the Ceph cluster and should get approximately 2x the I/O
>>>>> bandwidth.
>>>>> I plan to keep the disk snapshot for at least several weeks so if you
>>>>> want to test something else just say so.
>>>>
>>>>
>>>> The scrub is finished, here are the results :
>>>>
>>>> UUID: 61e86d80-d6e4-4f9e-a312-885194c5e690
>>>> Scrub started:    Wed Jun 19 00:01:59 2024
>>>> Status:           finished
>>>> Duration:         81:04:21
>>>> Total to scrub:   18.83TiB
>>>> Rate:             67.67MiB/s
>>>> Error summary:    no errors found
>>>>
>>>> So the scrub error isn't deterministic. I'll shut down the test VM for
>>>> now and keep the disk snapshot it uses for at least a couple of week if
>>>> it is needed for further tests.
>>>> The original filesystem is scrubbed monthly, I'll reply to this message
>>>> if another error shows up.
>>>
>>> I briefly remembered that there was a bug related to scrub that can
>>> report false alerts:
>>>
>>> f546c4282673 ("btrfs: scrub: avoid use-after-free when chunk length is
>>> not 64K aligned")
>>>
>>> But that should be automatically backported, and in that case it should
>>> have some errors like "unable to find chunk map" error messages in the
>>> kernel log.
>>>
>>> Otherwise, I have no extra clues.
>>>
>>> Have you tried kernels like v6.8/6.9 and can you reproduce the bug in
>>> those newer kernels?
>>
>> I've just upgraded the kernel to 6.9.7 (and btrfs-progs to 6.9.2) and
>> monthly scrubs with it will start next week. That said the last
>> filesystem scrub with 6.6.30 ran without errors so it might be hard to
>> reproduce.
>> One difference with the last scrub vs the previous one which reported
>> checksum errors is the underlying device speed : it is getting faster
>> as we replace HDDs with SSDs on the Ceph cluster (it might be a cause
>> if there's a race condition somewhere). Other than that there's
>> nothing I can think of.
>>
>> In fact the only 2 major changes before the scrub checksum errors where :
>> - a noticeable increase in constant I/O load,
>> - an upgrade to the 6.6 kernel.
>>
>> As nobody else reported the same behavior I'm not ruling out an
>> hardware glitch either.
>> I'll reply to this thread if a future scrub reports a non reproducible
>> checksum error again.
>
> I didn't expect to have something to report so soon...
> Another virtual machine running on another physical server but using the
> same Ceph cluster just reported csum errors that aren't reproducible.
> This was with kernel 6.6.13 and btrfs-progs 6.8.2.
> Fortunately this filesystem is small and can be scrubbed in 2 minutes :
> I just ran the scrub again (less than 5 hours after the one that
> reported errors) and no error are reported this time.
>
> I'll upgrade this VM to 6.9.7+ too. If 6.6 has indeed a scrub bug and
> not 6.9 it might be easier to verify than I anticipated : most of our
> VMs have migrated or are in the process of migrating to 6.6 which is the
> latest LTS. If the problem manifest itself on a small filesystem too I
> expect other systems to fail scrubs sooner or later if 6.6 is affected
> by a scrub bug.

So far it looks like it's the commit f546c4282673 ("btrfs: scrub: avoid
use-after-free when chunk length is not 64K aligned") fixing the error.

In that case, it looks like 6.6 is EOL at that time thus didn't got
backports.

To verify if it's the case, you can dump the chunk tree and check if
their length is 64K aligned, and if the dmesg shows something like this
"      BTRFS critical (device vdb): unable to find chunk map for logical
2214744064 length 4096".

Thanks,
Qu
>
> Like I wrote last time I'll reply here if other scrubs report non
> reproducible csum errors.
>
>>
>> Lionel
>>
>>>
>>> Thanks,
>>> Qu
>>>>
>>>>>
>>>>> Best regards,
>>>>> Lionel
>>>>>
>>>>>>
>>>>>> If btrfs check reports error, and logical-resolve failed to locate
>>>>>> the
>>>>>> file and its position, it means the corruption is in bookend exntets.
>>>>>>
>>>>>> If btrfs check reports error and logical-resolve can locate the
>>>>>> file and
>>>>>> position, it's a different problem then.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>>
>>>>>>> The stats output show that this file was not modified since its
>>>>>>> creation
>>>>>>> more than 6 years ago. This is what led me to report a bug in scrub.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> - Some including the original subvolume have only logged 2
>>>>>>>>> warnings and
>>>>>>>>> one (root 151335) logged only one warning.
>>>>>>>>> - All of the snapshots reported a warning for offset 20480.
>>>>>>>>> - When several offsets are logged their order seems random.
>>>>>>>>
>>>>>>>> One problem of scrub is, it may ratelimit the output, thus we
>>>>>>>> can not
>>>>>>>> just rely on dmesg to know the damage.
>>>>>>>
>>>>>>> I wondered about this: I've read other threads where ratelimiting is
>>>>>>> mentioned but I was not sure if it could apply here. Thanks for
>>>>>>> clarifying.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm attaching kernel logs beginning with the scrub start and
>>>>>>>>> ending with
>>>>>>>>> the last error. As of now there are no new errors/warnings even
>>>>>>>>> though
>>>>>>>>> the scrub is still ongoing, 15 hours after the last error. I
>>>>>>>>> didn't
>>>>>>>>> clean the log frombalance logs linked to the same filesystem.
>>>>>>>>>
>>>>>>>>> Side-note for the curious or in the unlikely case it could be
>>>>>>>>> linked to
>>>>>>>>> the problem:
>>>>>>>>> Historically this filesystem was mounted with ssd_spread
>>>>>>>>> without any
>>>>>>>>> issue (I guess several years ago it made sense to me reading the
>>>>>>>>> documentation and given the IO latencies I saw on the Ceph
>>>>>>>>> cluster). A
>>>>>>>>> recent kernel filled the whole available space with nearly empty
>>>>>>>>> block
>>>>>>>>> groups which seemed to re-appear each time we mounted with
>>>>>>>>> ssd_spread.
>>>>>>>>> We switched to "ssd" since then and there is a mostly daily 90mn
>>>>>>>>> balance
>>>>>>>>> to reach back the previous stateprogressively (this is the reason
>>>>>>>>> behind
>>>>>>>>> the balance related logs). Having some space available for new
>>>>>>>>> block
>>>>>>>>> groups seems to be a good idea but additionally as we use Ceph
>>>>>>>>> RBD, we
>>>>>>>>> want it to be able to deallocate unused space: having many nearly
>>>>>>>>> empty
>>>>>>>>> block groups could waste resources (especially if the used
>>>>>>>>> space in
>>>>>>>>> these groups is in <4MB continuous chunks which is the default RBD
>>>>>>>>> object size).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> More information :
>>>>>>>>> The scrub is run monthly. This is the first time an error was ever
>>>>>>>>> reported. The previous scrub was run successfully at the
>>>>>>>>> beginning of
>>>>>>>>> May with a 6.6.13 kernel.
>>>>>>>>>
>>>>>>>>> There is a continuous defragmentation process running (it
>>>>>>>>> processes the
>>>>>>>>> whole filesystem slowly ignoring snapshots and defragments file by
>>>>>>>>> file
>>>>>>>>> based on a fragmentation estimation using filefrag -v). All
>>>>>>>>> defragmentations are logged and I can confirm the file for
>>>>>>>>> which this
>>>>>>>>> error was reported was not defragmented for at least a year (I
>>>>>>>>> checked
>>>>>>>>> because I wanted to rule out a possible race condition between
>>>>>>>>> defragmentation and scrub).
>>>>>>>>
>>>>>>>> I'm wondering if there is any direct IO conflicting with
>>>>>>>> buffered/defrag IO.
>>>>>>>>
>>>>>>>> It's known that conflicting buffered/direct IO can lead to contents
>>>>>>>> change halfway, and lead to btrfs csum mismatch.
>>>>>>>> So far that's the only thing leading to known btrfs csum mismatch I
>>>>>>>> can
>>>>>>>> think of.
>>>>>>>
>>>>>>> But here it seems there isn't an actual mismatch as reading the
>>>>>>> file is
>>>>>>> possible and gives the data which was written in it 6 years ago.
>>>>>>> Tthis
>>>>>>> seems an error in scrub (or a neutrino flipping a bit somewhere
>>>>>>> during
>>>>>>> the scrub). The VM runs on a server with ECC RAM so it is unlikely
>>>>>>> to be
>>>>>>> a simple bitflip but when everything likely has been ruled out...
>>>>>>>
>>>>>>> Thanks for your feedback,
>>>>>>> Lionel
>>>>>>>
>>>>>>>>
>>>>>>>> But 6.x kernel should all log a warning message for it.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Qu
>>>>>>>>>
>>>>>>>>> In addition to the above, among the notable IO are :
>>>>>>>>> - a moderately IO intensive PostgreSQL replication subscriber,
>>>>>>>>> - ponctual large synchronisations using Syncthing,
>>>>>>>>> - snapshot creations/removals occur approximately every 80
>>>>>>>>> minutes. The
>>>>>>>>> last snapshot operation was logged 31 minutes before the errors
>>>>>>>>> (removal
>>>>>>>>> occur asynchronously but where was no unusual load at this time
>>>>>>>>> that
>>>>>>>>> could point to a particularly long snapshot cleanup process).
>>>>>>>>>
>>>>>>>>> To sum up, there are many processes accessing the filesystem but
>>>>>>>>> historically it saw far higher IO load during scrubs than what was
>>>>>>>>> occurring here.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Reproducing this might be difficult: the whole scrub can take a
>>>>>>>>> week
>>>>>>>>> depending on load. That said I can easily prepare a kernel
>>>>>>>>> and/or new
>>>>>>>>> btrfs-progs binaries to launch scrubs or other non-destructive
>>>>>>>>> tasks the
>>>>>>>>> week-end or at the end of the day (UTC+2 timezone).
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> Lionel Bouton
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single)
  2024-07-04 22:38                 ` Qu Wenruo
@ 2024-07-04 22:49                   ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2024-07-04 22:49 UTC (permalink / raw)
  To: Lionel Bouton, linux-btrfs



在 2024/7/5 08:08, Qu Wenruo 写道:
>
>
> 在 2024/7/4 21:51, Lionel Bouton 写道:
>> Le 30/06/2024 à 12:59, Lionel Bouton a écrit :
>>> Le 22/06/2024 à 11:41, Qu Wenruo a écrit :
>>>>
>>>>
>>>> 在 2024/6/22 18:21, Lionel Bouton 写道:
>>>> [...]
>>>>>>
>>>>>> I'll mount the filesystem and run a scrub again to see if I can
>>>>>> reproduce the problem. It should be noticeably quicker, we made
>>>>>> updates to the Ceph cluster and should get approximately 2x the I/O
>>>>>> bandwidth.
>>>>>> I plan to keep the disk snapshot for at least several weeks so if you
>>>>>> want to test something else just say so.
>>>>>
>>>>>
>>>>> The scrub is finished, here are the results :
>>>>>
>>>>> UUID: 61e86d80-d6e4-4f9e-a312-885194c5e690
>>>>> Scrub started:    Wed Jun 19 00:01:59 2024
>>>>> Status:           finished
>>>>> Duration:         81:04:21
>>>>> Total to scrub:   18.83TiB
>>>>> Rate:             67.67MiB/s
>>>>> Error summary:    no errors found
>>>>>
>>>>> So the scrub error isn't deterministic. I'll shut down the test VM for
>>>>> now and keep the disk snapshot it uses for at least a couple of
>>>>> week if
>>>>> it is needed for further tests.
>>>>> The original filesystem is scrubbed monthly, I'll reply to this
>>>>> message
>>>>> if another error shows up.
>>>>
>>>> I briefly remembered that there was a bug related to scrub that can
>>>> report false alerts:
>>>>
>>>> f546c4282673 ("btrfs: scrub: avoid use-after-free when chunk length is
>>>> not 64K aligned")
>>>>
>>>> But that should be automatically backported, and in that case it should
>>>> have some errors like "unable to find chunk map" error messages in the
>>>> kernel log.
>>>>
>>>> Otherwise, I have no extra clues.
>>>>
>>>> Have you tried kernels like v6.8/6.9 and can you reproduce the bug in
>>>> those newer kernels?
>>>
>>> I've just upgraded the kernel to 6.9.7 (and btrfs-progs to 6.9.2) and
>>> monthly scrubs with it will start next week. That said the last
>>> filesystem scrub with 6.6.30 ran without errors so it might be hard to
>>> reproduce.
>>> One difference with the last scrub vs the previous one which reported
>>> checksum errors is the underlying device speed : it is getting faster
>>> as we replace HDDs with SSDs on the Ceph cluster (it might be a cause
>>> if there's a race condition somewhere). Other than that there's
>>> nothing I can think of.
>>>
>>> In fact the only 2 major changes before the scrub checksum errors
>>> where :
>>> - a noticeable increase in constant I/O load,
>>> - an upgrade to the 6.6 kernel.
>>>
>>> As nobody else reported the same behavior I'm not ruling out an
>>> hardware glitch either.
>>> I'll reply to this thread if a future scrub reports a non reproducible
>>> checksum error again.
>>
>> I didn't expect to have something to report so soon...
>> Another virtual machine running on another physical server but using the
>> same Ceph cluster just reported csum errors that aren't reproducible.
>> This was with kernel 6.6.13 and btrfs-progs 6.8.2.
>> Fortunately this filesystem is small and can be scrubbed in 2 minutes :
>> I just ran the scrub again (less than 5 hours after the one that
>> reported errors) and no error are reported this time.
>>
>> I'll upgrade this VM to 6.9.7+ too. If 6.6 has indeed a scrub bug and
>> not 6.9 it might be easier to verify than I anticipated : most of our
>> VMs have migrated or are in the process of migrating to 6.6 which is the
>> latest LTS. If the problem manifest itself on a small filesystem too I
>> expect other systems to fail scrubs sooner or later if 6.6 is affected
>> by a scrub bug.
>
> So far it looks like it's the commit f546c4282673 ("btrfs: scrub: avoid
> use-after-free when chunk length is not 64K aligned") fixing the error.
>
> In that case, it looks like 6.6 is EOL at that time thus didn't got
> backports.

Nope, just as you mentioned 6.6 is LTS, and the last time I checked the
stable tree.

And it's already merged into 6.6.15, so it is not the case.

Let me dig deeper to find out why.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-07-04 22:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-04 14:12 BUG: scrub reports uncorrectable csum errors linked to readable file (data: single) Lionel Bouton
2024-06-06 22:51 ` Lionel Bouton
2024-06-06 23:05 ` Qu Wenruo
2024-06-06 23:21   ` Lionel Bouton
2024-06-06 23:30     ` Qu Wenruo
2024-06-06 23:46       ` Lionel Bouton
2024-06-08 16:15         ` Lionel Bouton
2024-06-08 22:48           ` Qu Wenruo
2024-06-09  0:16             ` Lionel Bouton
2024-06-10 12:52               ` Lionel Bouton
2024-06-18 21:45       ` Lionel Bouton
2024-06-22  8:51         ` Lionel Bouton
2024-06-22  9:41           ` Qu Wenruo
2024-06-30 10:59             ` Lionel Bouton
2024-07-04 12:21               ` Lionel Bouton
2024-07-04 22:38                 ` Qu Wenruo
2024-07-04 22:49                   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox