* Corruption while bisecting (was: [PATCH] btrfs: tree-checker: dump the page status if hit something wrong)
@ 2024-02-25 19:30 Tavian Barnes
2024-02-26 8:30 ` Qu Wenruo
2024-02-26 15:53 ` David Sterba
0 siblings, 2 replies; 6+ messages in thread
From: Tavian Barnes @ 2024-02-25 19:30 UTC (permalink / raw)
To: linux-btrfs
Well, bad news: I started bisecting from v6.0 and after a couple
rounds, my root fs is really corrupted:
UUID: e1902620-c206-4e34-9f24-e66cdb6b8872
Scrub started: Sun Feb 25 18:47:29 2024
Status: finished
Duration: 0:20:18
Total to scrub: 2.72TiB
Rate: 2.29GiB/s
Error summary: csum=2073625
Corrected: 0
Uncorrectable: 2073625
Unverified: 0
All the errors seem confined to one of the four disks which is strange:
BTRFS error (device dm-0): unable to fixup (regular) error at logical
7242230988800 on dev /dev/mapper/slash3 physical 914556321792
BTRFS error (device dm-0): unable to fixup (regular) error at logical
7242227580928 on dev /dev/mapper/slash3 physical 914555469824
BTRFS error (device dm-0): unable to fixup (regular) error at logical
7242228105216 on dev /dev/mapper/slash3 physical 914555600896
BTRFS error (device dm-0): unable to fixup (regular) error at logical
7242230202368 on dev /dev/mapper/slash3 physical 914556125184
BTRFS error (device dm-0): unable to fixup (regular) error at logical
7242233348096 on dev /dev/mapper/slash3 physical 914556911616
BTRFS error (device dm-0): unable to fixup (regular) error at logical
7242227843072 on dev /dev/mapper/slash3 physical 914555535360
BTRFS error (device dm-0): unable to fixup (regular) error at logical
7242228367360 on dev /dev/mapper/slash3 physical 914555666432
BTRFS error (device dm-0): unable to fixup (regular) error at logical
7242229415936 on dev /dev/mapper/slash3 physical 914555928576
BTRFS error (device dm-0): unable to fixup (regular) error at logical
7242229940224 on dev /dev/mapper/slash3 physical 914556059648
BTRFS error (device dm-0): unable to fixup (regular) error at logical
7242228891648 on dev /dev/mapper/slash3 physical 914555797504
BTRFS warning (device dm-0): checksum error at logical 7242227843072
on dev /dev/mapper/slash3, physical 914555535360, root 136483, inode
60736199, offset 720896, length 4096, links 1 (path:
var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
BTRFS warning (device dm-0): checksum error at logical 7242228891648
on dev /dev/mapper/slash3, physical 914555797504, root 136483, inode
60736199, offset 1769472, length 4096, links 1 (path:
var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
BTRFS warning (device dm-0): checksum error at logical 7242228105216
on dev /dev/mapper/slash3, physical 914555600896, root 136483, inode
60736199, offset 983040, length 4096, links 1 (path:
var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
BTRFS warning (device dm-0): checksum error at logical 7242228367360
on dev /dev/mapper/slash3, physical 914555666432, root 136483, inode
60736199, offset 1245184, length 4096, links 1 (path:
var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
BTRFS warning (device dm-0): checksum error at logical 7242230988800
on dev /dev/mapper/slash3, physical 914556321792, root 136483, inode
60736199, offset 3866624, length 4096, links 1 (path:
var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
BTRFS warning (device dm-0): checksum error at logical 7242229940224
on dev /dev/mapper/slash3, physical 914556059648, root 136483, inode
60736199, offset 2818048, length 4096, links 1 (path:
var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
BTRFS warning (device dm-0): checksum error at logical 7242228891648
on dev /dev/mapper/slash3, physical 914555797504, root 136483, inode
60736199, offset 1769472, length 4096, links 1 (path:
var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
BTRFS warning (device dm-0): checksum error at logical 7242233348096
on dev /dev/mapper/slash3, physical 914556911616, root 136483, inode
60736199, offset 6225920, length 4096, links 1 (path:
var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
BTRFS warning (device dm-0): checksum error at logical 7242230202368
on dev /dev/mapper/slash3, physical 914556125184, root 136483, inode
60736199, offset 3080192, length 4096, links 1 (path:
var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
BTRFS warning (device dm-0): checksum error at logical 7242227843072
on dev /dev/mapper/slash3, physical 914555535360, root 136483, inode
60736199, offset 720896, length 4096, links 1 (path:
var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
scrub_stripe_report_errors: 344892 callbacks suppressed
...
--
Tavian Barnes
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Corruption while bisecting (was: [PATCH] btrfs: tree-checker: dump the page status if hit something wrong)
2024-02-25 19:30 Corruption while bisecting (was: [PATCH] btrfs: tree-checker: dump the page status if hit something wrong) Tavian Barnes
@ 2024-02-26 8:30 ` Qu Wenruo
2024-02-26 15:49 ` Tavian Barnes
2024-02-26 15:53 ` David Sterba
1 sibling, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2024-02-26 8:30 UTC (permalink / raw)
To: Tavian Barnes, linux-btrfs
在 2024/2/26 06:00, Tavian Barnes 写道:
> Well, bad news: I started bisecting from v6.0 and after a couple
> rounds, my root fs is really corrupted:
>
> UUID: e1902620-c206-4e34-9f24-e66cdb6b8872
> Scrub started: Sun Feb 25 18:47:29 2024
> Status: finished
> Duration: 0:20:18
> Total to scrub: 2.72TiB
> Rate: 2.29GiB/s
> Error summary: csum=2073625
> Corrected: 0
> Uncorrectable: 2073625
> Unverified: 0
>
> All the errors seem confined to one of the four disks which is strange:
Mind to share which commit you're at when hitting the scrub errors?
And have you tried with offline scrub (aka, "btrfs check
--check-data-csum")?
IIRC during the rework of scrub, there are several regression caused by
the rework (e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to
scrub_stripe infrastructure"), which is around 6.4).
So if "btrfs check --check-data-csum" shows no error, it would be a
false alert and you can just ignore them for now.
Thanks,
Qu
>
> BTRFS error (device dm-0): unable to fixup (regular) error at logical
> 7242230988800 on dev /dev/mapper/slash3 physical 914556321792
> BTRFS error (device dm-0): unable to fixup (regular) error at logical
> 7242227580928 on dev /dev/mapper/slash3 physical 914555469824
> BTRFS error (device dm-0): unable to fixup (regular) error at logical
> 7242228105216 on dev /dev/mapper/slash3 physical 914555600896
> BTRFS error (device dm-0): unable to fixup (regular) error at logical
> 7242230202368 on dev /dev/mapper/slash3 physical 914556125184
> BTRFS error (device dm-0): unable to fixup (regular) error at logical
> 7242233348096 on dev /dev/mapper/slash3 physical 914556911616
> BTRFS error (device dm-0): unable to fixup (regular) error at logical
> 7242227843072 on dev /dev/mapper/slash3 physical 914555535360
> BTRFS error (device dm-0): unable to fixup (regular) error at logical
> 7242228367360 on dev /dev/mapper/slash3 physical 914555666432
> BTRFS error (device dm-0): unable to fixup (regular) error at logical
> 7242229415936 on dev /dev/mapper/slash3 physical 914555928576
> BTRFS error (device dm-0): unable to fixup (regular) error at logical
> 7242229940224 on dev /dev/mapper/slash3 physical 914556059648
> BTRFS error (device dm-0): unable to fixup (regular) error at logical
> 7242228891648 on dev /dev/mapper/slash3 physical 914555797504
> BTRFS warning (device dm-0): checksum error at logical 7242227843072
> on dev /dev/mapper/slash3, physical 914555535360, root 136483, inode
> 60736199, offset 720896, length 4096, links 1 (path:
> var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
> BTRFS warning (device dm-0): checksum error at logical 7242228891648
> on dev /dev/mapper/slash3, physical 914555797504, root 136483, inode
> 60736199, offset 1769472, length 4096, links 1 (path:
> var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
> BTRFS warning (device dm-0): checksum error at logical 7242228105216
> on dev /dev/mapper/slash3, physical 914555600896, root 136483, inode
> 60736199, offset 983040, length 4096, links 1 (path:
> var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
> BTRFS warning (device dm-0): checksum error at logical 7242228367360
> on dev /dev/mapper/slash3, physical 914555666432, root 136483, inode
> 60736199, offset 1245184, length 4096, links 1 (path:
> var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
> BTRFS warning (device dm-0): checksum error at logical 7242230988800
> on dev /dev/mapper/slash3, physical 914556321792, root 136483, inode
> 60736199, offset 3866624, length 4096, links 1 (path:
> var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
> BTRFS warning (device dm-0): checksum error at logical 7242229940224
> on dev /dev/mapper/slash3, physical 914556059648, root 136483, inode
> 60736199, offset 2818048, length 4096, links 1 (path:
> var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
> BTRFS warning (device dm-0): checksum error at logical 7242228891648
> on dev /dev/mapper/slash3, physical 914555797504, root 136483, inode
> 60736199, offset 1769472, length 4096, links 1 (path:
> var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
> BTRFS warning (device dm-0): checksum error at logical 7242233348096
> on dev /dev/mapper/slash3, physical 914556911616, root 136483, inode
> 60736199, offset 6225920, length 4096, links 1 (path:
> var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
> BTRFS warning (device dm-0): checksum error at logical 7242230202368
> on dev /dev/mapper/slash3, physical 914556125184, root 136483, inode
> 60736199, offset 3080192, length 4096, links 1 (path:
> var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
> BTRFS warning (device dm-0): checksum error at logical 7242227843072
> on dev /dev/mapper/slash3, physical 914555535360, root 136483, inode
> 60736199, offset 720896, length 4096, links 1 (path:
> var/cache/pacman/pkg/agda-2.6.3-27-x86_64.pkg.tar.zst)
> scrub_stripe_report_errors: 344892 callbacks suppressed
> ...
>
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Corruption while bisecting (was: [PATCH] btrfs: tree-checker: dump the page status if hit something wrong)
2024-02-26 8:30 ` Qu Wenruo
@ 2024-02-26 15:49 ` Tavian Barnes
2024-02-26 15:56 ` David Sterba
0 siblings, 1 reply; 6+ messages in thread
From: Tavian Barnes @ 2024-02-26 15:49 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Mon, Feb 26, 2024 at 3:30 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> 在 2024/2/26 06:00, Tavian Barnes 写道:
> > Well, bad news: I started bisecting from v6.0 and after a couple
> > rounds, my root fs is really corrupted:
> >
> > UUID: e1902620-c206-4e34-9f24-e66cdb6b8872
> > Scrub started: Sun Feb 25 18:47:29 2024
> > Status: finished
> > Duration: 0:20:18
> > Total to scrub: 2.72TiB
> > Rate: 2.29GiB/s
> > Error summary: csum=2073625
> > Corrected: 0
> > Uncorrectable: 2073625
> > Unverified: 0
> >
> > All the errors seem confined to one of the four disks which is strange:
>
> Mind to share which commit you're at when hitting the scrub errors?
The corruption seemed to start while testing a kernel somewhere around
6.4. I'm not sure exactly because lots of the corruption is affecting
the kernel tree I was bisecting from.
> And have you tried with offline scrub (aka, "btrfs check
> --check-data-csum")?
Yes, it also reports a vast number of errors. But plain btrfs check succeeds.
> IIRC during the rework of scrub, there are several regression caused by
> the rework (e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to
> scrub_stripe infrastructure"), which is around 6.4).
I did the scrub from 6.7.2. I also mounted with -o
ro,rescue=ignoredatacsums and checked the file contents. The data
really is corrupt in these files, so it's not a problem with the
checksums.
> So if "btrfs check --check-data-csum" shows no error, it would be a
> false alert and you can just ignore them for now.
>
> Thanks,
> Qu
> >
> > BTRFS error (device dm-0): unable to fixup (regular) error at logical
> > 7242230988800 on dev /dev/mapper/slash3 physical 914556321792
> > BTRFS error (device dm-0): unable to fixup (regular) error at logical
> > 7242227580928 on dev /dev/mapper/slash3 physical 914555469824
> > ...
So looking at this closer, a lot of the corruption seemed to be around
the same LBA (~7242...). It seems like a couple chunks on slash3 are
corrupt. E.g. this one is in the middle of an .svg file, so it should
be human readable, yet it contains seemingly random bytes:
root@archiso ~ # dmesg | grep svg
[ 2967.927789] BTRFS warning (device dm-0): checksum error at logical
7270571376640 on dev /dev/mapper/slash3, physical 920567676928, root
136483, inode 60843632, offset 110592, length 4096, links 1 (path:
usr/share/inkscape/tutorials/tutorial-shapes.nl.svg)
root@archiso ~ # xxd -s 920567676928 -l 32 /dev/mapper/slash3
d6561c0000: 1a9c a774 a62d 61dc 96e6 fca8 0070 2326 ...t.-a......p#&
d6561c0010: 7579 99b0 096d d4f2 453d 54e1 ec76 81e0 uy...m..E=T..v..
That matches the file contents at the beginning of the corruption. At
first I thought maybe the device tree was corrupt, pointing the stripe
to the wrong disk offset and reading something random, but then I
thought to check the raw encrypted bytes as the corresponding offset:
root@archiso ~ # cryptsetup luksDump /dev/nvme3n1p2 | grep -B2 offset
Data segments:
0: crypt
offset: 16777216 [bytes]
root@archiso ~ # xxd -s $((920567676928 + 16777216)) -l 32 /dev/nvme3n1p2
d6571c0000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
d6571c0010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
So what I'm seeing is basically a whole stripe that has been zeroed
out, and LUKS is decrypting those zeros to random bytes.
I wonder if I got hit by some miscalculated DISCARD or something that
wiped the wrong area of the disk. It could also be a hardware
failure, but I see nothing relevant in nvme {smart,error}-log.
--
Tavian Barnes
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Corruption while bisecting (was: [PATCH] btrfs: tree-checker: dump the page status if hit something wrong)
2024-02-26 15:49 ` Tavian Barnes
@ 2024-02-26 15:56 ` David Sterba
0 siblings, 0 replies; 6+ messages in thread
From: David Sterba @ 2024-02-26 15:56 UTC (permalink / raw)
To: Tavian Barnes; +Cc: Qu Wenruo, linux-btrfs
On Mon, Feb 26, 2024 at 10:49:40AM -0500, Tavian Barnes wrote:
> On Mon, Feb 26, 2024 at 3:30 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > 在 2024/2/26 06:00, Tavian Barnes 写道:
> root@archiso ~ # dmesg | grep svg
> [ 2967.927789] BTRFS warning (device dm-0): checksum error at logical
> 7270571376640 on dev /dev/mapper/slash3, physical 920567676928, root
> 136483, inode 60843632, offset 110592, length 4096, links 1 (path:
> usr/share/inkscape/tutorials/tutorial-shapes.nl.svg)
> root@archiso ~ # xxd -s 920567676928 -l 32 /dev/mapper/slash3
> d6561c0000: 1a9c a774 a62d 61dc 96e6 fca8 0070 2326 ...t.-a......p#&
> d6561c0010: 7579 99b0 096d d4f2 453d 54e1 ec76 81e0 uy...m..E=T..v..
>
> That matches the file contents at the beginning of the corruption. At
> first I thought maybe the device tree was corrupt, pointing the stripe
> to the wrong disk offset and reading something random, but then I
> thought to check the raw encrypted bytes as the corresponding offset:
>
> root@archiso ~ # cryptsetup luksDump /dev/nvme3n1p2 | grep -B2 offset
> Data segments:
> 0: crypt
> offset: 16777216 [bytes]
> root@archiso ~ # xxd -s $((920567676928 + 16777216)) -l 32 /dev/nvme3n1p2
> d6571c0000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> d6571c0010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
>
> So what I'm seeing is basically a whole stripe that has been zeroed
> out, and LUKS is decrypting those zeros to random bytes.
That's a good find and the explanation sounds plausible. Not the first
time we see strange errors in connection with LUKS/dm-crypt and discard.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Corruption while bisecting (was: [PATCH] btrfs: tree-checker: dump the page status if hit something wrong)
2024-02-25 19:30 Corruption while bisecting (was: [PATCH] btrfs: tree-checker: dump the page status if hit something wrong) Tavian Barnes
2024-02-26 8:30 ` Qu Wenruo
@ 2024-02-26 15:53 ` David Sterba
2024-02-26 16:03 ` Tavian Barnes
1 sibling, 1 reply; 6+ messages in thread
From: David Sterba @ 2024-02-26 15:53 UTC (permalink / raw)
To: Tavian Barnes; +Cc: linux-btrfs
On Sun, Feb 25, 2024 at 02:30:22PM -0500, Tavian Barnes wrote:
> Well, bad news: I started bisecting from v6.0 and after a couple
> rounds, my root fs is really corrupted:
The span of releases where you can reproduce it quite wide, 6.0 until
6.7. I think there's a possibility that you hit the new bug in 6.7 and
the error propagated to the filesystem so that now it's detectable on
any lower version too.
We have only indirect evidence here, 2 reports of the page reference
counts and all in a short window after 6.7. The lack of other reports
would point out to either one time damage or some other factor like
hardware problems.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Corruption while bisecting (was: [PATCH] btrfs: tree-checker: dump the page status if hit something wrong)
2024-02-26 15:53 ` David Sterba
@ 2024-02-26 16:03 ` Tavian Barnes
0 siblings, 0 replies; 6+ messages in thread
From: Tavian Barnes @ 2024-02-26 16:03 UTC (permalink / raw)
To: dsterba; +Cc: linux-btrfs
On Mon, Feb 26, 2024 at 10:53 AM David Sterba <dsterba@suse.cz> wrote:
> On Sun, Feb 25, 2024 at 02:30:22PM -0500, Tavian Barnes wrote:
> > Well, bad news: I started bisecting from v6.0 and after a couple
> > rounds, my root fs is really corrupted:
>
> The span of releases where you can reproduce it quite wide, 6.0 until
> 6.7. I think there's a possibility that you hit the new bug in 6.7 and
> the error propagated to the filesystem so that now it's detectable on
> any lower version too.
To be clear I didn't reproduce the bug on v6.0. I did still see it on
v6.5. At the point just before this corruption happened, I had marked
v6.4-rc[something] as good in the bisect and was rebuilding the next
version when everything started dying to SIGBUS.
> We have only indirect evidence here, 2 reports of the page reference
> counts and all in a short window after 6.7. The lack of other reports
> would point out to either one time damage or some other factor like
> hardware problems.
--
Tavian Barnes
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-02-26 16:04 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-25 19:30 Corruption while bisecting (was: [PATCH] btrfs: tree-checker: dump the page status if hit something wrong) Tavian Barnes
2024-02-26 8:30 ` Qu Wenruo
2024-02-26 15:49 ` Tavian Barnes
2024-02-26 15:56 ` David Sterba
2024-02-26 15:53 ` David Sterba
2024-02-26 16:03 ` Tavian Barnes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox