* parent transid verify failed on raid1
@ 2025-05-07 16:33 Ivo Smits
2025-05-07 21:53 ` Qu Wenruo
0 siblings, 1 reply; 2+ messages in thread
From: Ivo Smits @ 2025-05-07 16:33 UTC (permalink / raw)
To: linux-btrfs
Hello everyone,
After some abuse (drive going offline and unexpected shutdowns) one of
my fairly large BTRFS filesystems seems to suffer from some corruption.
The filesystem still mounts and operates mostly fine. A lot of errors
(probably caused by a drive going offline and later returning) have been
recovered from a good RAID1 mirror by scrub a little while ago, but some
problems persist.
The kernel log is repeating the following two messages about every 30
seconds:
BTRFS error (device sdg1): parent transid verify failed on logical
31419461632000 mirror 1 wanted 1240926 found 1089963
BTRFS error (device sdg1): parent transid verify failed on logical
31419461632000 mirror 2 wanted 1240926 found 1089963
I suspect this might be some background process in the kernel trying to
clean things up since it starts after mounting and doesn't stop.
While all data seems to be accessible, some filesystem operations (like
balancing partially empty blocks or reducing the size) block forever. I
suspect this might end up waiting for the failing maintenance or
retrying indefinitely because of the same error. I think an older kernel
version even remounted the filesystem read-only when running these
operations, likely because the transid error produced a different error
code before (I/O error).
Scrub reports two unrecoverable errors:
BTRFS warning (device sdg1): tree block 31419461632000 mirror 2 has bad
generation, has 1089963 want 1240926
BTRFS warning (device sdg1): tree block 31419461632000 mirror 1 has bad
generation, has 1089963 want 1240926
BTRFS warning (device sdg1): checksum/header error at logical
31419461632000 on dev /dev/sdh1, physical 4363721801728: metadata leaf
(level 0) in tree 29210773045248
BTRFS warning (device sdg1): checksum/header error at logical
31419461632000 on dev /dev/sdh1, physical 4363721801728: metadata leaf
(level 0) in tree 29210773045248
BTRFS warning (device sdg1): tree block 31419461632000 mirror 0 has bad
generation, has 1089963 want 1240926
BTRFS error (device sdg1): unable to fixup (regular) error at logical
31419461632000 on dev /dev/sdh1
BTRFS warning (device sdg1): tree block 31420155953152 mirror 2 has bad
generation, has 1090179 want 1211718
BTRFS warning (device sdg1): tree block 31420155953152 mirror 1 has bad
generation, has 1090179 want 1211718
BTRFS warning (device sdg1): checksum/header error at logical
31420155953152 on dev /dev/sdh1, physical 4364416122880: metadata leaf
(level 0) in tree 14366209343488
BTRFS warning (device sdg1): checksum/header error at logical
31420155953152 on dev /dev/sdh1, physical 4364416122880: metadata leaf
(level 0) in tree 14366209343488
BTRFS warning (device sdg1): tree block 31420155953152 mirror 0 has bad
generation, has 1090179 want 1211718
BTRFS error (device sdg1): unable to fixup (regular) error at logical
31420155953152 on dev /dev/sdh1
So I tried to locate the affected file(s) for the problematic blocks:
# btrfs inspect-internal dump-tree -b 31419461632000 /dev/sdh1
leaf 31419461632000 items 1 free space 13458 generation 1089963 owner
CSUM_TREE
leaf 31419461632000 flags 0x1(WRITTEN) backref revision 1
fs uuid b044297e-9527-4d22-bb66-c09206ad8aa7
chunk uuid 943d34f6-586f-414f-91a0-67b3f04e2feb
item 0 key (EXTENT_CSUM EXTENT_CSUM 36776010444800) itemoff
13483 itemsize 2800
range start 36776010444800 end 36776013312000 length
2867200
And then managed to find some filenames for the specified range using
logical-resolve. The filenames pointed to some old backups, so I decided
to remove those subvolumes, hoping that btrfs could simply release the
corrupted block. Unfortunately it appears that this did not work,
logical-resolve now fails because it can't find the referenced subvol_id
and the kernel errors continue.
I also tried to check the other corrupt block found by scrub. This one
seems to have a lot more items so might be more problematic, but does
not normally show up in the kernel log:
# btrfs inspect dump-tree -b 31420155953152 /dev/sdh1
leaf 31420155953152 items 127 free space 3022 generation 1090179 owner
EXTENT_TREE
leaf 31420155953152 flags 0x1(WRITTEN) backref revision 1
fs uuid b044297e-9527-4d22-bb66-c09206ad8aa7
chunk uuid 943d34f6-586f-414f-91a0-67b3f04e2feb
item 0 key (31420152774656 METADATA_ITEM 0) itemoff 16007
itemsize 276
refs 28 gen 1073175 flags TREE_BLOCK
...
I also ran btrfs check while the filesystem was unmounted. This first
discovered the two transid failures also found by scrub, and then
continued to find a lot more errors, like reference count and bytenr
mismatches. Since the filesystem appears to operate normally and scrub
did not find those errors, could this just be blocks which are no longer
part of the filesystem tree, possibly not even referenced by anything?
Is this situation something btrfs check can fix? Is it possible to only
let it fix the most problematic transid error and ignore everything
else? Could manually patching the transid value help btrfs clean things up?
Most of the data on the filesystem is backup data, or can be backed up
elsewhere, so losing some files would not be the end of the world, as
long as damaged files can be identified and there is no silent data
corruption.
Best regards,
Ivo
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: parent transid verify failed on raid1
2025-05-07 16:33 parent transid verify failed on raid1 Ivo Smits
@ 2025-05-07 21:53 ` Qu Wenruo
0 siblings, 0 replies; 2+ messages in thread
From: Qu Wenruo @ 2025-05-07 21:53 UTC (permalink / raw)
To: Ivo Smits, linux-btrfs
在 2025/5/8 02:03, Ivo Smits 写道:
> Hello everyone,
>
> After some abuse (drive going offline and unexpected shutdowns) one of
> my fairly large BTRFS filesystems seems to suffer from some corruption.
> The filesystem still mounts and operates mostly fine. A lot of errors
> (probably caused by a drive going offline and later returning) have been
> recovered from a good RAID1 mirror by scrub a little while ago, but some
> problems persist.
>
> The kernel log is repeating the following two messages about every 30
> seconds:
>
> BTRFS error (device sdg1): parent transid verify failed on logical
> 31419461632000 mirror 1 wanted 1240926 found 1089963
> BTRFS error (device sdg1): parent transid verify failed on logical
> 31419461632000 mirror 2 wanted 1240926 found 1089963
The transid mismatch mostly a death sentence for a btrfs.
This normally means bad metadata COW or bad hardware FLUSH/FUA behavior.
>
> I suspect this might be some background process in the kernel trying to
> clean things up since it starts after mounting and doesn't stop.
Nope, no regular operation should lead to such problem.
Not to mention both mirrors share the same bad transid.
[...]
>
> I also ran btrfs check while the filesystem was unmounted. This first
> discovered the two transid failures also found by scrub, and then
> continued to find a lot more errors, like reference count and bytenr
> mismatches. Since the filesystem appears to operate normally and scrub
> did not find those errors, could this just be blocks which are no longer
> part of the filesystem tree, possibly not even referenced by anything?
When anything go wrong on btrfs, please just go "btrfs check --readonly"
on the unmounted fs directly.
That's the only reliable way to evaluate the problem.
If the fs is too large, or you want a better way to show the errors,
"btrfs check --readonly --mode=lowmem" will also help.
>
> Is this situation something btrfs check can fix? Is it possible to only
> let it fix the most problematic transid error and ignore everything
> else? Could manually patching the transid value help btrfs clean things up?
Normally no to all the questions above.
Thanks,
Qu
>
> Most of the data on the filesystem is backup data, or can be backed up
> elsewhere, so losing some files would not be the end of the world, as
> long as damaged files can be identified and there is no silent data
> corruption.
>
> Best regards,
>
> Ivo
>
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-05-07 21:53 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 16:33 parent transid verify failed on raid1 Ivo Smits
2025-05-07 21:53 ` Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox