parent transid verify failed on raid1

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* parent transid verify failed on raid1
@ 2025-05-07 16:33 Ivo Smits
  2025-05-07 21:53 ` Qu Wenruo
  0 siblings, 1 reply; 2+ messages in thread
From: Ivo Smits @ 2025-05-07 16:33 UTC (permalink / raw)
  To: linux-btrfs

Hello everyone,

After some abuse (drive going offline and unexpected shutdowns) one of 
my fairly large BTRFS filesystems seems to suffer from some corruption. 
The filesystem still mounts and operates mostly fine. A lot of errors 
(probably caused by a drive going offline and later returning) have been 
recovered from a good RAID1 mirror by scrub a little while ago, but some 
problems persist.

The kernel log is repeating the following two messages about every 30 
seconds:

BTRFS error (device sdg1): parent transid verify failed on logical 
31419461632000 mirror 1 wanted 1240926 found 1089963
BTRFS error (device sdg1): parent transid verify failed on logical 
31419461632000 mirror 2 wanted 1240926 found 1089963

I suspect this might be some background process in the kernel trying to 
clean things up since it starts after mounting and doesn't stop.

While all data seems to be accessible, some filesystem operations (like 
balancing partially empty blocks or reducing the size) block forever. I 
suspect this might end up waiting for the failing maintenance or 
retrying indefinitely because of the same error. I think an older kernel 
version even remounted the filesystem read-only when running these 
operations, likely because the transid error produced a different error 
code before (I/O error).

Scrub reports two unrecoverable errors:

BTRFS warning (device sdg1): tree block 31419461632000 mirror 2 has bad 
generation, has 1089963 want 1240926
BTRFS warning (device sdg1): tree block 31419461632000 mirror 1 has bad 
generation, has 1089963 want 1240926
BTRFS warning (device sdg1): checksum/header error at logical 
31419461632000 on dev /dev/sdh1, physical 4363721801728: metadata leaf 
(level 0) in tree 29210773045248
BTRFS warning (device sdg1): checksum/header error at logical 
31419461632000 on dev /dev/sdh1, physical 4363721801728: metadata leaf 
(level 0) in tree 29210773045248
BTRFS warning (device sdg1): tree block 31419461632000 mirror 0 has bad 
generation, has 1089963 want 1240926
BTRFS error (device sdg1): unable to fixup (regular) error at logical 
31419461632000 on dev /dev/sdh1

BTRFS warning (device sdg1): tree block 31420155953152 mirror 2 has bad 
generation, has 1090179 want 1211718
BTRFS warning (device sdg1): tree block 31420155953152 mirror 1 has bad 
generation, has 1090179 want 1211718
BTRFS warning (device sdg1): checksum/header error at logical 
31420155953152 on dev /dev/sdh1, physical 4364416122880: metadata leaf 
(level 0) in tree 14366209343488
BTRFS warning (device sdg1): checksum/header error at logical 
31420155953152 on dev /dev/sdh1, physical 4364416122880: metadata leaf 
(level 0) in tree 14366209343488
BTRFS warning (device sdg1): tree block 31420155953152 mirror 0 has bad 
generation, has 1090179 want 1211718
BTRFS error (device sdg1): unable to fixup (regular) error at logical 
31420155953152 on dev /dev/sdh1

So I tried to locate the affected file(s) for the problematic blocks:

# btrfs inspect-internal dump-tree -b 31419461632000 /dev/sdh1
leaf 31419461632000 items 1 free space 13458 generation 1089963 owner 
CSUM_TREE
leaf 31419461632000 flags 0x1(WRITTEN) backref revision 1
fs uuid b044297e-9527-4d22-bb66-c09206ad8aa7
chunk uuid 943d34f6-586f-414f-91a0-67b3f04e2feb
         item 0 key (EXTENT_CSUM EXTENT_CSUM 36776010444800) itemoff 
13483 itemsize 2800
                 range start 36776010444800 end 36776013312000 length 
2867200

And then managed to find some filenames for the specified range using 
logical-resolve. The filenames pointed to some old backups, so I decided 
to remove those subvolumes, hoping that btrfs could simply release the 
corrupted block. Unfortunately it appears that this did not work, 
logical-resolve now fails because it can't find the referenced subvol_id 
and the kernel errors continue.

I also tried to check the other corrupt block found by scrub. This one 
seems to have a lot more items so might be more problematic, but does 
not normally show up in the kernel log:

# btrfs inspect dump-tree -b 31420155953152 /dev/sdh1
leaf 31420155953152 items 127 free space 3022 generation 1090179 owner 
EXTENT_TREE
leaf 31420155953152 flags 0x1(WRITTEN) backref revision 1
fs uuid b044297e-9527-4d22-bb66-c09206ad8aa7
chunk uuid 943d34f6-586f-414f-91a0-67b3f04e2feb
         item 0 key (31420152774656 METADATA_ITEM 0) itemoff 16007 
itemsize 276
                 refs 28 gen 1073175 flags TREE_BLOCK
...

I also ran btrfs check while the filesystem was unmounted. This first 
discovered the two transid failures also found by scrub, and then 
continued to find a lot more errors, like reference count and bytenr 
mismatches. Since the filesystem appears to operate normally and scrub 
did not find those errors, could this just be blocks which are no longer 
part of the filesystem tree, possibly not even referenced by anything?

Is this situation something btrfs check can fix? Is it possible to only 
let it fix the most problematic transid error and ignore everything 
else? Could manually patching the transid value help btrfs clean things up?

Most of the data on the filesystem is backup data, or can be backed up 
elsewhere, so losing some files would not be the end of the world, as 
long as damaged files can be identified and there is no silent data 
corruption.

Best regards,

Ivo

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: parent transid verify failed on raid1
  2025-05-07 16:33 parent transid verify failed on raid1 Ivo Smits
@ 2025-05-07 21:53 ` Qu Wenruo
  0 siblings, 0 replies; 2+ messages in thread
From: Qu Wenruo @ 2025-05-07 21:53 UTC (permalink / raw)
  To: Ivo Smits, linux-btrfs



在 2025/5/8 02:03, Ivo Smits 写道:
> Hello everyone,
> 
> After some abuse (drive going offline and unexpected shutdowns) one of 
> my fairly large BTRFS filesystems seems to suffer from some corruption. 
> The filesystem still mounts and operates mostly fine. A lot of errors 
> (probably caused by a drive going offline and later returning) have been 
> recovered from a good RAID1 mirror by scrub a little while ago, but some 
> problems persist.
> 
> The kernel log is repeating the following two messages about every 30 
> seconds:
> 
> BTRFS error (device sdg1): parent transid verify failed on logical 
> 31419461632000 mirror 1 wanted 1240926 found 1089963
> BTRFS error (device sdg1): parent transid verify failed on logical 
> 31419461632000 mirror 2 wanted 1240926 found 1089963

The transid mismatch mostly a death sentence for a btrfs.

This normally means bad metadata COW or bad hardware FLUSH/FUA behavior.

> 
> I suspect this might be some background process in the kernel trying to 
> clean things up since it starts after mounting and doesn't stop.

Nope, no regular operation should lead to such problem.

Not to mention both mirrors share the same bad transid.

[...]
> 
> I also ran btrfs check while the filesystem was unmounted. This first 
> discovered the two transid failures also found by scrub, and then 
> continued to find a lot more errors, like reference count and bytenr 
> mismatches. Since the filesystem appears to operate normally and scrub 
> did not find those errors, could this just be blocks which are no longer 
> part of the filesystem tree, possibly not even referenced by anything?

When anything go wrong on btrfs, please just go "btrfs check --readonly" 
on the unmounted fs directly.

That's the only reliable way to evaluate the problem.
If the fs is too large, or you want a better way to show the errors, 
"btrfs check --readonly --mode=lowmem" will also help.

> 
> Is this situation something btrfs check can fix? Is it possible to only 
> let it fix the most problematic transid error and ignore everything 
> else? Could manually patching the transid value help btrfs clean things up?

Normally no to all the questions above.

Thanks,
Qu

> 
> Most of the data on the filesystem is backup data, or can be backed up 
> elsewhere, so losing some files would not be the end of the world, as 
> long as damaged files can be identified and there is no silent data 
> corruption.
> 
> Best regards,
> 
> Ivo
> 
> 

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-05-07 21:53 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 16:33 parent transid verify failed on raid1 Ivo Smits
2025-05-07 21:53 ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox