* BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0)
@ 2024-10-01 14:15 Peter Volkov
2024-10-01 15:09 ` David Sterba
0 siblings, 1 reply; 7+ messages in thread
From: Peter Volkov @ 2024-10-01 14:15 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 4233 bytes --]
Hi! I've been using this system with this kernel (6.10.10) for a few
months already and today out of nowhere btrfs broke with this error
message:
[53923.816740] page dumped because: eb page dump
[53923.816743] BTRFS critical (device dm-0): corrupt node: root=256
block=1035372494848 slot=364, bad key order, current (8796143471049
108 0) next (50450969 1 0)
[53923.816750] BTRFS info (device dm-0): node 1035372494848 level 1
gen 2990872 total ptrs 392 free spc 101 owner 256
[53923.816753] key 0 (50012416 1 0) block 933847334912 gen 2917413
[53923.816756] key 1 (50012429 108 0) block 1077754986496 gen 2981970
[53923.816758] key 2 (50012438 1 0) block 933899796480 gen 2917414
[53923.816759] key 3 (50012446 12 14907231) block 933847367680 gen 2917413
[53923.816761] key 4 (50012460 108 0) block 933743067136 gen 2980206
[53923.816763] key 5 (50012466 108 0) block 933743083520 gen 2980206
(Full dmesg in attachment)
With this error message btrfs went into RO mode. I've saved dmesg and
entered livecd to investigate what happened. While I continue to
investigate similar reports in the internet I decided to ask here for
help, since may be this problem is already known and you could point
me to the correct solution. At least I found some similar reports for
similar kernel versions:
https://www.reddit.com/r/btrfs/comments/1fbepoh/btrfs_filesystem_suddenly_died/
https://lkml.org/lkml/2024/7/17/556
https://discussion.fedoraproject.org/t/kernel-6-10-9-causes-system-to-boot-to-read-only-mode-for-btrfs/131472
The difference in that reports is that btrfs reports "corrupt leaf"
while I have corrupt node.
Now I'm trying to run btrfs check and here is the output I receive:
===========================================================================================
Opening filesystem to check...
Checking filesystem on /dev/mapper/dev-root
UUID: 3be5c9c5-f5be-4ba3-8405-2740e86149ef
[1/7] checking root items
Error: could not find extent items for root 256
ERROR: failed to repair root items: No such file or directory
[2/7] checking extents
ref mismatch on [294031360 8192] extent item 0, found 1
data backref 294031360 root 257 owner 1237292 offset 0 num_refs 0 not
found in extent tree
incorrect local backref count on 294031360 root 257 owner 1237292
offset 0 found 1 wanted 0 back 0x559ef6c3ce70
backpointer mismatch on [294031360 8192]
ref mismatch on [294039552 4096] extent item 0, found 1
data backref 294039552 root 257 owner 1237293 offset 0 num_refs 0 not
found in extent tree
incorrect local backref count on 294039552 root 257 owner 1237293
offset 0 found 1 wanted 0 back 0x559ef6c3cd40
backpointer mismatch on [294039552 4096]
ref mismatch on [294043648 4096] extent item 0, found 1
data backref 294043648 root 257 owner 1237294 offset 0 num_refs 0 not
found in extent tree
incorrect local backref count on 294043648 root 257 owner 1237294
offset 0 found 1 wanted 0 back 0x559ef6c3cc10
backpointer mismatch on [294043648 4096]
ref mismatch on [294047744 4096] extent item 0, found 1
data backref 294047744 root 257 owner 1237295 offset 0 num_refs 0 not
found in extent tree
incorrect local backref count on 294047744 root 257 owner 1237295
offset 0 found 1 wanted 0 back 0x559ef6c3cae0
backpointer mismatch on [294047744 4096]
ref mismatch on [294051840 8192] extent item 0, found 1
data backref 294051840 root 257 owner 1237296 offset 0 num_refs 0 not
found in extent tree
(and many many more this lines, actually I'm still wating to btrfs
check to finish)
===========================================================================================
I can not show output of btrfs command from host, but here is the
output from liveCD I'm currently in:
calculate ~ # btrfs --version
btrfs-progs v6.0.2
~ # btrfs fi show
Label: 'btrfs-systems' uuid: d5214342-ccfc-42c1-9491-804aae1a7e1a
Total devices 1 FS bytes used 602.73MiB
devid 1 size 4.98GiB used 2.27GiB path /dev/mapper/dev-systems
Label: 'btrfs-root' uuid: 3be5c9c5-f5be-4ba3-8405-2740e86149ef
Total devices 1 FS bytes used 899.14GiB
devid 1 size 910.00GiB used 910.00GiB path /dev/mapper/dev-root
Is this a known problem? What do you think, for the output above is it
safe to run btrfs check with --repair option?
--
Peter.
[-- Attachment #2: dmesg.xz --]
[-- Type: application/x-xz, Size: 35252 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0)
2024-10-01 14:15 BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0) Peter Volkov
@ 2024-10-01 15:09 ` David Sterba
2024-10-01 17:10 ` Peter Volkov
0 siblings, 1 reply; 7+ messages in thread
From: David Sterba @ 2024-10-01 15:09 UTC (permalink / raw)
To: Peter Volkov; +Cc: linux-btrfs
On Tue, Oct 01, 2024 at 02:15:51PM +0000, Peter Volkov wrote:
> Hi! I've been using this system with this kernel (6.10.10) for a few
> months already and today out of nowhere btrfs broke with this error
> message:
>
> [53923.816740] page dumped because: eb page dump
> [53923.816743] BTRFS critical (device dm-0): corrupt node: root=256
> block=1035372494848 slot=364, bad key order, current (8796143471049
> 108 0) next (50450969 1 0)
Quite obvious memory bitflip:
8796143471049 = 0x8000301c9c9
50450969 = 0x301d219
The first one should probably be 0x301c9c9, but it's impossible to tell
how many other data/metadata could have been hit by this or another
memory bitflip so check can detect the things but not fix.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0)
2024-10-01 15:09 ` David Sterba
@ 2024-10-01 17:10 ` Peter Volkov
2024-10-01 17:55 ` Matthew Warren
2024-10-01 22:12 ` Qu Wenruo
0 siblings, 2 replies; 7+ messages in thread
From: Peter Volkov @ 2024-10-01 17:10 UTC (permalink / raw)
To: dsterba; +Cc: linux-btrfs
On Tue, Oct 1, 2024 at 3:09 PM David Sterba <dsterba@suse.cz> wrote:
> On Tue, Oct 01, 2024 at 02:15:51PM +0000, Peter Volkov wrote:
> > Hi! I've been using this system with this kernel (6.10.10) for a few
> > months already and today out of nowhere btrfs broke with this error
> > message:
> >
> > [53923.816740] page dumped because: eb page dump
> > [53923.816743] BTRFS critical (device dm-0): corrupt node: root=256
> > block=1035372494848 slot=364, bad key order, current (8796143471049
> > 108 0) next (50450969 1 0)
>
> Quite obvious memory bitflip:
>
> 8796143471049 = 0x8000301c9c9
> 50450969 = 0x301d219
>
> The first one should probably be 0x301c9c9, but it's impossible to tell
> how many other data/metadata could have been hit by this or another
> memory bitflip so check can detect the things but not fix.
Thank you David! Is my understanding correct, that btrfs catches
memory problems,
so this bitflip most probably means that my drive is failing?
--
Peter.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0)
2024-10-01 17:10 ` Peter Volkov
@ 2024-10-01 17:55 ` Matthew Warren
2024-10-01 22:12 ` Qu Wenruo
1 sibling, 0 replies; 7+ messages in thread
From: Matthew Warren @ 2024-10-01 17:55 UTC (permalink / raw)
To: Peter Volkov; +Cc: dsterba, linux-btrfs
> so this bitflip most probably means that my drive is failing
It could be either a failing device or a memory issue. I'd recommend
running a memory test to rule out the memory being bad. If this is a
multi-device filesystem that uses a profile with redundancy then this
is most likely a memory bitflip issue.
Matthew Warren
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0)
2024-10-01 17:10 ` Peter Volkov
2024-10-01 17:55 ` Matthew Warren
@ 2024-10-01 22:12 ` Qu Wenruo
2024-10-04 8:01 ` Peter Volkov
1 sibling, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2024-10-01 22:12 UTC (permalink / raw)
To: Peter Volkov, dsterba; +Cc: linux-btrfs
在 2024/10/2 02:40, Peter Volkov 写道:
> On Tue, Oct 1, 2024 at 3:09 PM David Sterba <dsterba@suse.cz> wrote:
>> On Tue, Oct 01, 2024 at 02:15:51PM +0000, Peter Volkov wrote:
>>> Hi! I've been using this system with this kernel (6.10.10) for a few
>>> months already and today out of nowhere btrfs broke with this error
>>> message:
>>>
>>> [53923.816740] page dumped because: eb page dump
>>> [53923.816743] BTRFS critical (device dm-0): corrupt node: root=256
>>> block=1035372494848 slot=364, bad key order, current (8796143471049
>>> 108 0) next (50450969 1 0)
>>
>> Quite obvious memory bitflip:
>>
>> 8796143471049 = 0x8000301c9c9
>> 50450969 = 0x301d219
>>
>> The first one should probably be 0x301c9c9, but it's impossible to tell
>> how many other data/metadata could have been hit by this or another
>> memory bitflip so check can detect the things but not fix.
>
> Thank you David! Is my understanding correct, that btrfs catches
> memory problems,
> so this bitflip most probably means that my drive is failing?
In this particular case, it's your hardware memory, not the drive.
The error is happening at write time, so the metadata read from disk is
fine, thus not your driver returning some weird data.
Furthermore, it's pretty hard that a simple bitflip can pass the
internal checksums of the storage device, thus it's very unlikely it's
your drive.
So, please do a full memtest of your system before doing anything else.
And considering your fsck result is already bad, it's no doubt that some
bitflip has already corrupted extent tree, and I believe the csum tree
is also corrupted.
Thanks,
Qu
>
> --
> Peter.
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0)
2024-10-01 22:12 ` Qu Wenruo
@ 2024-10-04 8:01 ` Peter Volkov
2024-10-04 8:28 ` Qu Wenruo
0 siblings, 1 reply; 7+ messages in thread
From: Peter Volkov @ 2024-10-04 8:01 UTC (permalink / raw)
To: Qu Wenruo; +Cc: dsterba, linux-btrfs
On Wed, Oct 2, 2024 at 1:12 AM Qu Wenruo <wqu@suse.com> wrote:
> 在 2024/10/2 02:40, Peter Volkov 写道:
> > On Tue, Oct 1, 2024 at 3:09 PM David Sterba <dsterba@suse.cz> wrote:
> >> On Tue, Oct 01, 2024 at 02:15:51PM +0000, Peter Volkov wrote:
> >>> Hi! I've been using this system with this kernel (6.10.10) for a few
> >>> months already and today out of nowhere btrfs broke with this error
> >>> message:
> >>>
> >>> [53923.816740] page dumped because: eb page dump
> >>> [53923.816743] BTRFS critical (device dm-0): corrupt node: root=256
> >>> block=1035372494848 slot=364, bad key order, current (8796143471049
> >>> 108 0) next (50450969 1 0)
> >>
> >> Quite obvious memory bitflip:
> >>
> >> 8796143471049 = 0x8000301c9c9
> >> 50450969 = 0x301d219
> >>
> >> The first one should probably be 0x301c9c9, but it's impossible to tell
> >> how many other data/metadata could have been hit by this or another
> >> memory bitflip so check can detect the things but not fix.
> >
> > Thank you David! Is my understanding correct, that btrfs catches
> > memory problems,
> > so this bitflip most probably means that my drive is failing?
>
> In this particular case, it's your hardware memory, not the drive.
Thank you, guys! You are right. memtest showed memory errors.
> The error is happening at write time, so the metadata read from disk is
> fine, thus not your driver returning some weird data.
>
> Furthermore, it's pretty hard that a simple bitflip can pass the
> internal checksums of the storage device, thus it's very unlikely it's
> your drive.
>
> So, please do a full memtest of your system before doing anything else.
>
> And considering your fsck result is already bad, it's no doubt that some
> bitflip has already corrupted extent tree, and I believe the csum tree
> is also corrupted.
So I have to start over from last backup. Or is it possible to fix
some of this bitflips to read at least part of tree?
--
Peter.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0)
2024-10-04 8:01 ` Peter Volkov
@ 2024-10-04 8:28 ` Qu Wenruo
0 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2024-10-04 8:28 UTC (permalink / raw)
To: Peter Volkov; +Cc: dsterba, linux-btrfs
在 2024/10/4 17:31, Peter Volkov 写道:
> On Wed, Oct 2, 2024 at 1:12 AM Qu Wenruo <wqu@suse.com> wrote:
>> 在 2024/10/2 02:40, Peter Volkov 写道:
>>> On Tue, Oct 1, 2024 at 3:09 PM David Sterba <dsterba@suse.cz> wrote:
>>>> On Tue, Oct 01, 2024 at 02:15:51PM +0000, Peter Volkov wrote:
>>>>> Hi! I've been using this system with this kernel (6.10.10) for a few
>>>>> months already and today out of nowhere btrfs broke with this error
>>>>> message:
>>>>>
>>>>> [53923.816740] page dumped because: eb page dump
>>>>> [53923.816743] BTRFS critical (device dm-0): corrupt node: root=256
>>>>> block=1035372494848 slot=364, bad key order, current (8796143471049
>>>>> 108 0) next (50450969 1 0)
>>>>
>>>> Quite obvious memory bitflip:
>>>>
>>>> 8796143471049 = 0x8000301c9c9
>>>> 50450969 = 0x301d219
>>>>
>>>> The first one should probably be 0x301c9c9, but it's impossible to tell
>>>> how many other data/metadata could have been hit by this or another
>>>> memory bitflip so check can detect the things but not fix.
>>>
>>> Thank you David! Is my understanding correct, that btrfs catches
>>> memory problems,
>>> so this bitflip most probably means that my drive is failing?
>>
>> In this particular case, it's your hardware memory, not the drive.
>
> Thank you, guys! You are right. memtest showed memory errors.
>
>> The error is happening at write time, so the metadata read from disk is
>> fine, thus not your driver returning some weird data.
>>
>> Furthermore, it's pretty hard that a simple bitflip can pass the
>> internal checksums of the storage device, thus it's very unlikely it's
>> your drive.
>>
>> So, please do a full memtest of your system before doing anything else.
>>
>> And considering your fsck result is already bad, it's no doubt that some
>> bitflip has already corrupted extent tree, and I believe the csum tree
>> is also corrupted.
>
> So I have to start over from last backup. Or is it possible to fix
> some of this bitflips to read at least part of tree?
In theory, it's possible to fix the problem with complex manual
intervention.
It will be an interesting adventure if you're a btrfs developer,
otherwise it will be a weeks long communicating with some developers,
and may still not fully repair everything.
I'd prefer to do a full restore onto a new fs, of course with all the
hardware memory problem solved.
Thanks,
Qu
>
> --
> Peter.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-10-04 8:28 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-01 14:15 BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0) Peter Volkov
2024-10-01 15:09 ` David Sterba
2024-10-01 17:10 ` Peter Volkov
2024-10-01 17:55 ` Matthew Warren
2024-10-01 22:12 ` Qu Wenruo
2024-10-04 8:01 ` Peter Volkov
2024-10-04 8:28 ` Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).