* BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0) @ 2024-10-01 14:15 Peter Volkov 2024-10-01 15:09 ` David Sterba 0 siblings, 1 reply; 7+ messages in thread From: Peter Volkov @ 2024-10-01 14:15 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4233 bytes --] Hi! I've been using this system with this kernel (6.10.10) for a few months already and today out of nowhere btrfs broke with this error message: [53923.816740] page dumped because: eb page dump [53923.816743] BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0) [53923.816750] BTRFS info (device dm-0): node 1035372494848 level 1 gen 2990872 total ptrs 392 free spc 101 owner 256 [53923.816753] key 0 (50012416 1 0) block 933847334912 gen 2917413 [53923.816756] key 1 (50012429 108 0) block 1077754986496 gen 2981970 [53923.816758] key 2 (50012438 1 0) block 933899796480 gen 2917414 [53923.816759] key 3 (50012446 12 14907231) block 933847367680 gen 2917413 [53923.816761] key 4 (50012460 108 0) block 933743067136 gen 2980206 [53923.816763] key 5 (50012466 108 0) block 933743083520 gen 2980206 (Full dmesg in attachment) With this error message btrfs went into RO mode. I've saved dmesg and entered livecd to investigate what happened. While I continue to investigate similar reports in the internet I decided to ask here for help, since may be this problem is already known and you could point me to the correct solution. At least I found some similar reports for similar kernel versions: https://www.reddit.com/r/btrfs/comments/1fbepoh/btrfs_filesystem_suddenly_died/ https://lkml.org/lkml/2024/7/17/556 https://discussion.fedoraproject.org/t/kernel-6-10-9-causes-system-to-boot-to-read-only-mode-for-btrfs/131472 The difference in that reports is that btrfs reports "corrupt leaf" while I have corrupt node. Now I'm trying to run btrfs check and here is the output I receive: =========================================================================================== Opening filesystem to check... Checking filesystem on /dev/mapper/dev-root UUID: 3be5c9c5-f5be-4ba3-8405-2740e86149ef [1/7] checking root items Error: could not find extent items for root 256 ERROR: failed to repair root items: No such file or directory [2/7] checking extents ref mismatch on [294031360 8192] extent item 0, found 1 data backref 294031360 root 257 owner 1237292 offset 0 num_refs 0 not found in extent tree incorrect local backref count on 294031360 root 257 owner 1237292 offset 0 found 1 wanted 0 back 0x559ef6c3ce70 backpointer mismatch on [294031360 8192] ref mismatch on [294039552 4096] extent item 0, found 1 data backref 294039552 root 257 owner 1237293 offset 0 num_refs 0 not found in extent tree incorrect local backref count on 294039552 root 257 owner 1237293 offset 0 found 1 wanted 0 back 0x559ef6c3cd40 backpointer mismatch on [294039552 4096] ref mismatch on [294043648 4096] extent item 0, found 1 data backref 294043648 root 257 owner 1237294 offset 0 num_refs 0 not found in extent tree incorrect local backref count on 294043648 root 257 owner 1237294 offset 0 found 1 wanted 0 back 0x559ef6c3cc10 backpointer mismatch on [294043648 4096] ref mismatch on [294047744 4096] extent item 0, found 1 data backref 294047744 root 257 owner 1237295 offset 0 num_refs 0 not found in extent tree incorrect local backref count on 294047744 root 257 owner 1237295 offset 0 found 1 wanted 0 back 0x559ef6c3cae0 backpointer mismatch on [294047744 4096] ref mismatch on [294051840 8192] extent item 0, found 1 data backref 294051840 root 257 owner 1237296 offset 0 num_refs 0 not found in extent tree (and many many more this lines, actually I'm still wating to btrfs check to finish) =========================================================================================== I can not show output of btrfs command from host, but here is the output from liveCD I'm currently in: calculate ~ # btrfs --version btrfs-progs v6.0.2 ~ # btrfs fi show Label: 'btrfs-systems' uuid: d5214342-ccfc-42c1-9491-804aae1a7e1a Total devices 1 FS bytes used 602.73MiB devid 1 size 4.98GiB used 2.27GiB path /dev/mapper/dev-systems Label: 'btrfs-root' uuid: 3be5c9c5-f5be-4ba3-8405-2740e86149ef Total devices 1 FS bytes used 899.14GiB devid 1 size 910.00GiB used 910.00GiB path /dev/mapper/dev-root Is this a known problem? What do you think, for the output above is it safe to run btrfs check with --repair option? -- Peter. [-- Attachment #2: dmesg.xz --] [-- Type: application/x-xz, Size: 35252 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0) 2024-10-01 14:15 BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0) Peter Volkov @ 2024-10-01 15:09 ` David Sterba 2024-10-01 17:10 ` Peter Volkov 0 siblings, 1 reply; 7+ messages in thread From: David Sterba @ 2024-10-01 15:09 UTC (permalink / raw) To: Peter Volkov; +Cc: linux-btrfs On Tue, Oct 01, 2024 at 02:15:51PM +0000, Peter Volkov wrote: > Hi! I've been using this system with this kernel (6.10.10) for a few > months already and today out of nowhere btrfs broke with this error > message: > > [53923.816740] page dumped because: eb page dump > [53923.816743] BTRFS critical (device dm-0): corrupt node: root=256 > block=1035372494848 slot=364, bad key order, current (8796143471049 > 108 0) next (50450969 1 0) Quite obvious memory bitflip: 8796143471049 = 0x8000301c9c9 50450969 = 0x301d219 The first one should probably be 0x301c9c9, but it's impossible to tell how many other data/metadata could have been hit by this or another memory bitflip so check can detect the things but not fix. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0) 2024-10-01 15:09 ` David Sterba @ 2024-10-01 17:10 ` Peter Volkov 2024-10-01 17:55 ` Matthew Warren 2024-10-01 22:12 ` Qu Wenruo 0 siblings, 2 replies; 7+ messages in thread From: Peter Volkov @ 2024-10-01 17:10 UTC (permalink / raw) To: dsterba; +Cc: linux-btrfs On Tue, Oct 1, 2024 at 3:09 PM David Sterba <dsterba@suse.cz> wrote: > On Tue, Oct 01, 2024 at 02:15:51PM +0000, Peter Volkov wrote: > > Hi! I've been using this system with this kernel (6.10.10) for a few > > months already and today out of nowhere btrfs broke with this error > > message: > > > > [53923.816740] page dumped because: eb page dump > > [53923.816743] BTRFS critical (device dm-0): corrupt node: root=256 > > block=1035372494848 slot=364, bad key order, current (8796143471049 > > 108 0) next (50450969 1 0) > > Quite obvious memory bitflip: > > 8796143471049 = 0x8000301c9c9 > 50450969 = 0x301d219 > > The first one should probably be 0x301c9c9, but it's impossible to tell > how many other data/metadata could have been hit by this or another > memory bitflip so check can detect the things but not fix. Thank you David! Is my understanding correct, that btrfs catches memory problems, so this bitflip most probably means that my drive is failing? -- Peter. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0) 2024-10-01 17:10 ` Peter Volkov @ 2024-10-01 17:55 ` Matthew Warren 2024-10-01 22:12 ` Qu Wenruo 1 sibling, 0 replies; 7+ messages in thread From: Matthew Warren @ 2024-10-01 17:55 UTC (permalink / raw) To: Peter Volkov; +Cc: dsterba, linux-btrfs > so this bitflip most probably means that my drive is failing It could be either a failing device or a memory issue. I'd recommend running a memory test to rule out the memory being bad. If this is a multi-device filesystem that uses a profile with redundancy then this is most likely a memory bitflip issue. Matthew Warren ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0) 2024-10-01 17:10 ` Peter Volkov 2024-10-01 17:55 ` Matthew Warren @ 2024-10-01 22:12 ` Qu Wenruo 2024-10-04 8:01 ` Peter Volkov 1 sibling, 1 reply; 7+ messages in thread From: Qu Wenruo @ 2024-10-01 22:12 UTC (permalink / raw) To: Peter Volkov, dsterba; +Cc: linux-btrfs 在 2024/10/2 02:40, Peter Volkov 写道: > On Tue, Oct 1, 2024 at 3:09 PM David Sterba <dsterba@suse.cz> wrote: >> On Tue, Oct 01, 2024 at 02:15:51PM +0000, Peter Volkov wrote: >>> Hi! I've been using this system with this kernel (6.10.10) for a few >>> months already and today out of nowhere btrfs broke with this error >>> message: >>> >>> [53923.816740] page dumped because: eb page dump >>> [53923.816743] BTRFS critical (device dm-0): corrupt node: root=256 >>> block=1035372494848 slot=364, bad key order, current (8796143471049 >>> 108 0) next (50450969 1 0) >> >> Quite obvious memory bitflip: >> >> 8796143471049 = 0x8000301c9c9 >> 50450969 = 0x301d219 >> >> The first one should probably be 0x301c9c9, but it's impossible to tell >> how many other data/metadata could have been hit by this or another >> memory bitflip so check can detect the things but not fix. > > Thank you David! Is my understanding correct, that btrfs catches > memory problems, > so this bitflip most probably means that my drive is failing? In this particular case, it's your hardware memory, not the drive. The error is happening at write time, so the metadata read from disk is fine, thus not your driver returning some weird data. Furthermore, it's pretty hard that a simple bitflip can pass the internal checksums of the storage device, thus it's very unlikely it's your drive. So, please do a full memtest of your system before doing anything else. And considering your fsck result is already bad, it's no doubt that some bitflip has already corrupted extent tree, and I believe the csum tree is also corrupted. Thanks, Qu > > -- > Peter. > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0) 2024-10-01 22:12 ` Qu Wenruo @ 2024-10-04 8:01 ` Peter Volkov 2024-10-04 8:28 ` Qu Wenruo 0 siblings, 1 reply; 7+ messages in thread From: Peter Volkov @ 2024-10-04 8:01 UTC (permalink / raw) To: Qu Wenruo; +Cc: dsterba, linux-btrfs On Wed, Oct 2, 2024 at 1:12 AM Qu Wenruo <wqu@suse.com> wrote: > 在 2024/10/2 02:40, Peter Volkov 写道: > > On Tue, Oct 1, 2024 at 3:09 PM David Sterba <dsterba@suse.cz> wrote: > >> On Tue, Oct 01, 2024 at 02:15:51PM +0000, Peter Volkov wrote: > >>> Hi! I've been using this system with this kernel (6.10.10) for a few > >>> months already and today out of nowhere btrfs broke with this error > >>> message: > >>> > >>> [53923.816740] page dumped because: eb page dump > >>> [53923.816743] BTRFS critical (device dm-0): corrupt node: root=256 > >>> block=1035372494848 slot=364, bad key order, current (8796143471049 > >>> 108 0) next (50450969 1 0) > >> > >> Quite obvious memory bitflip: > >> > >> 8796143471049 = 0x8000301c9c9 > >> 50450969 = 0x301d219 > >> > >> The first one should probably be 0x301c9c9, but it's impossible to tell > >> how many other data/metadata could have been hit by this or another > >> memory bitflip so check can detect the things but not fix. > > > > Thank you David! Is my understanding correct, that btrfs catches > > memory problems, > > so this bitflip most probably means that my drive is failing? > > In this particular case, it's your hardware memory, not the drive. Thank you, guys! You are right. memtest showed memory errors. > The error is happening at write time, so the metadata read from disk is > fine, thus not your driver returning some weird data. > > Furthermore, it's pretty hard that a simple bitflip can pass the > internal checksums of the storage device, thus it's very unlikely it's > your drive. > > So, please do a full memtest of your system before doing anything else. > > And considering your fsck result is already bad, it's no doubt that some > bitflip has already corrupted extent tree, and I believe the csum tree > is also corrupted. So I have to start over from last backup. Or is it possible to fix some of this bitflips to read at least part of tree? -- Peter. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0) 2024-10-04 8:01 ` Peter Volkov @ 2024-10-04 8:28 ` Qu Wenruo 0 siblings, 0 replies; 7+ messages in thread From: Qu Wenruo @ 2024-10-04 8:28 UTC (permalink / raw) To: Peter Volkov; +Cc: dsterba, linux-btrfs 在 2024/10/4 17:31, Peter Volkov 写道: > On Wed, Oct 2, 2024 at 1:12 AM Qu Wenruo <wqu@suse.com> wrote: >> 在 2024/10/2 02:40, Peter Volkov 写道: >>> On Tue, Oct 1, 2024 at 3:09 PM David Sterba <dsterba@suse.cz> wrote: >>>> On Tue, Oct 01, 2024 at 02:15:51PM +0000, Peter Volkov wrote: >>>>> Hi! I've been using this system with this kernel (6.10.10) for a few >>>>> months already and today out of nowhere btrfs broke with this error >>>>> message: >>>>> >>>>> [53923.816740] page dumped because: eb page dump >>>>> [53923.816743] BTRFS critical (device dm-0): corrupt node: root=256 >>>>> block=1035372494848 slot=364, bad key order, current (8796143471049 >>>>> 108 0) next (50450969 1 0) >>>> >>>> Quite obvious memory bitflip: >>>> >>>> 8796143471049 = 0x8000301c9c9 >>>> 50450969 = 0x301d219 >>>> >>>> The first one should probably be 0x301c9c9, but it's impossible to tell >>>> how many other data/metadata could have been hit by this or another >>>> memory bitflip so check can detect the things but not fix. >>> >>> Thank you David! Is my understanding correct, that btrfs catches >>> memory problems, >>> so this bitflip most probably means that my drive is failing? >> >> In this particular case, it's your hardware memory, not the drive. > > Thank you, guys! You are right. memtest showed memory errors. > >> The error is happening at write time, so the metadata read from disk is >> fine, thus not your driver returning some weird data. >> >> Furthermore, it's pretty hard that a simple bitflip can pass the >> internal checksums of the storage device, thus it's very unlikely it's >> your drive. >> >> So, please do a full memtest of your system before doing anything else. >> >> And considering your fsck result is already bad, it's no doubt that some >> bitflip has already corrupted extent tree, and I believe the csum tree >> is also corrupted. > > So I have to start over from last backup. Or is it possible to fix > some of this bitflips to read at least part of tree? In theory, it's possible to fix the problem with complex manual intervention. It will be an interesting adventure if you're a btrfs developer, otherwise it will be a weeks long communicating with some developers, and may still not fully repair everything. I'd prefer to do a full restore onto a new fs, of course with all the hardware memory problem solved. Thanks, Qu > > -- > Peter. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-10-04 8:28 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-10-01 14:15 BTRFS critical (device dm-0): corrupt node: root=256 block=1035372494848 slot=364, bad key order, current (8796143471049 108 0) next (50450969 1 0) Peter Volkov 2024-10-01 15:09 ` David Sterba 2024-10-01 17:10 ` Peter Volkov 2024-10-01 17:55 ` Matthew Warren 2024-10-01 22:12 ` Qu Wenruo 2024-10-04 8:01 ` Peter Volkov 2024-10-04 8:28 ` Qu Wenruo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).