errors found in extent allocation tree or chunk allocation

All of lore.kernel.org
 help / color / mirror / Atom feed

* errors found in extent allocation tree or chunk allocation
@ 2023-01-10 12:49 Frankie Fisher
  2023-01-12 22:59 ` Frankie Fisher
  0 siblings, 1 reply; 14+ messages in thread
From: Frankie Fisher @ 2023-01-10 12:49 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I upgraded a box's kernel from 5.4 to 5.15 then restarted it. The box had been up for 2 months before the restart and after the restart the btrfs filesystem wouldn't mount. I suppose there are two possibilities - the issue occurred during the 2 months of uptime or as a consequence of starting up with the newer kernel.

uname -a:

Linux basie 5.4.0-136-generic #153-Ubuntu SMP Thu Nov 24 15:56:58 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Linux basie 5.15.0-57-generic #63~20.04.1-Ubuntu SMP Wed Nov 30 13:40:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux 

I first restarted with the older kernel 5.4 and the problem recurred. dmesg output (filtered for btrfs/BTRFS) is similar with both kernels:



[    4.607811] Btrfs loaded, crc32c=crc32c-intel, zoned=yes, fsverity=yes
[   22.257868] BTRFS: device fsid 0f4a1bba-fbd1-4007-88f8-5c288a8eb161 devid 11 transid 4797718 /dev/sdh2 scanned by btrfs (561)
[   22.257977] BTRFS: device fsid 0f4a1bba-fbd1-4007-88f8-5c288a8eb161 devid 8 transid 4797718 /dev/sdg2 scanned by btrfs (561)
[   22.258313] BTRFS: device fsid 0f4a1bba-fbd1-4007-88f8-5c288a8eb161 devid 10 transid 4797718 /dev/sdf2 scanned by btrfs (561)
[   22.258420] BTRFS: device fsid 0f4a1bba-fbd1-4007-88f8-5c288a8eb161 devid 7 transid 4797718 /dev/sde2 scanned by btrfs (561)
[   22.258531] BTRFS: device fsid 0f4a1bba-fbd1-4007-88f8-5c288a8eb161 devid 9 transid 4797718 /dev/sdd2 scanned by btrfs (561)
[   29.581350] BTRFS info (device sde2): disk space caching is enabled
[   31.414167] BTRFS info (device sde2): bdev /dev/sde2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[   33.735212] BTRFS critical (device sde2): corrupt leaf: block=30874802077696 slot=176 extent bytenr=21866556121088 len=4096 previous extent [21866556112896 168 4503599627378688] overlaps current extent [21866556121088 168 4096]
[   33.735234] BTRFS error (device sde2): block=30874802077696 read time tree block corruption detected
[   33.751471] BTRFS critical (device sde2): corrupt leaf: block=30874802077696 slot=176 extent bytenr=21866556121088 len=4096 previous extent [21866556112896 168 4503599627378688] overlaps current extent [21866556121088 168 4096]
[   33.751484] BTRFS error (device sde2): block=30874802077696 read time tree block corruption detected
[   33.751517] BTRFS error (device sde2): failed to read block groups: -5
[   33.757126] BTRFS error (device sde2): open_ctree failed

I ran btrfs check with btrfs-progs v5.4.1

Checking filesystem on /dev/sde2
UUID: 0f4a1bba-fbd1-4007-88f8-5c288a8eb161
[1/7] checking root items

[2/7] checking extents
ref mismatch on [21866556112896 4503599627378688] extent item 0, found 1
backref bytes do not match extent backref, bytenr=21866556112896, ref bytes=4503599627378688, backref bytes=8192
backpointer mismatch on [21866556112896 4503599627378688]
extent item 22704514924544 has multiple extent items
ref mismatch on [28106103517184 8192] extent item 4503599627370497, found 1
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
there is no free space entry for 4525466183491584-21866556121088
cache appears valid but isn't 21865483468800
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 4511516231163904 bytes used, error(s) found
total csum bytes: 7142228136
total tree bytes: 11304239104
total fs tree bytes: 3378511872
total extent tree bytes: 386826240
btree space waste bytes: 930753844
file data blocks allocated: 28547414216704
 referenced 7990763888640


I also installed btrfs-progs v6.1.2 and the outputi was similar, other than section [3/7]:

[3/7] checking free space cache
There are still entries left in the space cache
cache appears valid but isn't 21866557210624
There are still entries left in the space cache
cache appears valid but isn't 21867630952448
.... (similar lines removed)



Any suggestions to recover this filesystem are gratefully received!

Cheers,

Frankie Fisher

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: errors found in extent allocation tree or chunk allocation
  2023-01-10 12:49 errors " Frankie Fisher
@ 2023-01-12 22:59 ` Frankie Fisher
  0 siblings, 0 replies; 14+ messages in thread
From: Frankie Fisher @ 2023-01-12 22:59 UTC (permalink / raw)
  To: linux-btrfs

On Tue, 10 Jan 2023, at 12:49 PM, Frankie Fisher wrote:

> [   33.735212] BTRFS critical (device sde2): corrupt leaf: block=30874802077696 slot=176 extent bytenr=21866556121088 len=4096 previous extent [21866556112896 168 4503599627378688] overlaps current extent [21866556121088 168 4096]

> [2/7] checking extents
> ref mismatch on [21866556112896 4503599627378688] extent item 0, found 1
> backref bytes do not match extent backref, bytenr=21866556112896, ref bytes=4503599627378688, backref bytes=8192
> backpointer mismatch on [21866556112896 4503599627378688]
> extent item 22704514924544 has multiple extent items
> ref mismatch on [28106103517184 8192] extent item 4503599627370497, found 1

Based on the dmesg and btrfs check excerpts above, my research has led me to conclude that the likely cause of the corruption was a bit flip in the recorded length of an extent. This triggers the "previous extent overlaps current extent" kernel message, as the previous extent length is recorded as exactly 4PiB + 8192B. The gap between the two extents in the corrupt leaf kernel message is 8192B. And the btrfs check output backref bytes are listed as 8192B. So 
all of this points to a bitflip in memory before this part of the tree was written to disc.

The output of dump-tree puts the above in context:

        item 174 key (21866556104704 EXTENT_ITEM 8192) itemoff 7024 itemsize 53
                refs 1 gen 2228553 flags DATA
                extent data backref root 258 objectid 3633423 offset 0 count 1
        item 175 key (21866556112896 EXTENT_ITEM 4503599627378688) itemoff 6971 itemsize 53
                refs 1 gen 2228553 flags DATA
                extent data backref root 258 objectid 3633429 offset 0 count 1
        item 176 key (21866556121088 EXTENT_ITEM 4096) itemoff 6918 itemsize 53
                refs 1 gen 2228553 flags DATA
                extent data backref root 258 objectid 3633434 offset 0 count 1

I have run memtest86+ for some time which has demonstrated that if the RAM is faulty, it's a rare fault, so I feel hopeful that most/all of the rest of the data on the filesystem is intact.

In theory then, I can fix the filesystem by unflipping this bit (easy), and then updating the checksum in the csum tree (slightly more complicated but doable). I'm planning then to cobble together a programme based on some of the code in btrfsprogs to update data on my disc. Running "btrfs check --repair" seems an uncertain option to me as I don't know exactly what changes it might make to the disc, while I have a good idea of the changes I want to make to the btrfs structure.

My questions are:

* does this approach sound workable?
* are there any pitfalls that I might naively run into?
* are there any tools or libraries that will do some/all of this fix already? Or is there a simpler approach?
* are there any other things I should check in the filesystem structure before I plough on with my attempted repair?

Regards,

Frankie

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Errors found in extent allocation tree or chunk allocation
@ 2024-12-04  0:02 Nicolas Gnyra
  2024-12-04  2:50 ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Nicolas Gnyra @ 2024-12-04  0:02 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

I seem to have messed up my btrfs filesystem after adding a new (3rd) 
drive and running `btrfs balance start -dconvert=raid5 -mconvert=raid1c3 
/path/to/mount`. It ran for a while and I thought it had finished 
successfully but after a reboot it's stuck mounting as read-only. I 
seemingly am able to mount it as read/write if I add `-o skip_balance` 
but if I try to write to it, it locks up again. I managed to run a scrub 
in this state but it found no errors.

Kernel logs: https://pastebin.com/Cs06sNnr (drives sdb, sdc, and sdd, 
UUID dfa2779b-b7d1-4658-89f7-dabe494e67c8)
`btrfs check`: https://pastebin.com/7SJZS3Yv
`btrfs check --repair` (ran after a discussion in Libera Chat, failed): 
https://pastebin.com/BGLSx6GM

I'm currently running btrfs-progs v6.12 but the balance was originally 
run on v5.10.1. Is there any way to recover from this or should I just 
nuke the filesystem and restart from scratch? There's nothing super 
important on there, it's just going to be annoying to restore from a 
backup, and I thought it'd be interesting to try to figure out what 
happened here.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors found in extent allocation tree or chunk allocation
  2024-12-04  0:02 Errors found in extent allocation tree or chunk allocation Nicolas Gnyra
@ 2024-12-04  2:50 ` Qu Wenruo
  2024-12-04  3:58   ` Nicolas Gnyra
  2025-01-29 19:33   ` Nicolas Gnyra
  0 siblings, 2 replies; 14+ messages in thread
From: Qu Wenruo @ 2024-12-04  2:50 UTC (permalink / raw)
  To: Nicolas Gnyra, linux-btrfs



在 2024/12/4 10:32, Nicolas Gnyra 写道:
> Hi all,
>
> I seem to have messed up my btrfs filesystem after adding a new (3rd)
> drive and running `btrfs balance start -dconvert=raid5 -
> mconvert=raid1c3 /path/to/mount`. It ran for a while and I thought it
> had finished successfully but after a reboot it's stuck mounting as
> read-only. I seemingly am able to mount it as read/write if I add `-o
> skip_balance` but if I try to write to it, it locks up again. I managed
> to run a scrub in this state but it found no errors.
>
> Kernel logs: https://pastebin.com/Cs06sNnr (drives sdb, sdc, and sdd,
> UUID dfa2779b-b7d1-4658-89f7-dabe494e67c8)

The dmesg shows the problem very straightforward:

   item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
     extent refs 1 gen 84178 flags 1
     ref#0: shared data backref parent 32399126528000 count 0 <<<
     ref#1: shared data backref parent 31808973717504 count 1

Notice the count number, it should never be 0, as if one ref goes zero
it should be removed from the extent item.

I believe the correct value should just be 1, and 0 -> 1 is also
possibly an indicator of hardware runtime bitflip.

This is a new corner case we have never seen, thus I'll send a new patch
to handle such case in tree-checker.

> `btrfs check`: https://pastebin.com/7SJZS3Yv
> `btrfs check --repair` (ran after a discussion in Libera Chat, failed):
> https://pastebin.com/BGLSx6GM

In theory, btrfs should be able to repair this particular error,
but the error message seems to indicate ENOSPC, meaning there is not
enough space for metadata at least.

>
> I'm currently running btrfs-progs v6.12 but the balance was originally
> run on v5.10.1. Is there any way to recover from this or should I just
> nuke the filesystem and restart from scratch? There's nothing super
> important on there, it's just going to be annoying to restore from a
> backup, and I thought it'd be interesting to try to figure out what
> happened here.

Recommended to run a full memtest before doing anything, just to verify
if it's really a hardware bitflip.

Thanks,
Qu

>
> Thanks!
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors found in extent allocation tree or chunk allocation
  2024-12-04  2:50 ` Qu Wenruo
@ 2024-12-04  3:58   ` Nicolas Gnyra
  2024-12-04  4:23     ` Qu Wenruo
  2025-01-29 19:33   ` Nicolas Gnyra
  1 sibling, 1 reply; 14+ messages in thread
From: Nicolas Gnyra @ 2024-12-04  3:58 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Thank you for replying so quickly!

> 在 2024/12/4 10:32, Nicolas Gnyra 写道:
>> Hi all,
>>
>> I seem to have messed up my btrfs filesystem after adding a new (3rd)
>> drive and running `btrfs balance start -dconvert=raid5 -
>> mconvert=raid1c3 /path/to/mount`. It ran for a while and I thought it
>> had finished successfully but after a reboot it's stuck mounting as
>> read-only. I seemingly am able to mount it as read/write if I add `-o
>> skip_balance` but if I try to write to it, it locks up again. I managed
>> to run a scrub in this state but it found no errors.
>>
>> Kernel logs: https://pastebin.com/Cs06sNnr (drives sdb, sdc, and sdd,
>> UUID dfa2779b-b7d1-4658-89f7-dabe494e67c8)
> 
> The dmesg shows the problem very straightforward:
> 
>    item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
>      extent refs 1 gen 84178 flags 1
>      ref#0: shared data backref parent 32399126528000 count 0 <<<
>      ref#1: shared data backref parent 31808973717504 count 1
> 
> Notice the count number, it should never be 0, as if one ref goes zero
> it should be removed from the extent item.
> 
> I believe the correct value should just be 1, and 0 -> 1 is also
> possibly an indicator of hardware runtime bitflip.
> 
> This is a new corner case we have never seen, thus I'll send a new patch
> to handle such case in tree-checker.

Ah okay, interesting! I'm glad I reported this then haha.

>> `btrfs check`: https://pastebin.com/7SJZS3Yv
>> `btrfs check --repair` (ran after a discussion in Libera Chat, failed):
>> https://pastebin.com/BGLSx6GM
> 
> In theory, btrfs should be able to repair this particular error,
> but the error message seems to indicate ENOSPC, meaning there is not
> enough space for metadata at least.

Oh, I just remembered I copied a rather large file (just under 400 GiB) 
onto the array while it was doing the balance without thinking about it. 
I think I had around 600 GiB of space left when I first started the 
balance, so I might've messed it up by doing that?

>>
>> I'm currently running btrfs-progs v6.12 but the balance was originally
>> run on v5.10.1. Is there any way to recover from this or should I just
>> nuke the filesystem and restart from scratch? There's nothing super
>> important on there, it's just going to be annoying to restore from a
>> backup, and I thought it'd be interesting to try to figure out what
>> happened here.
> 
> Recommended to run a full memtest before doing anything, just to verify
> if it's really a hardware bitflip.

I started Memtest86+ ~3.5 hours ago (it's on the 7th pass) based on a 
recommendation when I asked in the IRC channel; no errors yet, but I'll 
let it run overnight at least and let you know if it fails.

> Thanks,
> Qu
> 
>>
>> Thanks!
>>
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors found in extent allocation tree or chunk allocation
  2024-12-04  3:58   ` Nicolas Gnyra
@ 2024-12-04  4:23     ` Qu Wenruo
  2024-12-04  4:43       ` Nicolas Gnyra
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2024-12-04  4:23 UTC (permalink / raw)
  To: Nicolas Gnyra, linux-btrfs



在 2024/12/4 14:28, Nicolas Gnyra 写道:
> Thank you for replying so quickly!
>
[...]
>>
>> This is a new corner case we have never seen, thus I'll send a new patch
>> to handle such case in tree-checker.
>
> Ah okay, interesting! I'm glad I reported this then haha.
>
>>> `btrfs check`: https://pastebin.com/7SJZS3Yv
>>> `btrfs check --repair` (ran after a discussion in Libera Chat, failed):
>>> https://pastebin.com/BGLSx6GM
>>
>> In theory, btrfs should be able to repair this particular error,
>> but the error message seems to indicate ENOSPC, meaning there is not
>> enough space for metadata at least.
>
> Oh, I just remembered I copied a rather large file (just under 400 GiB)
> onto the array while it was doing the balance without thinking about it.
> I think I had around 600 GiB of space left when I first started the
> balance, so I might've messed it up by doing that?

That's totally fine, and it should not cause any problem.
(As long as hardware and software are working as expected).

>
>>>
>>> I'm currently running btrfs-progs v6.12 but the balance was originally
>>> run on v5.10.1. Is there any way to recover from this or should I just
>>> nuke the filesystem and restart from scratch? There's nothing super
>>> important on there, it's just going to be annoying to restore from a
>>> backup, and I thought it'd be interesting to try to figure out what
>>> happened here.
>>
>> Recommended to run a full memtest before doing anything, just to verify
>> if it's really a hardware bitflip.
>
> I started Memtest86+ ~3.5 hours ago (it's on the 7th pass) based on a
> recommendation when I asked in the IRC channel; no errors yet, but I'll
> let it run overnight at least and let you know if it fails.

Just in case, have you tried memtester?

There used to be a AMD SFH driver bug that causes random memory corruption.

Tools like memtest86+ are doing its own EFI payload so that it will
detect problems caused by kernel drivers.

Anyway, 7 passes already look good enough to me.

Then the cause will be much harder to pin down.

Thanks,
Qu
>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks!
>>>
>>
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors found in extent allocation tree or chunk allocation
  2024-12-04  4:23     ` Qu Wenruo
@ 2024-12-04  4:43       ` Nicolas Gnyra
  2024-12-04 13:38         ` Nicolas Gnyra
  0 siblings, 1 reply; 14+ messages in thread
From: Nicolas Gnyra @ 2024-12-04  4:43 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

[...]
>>>> I'm currently running btrfs-progs v6.12 but the balance was originally
>>>> run on v5.10.1. Is there any way to recover from this or should I just
>>>> nuke the filesystem and restart from scratch? There's nothing super
>>>> important on there, it's just going to be annoying to restore from a
>>>> backup, and I thought it'd be interesting to try to figure out what
>>>> happened here.
>>>
>>> Recommended to run a full memtest before doing anything, just to verify
>>> if it's really a hardware bitflip.
>>
>> I started Memtest86+ ~3.5 hours ago (it's on the 7th pass) based on a
>> recommendation when I asked in the IRC channel; no errors yet, but I'll
>> let it run overnight at least and let you know if it fails.
> 
> Just in case, have you tried memtester?
> 
> There used to be a AMD SFH driver bug that causes random memory corruption.
> 
> Tools like memtest86+ are doing its own EFI payload so that it will
> detect problems caused by kernel drivers.
> 
> Anyway, 7 passes already look good enough to me.
> 
> Then the cause will be much harder to pin down.

Oh alright! I haven't tried memtester - I'll give it a shot and get back 
to you. Thanks again!


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors found in extent allocation tree or chunk allocation
  2024-12-04  4:43       ` Nicolas Gnyra
@ 2024-12-04 13:38         ` Nicolas Gnyra
  0 siblings, 0 replies; 14+ messages in thread
From: Nicolas Gnyra @ 2024-12-04 13:38 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

> [...]
>>>>> I'm currently running btrfs-progs v6.12 but the balance was originally
>>>>> run on v5.10.1. Is there any way to recover from this or should I just
>>>>> nuke the filesystem and restart from scratch? There's nothing super
>>>>> important on there, it's just going to be annoying to restore from a
>>>>> backup, and I thought it'd be interesting to try to figure out what
>>>>> happened here.
>>>>
>>>> Recommended to run a full memtest before doing anything, just to verify
>>>> if it's really a hardware bitflip.
>>>
>>> I started Memtest86+ ~3.5 hours ago (it's on the 7th pass) based on a
>>> recommendation when I asked in the IRC channel; no errors yet, but I'll
>>> let it run overnight at least and let you know if it fails.
>>
>> Just in case, have you tried memtester?
>>
>> There used to be a AMD SFH driver bug that causes random memory 
>> corruption.
>>
>> Tools like memtest86+ are doing its own EFI payload so that it will
>> detect problems caused by kernel drivers.
>>
>> Anyway, 7 passes already look good enough to me.
>>
>> Then the cause will be much harder to pin down.
> 
> Oh alright! I haven't tried memtester - I'll give it a shot and get back 
> to you. Thanks again!

I let memtester run overnight; it's now at loop 20 and still running.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors found in extent allocation tree or chunk allocation
  2024-12-04  2:50 ` Qu Wenruo
  2024-12-04  3:58   ` Nicolas Gnyra
@ 2025-01-29 19:33   ` Nicolas Gnyra
  2025-01-29 23:35     ` Qu Wenruo
  2025-03-15 16:52     ` Nicolas Gnyra
  1 sibling, 2 replies; 14+ messages in thread
From: Nicolas Gnyra @ 2025-01-29 19:33 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Le 2024-12-03 à 21:50, Qu Wenruo a écrit :
> 
> 
> 在 2024/12/4 10:32, Nicolas Gnyra 写道:
>> Hi all,
>>
>> I seem to have messed up my btrfs filesystem after adding a new (3rd)
>> drive and running `btrfs balance start -dconvert=raid5 -
>> mconvert=raid1c3 /path/to/mount`. It ran for a while and I thought it
>> had finished successfully but after a reboot it's stuck mounting as
>> read-only. I seemingly am able to mount it as read/write if I add `-o
>> skip_balance` but if I try to write to it, it locks up again. I managed
>> to run a scrub in this state but it found no errors.
>>
>> Kernel logs: https://pastebin.com/Cs06sNnr (drives sdb, sdc, and sdd,
>> UUID dfa2779b-b7d1-4658-89f7-dabe494e67c8)
> 
> The dmesg shows the problem very straightforward:
> 
>    item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
>      extent refs 1 gen 84178 flags 1
>      ref#0: shared data backref parent 32399126528000 count 0 <<<
>      ref#1: shared data backref parent 31808973717504 count 1
> 
> Notice the count number, it should never be 0, as if one ref goes zero
> it should be removed from the extent item.
> 
> I believe the correct value should just be 1, and 0 -> 1 is also
> possibly an indicator of hardware runtime bitflip.
> 
> This is a new corner case we have never seen, thus I'll send a new patch
> to handle such case in tree-checker.
> 
>> `btrfs check`: https://pastebin.com/7SJZS3Yv
>> `btrfs check --repair` (ran after a discussion in Libera Chat, failed):
>> https://pastebin.com/BGLSx6GM
> 
> In theory, btrfs should be able to repair this particular error,
> but the error message seems to indicate ENOSPC, meaning there is not
> enough space for metadata at least.

I finally had some time to try out a version of the kernel with your fix 
(built locally from commit 0afd22092df4d3473569c197e317f91face7e51b) and 
I can now see the modified error message (see new dmesg contents: 
https://pastebin.com/t7J5TJ0Z). Unfortunately, apart from that, 
behaviour seems to be identical to before. `btrfs check --repair` still 
fails in the exact same way. Is this expected? For some reason I had 
assumed your change would fix it, but I had forgotten this mention of 
ENOSPC so is there any chance of getting back into a writable state or 
should I just reformat the drives?

>> I'm currently running btrfs-progs v6.12 but the balance was originally
>> run on v5.10.1. Is there any way to recover from this or should I just
>> nuke the filesystem and restart from scratch? There's nothing super
>> important on there, it's just going to be annoying to restore from a
>> backup, and I thought it'd be interesting to try to figure out what
>> happened here.
> 
> Recommended to run a full memtest before doing anything, just to verify
> if it's really a hardware bitflip.
> 
> Thanks,
> Qu
> 
>>
>> Thanks!
>>
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors found in extent allocation tree or chunk allocation
  2025-01-29 19:33   ` Nicolas Gnyra
@ 2025-01-29 23:35     ` Qu Wenruo
  2025-01-30  3:49       ` Nicolas Gnyra
  2025-03-15 16:52     ` Nicolas Gnyra
  1 sibling, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2025-01-29 23:35 UTC (permalink / raw)
  To: Nicolas Gnyra, linux-btrfs



在 2025/1/30 06:03, Nicolas Gnyra 写道:
> Le 2024-12-03 à 21:50, Qu Wenruo a écrit :
>>
>>
>> 在 2024/12/4 10:32, Nicolas Gnyra 写道:
>>> Hi all,
>>>
>>> I seem to have messed up my btrfs filesystem after adding a new (3rd)
>>> drive and running `btrfs balance start -dconvert=raid5 -
>>> mconvert=raid1c3 /path/to/mount`. It ran for a while and I thought it
>>> had finished successfully but after a reboot it's stuck mounting as
>>> read-only. I seemingly am able to mount it as read/write if I add `-o
>>> skip_balance` but if I try to write to it, it locks up again. I managed
>>> to run a scrub in this state but it found no errors.
>>>
>>> Kernel logs: https://pastebin.com/Cs06sNnr (drives sdb, sdc, and sdd,
>>> UUID dfa2779b-b7d1-4658-89f7-dabe494e67c8)
>>
>> The dmesg shows the problem very straightforward:
>>
>>    item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
>>      extent refs 1 gen 84178 flags 1
>>      ref#0: shared data backref parent 32399126528000 count 0 <<<
>>      ref#1: shared data backref parent 31808973717504 count 1
>>
>> Notice the count number, it should never be 0, as if one ref goes zero
>> it should be removed from the extent item.
>>
>> I believe the correct value should just be 1, and 0 -> 1 is also
>> possibly an indicator of hardware runtime bitflip.
>>
>> This is a new corner case we have never seen, thus I'll send a new patch
>> to handle such case in tree-checker.
>>
>>> `btrfs check`: https://pastebin.com/7SJZS3Yv
>>> `btrfs check --repair` (ran after a discussion in Libera Chat, failed):
>>> https://pastebin.com/BGLSx6GM
>>
>> In theory, btrfs should be able to repair this particular error,
>> but the error message seems to indicate ENOSPC, meaning there is not
>> enough space for metadata at least.
>
> I finally had some time to try out a version of the kernel with your fix
> (built locally from commit 0afd22092df4d3473569c197e317f91face7e51b) and
> I can now see the modified error message (see new dmesg contents:
> https://pastebin.com/t7J5TJ0Z). Unfortunately, apart from that,
> behaviour seems to be identical to before. `btrfs check --repair` still
> fails in the exact same way. Is this expected? For some reason I had
> assumed your change would fix it, but I had forgotten this mention of
> ENOSPC so is there any chance of getting back into a writable state or
> should I just reformat the drives?

For the ENOSPC problem, please provide `btrfs fi usage` output for the
mount fs.

I believe with the ENOSPC problem resolved, we can let btrfs check
--repair to fix the problem.

Thanks,
Qu

>
>>> I'm currently running btrfs-progs v6.12 but the balance was originally
>>> run on v5.10.1. Is there any way to recover from this or should I just
>>> nuke the filesystem and restart from scratch? There's nothing super
>>> important on there, it's just going to be annoying to restore from a
>>> backup, and I thought it'd be interesting to try to figure out what
>>> happened here.
>>
>> Recommended to run a full memtest before doing anything, just to verify
>> if it's really a hardware bitflip.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks!
>>>
>>
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors found in extent allocation tree or chunk allocation
  2025-01-29 23:35     ` Qu Wenruo
@ 2025-01-30  3:49       ` Nicolas Gnyra
  2025-01-30  4:19         ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Nicolas Gnyra @ 2025-01-30  3:49 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Le 2025-01-29 à 18:35, Qu Wenruo a écrit :
> 
> 
> 在 2025/1/30 06:03, Nicolas Gnyra 写道:
>> Le 2024-12-03 à 21:50, Qu Wenruo a écrit :
>>>
>>>
>>> 在 2024/12/4 10:32, Nicolas Gnyra 写道:
>>>> Hi all,
>>>>
>>>> I seem to have messed up my btrfs filesystem after adding a new (3rd)
>>>> drive and running `btrfs balance start -dconvert=raid5 -
>>>> mconvert=raid1c3 /path/to/mount`. It ran for a while and I thought it
>>>> had finished successfully but after a reboot it's stuck mounting as
>>>> read-only. I seemingly am able to mount it as read/write if I add `-o
>>>> skip_balance` but if I try to write to it, it locks up again. I managed
>>>> to run a scrub in this state but it found no errors.
>>>>
>>>> Kernel logs: https://pastebin.com/Cs06sNnr (drives sdb, sdc, and sdd,
>>>> UUID dfa2779b-b7d1-4658-89f7-dabe494e67c8)
>>>
>>> The dmesg shows the problem very straightforward:
>>>
>>>    item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
>>>      extent refs 1 gen 84178 flags 1
>>>      ref#0: shared data backref parent 32399126528000 count 0 <<<
>>>      ref#1: shared data backref parent 31808973717504 count 1
>>>
>>> Notice the count number, it should never be 0, as if one ref goes zero
>>> it should be removed from the extent item.
>>>
>>> I believe the correct value should just be 1, and 0 -> 1 is also
>>> possibly an indicator of hardware runtime bitflip.
>>>
>>> This is a new corner case we have never seen, thus I'll send a new patch
>>> to handle such case in tree-checker.
>>>
>>>> `btrfs check`: https://pastebin.com/7SJZS3Yv
>>>> `btrfs check --repair` (ran after a discussion in Libera Chat, failed):
>>>> https://pastebin.com/BGLSx6GM
>>>
>>> In theory, btrfs should be able to repair this particular error,
>>> but the error message seems to indicate ENOSPC, meaning there is not
>>> enough space for metadata at least.
>>
>> I finally had some time to try out a version of the kernel with your fix
>> (built locally from commit 0afd22092df4d3473569c197e317f91face7e51b) and
>> I can now see the modified error message (see new dmesg contents:
>> https://pastebin.com/t7J5TJ0Z). Unfortunately, apart from that,
>> behaviour seems to be identical to before. `btrfs check --repair` still
>> fails in the exact same way. Is this expected? For some reason I had
>> assumed your change would fix it, but I had forgotten this mention of
>> ENOSPC so is there any chance of getting back into a writable state or
>> should I just reformat the drives?
> 
> For the ENOSPC problem, please provide `btrfs fi usage` output for the
> mount fs.
> 
> I believe with the ENOSPC problem resolved, we can let btrfs check
> --repair to fix the problem.
> 
> Thanks,
> Qu

Thanks for the quick reply! Here's the output of `btrfs fi usage`:

    Overall:
       Device size:                  21.83TiB
       Device allocated:             12.50TiB
       Device unallocated:            9.33TiB
       Device missing:                  0.00B
       Device slack:                    0.00B
       Used:                         11.35TiB
       Free (estimated):              6.89TiB      (min: 3.85TiB)
       Free (statfs, df):             6.78TiB
       Data ratio:                       1.52
       Metadata ratio:                   2.88
       Global reserve:              512.00MiB      (used: 0.00B)
       Multiple profiles:                 yes      (data, metadata, system)

    Data,RAID1: Size:324.00GiB, Used:299.59GiB (92.47%)
       /dev/sdd      324.00GiB
       /dev/sde      324.00GiB

    Data,RAID5: Size:7.88TiB, Used:7.16TiB (90.84%)
       /dev/sdd        3.94TiB
       /dev/sde        3.94TiB
       /dev/sdf        3.94TiB

    Metadata,RAID1: Size:2.00GiB, Used:73.25MiB (3.58%)
       /dev/sdd        2.00GiB
       /dev/sde        2.00GiB

    Metadata,RAID1C3: Size:14.00GiB, Used:8.69GiB (62.08%)
       /dev/sdd       14.00GiB
       /dev/sde       14.00GiB
       /dev/sdf       14.00GiB

    System,RAID1: Size:32.00MiB, Used:48.00KiB (0.15%)
       /dev/sdd       32.00MiB
       /dev/sde       32.00MiB

    System,RAID1C3: Size:32.00MiB, Used:736.00KiB (2.25%)
       /dev/sdd       32.00MiB
       /dev/sde       32.00MiB
       /dev/sdf       32.00MiB

    Unallocated:
       /dev/sdd        3.00TiB
       /dev/sde        3.00TiB
       /dev/sdf        3.32TiB

Thanks,
Nicolas

>>>> I'm currently running btrfs-progs v6.12 but the balance was originally
>>>> run on v5.10.1. Is there any way to recover from this or should I just
>>>> nuke the filesystem and restart from scratch? There's nothing super
>>>> important on there, it's just going to be annoying to restore from a
>>>> backup, and I thought it'd be interesting to try to figure out what
>>>> happened here.
>>>
>>> Recommended to run a full memtest before doing anything, just to verify
>>> if it's really a hardware bitflip.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> Thanks!
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors found in extent allocation tree or chunk allocation
  2025-01-30  3:49       ` Nicolas Gnyra
@ 2025-01-30  4:19         ` Qu Wenruo
  2025-01-30  5:21           ` Nicolas Gnyra
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2025-01-30  4:19 UTC (permalink / raw)
  To: Nicolas Gnyra, Qu Wenruo, linux-btrfs



在 2025/1/30 14:19, Nicolas Gnyra 写道:
> Le 2025-01-29 à 18:35, Qu Wenruo a écrit :
>>
>>
>> 在 2025/1/30 06:03, Nicolas Gnyra 写道:
>>> Le 2024-12-03 à 21:50, Qu Wenruo a écrit :
>>>>
>>>>
>>>> 在 2024/12/4 10:32, Nicolas Gnyra 写道:
>>>>> Hi all,
>>>>>
>>>>> I seem to have messed up my btrfs filesystem after adding a new (3rd)
>>>>> drive and running `btrfs balance start -dconvert=raid5 -
>>>>> mconvert=raid1c3 /path/to/mount`. It ran for a while and I thought it
>>>>> had finished successfully but after a reboot it's stuck mounting as
>>>>> read-only. I seemingly am able to mount it as read/write if I add `-o
>>>>> skip_balance` but if I try to write to it, it locks up again. I 
>>>>> managed
>>>>> to run a scrub in this state but it found no errors.
>>>>>
>>>>> Kernel logs: https://pastebin.com/Cs06sNnr (drives sdb, sdc, and sdd,
>>>>> UUID dfa2779b-b7d1-4658-89f7-dabe494e67c8)
>>>>
>>>> The dmesg shows the problem very straightforward:
>>>>
>>>>    item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
>>>>      extent refs 1 gen 84178 flags 1
>>>>      ref#0: shared data backref parent 32399126528000 count 0 <<<
>>>>      ref#1: shared data backref parent 31808973717504 count 1
>>>>
>>>> Notice the count number, it should never be 0, as if one ref goes zero
>>>> it should be removed from the extent item.
>>>>
>>>> I believe the correct value should just be 1, and 0 -> 1 is also
>>>> possibly an indicator of hardware runtime bitflip.
>>>>
>>>> This is a new corner case we have never seen, thus I'll send a new 
>>>> patch
>>>> to handle such case in tree-checker.
>>>>
>>>>> `btrfs check`: https://pastebin.com/7SJZS3Yv
>>>>> `btrfs check --repair` (ran after a discussion in Libera Chat, 
>>>>> failed):
>>>>> https://pastebin.com/BGLSx6GM
>>>>
>>>> In theory, btrfs should be able to repair this particular error,
>>>> but the error message seems to indicate ENOSPC, meaning there is not
>>>> enough space for metadata at least.
>>>
>>> I finally had some time to try out a version of the kernel with your fix
>>> (built locally from commit 0afd22092df4d3473569c197e317f91face7e51b) and
>>> I can now see the modified error message (see new dmesg contents:
>>> https://pastebin.com/t7J5TJ0Z). Unfortunately, apart from that,
>>> behaviour seems to be identical to before. `btrfs check --repair` still
>>> fails in the exact same way. Is this expected? For some reason I had
>>> assumed your change would fix it, but I had forgotten this mention of
>>> ENOSPC so is there any chance of getting back into a writable state or
>>> should I just reformat the drives?
>>
>> For the ENOSPC problem, please provide `btrfs fi usage` output for the
>> mount fs.
>>
>> I believe with the ENOSPC problem resolved, we can let btrfs check
>> --repair to fix the problem.
>>
>> Thanks,
>> Qu
> 
> Thanks for the quick reply! Here's the output of `btrfs fi usage`:
> 
>     Overall:
>        Device size:                  21.83TiB
>        Device allocated:             12.50TiB
>        Device unallocated:            9.33TiB
>        Device missing:                  0.00B
>        Device slack:                    0.00B
>        Used:                         11.35TiB
>        Free (estimated):              6.89TiB      (min: 3.85TiB)
>        Free (statfs, df):             6.78TiB
>        Data ratio:                       1.52
>        Metadata ratio:                   2.88
>        Global reserve:              512.00MiB      (used: 0.00B)
>        Multiple profiles:                 yes      (data, metadata, system)
> 
>     Data,RAID1: Size:324.00GiB, Used:299.59GiB (92.47%)
>        /dev/sdd      324.00GiB
>        /dev/sde      324.00GiB
> 
>     Data,RAID5: Size:7.88TiB, Used:7.16TiB (90.84%)
>        /dev/sdd        3.94TiB
>        /dev/sde        3.94TiB
>        /dev/sdf        3.94TiB
> 
>     Metadata,RAID1: Size:2.00GiB, Used:73.25MiB (3.58%)
>        /dev/sdd        2.00GiB
>        /dev/sde        2.00GiB

The mixed metadata profile may be the problem.

Have you tried to convert the remaining 2GiB RAID1 metadata into RAID1C3?

Or is the problem you're hitting preventing the full conversion to RAID1C3?


Anyway, it also looks like a bug in btrfs-progs, I'll need to dig deeper 
to fix it.

Thanks,
Qu
> 
>     Metadata,RAID1C3: Size:14.00GiB, Used:8.69GiB (62.08%)
>        /dev/sdd       14.00GiB
>        /dev/sde       14.00GiB
>        /dev/sdf       14.00GiB
> 
>     System,RAID1: Size:32.00MiB, Used:48.00KiB (0.15%)
>        /dev/sdd       32.00MiB
>        /dev/sde       32.00MiB
> 
>     System,RAID1C3: Size:32.00MiB, Used:736.00KiB (2.25%)
>        /dev/sdd       32.00MiB
>        /dev/sde       32.00MiB
>        /dev/sdf       32.00MiB
> 
>     Unallocated:
>        /dev/sdd        3.00TiB
>        /dev/sde        3.00TiB
>        /dev/sdf        3.32TiB
> 
> Thanks,
> Nicolas
> 
>>>>> I'm currently running btrfs-progs v6.12 but the balance was originally
>>>>> run on v5.10.1. Is there any way to recover from this or should I just
>>>>> nuke the filesystem and restart from scratch? There's nothing super
>>>>> important on there, it's just going to be annoying to restore from a
>>>>> backup, and I thought it'd be interesting to try to figure out what
>>>>> happened here.
>>>>
>>>> Recommended to run a full memtest before doing anything, just to verify
>>>> if it's really a hardware bitflip.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors found in extent allocation tree or chunk allocation
  2025-01-30  4:19         ` Qu Wenruo
@ 2025-01-30  5:21           ` Nicolas Gnyra
  0 siblings, 0 replies; 14+ messages in thread
From: Nicolas Gnyra @ 2025-01-30  5:21 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs



Le 2025-01-29 à 23:19, Qu Wenruo a écrit :
> 
> 
> 在 2025/1/30 14:19, Nicolas Gnyra 写道:
>> Le 2025-01-29 à 18:35, Qu Wenruo a écrit :
>>>
>>>
>>> 在 2025/1/30 06:03, Nicolas Gnyra 写道:
>>>> Le 2024-12-03 à 21:50, Qu Wenruo a écrit :
>>>>>
>>>>>
>>>>> 在 2024/12/4 10:32, Nicolas Gnyra 写道:
>>>>>> Hi all,
>>>>>>
>>>>>> I seem to have messed up my btrfs filesystem after adding a new (3rd)
>>>>>> drive and running `btrfs balance start -dconvert=raid5 -
>>>>>> mconvert=raid1c3 /path/to/mount`. It ran for a while and I thought it
>>>>>> had finished successfully but after a reboot it's stuck mounting as
>>>>>> read-only. I seemingly am able to mount it as read/write if I add `-o
>>>>>> skip_balance` but if I try to write to it, it locks up again. I 
>>>>>> managed
>>>>>> to run a scrub in this state but it found no errors.
>>>>>>
>>>>>> Kernel logs: https://pastebin.com/Cs06sNnr (drives sdb, sdc, and sdd,
>>>>>> UUID dfa2779b-b7d1-4658-89f7-dabe494e67c8)
>>>>>
>>>>> The dmesg shows the problem very straightforward:
>>>>>
>>>>>    item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
>>>>>      extent refs 1 gen 84178 flags 1
>>>>>      ref#0: shared data backref parent 32399126528000 count 0 <<<
>>>>>      ref#1: shared data backref parent 31808973717504 count 1
>>>>>
>>>>> Notice the count number, it should never be 0, as if one ref goes zero
>>>>> it should be removed from the extent item.
>>>>>
>>>>> I believe the correct value should just be 1, and 0 -> 1 is also
>>>>> possibly an indicator of hardware runtime bitflip.
>>>>>
>>>>> This is a new corner case we have never seen, thus I'll send a new 
>>>>> patch
>>>>> to handle such case in tree-checker.
>>>>>
>>>>>> `btrfs check`: https://pastebin.com/7SJZS3Yv
>>>>>> `btrfs check --repair` (ran after a discussion in Libera Chat, 
>>>>>> failed):
>>>>>> https://pastebin.com/BGLSx6GM
>>>>>
>>>>> In theory, btrfs should be able to repair this particular error,
>>>>> but the error message seems to indicate ENOSPC, meaning there is not
>>>>> enough space for metadata at least.
>>>>
>>>> I finally had some time to try out a version of the kernel with your 
>>>> fix
>>>> (built locally from commit 0afd22092df4d3473569c197e317f91face7e51b) 
>>>> and
>>>> I can now see the modified error message (see new dmesg contents:
>>>> https://pastebin.com/t7J5TJ0Z). Unfortunately, apart from that,
>>>> behaviour seems to be identical to before. `btrfs check --repair` still
>>>> fails in the exact same way. Is this expected? For some reason I had
>>>> assumed your change would fix it, but I had forgotten this mention of
>>>> ENOSPC so is there any chance of getting back into a writable state or
>>>> should I just reformat the drives?
>>>
>>> For the ENOSPC problem, please provide `btrfs fi usage` output for the
>>> mount fs.
>>>
>>> I believe with the ENOSPC problem resolved, we can let btrfs check
>>> --repair to fix the problem.
>>>
>>> Thanks,
>>> Qu
>>
>> Thanks for the quick reply! Here's the output of `btrfs fi usage`:
>>
>>     Overall:
>>        Device size:                  21.83TiB
>>        Device allocated:             12.50TiB
>>        Device unallocated:            9.33TiB
>>        Device missing:                  0.00B
>>        Device slack:                    0.00B
>>        Used:                         11.35TiB
>>        Free (estimated):              6.89TiB      (min: 3.85TiB)
>>        Free (statfs, df):             6.78TiB
>>        Data ratio:                       1.52
>>        Metadata ratio:                   2.88
>>        Global reserve:              512.00MiB      (used: 0.00B)
>>        Multiple profiles:                 yes      (data, metadata, 
>> system)
>>
>>     Data,RAID1: Size:324.00GiB, Used:299.59GiB (92.47%)
>>        /dev/sdd      324.00GiB
>>        /dev/sde      324.00GiB
>>
>>     Data,RAID5: Size:7.88TiB, Used:7.16TiB (90.84%)
>>        /dev/sdd        3.94TiB
>>        /dev/sde        3.94TiB
>>        /dev/sdf        3.94TiB
>>
>>     Metadata,RAID1: Size:2.00GiB, Used:73.25MiB (3.58%)
>>        /dev/sdd        2.00GiB
>>        /dev/sde        2.00GiB
> 
> The mixed metadata profile may be the problem.
> 
> Have you tried to convert the remaining 2GiB RAID1 metadata into RAID1C3?
> 
> Or is the problem you're hitting preventing the full conversion to RAID1C3?
> 
> 
> Anyway, it also looks like a bug in btrfs-progs, I'll need to dig deeper 
> to fix it.
> 
> Thanks,
> Qu

Just to make sure, you mean running `btrfs balance start 
-mconvert=raid1c3,soft` right? If so, unfortunately it just triggers 
those same "invalid shared data ref count, should have non-zero value" 
errors then forces the filesystem into read-only mode so I can't get it 
to run.

>>
>>     Metadata,RAID1C3: Size:14.00GiB, Used:8.69GiB (62.08%)
>>        /dev/sdd       14.00GiB
>>        /dev/sde       14.00GiB
>>        /dev/sdf       14.00GiB
>>
>>     System,RAID1: Size:32.00MiB, Used:48.00KiB (0.15%)
>>        /dev/sdd       32.00MiB
>>        /dev/sde       32.00MiB
>>
>>     System,RAID1C3: Size:32.00MiB, Used:736.00KiB (2.25%)
>>        /dev/sdd       32.00MiB
>>        /dev/sde       32.00MiB
>>        /dev/sdf       32.00MiB
>>
>>     Unallocated:
>>        /dev/sdd        3.00TiB
>>        /dev/sde        3.00TiB
>>        /dev/sdf        3.32TiB
>>
>> Thanks,
>> Nicolas
>>
>>>>>> I'm currently running btrfs-progs v6.12 but the balance was 
>>>>>> originally
>>>>>> run on v5.10.1. Is there any way to recover from this or should I 
>>>>>> just
>>>>>> nuke the filesystem and restart from scratch? There's nothing super
>>>>>> important on there, it's just going to be annoying to restore from a
>>>>>> backup, and I thought it'd be interesting to try to figure out what
>>>>>> happened here.
>>>>>
>>>>> Recommended to run a full memtest before doing anything, just to 
>>>>> verify
>>>>> if it's really a hardware bitflip.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>
>>>>
>>>
>>
>>
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors found in extent allocation tree or chunk allocation
  2025-01-29 19:33   ` Nicolas Gnyra
  2025-01-29 23:35     ` Qu Wenruo
@ 2025-03-15 16:52     ` Nicolas Gnyra
  1 sibling, 0 replies; 14+ messages in thread
From: Nicolas Gnyra @ 2025-03-15 16:52 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



Le 2025-01-29 à 14:33, Nicolas Gnyra a écrit :
> Le 2024-12-03 à 21:50, Qu Wenruo a écrit :
>>
>>
>> 在 2024/12/4 10:32, Nicolas Gnyra 写道:
>>> Hi all,
>>>
>>> I seem to have messed up my btrfs filesystem after adding a new (3rd)
>>> drive and running `btrfs balance start -dconvert=raid5 -
>>> mconvert=raid1c3 /path/to/mount`. It ran for a while and I thought it
>>> had finished successfully but after a reboot it's stuck mounting as
>>> read-only. I seemingly am able to mount it as read/write if I add `-o
>>> skip_balance` but if I try to write to it, it locks up again. I managed
>>> to run a scrub in this state but it found no errors.
>>>
>>> Kernel logs: https://pastebin.com/Cs06sNnr (drives sdb, sdc, and sdd,
>>> UUID dfa2779b-b7d1-4658-89f7-dabe494e67c8)
>>
>> The dmesg shows the problem very straightforward:
>>
>>    item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
>>      extent refs 1 gen 84178 flags 1
>>      ref#0: shared data backref parent 32399126528000 count 0 <<<
>>      ref#1: shared data backref parent 31808973717504 count 1
>>
>> Notice the count number, it should never be 0, as if one ref goes zero
>> it should be removed from the extent item.
>>
>> I believe the correct value should just be 1, and 0 -> 1 is also
>> possibly an indicator of hardware runtime bitflip.
>>
>> This is a new corner case we have never seen, thus I'll send a new patch
>> to handle such case in tree-checker.
>>
>>> `btrfs check`: https://pastebin.com/7SJZS3Yv
>>> `btrfs check --repair` (ran after a discussion in Libera Chat, failed):
>>> https://pastebin.com/BGLSx6GM
>>
>> In theory, btrfs should be able to repair this particular error,
>> but the error message seems to indicate ENOSPC, meaning there is not
>> enough space for metadata at least.
> 
> I finally had some time to try out a version of the kernel with your fix 
> (built locally from commit 0afd22092df4d3473569c197e317f91face7e51b) and 
> I can now see the modified error message (see new dmesg contents: 
> https://pastebin.com/t7J5TJ0Z). Unfortunately, apart from that, 
> behaviour seems to be identical to before. `btrfs check --repair` still 
> fails in the exact same way. Is this expected? For some reason I had 
> assumed your change would fix it, but I had forgotten this mention of 
> ENOSPC so is there any chance of getting back into a writable state or 
> should I just reformat the drives?

Just wanted to check in one last time before formatting the drives. Is 
there any chance of recovery here? I just tried with kernel v6.14-rc6 
(80e54e8) and the latest btrfs-progs from GitHub as of writing (721df6f 
on the devel branch) but I'm still getting the same error with `btrfs 
check --repair`.

>>> I'm currently running btrfs-progs v6.12 but the balance was originally
>>> run on v5.10.1. Is there any way to recover from this or should I just
>>> nuke the filesystem and restart from scratch? There's nothing super
>>> important on there, it's just going to be annoying to restore from a
>>> backup, and I thought it'd be interesting to try to figure out what
>>> happened here.
>>
>> Recommended to run a full memtest before doing anything, just to verify
>> if it's really a hardware bitflip.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks!
>>>
>>
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-03-15 16:52 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-04  0:02 Errors found in extent allocation tree or chunk allocation Nicolas Gnyra
2024-12-04  2:50 ` Qu Wenruo
2024-12-04  3:58   ` Nicolas Gnyra
2024-12-04  4:23     ` Qu Wenruo
2024-12-04  4:43       ` Nicolas Gnyra
2024-12-04 13:38         ` Nicolas Gnyra
2025-01-29 19:33   ` Nicolas Gnyra
2025-01-29 23:35     ` Qu Wenruo
2025-01-30  3:49       ` Nicolas Gnyra
2025-01-30  4:19         ` Qu Wenruo
2025-01-30  5:21           ` Nicolas Gnyra
2025-03-15 16:52     ` Nicolas Gnyra
  -- strict thread matches above, loose matches on Subject: below --
2023-01-10 12:49 errors " Frankie Fisher
2023-01-12 22:59 ` Frankie Fisher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.