Linux Btrfs filesystem development
 help / color / mirror / Atom feed
* Fwd: "BTRFS critical: ... corrupt leaf" due to defective RAM
       [not found] <6f36a628-21f9-ca21-bae3-2a4150245ec2@avgustinov.eu>
@ 2020-12-21 10:08 ` Nik.
  2020-12-21 11:44   ` Qu Wenruo
  0 siblings, 1 reply; 3+ messages in thread
From: Nik. @ 2020-12-21 10:08 UTC (permalink / raw)
  To: Btrfs BTRFS

Dear all,

the forwarded mail below came back yesterday with the error 
"Diagnostic-Code: X-Postfix; TLS is required, but was not offered by 
host vger.kernel.org[23.128.96.18]".

Is it really intended that your mail server does not offer TLS?

Kind regards,

Nik.

--

15.12.2020 18:40, Nik.:
> Dear all,
>
> after almost a year without problems I need again your advice about 
> the same computer, but this time it is (hopefully only) the root FS 
> that failed. I have backups of everything except a couple of files in 
> /etc, so nothing critical, but probably it would be interesting for 
> somebody to see how behaved btrfs in such a situation.
>
> The story in short:
>
> - the FS switched to ro mode. Initially I thought that it is due to 
> insufficient free space (have already had similar situations) and 
> deleted some old snapshots. Within half a day it happened 3 more 
> times, though.
>
> - so I booted in memtest86 and it gave me a lot of errors! This NAS is 
> 9 years old and I was already looking for replacement, but it is not 
> easy to find 8-bay NAS for 2,5" drives...
>
> - took the drive out from the failed system and tried to mount it on 
> another (healthy?) PC. I am getting:
>
> root@ubrun:~# mount -t btrfs -o subvol=@ /dev/sdb1 /mnt/sd
> mount: /mnt/sd: wrong fs type, bad option, bad superblock on 
> /dev/sdb1, missing codepage or helper program, or other error.
> root@ubrun:~# dmesg |tail
> [   50.672561] Policy zone: Normal
> [  185.190764] BTRFS info (device sdb1): disk space caching is enabled
> [  185.190767] BTRFS info (device sdb1): has skinny extents
> [  185.199331] BTRFS info (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 
> 0, flush 0, corrupt 65, gen 0
> [  185.246051] BTRFS critical (device sdb1): corrupt leaf: 
> block=50850988032 slot=79 extent bytenr=50496929792 len=16384 unknown 
> inline ref type: 54
> [  185.246055] BTRFS error (device sdb1): block=50850988032 read time 
> tree block corruption detected
> [  185.247070] BTRFS critical (device sdb1): corrupt leaf: 
> block=50850988032 slot=79 extent bytenr=50496929792 len=16384 unknown 
> inline ref type: 54
> [  185.247073] BTRFS error (device sdb1): block=50850988032 read time 
> tree block corruption detected
> [  185.247093] BTRFS error (device sdb1): failed to read block groups: -5
> [  185.281382] BTRFS error (device sdb1): open_ctree failed
> root@ubrun:~#
>
> How should one proceed?
>
> Kind regards
>
> Nik.
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Fwd: "BTRFS critical: ... corrupt leaf" due to defective RAM
  2020-12-21 10:08 ` Fwd: "BTRFS critical: ... corrupt leaf" due to defective RAM Nik.
@ 2020-12-21 11:44   ` Qu Wenruo
       [not found]     ` <67ee3588-b18b-c9aa-5f33-ae7bbde10e7c@avgustinov.eu>
  0 siblings, 1 reply; 3+ messages in thread
From: Qu Wenruo @ 2020-12-21 11:44 UTC (permalink / raw)
  To: Nik., Btrfs BTRFS



On 2020/12/21 下午6:08, Nik. wrote:
> Dear all,
>
> the forwarded mail below came back yesterday with the error
> "Diagnostic-Code: X-Postfix; TLS is required, but was not offered by
> host vger.kernel.org[23.128.96.18]".
>
> Is it really intended that your mail server does not offer TLS?

Can't help on that, not a vger manager nor know anything. (Most if not
all kernel mail lists are hosted by vger, each mail list can't do much)

But I can definitely answer some of your btrfs problem.
>
> Kind regards,
>
> Nik.
>
> --
>
> 15.12.2020 18:40, Nik.:
>> Dear all,
>>
>> after almost a year without problems I need again your advice about
>> the same computer, but this time it is (hopefully only) the root FS
>> that failed. I have backups of everything except a couple of files in
>> /etc, so nothing critical, but probably it would be interesting for
>> somebody to see how behaved btrfs in such a situation.
>>
>> The story in short:
>>
>> - the FS switched to ro mode. Initially I thought that it is due to
>> insufficient free space (have already had similar situations) and
>> deleted some old snapshots. Within half a day it happened 3 more
>> times, though.

Any detailed report on that RO?
We should have it addressed upstream, if you still hit that, I guess we
need more investigation (if it's not caused by memory corruption)

>>
>> - so I booted in memtest86 and it gave me a lot of errors! This NAS is
>> 9 years old and I was already looking for replacement, but it is not
>> easy to find 8-bay NAS for 2,5" drives...
>>
>> - took the drive out from the failed system and tried to mount it on
>> another (healthy?) PC. I am getting:
>>
>> root@ubrun:~# mount -t btrfs -o subvol=@ /dev/sdb1 /mnt/sd
>> mount: /mnt/sd: wrong fs type, bad option, bad superblock on
>> /dev/sdb1, missing codepage or helper program, or other error.
>> root@ubrun:~# dmesg |tail
>> [   50.672561] Policy zone: Normal
>> [  185.190764] BTRFS info (device sdb1): disk space caching is enabled
>> [  185.190767] BTRFS info (device sdb1): has skinny extents
>> [  185.199331] BTRFS info (device sdb1): bdev /dev/sdb1 errs: wr 0, rd
>> 0, flush 0, corrupt 65, gen 0
>> [  185.246051] BTRFS critical (device sdb1): corrupt leaf:
>> block=50850988032 slot=79 extent bytenr=50496929792 len=16384 unknown
>> inline ref type: 54

This is indeed some memory bitflip, and your initial kernel is not newer
enough to detect it at write time.

If using newer enough kernel, such corrupted metadata shouldn't even
reach disk. (Although it still means you will get the fs RO)

There are only 4 valid types for extent refs:

TREE_BLOCK_REF	 176(0xb0)
EXTENT_DATA_REF  178(0xb2)
SHARED_BLOCK_REF 182(0xb6)
SHARED_DATA_REF  184(0xb8)

The invalid type is:

                   54(0x36)

The diff is 0x80 to SHARED_BLOCK_REF, indeed one bit flipped.

>> [  185.246055] BTRFS error (device sdb1): block=50850988032 read time
>> tree block corruption detected
>> [  185.247070] BTRFS critical (device sdb1): corrupt leaf:
>> block=50850988032 slot=79 extent bytenr=50496929792 len=16384 unknown
>> inline ref type: 54
>> [  185.247073] BTRFS error (device sdb1): block=50850988032 read time
>> tree block corruption detected
>> [  185.247093] BTRFS error (device sdb1): failed to read block groups: -5
>> [  185.281382] BTRFS error (device sdb1): open_ctree failed
>> root@ubrun:~#
>>
>> How should one proceed?

Since it's caused by bitflip and you mentioned the system has tons of
memory error, I believe there will be tons of similar problems
scattering around your fs.

For repair, I don't really believe btrfs-check can or will be able to
fix any bitflip, not to mention so many possible more bitflips.

It's better just to use your backup.

BTW, for detection for extent tree bitflip is introduced in v5.4.
Next time at least you can catch the faulty hardware before it screws up
your data.

Thanks,
Qu

>>
>> Kind regards
>>
>> Nik.
>>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Fwd: "BTRFS critical: ... corrupt leaf" due to defective RAM
       [not found]     ` <67ee3588-b18b-c9aa-5f33-ae7bbde10e7c@avgustinov.eu>
@ 2020-12-22 23:02       ` Qu Wenruo
  0 siblings, 0 replies; 3+ messages in thread
From: Qu Wenruo @ 2020-12-22 23:02 UTC (permalink / raw)
  To: Nik., linux-btrfs@vger.kernel.org



On 2020/12/22 下午11:59, Nik. wrote:
> Hi,
>
> Thank you very much for the quick reply.
> Ok, I am going to use the backups (as you suggested).
>
> Just a quick question for understanding the background better:
>    -given a btrfs with many intact subvolumes and
>    -say, one defective sector within the subvolume "@" (Ubuntu specific),
>     which couses this subvolume to be (automatically) remounted as RO
>    -am I getting it right that none of the other subvolumes can be
> mounted properly (i.e., RW)?

Unfortunately, it's not subvolume tree itself get corrupted, but the
extent tree.

Extent tree is shared through the whole fs, thus you may still be unable
to mount other subvolumes as long as it involves reading the extent tree.

> Woildn't it be interesting to have an
> option, allowing this to work?

We have new rescue= mount options, IIRC we have rescue=all, which will
try to ignore any non-critical trees.

In that case, you may be able to mount the subvolume RO, as long as
there are no bitflips in that subvolume.

Thanks,
Qu

> There will be, of course, a processing
> overhead, but probably not so expensive as by RAID 1?
>
> Thank you in advance and I wish you all to be happy and healthy!
>
> Nik.
> --
> 21.12.2020 12:44, Qu Wenruo:
>>
>>
>> On 2020/12/21 下午6:08, Nik. wrote:
>>> Dear all,
>>>
>>> the forwarded mail below came back yesterday with the error
>>> "Diagnostic-Code: X-Postfix; TLS is required, but was not offered by
>>> host vger.kernel.org[23.128.96.18]".
>>>
>>> Is it really intended that your mail server does not offer TLS?
>>
>> Can't help on that, not a vger manager nor know anything. (Most if not
>> all kernel mail lists are hosted by vger, each mail list can't do much)
>>
>> But I can definitely answer some of your btrfs problem.
>>>
>>> Kind regards,
>>>
>>> Nik.
>>>
>>> --
>>>
>>> 15.12.2020 18:40, Nik.:
>>>> Dear all,
>>>>
>>>> after almost a year without problems I need again your advice about
>>>> the same computer, but this time it is (hopefully only) the root FS
>>>> that failed. I have backups of everything except a couple of files in
>>>> /etc, so nothing critical, but probably it would be interesting for
>>>> somebody to see how behaved btrfs in such a situation.
>>>>
>>>> The story in short:
>>>>
>>>> - the FS switched to ro mode. Initially I thought that it is due to
>>>> insufficient free space (have already had similar situations) and
>>>> deleted some old snapshots. Within half a day it happened 3 more
>>>> times, though.
>>
>> Any detailed report on that RO?
>> We should have it addressed upstream, if you still hit that, I guess we
>> need more investigation (if it's not caused by memory corruption)
>>
>>>>
>>>> - so I booted in memtest86 and it gave me a lot of errors! This NAS is
>>>> 9 years old and I was already looking for replacement, but it is not
>>>> easy to find 8-bay NAS for 2,5" drives...
>>>>
>>>> - took the drive out from the failed system and tried to mount it on
>>>> another (healthy?) PC. I am getting:
>>>>
>>>> root@ubrun:~# mount -t btrfs -o subvol=@ /dev/sdb1 /mnt/sd
>>>> mount: /mnt/sd: wrong fs type, bad option, bad superblock on
>>>> /dev/sdb1, missing codepage or helper program, or other error.
>>>> root@ubrun:~# dmesg |tail
>>>> [   50.672561] Policy zone: Normal
>>>> [  185.190764] BTRFS info (device sdb1): disk space caching is enabled
>>>> [  185.190767] BTRFS info (device sdb1): has skinny extents
>>>> [  185.199331] BTRFS info (device sdb1): bdev /dev/sdb1 errs: wr 0, rd
>>>> 0, flush 0, corrupt 65, gen 0
>>>> [  185.246051] BTRFS critical (device sdb1): corrupt leaf:
>>>> block=50850988032 slot=79 extent bytenr=50496929792 len=16384 unknown
>>>> inline ref type: 54
>>
>> This is indeed some memory bitflip, and your initial kernel is not newer
>> enough to detect it at write time.
>>
>> If using newer enough kernel, such corrupted metadata shouldn't even
>> reach disk. (Although it still means you will get the fs RO)
>>
>> There are only 4 valid types for extent refs:
>>
>> TREE_BLOCK_REF     176(0xb0)
>> EXTENT_DATA_REF  178(0xb2)
>> SHARED_BLOCK_REF 182(0xb6)
>> SHARED_DATA_REF  184(0xb8)
>>
>> The invalid type is:
>>
>>                    54(0x36)
>>
>> The diff is 0x80 to SHARED_BLOCK_REF, indeed one bit flipped.
>>
>>>> [  185.246055] BTRFS error (device sdb1): block=50850988032 read time
>>>> tree block corruption detected
>>>> [  185.247070] BTRFS critical (device sdb1): corrupt leaf:
>>>> block=50850988032 slot=79 extent bytenr=50496929792 len=16384 unknown
>>>> inline ref type: 54
>>>> [  185.247073] BTRFS error (device sdb1): block=50850988032 read time
>>>> tree block corruption detected
>>>> [  185.247093] BTRFS error (device sdb1): failed to read block
>>>> groups: -5
>>>> [  185.281382] BTRFS error (device sdb1): open_ctree failed
>>>> root@ubrun:~#
>>>>
>>>> How should one proceed?
>>
>> Since it's caused by bitflip and you mentioned the system has tons of
>> memory error, I believe there will be tons of similar problems
>> scattering around your fs.
>>
>> For repair, I don't really believe btrfs-check can or will be able to
>> fix any bitflip, not to mention so many possible more bitflips.
>>
>> It's better just to use your backup.
>>
>> BTW, for detection for extent tree bitflip is introduced in v5.4.
>> Next time at least you can catch the faulty hardware before it screws up
>> your data.
>>
>> Thanks,
>> Qu
>>
>>>>
>>>> Kind regards
>>>>
>>>> Nik.
>>>>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-12-22 23:03 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <6f36a628-21f9-ca21-bae3-2a4150245ec2@avgustinov.eu>
2020-12-21 10:08 ` Fwd: "BTRFS critical: ... corrupt leaf" due to defective RAM Nik.
2020-12-21 11:44   ` Qu Wenruo
     [not found]     ` <67ee3588-b18b-c9aa-5f33-ae7bbde10e7c@avgustinov.eu>
2020-12-22 23:02       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox