Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Ellie <el@horse64.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, linux-btrfs@vger.kernel.org
Subject: Re: btrfs corruption issue on Pine64 PinePhone
Date: Thu, 17 Oct 2024 22:17:34 +0200	[thread overview]
Message-ID: <4e928770-efc9-4849-9471-e6379d4fe08f@horse64.org> (raw)
In-Reply-To: <33f0ecec-585d-4a02-a8a5-319759401e5f@gmx.com>



On 8/19/24 7:29 AM, Qu Wenruo wrote:
> 
> 
> 在 2024/8/19 13:28, ellie 写道:
>> Is there something else I could provide to help track this down? I
>> assume just because the file contents happen to be fine, doesn't mean
>> there wasn't corruption, like for example in the metadata. My apologies
>> for taking up your time.
> 
> This means, by somehow the data checksum is incorrect.
> 
> This doesn't sound sane to me, so I can only come up two possible reasons:
> 
> 1. The checksum algorithm on the platform is insane
>     IIRC the SOC is pretty mature (although it also means old), this
>     doesn't sound possible to me.
> 
> 2. Memory hardware is incorrect
>     Thus causing bitflip for data csum.
> 
> Other than above two reasons, I can not come up with other reasons
> unfortunately.
> 
> Thanks,
> Qu
> 

I did let a memtest run on this device recently, which didn't reveal 
anything suspicious. However, this device was known to have memory 
hiccups: https://forum.pine64.org/showthread.php?tid=9832&page=10 As far 
as I know they were supposedly resolved, but I wouldn't be able to 
judge. I would assume memtest would show them if still present, but 
again I'm not sure.

The checksum errors seem to be permanent whenever they happen, I can 
test this again if needed but I'm pretty sure I recall rerunning btrfs 
checks and the same error came back up again. I can only do very 
uninformed nonsense guesses what this means, but I guess this could 
imply there is a problem writing the metadata while the actual file is 
written correctly.

I hope some of this is helpful for some ideas.

Regards,

Ellie

>>>>>>>> 在 2024/8/5 15:25, ellie 写道:
>>>>>>>>> On 8/5/24 07:39, ellie wrote:
>>>>>>>>>> Dear kernel list,
>>>>>>>>>>
>>>>>>>>>> I'm hoping this is the right place to sent this. But there seems
>>>>>>>>>> to be
>>>>>>>>>> a btrfs corruption issue on the Pine64 PinePhone:
>>>>>>>>>>
>>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>>>>>>>>>
>>>>>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional
>>>>>>>>>> patches
>>>>>>>>>> may be used by postmarketOS (which is based on Alpine). The
>>>>>>>>>> device is
>>>>>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/ 
>>>>>>>>>> wiki/
>>>>>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a
>>>>>>>>>> way to
>>>>>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember 
>>>>>>>>>> which
>>>>>>>>>> it is.
>>>>>>>>>>
>>>>>>>>>> This is on an SD Card, so an inherently rather unreliable storage
>>>>>>>>>> medium. However, I tried two cards from what I believe to be two
>>>>>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with
>>>>>>>>>> both.
>>>>>>>>>>
>>>>>>>>>> The PinePhone had various chipset instability issues before, like
>>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I
>>>>>>>>>> believe
>>>>>>>>>> has however been fixed since. I have no idea if that's
>>>>>>>>>> relevant, I'm
>>>>>>>>>> just pointing it out. I also don't know if other filesystems, 
>>>>>>>>>> like
>>>>>>>>>> ext4 that I used before, might have also had corruption and just
>>>>>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not
>>>>>>>>>> sure I
>>>>>>>>>> necessarily ever would have.
>>>>>>>>
>>>>>>>> In the detailed report in pmOS issue, you mentioned it's a video
>>>>>>>> file.
>>>>>>>>
>>>>>>>> I'm wondering if all the corruptions you see are from video files,
>>>>>>>> especially if the video files are all recorded on the file.
>>>>>>>>
>>>>>>>> If that's the case, it may be related to the IO pattern,
>>>>>>>> especially if
>>>>>>>> the recording tool is using direct IO and didn't have proper
>>>>>>>> writeback
>>>>>>>> wait for those direct IO.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Qu
>>>>>>>>
>>>>>>>
>>>>>>> Thanks so much for the quick input!
>>>>>>>
>>>>>>> All the files I mentioned in bug reports were written by
>>>>>>> syncthing, so
>>>>>>> there wasn't any on-device video recording involved. I once saw
>>>>>>> Nheko's
>>>>>>> database file corrupt however, so it's apparently not limited to
>>>>>>> syncthing. I'm guessing video files are affected so often simply
>>>>>>> due to
>>>>>>> their large size.
>>>>>>
>>>>>> I did a quick clone and search of syncthing.
>>>>>>
>>>>>> There is no usage of O_DIRECT directly, so I guess it's not the known
>>>>>> csum mismatch caused by bad sync of direct IO writeback.
>>>>>>
>>>>>> In that case, since the corrupted file is syncthing synchronized, can
>>>>>> you do a diff of the binary data?
>>>>>>
>>>>>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to
>>>>>> mount the
>>>>>> sdcard on another system, then compare the binary.
>>>>>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd")
>>>>>>
>>>>>> At this stage, we need to find out what's really causing the problem,
>>>>>> the btrfs itself or some thing lower level.
>>>>>> (I strongly hope it's not btrfs, but either way it's not going to
>>>>>> end up
>>>>>> well)
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>> Thanks for your detailed instructions! I was about to do as you said
>>>>> and
>>>>> ran the sync for a few hours, stopped it, and planned to run btrfs
>>>>> scrub
>>>>> this evening. However, I then ran into a hard shutdown due to what
>>>>> might
>>>>> be an upower bug (won't lie, was very annoyed at that point):
>>>>>
>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3073
>>>>>
>>>>> Should I still attach a diff for an affected file I find now? Or are
>>>>> the
>>>>> results going to be worthless if there was a hard shutdown in between,
>>>>> and I need to first fix the filesystem, repeat the sync test, and
>>>>> repeat
>>>>> finding a new corruption error to diff?
>>>>
>>>> As long as you didn't touch those files, and scrub still reports errors
>>>> on that file, the diff is still very helpful to provide some clue.
>>>>
>>>
>>> I finally had a new corrupted file pop up, this was actually after any
>>> unintended sudden shutdown so there shouldn't be any interference from
>>> that:
>>>
>>> [128958.860335] BTRFS error (device dm-0): unable to fixup (regular)
>>> error at logical 133906497536 on dev /dev/mapper/root physical
>>> 135089684480
>>> [128958.862548] BTRFS warning (device dm-0): checksum error at logical
>>> 133906497536 on dev /dev/mapper/root, physical 135089684480, root 257,
>>> inode 331715, offset 102400, length 4096, links 1 (path: ellie/Music/
>>> Baldur's Gate (2) II Shadows of Amn (2000)/06 City Gates.mp3)
>>>
>>> However, when manually mounting the file on the computer where it
>>> originates from and where the undamaged original file is:
>>>
>>> /mnt # mount -t btrfs -o rescue=all,ro,subvol=/@home,defaults /dev/
>>> mapper/blamap p64
>>> /mnt # ls p64/
>>> ellie
>>> /mnt # cp p64/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\
>>> Amn\ \(2000\)/06\ City\ Gates.mp3 ./
>>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
>>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
>>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
>>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
>>> /mnt #
>>>
>>> It seems like file is exactly the same, which I assume isn't meant to
>>> happen.
>>>
>>> I'm not sure what that implies, but I hope it's helpful info!
>>>
>>> Regards,
>>>
>>> Ellie
>>>
>>


  parent reply	other threads:[~2024-10-17 20:27 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-05  5:39 btrfs corruption issue on Pine64 PinePhone ellie
2024-08-05  5:55 ` ellie
2024-08-05  6:10   ` Qu Wenruo
2024-08-05  6:20     ` ellie
2024-08-05  6:34       ` Qu Wenruo
2024-08-06 16:02         ` ellie
2024-08-06 21:55           ` Qu Wenruo
2024-08-08 11:31             ` ellie
2024-08-19  3:58               ` ellie
2024-08-19  5:29                 ` Qu Wenruo
2024-08-19  8:16                   ` ellie
2024-10-17 20:17                   ` Ellie [this message]
2024-10-02  7:20 ` ellie
2024-12-16 22:53   ` BTRFS hangs and causes semi-freezes on PinePhone Ellie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4e928770-efc9-4849-9471-e6379d4fe08f@horse64.org \
    --to=el@horse64.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox