From: Ellie <el@horse64.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, linux-btrfs@vger.kernel.org
Subject: Re: btrfs corruption issue on Pine64 PinePhone
Date: Thu, 17 Oct 2024 22:17:34 +0200 [thread overview]
Message-ID: <4e928770-efc9-4849-9471-e6379d4fe08f@horse64.org> (raw)
In-Reply-To: <33f0ecec-585d-4a02-a8a5-319759401e5f@gmx.com>
On 8/19/24 7:29 AM, Qu Wenruo wrote:
>
>
> 在 2024/8/19 13:28, ellie 写道:
>> Is there something else I could provide to help track this down? I
>> assume just because the file contents happen to be fine, doesn't mean
>> there wasn't corruption, like for example in the metadata. My apologies
>> for taking up your time.
>
> This means, by somehow the data checksum is incorrect.
>
> This doesn't sound sane to me, so I can only come up two possible reasons:
>
> 1. The checksum algorithm on the platform is insane
> IIRC the SOC is pretty mature (although it also means old), this
> doesn't sound possible to me.
>
> 2. Memory hardware is incorrect
> Thus causing bitflip for data csum.
>
> Other than above two reasons, I can not come up with other reasons
> unfortunately.
>
> Thanks,
> Qu
>
I did let a memtest run on this device recently, which didn't reveal
anything suspicious. However, this device was known to have memory
hiccups: https://forum.pine64.org/showthread.php?tid=9832&page=10 As far
as I know they were supposedly resolved, but I wouldn't be able to
judge. I would assume memtest would show them if still present, but
again I'm not sure.
The checksum errors seem to be permanent whenever they happen, I can
test this again if needed but I'm pretty sure I recall rerunning btrfs
checks and the same error came back up again. I can only do very
uninformed nonsense guesses what this means, but I guess this could
imply there is a problem writing the metadata while the actual file is
written correctly.
I hope some of this is helpful for some ideas.
Regards,
Ellie
>>>>>>>> 在 2024/8/5 15:25, ellie 写道:
>>>>>>>>> On 8/5/24 07:39, ellie wrote:
>>>>>>>>>> Dear kernel list,
>>>>>>>>>>
>>>>>>>>>> I'm hoping this is the right place to sent this. But there seems
>>>>>>>>>> to be
>>>>>>>>>> a btrfs corruption issue on the Pine64 PinePhone:
>>>>>>>>>>
>>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>>>>>>>>>
>>>>>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional
>>>>>>>>>> patches
>>>>>>>>>> may be used by postmarketOS (which is based on Alpine). The
>>>>>>>>>> device is
>>>>>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/
>>>>>>>>>> wiki/
>>>>>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a
>>>>>>>>>> way to
>>>>>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember
>>>>>>>>>> which
>>>>>>>>>> it is.
>>>>>>>>>>
>>>>>>>>>> This is on an SD Card, so an inherently rather unreliable storage
>>>>>>>>>> medium. However, I tried two cards from what I believe to be two
>>>>>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with
>>>>>>>>>> both.
>>>>>>>>>>
>>>>>>>>>> The PinePhone had various chipset instability issues before, like
>>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I
>>>>>>>>>> believe
>>>>>>>>>> has however been fixed since. I have no idea if that's
>>>>>>>>>> relevant, I'm
>>>>>>>>>> just pointing it out. I also don't know if other filesystems,
>>>>>>>>>> like
>>>>>>>>>> ext4 that I used before, might have also had corruption and just
>>>>>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not
>>>>>>>>>> sure I
>>>>>>>>>> necessarily ever would have.
>>>>>>>>
>>>>>>>> In the detailed report in pmOS issue, you mentioned it's a video
>>>>>>>> file.
>>>>>>>>
>>>>>>>> I'm wondering if all the corruptions you see are from video files,
>>>>>>>> especially if the video files are all recorded on the file.
>>>>>>>>
>>>>>>>> If that's the case, it may be related to the IO pattern,
>>>>>>>> especially if
>>>>>>>> the recording tool is using direct IO and didn't have proper
>>>>>>>> writeback
>>>>>>>> wait for those direct IO.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Qu
>>>>>>>>
>>>>>>>
>>>>>>> Thanks so much for the quick input!
>>>>>>>
>>>>>>> All the files I mentioned in bug reports were written by
>>>>>>> syncthing, so
>>>>>>> there wasn't any on-device video recording involved. I once saw
>>>>>>> Nheko's
>>>>>>> database file corrupt however, so it's apparently not limited to
>>>>>>> syncthing. I'm guessing video files are affected so often simply
>>>>>>> due to
>>>>>>> their large size.
>>>>>>
>>>>>> I did a quick clone and search of syncthing.
>>>>>>
>>>>>> There is no usage of O_DIRECT directly, so I guess it's not the known
>>>>>> csum mismatch caused by bad sync of direct IO writeback.
>>>>>>
>>>>>> In that case, since the corrupted file is syncthing synchronized, can
>>>>>> you do a diff of the binary data?
>>>>>>
>>>>>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to
>>>>>> mount the
>>>>>> sdcard on another system, then compare the binary.
>>>>>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd")
>>>>>>
>>>>>> At this stage, we need to find out what's really causing the problem,
>>>>>> the btrfs itself or some thing lower level.
>>>>>> (I strongly hope it's not btrfs, but either way it's not going to
>>>>>> end up
>>>>>> well)
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>> Thanks for your detailed instructions! I was about to do as you said
>>>>> and
>>>>> ran the sync for a few hours, stopped it, and planned to run btrfs
>>>>> scrub
>>>>> this evening. However, I then ran into a hard shutdown due to what
>>>>> might
>>>>> be an upower bug (won't lie, was very annoyed at that point):
>>>>>
>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3073
>>>>>
>>>>> Should I still attach a diff for an affected file I find now? Or are
>>>>> the
>>>>> results going to be worthless if there was a hard shutdown in between,
>>>>> and I need to first fix the filesystem, repeat the sync test, and
>>>>> repeat
>>>>> finding a new corruption error to diff?
>>>>
>>>> As long as you didn't touch those files, and scrub still reports errors
>>>> on that file, the diff is still very helpful to provide some clue.
>>>>
>>>
>>> I finally had a new corrupted file pop up, this was actually after any
>>> unintended sudden shutdown so there shouldn't be any interference from
>>> that:
>>>
>>> [128958.860335] BTRFS error (device dm-0): unable to fixup (regular)
>>> error at logical 133906497536 on dev /dev/mapper/root physical
>>> 135089684480
>>> [128958.862548] BTRFS warning (device dm-0): checksum error at logical
>>> 133906497536 on dev /dev/mapper/root, physical 135089684480, root 257,
>>> inode 331715, offset 102400, length 4096, links 1 (path: ellie/Music/
>>> Baldur's Gate (2) II Shadows of Amn (2000)/06 City Gates.mp3)
>>>
>>> However, when manually mounting the file on the computer where it
>>> originates from and where the undamaged original file is:
>>>
>>> /mnt # mount -t btrfs -o rescue=all,ro,subvol=/@home,defaults /dev/
>>> mapper/blamap p64
>>> /mnt # ls p64/
>>> ellie
>>> /mnt # cp p64/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\
>>> Amn\ \(2000\)/06\ City\ Gates.mp3 ./
>>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
>>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
>>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
>>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
>>> /mnt #
>>>
>>> It seems like file is exactly the same, which I assume isn't meant to
>>> happen.
>>>
>>> I'm not sure what that implies, but I hope it's helpful info!
>>>
>>> Regards,
>>>
>>> Ellie
>>>
>>
next prev parent reply other threads:[~2024-10-17 20:27 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-05 5:39 btrfs corruption issue on Pine64 PinePhone ellie
2024-08-05 5:55 ` ellie
2024-08-05 6:10 ` Qu Wenruo
2024-08-05 6:20 ` ellie
2024-08-05 6:34 ` Qu Wenruo
2024-08-06 16:02 ` ellie
2024-08-06 21:55 ` Qu Wenruo
2024-08-08 11:31 ` ellie
2024-08-19 3:58 ` ellie
2024-08-19 5:29 ` Qu Wenruo
2024-08-19 8:16 ` ellie
2024-10-17 20:17 ` Ellie [this message]
2024-10-02 7:20 ` ellie
2024-12-16 22:53 ` BTRFS hangs and causes semi-freezes on PinePhone Ellie
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4e928770-efc9-4849-9471-e6379d4fe08f@horse64.org \
--to=el@horse64.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox