* btrfs corruption issue on Pine64 PinePhone
@ 2024-08-05 5:39 ellie
2024-08-05 5:55 ` ellie
2024-10-02 7:20 ` ellie
0 siblings, 2 replies; 14+ messages in thread
From: ellie @ 2024-08-05 5:39 UTC (permalink / raw)
To: linux-btrfs
Dear kernel list,
I'm hoping this is the right place to sent this. But there seems to be a
btrfs corruption issue on the Pine64 PinePhone:
https://gitlab.com/postmarketOS/pmaports/-/issues/3058
The kernel is 6.9.10, I wouldn't know what exact additional patches may
be used by postmarketOS (which is based on Alpine). The device is the
PinePhone revision 1.2a or newer
https://wiki.pine64.org/wiki/PinePhone#Hardware_revisions sadly there
doesn't seem to be a way to check in software if it's 1.2a or 1.2b, and
I don't remember which it is.
This is on an SD Card, so an inherently rather unreliable storage
medium. However, I tried two cards from what I believe to be two
different vendors, Lexar and SanDisk, and I'm seeing this with both.
The PinePhone had various chipset instability issues before, like
https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe
has however been fixed since. I have no idea if that's relevant, I'm
just pointing it out. I also don't know if other filesystems, like ext4
that I used before, might have also had corruption and just didn't
detect it. Not that I ever noticed anything, but I'm not sure I
necessarily ever would have.
Regards,
Ellie
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone
2024-08-05 5:39 btrfs corruption issue on Pine64 PinePhone ellie
@ 2024-08-05 5:55 ` ellie
2024-08-05 6:10 ` Qu Wenruo
2024-10-02 7:20 ` ellie
1 sibling, 1 reply; 14+ messages in thread
From: ellie @ 2024-08-05 5:55 UTC (permalink / raw)
To: linux-btrfs
On 8/5/24 07:39, ellie wrote:
> Dear kernel list,
>
> I'm hoping this is the right place to sent this. But there seems to be a
> btrfs corruption issue on the Pine64 PinePhone:
>
> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>
> The kernel is 6.9.10, I wouldn't know what exact additional patches may
> be used by postmarketOS (which is based on Alpine). The device is the
> PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/
> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to
> check in software if it's 1.2a or 1.2b, and I don't remember which it is.
>
> This is on an SD Card, so an inherently rather unreliable storage
> medium. However, I tried two cards from what I believe to be two
> different vendors, Lexar and SanDisk, and I'm seeing this with both.
>
> The PinePhone had various chipset instability issues before, like
> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe
> has however been fixed since. I have no idea if that's relevant, I'm
> just pointing it out. I also don't know if other filesystems, like ext4
> that I used before, might have also had corruption and just didn't
> detect it. Not that I ever noticed anything, but I'm not sure I
> necessarily ever would have.
>
> Regards,
>
> Ellie
I forgot to specify one testing detail: testing this seems to require
writing a couple of gigabytes to the SD Card. So that's an additional
difficulty, since I assume doing that too often will simply kill the
card for real, which limits how quick and often this can be tested.
Regards,
Ellie
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone
2024-08-05 5:55 ` ellie
@ 2024-08-05 6:10 ` Qu Wenruo
2024-08-05 6:20 ` ellie
0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2024-08-05 6:10 UTC (permalink / raw)
To: ellie, linux-btrfs
在 2024/8/5 15:25, ellie 写道:
> On 8/5/24 07:39, ellie wrote:
>> Dear kernel list,
>>
>> I'm hoping this is the right place to sent this. But there seems to be
>> a btrfs corruption issue on the Pine64 PinePhone:
>>
>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>
>> The kernel is 6.9.10, I wouldn't know what exact additional patches
>> may be used by postmarketOS (which is based on Alpine). The device is
>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/
>> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to
>> check in software if it's 1.2a or 1.2b, and I don't remember which it is.
>>
>> This is on an SD Card, so an inherently rather unreliable storage
>> medium. However, I tried two cards from what I believe to be two
>> different vendors, Lexar and SanDisk, and I'm seeing this with both.
>>
>> The PinePhone had various chipset instability issues before, like
>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe
>> has however been fixed since. I have no idea if that's relevant, I'm
>> just pointing it out. I also don't know if other filesystems, like
>> ext4 that I used before, might have also had corruption and just
>> didn't detect it. Not that I ever noticed anything, but I'm not sure I
>> necessarily ever would have.
In the detailed report in pmOS issue, you mentioned it's a video file.
I'm wondering if all the corruptions you see are from video files,
especially if the video files are all recorded on the file.
If that's the case, it may be related to the IO pattern, especially if
the recording tool is using direct IO and didn't have proper writeback
wait for those direct IO.
Thanks,
Qu
>>
>> Regards,
>>
>> Ellie
>
> I forgot to specify one testing detail: testing this seems to require
> writing a couple of gigabytes to the SD Card. So that's an additional
> difficulty, since I assume doing that too often will simply kill the
> card for real, which limits how quick and often this can be tested.
>
> Regards,
>
> Ellie
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone
2024-08-05 6:10 ` Qu Wenruo
@ 2024-08-05 6:20 ` ellie
2024-08-05 6:34 ` Qu Wenruo
0 siblings, 1 reply; 14+ messages in thread
From: ellie @ 2024-08-05 6:20 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 8/5/24 08:10, Qu Wenruo wrote:
>
>
> 在 2024/8/5 15:25, ellie 写道:
>> On 8/5/24 07:39, ellie wrote:
>>> Dear kernel list,
>>>
>>> I'm hoping this is the right place to sent this. But there seems to be
>>> a btrfs corruption issue on the Pine64 PinePhone:
>>>
>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>>
>>> The kernel is 6.9.10, I wouldn't know what exact additional patches
>>> may be used by postmarketOS (which is based on Alpine). The device is
>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/
>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to
>>> check in software if it's 1.2a or 1.2b, and I don't remember which it
>>> is.
>>>
>>> This is on an SD Card, so an inherently rather unreliable storage
>>> medium. However, I tried two cards from what I believe to be two
>>> different vendors, Lexar and SanDisk, and I'm seeing this with both.
>>>
>>> The PinePhone had various chipset instability issues before, like
>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe
>>> has however been fixed since. I have no idea if that's relevant, I'm
>>> just pointing it out. I also don't know if other filesystems, like
>>> ext4 that I used before, might have also had corruption and just
>>> didn't detect it. Not that I ever noticed anything, but I'm not sure I
>>> necessarily ever would have.
>
> In the detailed report in pmOS issue, you mentioned it's a video file.
>
> I'm wondering if all the corruptions you see are from video files,
> especially if the video files are all recorded on the file.
>
> If that's the case, it may be related to the IO pattern, especially if
> the recording tool is using direct IO and didn't have proper writeback
> wait for those direct IO.
>
> Thanks,
> Qu
>
Thanks so much for the quick input!
All the files I mentioned in bug reports were written by syncthing, so
there wasn't any on-device video recording involved. I once saw Nheko's
database file corrupt however, so it's apparently not limited to
syncthing. I'm guessing video files are affected so often simply due to
their large size.
Regards,
Ellie
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone
2024-08-05 6:20 ` ellie
@ 2024-08-05 6:34 ` Qu Wenruo
2024-08-06 16:02 ` ellie
0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2024-08-05 6:34 UTC (permalink / raw)
To: ellie, linux-btrfs
在 2024/8/5 15:50, ellie 写道:
>
>
> On 8/5/24 08:10, Qu Wenruo wrote:
>>
>>
>> 在 2024/8/5 15:25, ellie 写道:
>>> On 8/5/24 07:39, ellie wrote:
>>>> Dear kernel list,
>>>>
>>>> I'm hoping this is the right place to sent this. But there seems to be
>>>> a btrfs corruption issue on the Pine64 PinePhone:
>>>>
>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>>>
>>>> The kernel is 6.9.10, I wouldn't know what exact additional patches
>>>> may be used by postmarketOS (which is based on Alpine). The device is
>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/
>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to
>>>> check in software if it's 1.2a or 1.2b, and I don't remember which
>>>> it is.
>>>>
>>>> This is on an SD Card, so an inherently rather unreliable storage
>>>> medium. However, I tried two cards from what I believe to be two
>>>> different vendors, Lexar and SanDisk, and I'm seeing this with both.
>>>>
>>>> The PinePhone had various chipset instability issues before, like
>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe
>>>> has however been fixed since. I have no idea if that's relevant, I'm
>>>> just pointing it out. I also don't know if other filesystems, like
>>>> ext4 that I used before, might have also had corruption and just
>>>> didn't detect it. Not that I ever noticed anything, but I'm not sure I
>>>> necessarily ever would have.
>>
>> In the detailed report in pmOS issue, you mentioned it's a video file.
>>
>> I'm wondering if all the corruptions you see are from video files,
>> especially if the video files are all recorded on the file.
>>
>> If that's the case, it may be related to the IO pattern, especially if
>> the recording tool is using direct IO and didn't have proper writeback
>> wait for those direct IO.
>>
>> Thanks,
>> Qu
>>
>
> Thanks so much for the quick input!
>
> All the files I mentioned in bug reports were written by syncthing, so
> there wasn't any on-device video recording involved. I once saw Nheko's
> database file corrupt however, so it's apparently not limited to
> syncthing. I'm guessing video files are affected so often simply due to
> their large size.
I did a quick clone and search of syncthing.
There is no usage of O_DIRECT directly, so I guess it's not the known
csum mismatch caused by bad sync of direct IO writeback.
In that case, since the corrupted file is syncthing synchronized, can
you do a diff of the binary data?
To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to mount the
sdcard on another system, then compare the binary.
(e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd")
At this stage, we need to find out what's really causing the problem,
the btrfs itself or some thing lower level.
(I strongly hope it's not btrfs, but either way it's not going to end up
well)
Thanks,
Qu
>
> Regards,
>
> Ellie
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone
2024-08-05 6:34 ` Qu Wenruo
@ 2024-08-06 16:02 ` ellie
2024-08-06 21:55 ` Qu Wenruo
0 siblings, 1 reply; 14+ messages in thread
From: ellie @ 2024-08-06 16:02 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 8/5/24 08:34, Qu Wenruo wrote:
>
>
> 在 2024/8/5 15:50, ellie 写道:
>>
>>
>> On 8/5/24 08:10, Qu Wenruo wrote:
>>>
>>>
>>> 在 2024/8/5 15:25, ellie 写道:
>>>> On 8/5/24 07:39, ellie wrote:
>>>>> Dear kernel list,
>>>>>
>>>>> I'm hoping this is the right place to sent this. But there seems to be
>>>>> a btrfs corruption issue on the Pine64 PinePhone:
>>>>>
>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>>>>
>>>>> The kernel is 6.9.10, I wouldn't know what exact additional patches
>>>>> may be used by postmarketOS (which is based on Alpine). The device is
>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/
>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to
>>>>> check in software if it's 1.2a or 1.2b, and I don't remember which
>>>>> it is.
>>>>>
>>>>> This is on an SD Card, so an inherently rather unreliable storage
>>>>> medium. However, I tried two cards from what I believe to be two
>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with both.
>>>>>
>>>>> The PinePhone had various chipset instability issues before, like
>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe
>>>>> has however been fixed since. I have no idea if that's relevant, I'm
>>>>> just pointing it out. I also don't know if other filesystems, like
>>>>> ext4 that I used before, might have also had corruption and just
>>>>> didn't detect it. Not that I ever noticed anything, but I'm not sure I
>>>>> necessarily ever would have.
>>>
>>> In the detailed report in pmOS issue, you mentioned it's a video file.
>>>
>>> I'm wondering if all the corruptions you see are from video files,
>>> especially if the video files are all recorded on the file.
>>>
>>> If that's the case, it may be related to the IO pattern, especially if
>>> the recording tool is using direct IO and didn't have proper writeback
>>> wait for those direct IO.
>>>
>>> Thanks,
>>> Qu
>>>
>>
>> Thanks so much for the quick input!
>>
>> All the files I mentioned in bug reports were written by syncthing, so
>> there wasn't any on-device video recording involved. I once saw Nheko's
>> database file corrupt however, so it's apparently not limited to
>> syncthing. I'm guessing video files are affected so often simply due to
>> their large size.
>
> I did a quick clone and search of syncthing.
>
> There is no usage of O_DIRECT directly, so I guess it's not the known
> csum mismatch caused by bad sync of direct IO writeback.
>
> In that case, since the corrupted file is syncthing synchronized, can
> you do a diff of the binary data?
>
> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to mount the
> sdcard on another system, then compare the binary.
> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd")
>
> At this stage, we need to find out what's really causing the problem,
> the btrfs itself or some thing lower level.
> (I strongly hope it's not btrfs, but either way it's not going to end up
> well)
>
> Thanks,
> Qu
Thanks for your detailed instructions! I was about to do as you said and
ran the sync for a few hours, stopped it, and planned to run btrfs scrub
this evening. However, I then ran into a hard shutdown due to what might
be an upower bug (won't lie, was very annoyed at that point):
https://gitlab.com/postmarketOS/pmaports/-/issues/3073
Should I still attach a diff for an affected file I find now? Or are the
results going to be worthless if there was a hard shutdown in between,
and I need to first fix the filesystem, repeat the sync test, and repeat
finding a new corruption error to diff?
Regards,
Ellie
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone
2024-08-06 16:02 ` ellie
@ 2024-08-06 21:55 ` Qu Wenruo
2024-08-08 11:31 ` ellie
0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2024-08-06 21:55 UTC (permalink / raw)
To: ellie, linux-btrfs
在 2024/8/7 01:32, ellie 写道:
>
>
> On 8/5/24 08:34, Qu Wenruo wrote:
>>
>>
>> 在 2024/8/5 15:50, ellie 写道:
>>>
>>>
>>> On 8/5/24 08:10, Qu Wenruo wrote:
>>>>
>>>>
>>>> 在 2024/8/5 15:25, ellie 写道:
>>>>> On 8/5/24 07:39, ellie wrote:
>>>>>> Dear kernel list,
>>>>>>
>>>>>> I'm hoping this is the right place to sent this. But there seems
>>>>>> to be
>>>>>> a btrfs corruption issue on the Pine64 PinePhone:
>>>>>>
>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>>>>>
>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional patches
>>>>>> may be used by postmarketOS (which is based on Alpine). The device is
>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/
>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to
>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember which
>>>>>> it is.
>>>>>>
>>>>>> This is on an SD Card, so an inherently rather unreliable storage
>>>>>> medium. However, I tried two cards from what I believe to be two
>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with both.
>>>>>>
>>>>>> The PinePhone had various chipset instability issues before, like
>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe
>>>>>> has however been fixed since. I have no idea if that's relevant, I'm
>>>>>> just pointing it out. I also don't know if other filesystems, like
>>>>>> ext4 that I used before, might have also had corruption and just
>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not
>>>>>> sure I
>>>>>> necessarily ever would have.
>>>>
>>>> In the detailed report in pmOS issue, you mentioned it's a video file.
>>>>
>>>> I'm wondering if all the corruptions you see are from video files,
>>>> especially if the video files are all recorded on the file.
>>>>
>>>> If that's the case, it may be related to the IO pattern, especially if
>>>> the recording tool is using direct IO and didn't have proper writeback
>>>> wait for those direct IO.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>
>>> Thanks so much for the quick input!
>>>
>>> All the files I mentioned in bug reports were written by syncthing, so
>>> there wasn't any on-device video recording involved. I once saw Nheko's
>>> database file corrupt however, so it's apparently not limited to
>>> syncthing. I'm guessing video files are affected so often simply due to
>>> their large size.
>>
>> I did a quick clone and search of syncthing.
>>
>> There is no usage of O_DIRECT directly, so I guess it's not the known
>> csum mismatch caused by bad sync of direct IO writeback.
>>
>> In that case, since the corrupted file is syncthing synchronized, can
>> you do a diff of the binary data?
>>
>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to mount the
>> sdcard on another system, then compare the binary.
>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd")
>>
>> At this stage, we need to find out what's really causing the problem,
>> the btrfs itself or some thing lower level.
>> (I strongly hope it's not btrfs, but either way it's not going to end up
>> well)
>>
>> Thanks,
>> Qu
> Thanks for your detailed instructions! I was about to do as you said and
> ran the sync for a few hours, stopped it, and planned to run btrfs scrub
> this evening. However, I then ran into a hard shutdown due to what might
> be an upower bug (won't lie, was very annoyed at that point):
>
> https://gitlab.com/postmarketOS/pmaports/-/issues/3073
>
> Should I still attach a diff for an affected file I find now? Or are the
> results going to be worthless if there was a hard shutdown in between,
> and I need to first fix the filesystem, repeat the sync test, and repeat
> finding a new corruption error to diff?
As long as you didn't touch those files, and scrub still reports errors
on that file, the diff is still very helpful to provide some clue.
Thanks,
Qu
>
> Regards,
>
> Ellie
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone
2024-08-06 21:55 ` Qu Wenruo
@ 2024-08-08 11:31 ` ellie
2024-08-19 3:58 ` ellie
0 siblings, 1 reply; 14+ messages in thread
From: ellie @ 2024-08-08 11:31 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 8/6/24 23:55, Qu Wenruo wrote:
>
>
> 在 2024/8/7 01:32, ellie 写道:
>>
>>
>> On 8/5/24 08:34, Qu Wenruo wrote:
>>>
>>>
>>> 在 2024/8/5 15:50, ellie 写道:
>>>>
>>>>
>>>> On 8/5/24 08:10, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> 在 2024/8/5 15:25, ellie 写道:
>>>>>> On 8/5/24 07:39, ellie wrote:
>>>>>>> Dear kernel list,
>>>>>>>
>>>>>>> I'm hoping this is the right place to sent this. But there seems
>>>>>>> to be
>>>>>>> a btrfs corruption issue on the Pine64 PinePhone:
>>>>>>>
>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>>>>>>
>>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional patches
>>>>>>> may be used by postmarketOS (which is based on Alpine). The
>>>>>>> device is
>>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/
>>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to
>>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember which
>>>>>>> it is.
>>>>>>>
>>>>>>> This is on an SD Card, so an inherently rather unreliable storage
>>>>>>> medium. However, I tried two cards from what I believe to be two
>>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with both.
>>>>>>>
>>>>>>> The PinePhone had various chipset instability issues before, like
>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I
>>>>>>> believe
>>>>>>> has however been fixed since. I have no idea if that's relevant, I'm
>>>>>>> just pointing it out. I also don't know if other filesystems, like
>>>>>>> ext4 that I used before, might have also had corruption and just
>>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not
>>>>>>> sure I
>>>>>>> necessarily ever would have.
>>>>>
>>>>> In the detailed report in pmOS issue, you mentioned it's a video file.
>>>>>
>>>>> I'm wondering if all the corruptions you see are from video files,
>>>>> especially if the video files are all recorded on the file.
>>>>>
>>>>> If that's the case, it may be related to the IO pattern, especially if
>>>>> the recording tool is using direct IO and didn't have proper writeback
>>>>> wait for those direct IO.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>
>>>> Thanks so much for the quick input!
>>>>
>>>> All the files I mentioned in bug reports were written by syncthing, so
>>>> there wasn't any on-device video recording involved. I once saw Nheko's
>>>> database file corrupt however, so it's apparently not limited to
>>>> syncthing. I'm guessing video files are affected so often simply due to
>>>> their large size.
>>>
>>> I did a quick clone and search of syncthing.
>>>
>>> There is no usage of O_DIRECT directly, so I guess it's not the known
>>> csum mismatch caused by bad sync of direct IO writeback.
>>>
>>> In that case, since the corrupted file is syncthing synchronized, can
>>> you do a diff of the binary data?
>>>
>>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to mount the
>>> sdcard on another system, then compare the binary.
>>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd")
>>>
>>> At this stage, we need to find out what's really causing the problem,
>>> the btrfs itself or some thing lower level.
>>> (I strongly hope it's not btrfs, but either way it's not going to end up
>>> well)
>>>
>>> Thanks,
>>> Qu
>> Thanks for your detailed instructions! I was about to do as you said and
>> ran the sync for a few hours, stopped it, and planned to run btrfs scrub
>> this evening. However, I then ran into a hard shutdown due to what might
>> be an upower bug (won't lie, was very annoyed at that point):
>>
>> https://gitlab.com/postmarketOS/pmaports/-/issues/3073
>>
>> Should I still attach a diff for an affected file I find now? Or are the
>> results going to be worthless if there was a hard shutdown in between,
>> and I need to first fix the filesystem, repeat the sync test, and repeat
>> finding a new corruption error to diff?
>
> As long as you didn't touch those files, and scrub still reports errors
> on that file, the diff is still very helpful to provide some clue.
>
I finally had a new corrupted file pop up, this was actually after any
unintended sudden shutdown so there shouldn't be any interference from that:
[128958.860335] BTRFS error (device dm-0): unable to fixup (regular)
error at logical 133906497536 on dev /dev/mapper/root physical 135089684480
[128958.862548] BTRFS warning (device dm-0): checksum error at logical
133906497536 on dev /dev/mapper/root, physical 135089684480, root 257,
inode 331715, offset 102400, length 4096, links 1 (path:
ellie/Music/Baldur's Gate (2) II Shadows of Amn (2000)/06 City Gates.mp3)
However, when manually mounting the file on the computer where it
originates from and where the undamaged original file is:
/mnt # mount -t btrfs -o rescue=all,ro,subvol=/@home,defaults
/dev/mapper/blamap p64
/mnt # ls p64/
ellie
/mnt # cp p64/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\ Amn\
\(2000\)/06\ City\ Gates.mp3 ./
/mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
\(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
/mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
\(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
/mnt #
It seems like file is exactly the same, which I assume isn't meant to
happen.
I'm not sure what that implies, but I hope it's helpful info!
Regards,
Ellie
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone
2024-08-08 11:31 ` ellie
@ 2024-08-19 3:58 ` ellie
2024-08-19 5:29 ` Qu Wenruo
0 siblings, 1 reply; 14+ messages in thread
From: ellie @ 2024-08-19 3:58 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
Is there something else I could provide to help track this down? I
assume just because the file contents happen to be fine, doesn't mean
there wasn't corruption, like for example in the metadata. My apologies
for taking up your time.
Regards,
Ellie
On 8/8/24 13:31, ellie wrote:
> On 8/6/24 23:55, Qu Wenruo wrote:
>>
>>
>> 在 2024/8/7 01:32, ellie 写道:
>>>
>>>
>>> On 8/5/24 08:34, Qu Wenruo wrote:
>>>>
>>>>
>>>> 在 2024/8/5 15:50, ellie 写道:
>>>>>
>>>>>
>>>>> On 8/5/24 08:10, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> 在 2024/8/5 15:25, ellie 写道:
>>>>>>> On 8/5/24 07:39, ellie wrote:
>>>>>>>> Dear kernel list,
>>>>>>>>
>>>>>>>> I'm hoping this is the right place to sent this. But there seems
>>>>>>>> to be
>>>>>>>> a btrfs corruption issue on the Pine64 PinePhone:
>>>>>>>>
>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>>>>>>>
>>>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional patches
>>>>>>>> may be used by postmarketOS (which is based on Alpine). The
>>>>>>>> device is
>>>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/
>>>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a
>>>>>>>> way to
>>>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember which
>>>>>>>> it is.
>>>>>>>>
>>>>>>>> This is on an SD Card, so an inherently rather unreliable storage
>>>>>>>> medium. However, I tried two cards from what I believe to be two
>>>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with
>>>>>>>> both.
>>>>>>>>
>>>>>>>> The PinePhone had various chipset instability issues before, like
>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I
>>>>>>>> believe
>>>>>>>> has however been fixed since. I have no idea if that's relevant,
>>>>>>>> I'm
>>>>>>>> just pointing it out. I also don't know if other filesystems, like
>>>>>>>> ext4 that I used before, might have also had corruption and just
>>>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not
>>>>>>>> sure I
>>>>>>>> necessarily ever would have.
>>>>>>
>>>>>> In the detailed report in pmOS issue, you mentioned it's a video
>>>>>> file.
>>>>>>
>>>>>> I'm wondering if all the corruptions you see are from video files,
>>>>>> especially if the video files are all recorded on the file.
>>>>>>
>>>>>> If that's the case, it may be related to the IO pattern,
>>>>>> especially if
>>>>>> the recording tool is using direct IO and didn't have proper
>>>>>> writeback
>>>>>> wait for those direct IO.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>
>>>>> Thanks so much for the quick input!
>>>>>
>>>>> All the files I mentioned in bug reports were written by syncthing, so
>>>>> there wasn't any on-device video recording involved. I once saw
>>>>> Nheko's
>>>>> database file corrupt however, so it's apparently not limited to
>>>>> syncthing. I'm guessing video files are affected so often simply
>>>>> due to
>>>>> their large size.
>>>>
>>>> I did a quick clone and search of syncthing.
>>>>
>>>> There is no usage of O_DIRECT directly, so I guess it's not the known
>>>> csum mismatch caused by bad sync of direct IO writeback.
>>>>
>>>> In that case, since the corrupted file is syncthing synchronized, can
>>>> you do a diff of the binary data?
>>>>
>>>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to mount
>>>> the
>>>> sdcard on another system, then compare the binary.
>>>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd")
>>>>
>>>> At this stage, we need to find out what's really causing the problem,
>>>> the btrfs itself or some thing lower level.
>>>> (I strongly hope it's not btrfs, but either way it's not going to
>>>> end up
>>>> well)
>>>>
>>>> Thanks,
>>>> Qu
>>> Thanks for your detailed instructions! I was about to do as you said and
>>> ran the sync for a few hours, stopped it, and planned to run btrfs scrub
>>> this evening. However, I then ran into a hard shutdown due to what might
>>> be an upower bug (won't lie, was very annoyed at that point):
>>>
>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3073
>>>
>>> Should I still attach a diff for an affected file I find now? Or are the
>>> results going to be worthless if there was a hard shutdown in between,
>>> and I need to first fix the filesystem, repeat the sync test, and repeat
>>> finding a new corruption error to diff?
>>
>> As long as you didn't touch those files, and scrub still reports errors
>> on that file, the diff is still very helpful to provide some clue.
>>
>
> I finally had a new corrupted file pop up, this was actually after any
> unintended sudden shutdown so there shouldn't be any interference from
> that:
>
> [128958.860335] BTRFS error (device dm-0): unable to fixup (regular)
> error at logical 133906497536 on dev /dev/mapper/root physical 135089684480
> [128958.862548] BTRFS warning (device dm-0): checksum error at logical
> 133906497536 on dev /dev/mapper/root, physical 135089684480, root 257,
> inode 331715, offset 102400, length 4096, links 1 (path: ellie/Music/
> Baldur's Gate (2) II Shadows of Amn (2000)/06 City Gates.mp3)
>
> However, when manually mounting the file on the computer where it
> originates from and where the undamaged original file is:
>
> /mnt # mount -t btrfs -o rescue=all,ro,subvol=/@home,defaults /dev/
> mapper/blamap p64
> /mnt # ls p64/
> ellie
> /mnt # cp p64/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\ Amn\
> \(2000\)/06\ City\ Gates.mp3 ./
> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
> /mnt #
>
> It seems like file is exactly the same, which I assume isn't meant to
> happen.
>
> I'm not sure what that implies, but I hope it's helpful info!
>
> Regards,
>
> Ellie
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone
2024-08-19 3:58 ` ellie
@ 2024-08-19 5:29 ` Qu Wenruo
2024-08-19 8:16 ` ellie
2024-10-17 20:17 ` Ellie
0 siblings, 2 replies; 14+ messages in thread
From: Qu Wenruo @ 2024-08-19 5:29 UTC (permalink / raw)
To: ellie, linux-btrfs
在 2024/8/19 13:28, ellie 写道:
> Is there something else I could provide to help track this down? I
> assume just because the file contents happen to be fine, doesn't mean
> there wasn't corruption, like for example in the metadata. My apologies
> for taking up your time.
This means, by somehow the data checksum is incorrect.
This doesn't sound sane to me, so I can only come up two possible reasons:
1. The checksum algorithm on the platform is insane
IIRC the SOC is pretty mature (although it also means old), this
doesn't sound possible to me.
2. Memory hardware is incorrect
Thus causing bitflip for data csum.
Other than above two reasons, I can not come up with other reasons
unfortunately.
Thanks,
Qu
>
> Regards,
>
> Ellie
>
> On 8/8/24 13:31, ellie wrote:
>> On 8/6/24 23:55, Qu Wenruo wrote:
>>>
>>>
>>> 在 2024/8/7 01:32, ellie 写道:
>>>>
>>>>
>>>> On 8/5/24 08:34, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> 在 2024/8/5 15:50, ellie 写道:
>>>>>>
>>>>>>
>>>>>> On 8/5/24 08:10, Qu Wenruo wrote:
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/8/5 15:25, ellie 写道:
>>>>>>>> On 8/5/24 07:39, ellie wrote:
>>>>>>>>> Dear kernel list,
>>>>>>>>>
>>>>>>>>> I'm hoping this is the right place to sent this. But there seems
>>>>>>>>> to be
>>>>>>>>> a btrfs corruption issue on the Pine64 PinePhone:
>>>>>>>>>
>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>>>>>>>>
>>>>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional
>>>>>>>>> patches
>>>>>>>>> may be used by postmarketOS (which is based on Alpine). The
>>>>>>>>> device is
>>>>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/
>>>>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a
>>>>>>>>> way to
>>>>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember which
>>>>>>>>> it is.
>>>>>>>>>
>>>>>>>>> This is on an SD Card, so an inherently rather unreliable storage
>>>>>>>>> medium. However, I tried two cards from what I believe to be two
>>>>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with
>>>>>>>>> both.
>>>>>>>>>
>>>>>>>>> The PinePhone had various chipset instability issues before, like
>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I
>>>>>>>>> believe
>>>>>>>>> has however been fixed since. I have no idea if that's
>>>>>>>>> relevant, I'm
>>>>>>>>> just pointing it out. I also don't know if other filesystems, like
>>>>>>>>> ext4 that I used before, might have also had corruption and just
>>>>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not
>>>>>>>>> sure I
>>>>>>>>> necessarily ever would have.
>>>>>>>
>>>>>>> In the detailed report in pmOS issue, you mentioned it's a video
>>>>>>> file.
>>>>>>>
>>>>>>> I'm wondering if all the corruptions you see are from video files,
>>>>>>> especially if the video files are all recorded on the file.
>>>>>>>
>>>>>>> If that's the case, it may be related to the IO pattern,
>>>>>>> especially if
>>>>>>> the recording tool is using direct IO and didn't have proper
>>>>>>> writeback
>>>>>>> wait for those direct IO.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Qu
>>>>>>>
>>>>>>
>>>>>> Thanks so much for the quick input!
>>>>>>
>>>>>> All the files I mentioned in bug reports were written by
>>>>>> syncthing, so
>>>>>> there wasn't any on-device video recording involved. I once saw
>>>>>> Nheko's
>>>>>> database file corrupt however, so it's apparently not limited to
>>>>>> syncthing. I'm guessing video files are affected so often simply
>>>>>> due to
>>>>>> their large size.
>>>>>
>>>>> I did a quick clone and search of syncthing.
>>>>>
>>>>> There is no usage of O_DIRECT directly, so I guess it's not the known
>>>>> csum mismatch caused by bad sync of direct IO writeback.
>>>>>
>>>>> In that case, since the corrupted file is syncthing synchronized, can
>>>>> you do a diff of the binary data?
>>>>>
>>>>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to
>>>>> mount the
>>>>> sdcard on another system, then compare the binary.
>>>>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd")
>>>>>
>>>>> At this stage, we need to find out what's really causing the problem,
>>>>> the btrfs itself or some thing lower level.
>>>>> (I strongly hope it's not btrfs, but either way it's not going to
>>>>> end up
>>>>> well)
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>> Thanks for your detailed instructions! I was about to do as you said
>>>> and
>>>> ran the sync for a few hours, stopped it, and planned to run btrfs
>>>> scrub
>>>> this evening. However, I then ran into a hard shutdown due to what
>>>> might
>>>> be an upower bug (won't lie, was very annoyed at that point):
>>>>
>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3073
>>>>
>>>> Should I still attach a diff for an affected file I find now? Or are
>>>> the
>>>> results going to be worthless if there was a hard shutdown in between,
>>>> and I need to first fix the filesystem, repeat the sync test, and
>>>> repeat
>>>> finding a new corruption error to diff?
>>>
>>> As long as you didn't touch those files, and scrub still reports errors
>>> on that file, the diff is still very helpful to provide some clue.
>>>
>>
>> I finally had a new corrupted file pop up, this was actually after any
>> unintended sudden shutdown so there shouldn't be any interference from
>> that:
>>
>> [128958.860335] BTRFS error (device dm-0): unable to fixup (regular)
>> error at logical 133906497536 on dev /dev/mapper/root physical
>> 135089684480
>> [128958.862548] BTRFS warning (device dm-0): checksum error at logical
>> 133906497536 on dev /dev/mapper/root, physical 135089684480, root 257,
>> inode 331715, offset 102400, length 4096, links 1 (path: ellie/Music/
>> Baldur's Gate (2) II Shadows of Amn (2000)/06 City Gates.mp3)
>>
>> However, when manually mounting the file on the computer where it
>> originates from and where the undamaged original file is:
>>
>> /mnt # mount -t btrfs -o rescue=all,ro,subvol=/@home,defaults /dev/
>> mapper/blamap p64
>> /mnt # ls p64/
>> ellie
>> /mnt # cp p64/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\
>> Amn\ \(2000\)/06\ City\ Gates.mp3 ./
>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
>> /mnt #
>>
>> It seems like file is exactly the same, which I assume isn't meant to
>> happen.
>>
>> I'm not sure what that implies, but I hope it's helpful info!
>>
>> Regards,
>>
>> Ellie
>>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone
2024-08-19 5:29 ` Qu Wenruo
@ 2024-08-19 8:16 ` ellie
2024-10-17 20:17 ` Ellie
1 sibling, 0 replies; 14+ messages in thread
From: ellie @ 2024-08-19 8:16 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 8/19/24 07:29, Qu Wenruo wrote:
>
>
> 在 2024/8/19 13:28, ellie 写道:
>> Is there something else I could provide to help track this down? I
>> assume just because the file contents happen to be fine, doesn't mean
>> there wasn't corruption, like for example in the metadata. My apologies
>> for taking up your time.
>
> This means, by somehow the data checksum is incorrect.
>
> This doesn't sound sane to me, so I can only come up two possible reasons:
>
> 1. The checksum algorithm on the platform is insane
> IIRC the SOC is pretty mature (although it also means old), this
> doesn't sound possible to me.
>
> 2. Memory hardware is incorrect
> Thus causing bitflip for data csum.
>
> Other than above two reasons, I can not come up with other reasons
> unfortunately.
>
> Thanks,
> Qu
Thanks so much for your reply! I was curious and did some more tests.
From a first impression, the memory seems to be possibly doing fine
(the device has 3GB, so what I can test from userspace is a bit limited):
# memtester 1024
memtester version 4.6.0 (64-bit)
Copyright (C) 2001-2020 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).
pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 1024MB (1073741824 bytes)
got 1024MB (1073741824 bytes), trying mlock ...locked.
Loop 1:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
Loop 2:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
Loop 3:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
Loop 4:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
>
>>
>> Regards,
>>
>> Ellie
>>
>> On 8/8/24 13:31, ellie wrote:
>>> On 8/6/24 23:55, Qu Wenruo wrote:
>>>>
>>>>
>>>> 在 2024/8/7 01:32, ellie 写道:
>>>>>
>>>>>
>>>>> On 8/5/24 08:34, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> 在 2024/8/5 15:50, ellie 写道:
>>>>>>>
>>>>>>>
>>>>>>> On 8/5/24 08:10, Qu Wenruo wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> 在 2024/8/5 15:25, ellie 写道:
>>>>>>>>> On 8/5/24 07:39, ellie wrote:
>>>>>>>>>> Dear kernel list,
>>>>>>>>>>
>>>>>>>>>> I'm hoping this is the right place to sent this. But there seems
>>>>>>>>>> to be
>>>>>>>>>> a btrfs corruption issue on the Pine64 PinePhone:
>>>>>>>>>>
>>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>>>>>>>>>
>>>>>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional
>>>>>>>>>> patches
>>>>>>>>>> may be used by postmarketOS (which is based on Alpine). The
>>>>>>>>>> device is
>>>>>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/
>>>>>>>>>> wiki/
>>>>>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a
>>>>>>>>>> way to
>>>>>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember
>>>>>>>>>> which
>>>>>>>>>> it is.
>>>>>>>>>>
>>>>>>>>>> This is on an SD Card, so an inherently rather unreliable storage
>>>>>>>>>> medium. However, I tried two cards from what I believe to be two
>>>>>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with
>>>>>>>>>> both.
>>>>>>>>>>
>>>>>>>>>> The PinePhone had various chipset instability issues before, like
>>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I
>>>>>>>>>> believe
>>>>>>>>>> has however been fixed since. I have no idea if that's
>>>>>>>>>> relevant, I'm
>>>>>>>>>> just pointing it out. I also don't know if other filesystems,
>>>>>>>>>> like
>>>>>>>>>> ext4 that I used before, might have also had corruption and just
>>>>>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not
>>>>>>>>>> sure I
>>>>>>>>>> necessarily ever would have.
>>>>>>>>
>>>>>>>> In the detailed report in pmOS issue, you mentioned it's a video
>>>>>>>> file.
>>>>>>>>
>>>>>>>> I'm wondering if all the corruptions you see are from video files,
>>>>>>>> especially if the video files are all recorded on the file.
>>>>>>>>
>>>>>>>> If that's the case, it may be related to the IO pattern,
>>>>>>>> especially if
>>>>>>>> the recording tool is using direct IO and didn't have proper
>>>>>>>> writeback
>>>>>>>> wait for those direct IO.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Qu
>>>>>>>>
>>>>>>>
>>>>>>> Thanks so much for the quick input!
>>>>>>>
>>>>>>> All the files I mentioned in bug reports were written by
>>>>>>> syncthing, so
>>>>>>> there wasn't any on-device video recording involved. I once saw
>>>>>>> Nheko's
>>>>>>> database file corrupt however, so it's apparently not limited to
>>>>>>> syncthing. I'm guessing video files are affected so often simply
>>>>>>> due to
>>>>>>> their large size.
>>>>>>
>>>>>> I did a quick clone and search of syncthing.
>>>>>>
>>>>>> There is no usage of O_DIRECT directly, so I guess it's not the known
>>>>>> csum mismatch caused by bad sync of direct IO writeback.
>>>>>>
>>>>>> In that case, since the corrupted file is syncthing synchronized, can
>>>>>> you do a diff of the binary data?
>>>>>>
>>>>>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to
>>>>>> mount the
>>>>>> sdcard on another system, then compare the binary.
>>>>>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd")
>>>>>>
>>>>>> At this stage, we need to find out what's really causing the problem,
>>>>>> the btrfs itself or some thing lower level.
>>>>>> (I strongly hope it's not btrfs, but either way it's not going to
>>>>>> end up
>>>>>> well)
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>> Thanks for your detailed instructions! I was about to do as you said
>>>>> and
>>>>> ran the sync for a few hours, stopped it, and planned to run btrfs
>>>>> scrub
>>>>> this evening. However, I then ran into a hard shutdown due to what
>>>>> might
>>>>> be an upower bug (won't lie, was very annoyed at that point):
>>>>>
>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3073
>>>>>
>>>>> Should I still attach a diff for an affected file I find now? Or are
>>>>> the
>>>>> results going to be worthless if there was a hard shutdown in between,
>>>>> and I need to first fix the filesystem, repeat the sync test, and
>>>>> repeat
>>>>> finding a new corruption error to diff?
>>>>
>>>> As long as you didn't touch those files, and scrub still reports errors
>>>> on that file, the diff is still very helpful to provide some clue.
>>>>
>>>
>>> I finally had a new corrupted file pop up, this was actually after any
>>> unintended sudden shutdown so there shouldn't be any interference from
>>> that:
>>>
>>> [128958.860335] BTRFS error (device dm-0): unable to fixup (regular)
>>> error at logical 133906497536 on dev /dev/mapper/root physical
>>> 135089684480
>>> [128958.862548] BTRFS warning (device dm-0): checksum error at logical
>>> 133906497536 on dev /dev/mapper/root, physical 135089684480, root 257,
>>> inode 331715, offset 102400, length 4096, links 1 (path: ellie/Music/
>>> Baldur's Gate (2) II Shadows of Amn (2000)/06 City Gates.mp3)
>>>
>>> However, when manually mounting the file on the computer where it
>>> originates from and where the undamaged original file is:
>>>
>>> /mnt # mount -t btrfs -o rescue=all,ro,subvol=/@home,defaults /dev/
>>> mapper/blamap p64
>>> /mnt # ls p64/
>>> ellie
>>> /mnt # cp p64/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\
>>> Amn\ \(2000\)/06\ City\ Gates.mp3 ./
>>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
>>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
>>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
>>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
>>> /mnt #
>>>
>>> It seems like file is exactly the same, which I assume isn't meant to
>>> happen.
>>>
>>> I'm not sure what that implies, but I hope it's helpful info!
>>>
>>> Regards,
>>>
>>> Ellie
>>>
>>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone
2024-08-05 5:39 btrfs corruption issue on Pine64 PinePhone ellie
2024-08-05 5:55 ` ellie
@ 2024-10-02 7:20 ` ellie
2024-12-16 22:53 ` BTRFS hangs and causes semi-freezes on PinePhone Ellie
1 sibling, 1 reply; 14+ messages in thread
From: ellie @ 2024-10-02 7:20 UTC (permalink / raw)
To: linux-btrfs
An update: I've largely ignored the corruption issue for now, which is
somewhat feasible since outside of large write loads for a longer time
it doesn't seem to happen.
But there seems to be another larger issue with btrfs on this device.
When syncthing scans on all its threads seemingly maxing out I/O at
least according to iotop, all other apps including Phosh (the window
manager) freeze eveery 5 seconds or so for long 10+ seconds durations.
It seems like syncthing simply reading blocks vital reads for even just
basic continued operation of other processes. Something about its tuning
seems to fundamentally not work on this low spec hardware. With ext4, I
had no such issues.
Sorry if my rambling isn't useful, I'm not experienced at reporting
filesystem hiccups.
Regards,
Ellie
On 8/5/24 07:39, ellie wrote:
> Dear kernel list,
>
> I'm hoping this is the right place to sent this. But there seems to be a
> btrfs corruption issue on the Pine64 PinePhone:
>
> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>
> The kernel is 6.9.10, I wouldn't know what exact additional patches may
> be used by postmarketOS (which is based on Alpine). The device is the
> PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/
> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to
> check in software if it's 1.2a or 1.2b, and I don't remember which it is.
>
> This is on an SD Card, so an inherently rather unreliable storage
> medium. However, I tried two cards from what I believe to be two
> different vendors, Lexar and SanDisk, and I'm seeing this with both.
>
> The PinePhone had various chipset instability issues before, like
> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe
> has however been fixed since. I have no idea if that's relevant, I'm
> just pointing it out. I also don't know if other filesystems, like ext4
> that I used before, might have also had corruption and just didn't
> detect it. Not that I ever noticed anything, but I'm not sure I
> necessarily ever would have.
>
> Regards,
>
> Ellie
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone
2024-08-19 5:29 ` Qu Wenruo
2024-08-19 8:16 ` ellie
@ 2024-10-17 20:17 ` Ellie
1 sibling, 0 replies; 14+ messages in thread
From: Ellie @ 2024-10-17 20:17 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 8/19/24 7:29 AM, Qu Wenruo wrote:
>
>
> 在 2024/8/19 13:28, ellie 写道:
>> Is there something else I could provide to help track this down? I
>> assume just because the file contents happen to be fine, doesn't mean
>> there wasn't corruption, like for example in the metadata. My apologies
>> for taking up your time.
>
> This means, by somehow the data checksum is incorrect.
>
> This doesn't sound sane to me, so I can only come up two possible reasons:
>
> 1. The checksum algorithm on the platform is insane
> IIRC the SOC is pretty mature (although it also means old), this
> doesn't sound possible to me.
>
> 2. Memory hardware is incorrect
> Thus causing bitflip for data csum.
>
> Other than above two reasons, I can not come up with other reasons
> unfortunately.
>
> Thanks,
> Qu
>
I did let a memtest run on this device recently, which didn't reveal
anything suspicious. However, this device was known to have memory
hiccups: https://forum.pine64.org/showthread.php?tid=9832&page=10 As far
as I know they were supposedly resolved, but I wouldn't be able to
judge. I would assume memtest would show them if still present, but
again I'm not sure.
The checksum errors seem to be permanent whenever they happen, I can
test this again if needed but I'm pretty sure I recall rerunning btrfs
checks and the same error came back up again. I can only do very
uninformed nonsense guesses what this means, but I guess this could
imply there is a problem writing the metadata while the actual file is
written correctly.
I hope some of this is helpful for some ideas.
Regards,
Ellie
>>>>>>>> 在 2024/8/5 15:25, ellie 写道:
>>>>>>>>> On 8/5/24 07:39, ellie wrote:
>>>>>>>>>> Dear kernel list,
>>>>>>>>>>
>>>>>>>>>> I'm hoping this is the right place to sent this. But there seems
>>>>>>>>>> to be
>>>>>>>>>> a btrfs corruption issue on the Pine64 PinePhone:
>>>>>>>>>>
>>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>>>>>>>>>
>>>>>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional
>>>>>>>>>> patches
>>>>>>>>>> may be used by postmarketOS (which is based on Alpine). The
>>>>>>>>>> device is
>>>>>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/
>>>>>>>>>> wiki/
>>>>>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a
>>>>>>>>>> way to
>>>>>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember
>>>>>>>>>> which
>>>>>>>>>> it is.
>>>>>>>>>>
>>>>>>>>>> This is on an SD Card, so an inherently rather unreliable storage
>>>>>>>>>> medium. However, I tried two cards from what I believe to be two
>>>>>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with
>>>>>>>>>> both.
>>>>>>>>>>
>>>>>>>>>> The PinePhone had various chipset instability issues before, like
>>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I
>>>>>>>>>> believe
>>>>>>>>>> has however been fixed since. I have no idea if that's
>>>>>>>>>> relevant, I'm
>>>>>>>>>> just pointing it out. I also don't know if other filesystems,
>>>>>>>>>> like
>>>>>>>>>> ext4 that I used before, might have also had corruption and just
>>>>>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not
>>>>>>>>>> sure I
>>>>>>>>>> necessarily ever would have.
>>>>>>>>
>>>>>>>> In the detailed report in pmOS issue, you mentioned it's a video
>>>>>>>> file.
>>>>>>>>
>>>>>>>> I'm wondering if all the corruptions you see are from video files,
>>>>>>>> especially if the video files are all recorded on the file.
>>>>>>>>
>>>>>>>> If that's the case, it may be related to the IO pattern,
>>>>>>>> especially if
>>>>>>>> the recording tool is using direct IO and didn't have proper
>>>>>>>> writeback
>>>>>>>> wait for those direct IO.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Qu
>>>>>>>>
>>>>>>>
>>>>>>> Thanks so much for the quick input!
>>>>>>>
>>>>>>> All the files I mentioned in bug reports were written by
>>>>>>> syncthing, so
>>>>>>> there wasn't any on-device video recording involved. I once saw
>>>>>>> Nheko's
>>>>>>> database file corrupt however, so it's apparently not limited to
>>>>>>> syncthing. I'm guessing video files are affected so often simply
>>>>>>> due to
>>>>>>> their large size.
>>>>>>
>>>>>> I did a quick clone and search of syncthing.
>>>>>>
>>>>>> There is no usage of O_DIRECT directly, so I guess it's not the known
>>>>>> csum mismatch caused by bad sync of direct IO writeback.
>>>>>>
>>>>>> In that case, since the corrupted file is syncthing synchronized, can
>>>>>> you do a diff of the binary data?
>>>>>>
>>>>>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to
>>>>>> mount the
>>>>>> sdcard on another system, then compare the binary.
>>>>>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd")
>>>>>>
>>>>>> At this stage, we need to find out what's really causing the problem,
>>>>>> the btrfs itself or some thing lower level.
>>>>>> (I strongly hope it's not btrfs, but either way it's not going to
>>>>>> end up
>>>>>> well)
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>> Thanks for your detailed instructions! I was about to do as you said
>>>>> and
>>>>> ran the sync for a few hours, stopped it, and planned to run btrfs
>>>>> scrub
>>>>> this evening. However, I then ran into a hard shutdown due to what
>>>>> might
>>>>> be an upower bug (won't lie, was very annoyed at that point):
>>>>>
>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3073
>>>>>
>>>>> Should I still attach a diff for an affected file I find now? Or are
>>>>> the
>>>>> results going to be worthless if there was a hard shutdown in between,
>>>>> and I need to first fix the filesystem, repeat the sync test, and
>>>>> repeat
>>>>> finding a new corruption error to diff?
>>>>
>>>> As long as you didn't touch those files, and scrub still reports errors
>>>> on that file, the diff is still very helpful to provide some clue.
>>>>
>>>
>>> I finally had a new corrupted file pop up, this was actually after any
>>> unintended sudden shutdown so there shouldn't be any interference from
>>> that:
>>>
>>> [128958.860335] BTRFS error (device dm-0): unable to fixup (regular)
>>> error at logical 133906497536 on dev /dev/mapper/root physical
>>> 135089684480
>>> [128958.862548] BTRFS warning (device dm-0): checksum error at logical
>>> 133906497536 on dev /dev/mapper/root, physical 135089684480, root 257,
>>> inode 331715, offset 102400, length 4096, links 1 (path: ellie/Music/
>>> Baldur's Gate (2) II Shadows of Amn (2000)/06 City Gates.mp3)
>>>
>>> However, when manually mounting the file on the computer where it
>>> originates from and where the undamaged original file is:
>>>
>>> /mnt # mount -t btrfs -o rescue=all,ro,subvol=/@home,defaults /dev/
>>> mapper/blamap p64
>>> /mnt # ls p64/
>>> ellie
>>> /mnt # cp p64/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\
>>> Amn\ \(2000\)/06\ City\ Gates.mp3 ./
>>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
>>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
>>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\
>>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3
>>> /mnt #
>>>
>>> It seems like file is exactly the same, which I assume isn't meant to
>>> happen.
>>>
>>> I'm not sure what that implies, but I hope it's helpful info!
>>>
>>> Regards,
>>>
>>> Ellie
>>>
>>
^ permalink raw reply [flat|nested] 14+ messages in thread
* BTRFS hangs and causes semi-freezes on PinePhone
2024-10-02 7:20 ` ellie
@ 2024-12-16 22:53 ` Ellie
0 siblings, 0 replies; 14+ messages in thread
From: Ellie @ 2024-12-16 22:53 UTC (permalink / raw)
To: linux-btrfs
I've had to hard reset the device a few times because it became so slow,
I couldn't easily log in anymore. The culprit seems to be BTRFS read
perf, because when I manage to terminate read-heavy applications it
usually recovers, and I didn't have this problem with EXT4.
(It's not like the reads ever fully freeze, but they're so slow that
apps including the lock screen will hang for multiple seconds up to a
minute or so whenever they seem to try to access things on disk.)
Regards,
Ellie
On 10/2/24 9:20 AM, ellie wrote:
> An update: I've largely ignored the corruption issue for now, which is
> somewhat feasible since outside of large write loads for a longer time
> it doesn't seem to happen.
>
> But there seems to be another larger issue with btrfs on this device.
> When syncthing scans on all its threads seemingly maxing out I/O at
> least according to iotop, all other apps including Phosh (the window
> manager) freeze eveery 5 seconds or so for long 10+ seconds durations.
> It seems like syncthing simply reading blocks vital reads for even just
> basic continued operation of other processes. Something about its tuning
> seems to fundamentally not work on this low spec hardware. With ext4, I
> had no such issues.
>
> Sorry if my rambling isn't useful, I'm not experienced at reporting
> filesystem hiccups.
>
> Regards,
>
> Ellie
>
> On 8/5/24 07:39, ellie wrote:
>> Dear kernel list,
>>
>> I'm hoping this is the right place to sent this. But there seems to be
>> a btrfs corruption issue on the Pine64 PinePhone:
>>
>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058
>>
>> The kernel is 6.9.10, I wouldn't know what exact additional patches
>> may be used by postmarketOS (which is based on Alpine). The device is
>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/
>> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to
>> check in software if it's 1.2a or 1.2b, and I don't remember which it is.
>>
>> This is on an SD Card, so an inherently rather unreliable storage
>> medium. However, I tried two cards from what I believe to be two
>> different vendors, Lexar and SanDisk, and I'm seeing this with both.
>>
>> The PinePhone had various chipset instability issues before, like
>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe
>> has however been fixed since. I have no idea if that's relevant, I'm
>> just pointing it out. I also don't know if other filesystems, like
>> ext4 that I used before, might have also had corruption and just
>> didn't detect it. Not that I ever noticed anything, but I'm not sure I
>> necessarily ever would have.
>>
>> Regards,
>>
>> Ellie
>>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-12-16 22:53 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-05 5:39 btrfs corruption issue on Pine64 PinePhone ellie
2024-08-05 5:55 ` ellie
2024-08-05 6:10 ` Qu Wenruo
2024-08-05 6:20 ` ellie
2024-08-05 6:34 ` Qu Wenruo
2024-08-06 16:02 ` ellie
2024-08-06 21:55 ` Qu Wenruo
2024-08-08 11:31 ` ellie
2024-08-19 3:58 ` ellie
2024-08-19 5:29 ` Qu Wenruo
2024-08-19 8:16 ` ellie
2024-10-17 20:17 ` Ellie
2024-10-02 7:20 ` ellie
2024-12-16 22:53 ` BTRFS hangs and causes semi-freezes on PinePhone Ellie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox