* btrfs corruption issue on Pine64 PinePhone @ 2024-08-05 5:39 ellie 2024-08-05 5:55 ` ellie 2024-10-02 7:20 ` ellie 0 siblings, 2 replies; 14+ messages in thread From: ellie @ 2024-08-05 5:39 UTC (permalink / raw) To: linux-btrfs Dear kernel list, I'm hoping this is the right place to sent this. But there seems to be a btrfs corruption issue on the Pine64 PinePhone: https://gitlab.com/postmarketOS/pmaports/-/issues/3058 The kernel is 6.9.10, I wouldn't know what exact additional patches may be used by postmarketOS (which is based on Alpine). The device is the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/PinePhone#Hardware_revisions sadly there doesn't seem to be a way to check in software if it's 1.2a or 1.2b, and I don't remember which it is. This is on an SD Card, so an inherently rather unreliable storage medium. However, I tried two cards from what I believe to be two different vendors, Lexar and SanDisk, and I'm seeing this with both. The PinePhone had various chipset instability issues before, like https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe has however been fixed since. I have no idea if that's relevant, I'm just pointing it out. I also don't know if other filesystems, like ext4 that I used before, might have also had corruption and just didn't detect it. Not that I ever noticed anything, but I'm not sure I necessarily ever would have. Regards, Ellie ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone 2024-08-05 5:39 btrfs corruption issue on Pine64 PinePhone ellie @ 2024-08-05 5:55 ` ellie 2024-08-05 6:10 ` Qu Wenruo 2024-10-02 7:20 ` ellie 1 sibling, 1 reply; 14+ messages in thread From: ellie @ 2024-08-05 5:55 UTC (permalink / raw) To: linux-btrfs On 8/5/24 07:39, ellie wrote: > Dear kernel list, > > I'm hoping this is the right place to sent this. But there seems to be a > btrfs corruption issue on the Pine64 PinePhone: > > https://gitlab.com/postmarketOS/pmaports/-/issues/3058 > > The kernel is 6.9.10, I wouldn't know what exact additional patches may > be used by postmarketOS (which is based on Alpine). The device is the > PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/ > PinePhone#Hardware_revisions sadly there doesn't seem to be a way to > check in software if it's 1.2a or 1.2b, and I don't remember which it is. > > This is on an SD Card, so an inherently rather unreliable storage > medium. However, I tried two cards from what I believe to be two > different vendors, Lexar and SanDisk, and I'm seeing this with both. > > The PinePhone had various chipset instability issues before, like > https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe > has however been fixed since. I have no idea if that's relevant, I'm > just pointing it out. I also don't know if other filesystems, like ext4 > that I used before, might have also had corruption and just didn't > detect it. Not that I ever noticed anything, but I'm not sure I > necessarily ever would have. > > Regards, > > Ellie I forgot to specify one testing detail: testing this seems to require writing a couple of gigabytes to the SD Card. So that's an additional difficulty, since I assume doing that too often will simply kill the card for real, which limits how quick and often this can be tested. Regards, Ellie ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone 2024-08-05 5:55 ` ellie @ 2024-08-05 6:10 ` Qu Wenruo 2024-08-05 6:20 ` ellie 0 siblings, 1 reply; 14+ messages in thread From: Qu Wenruo @ 2024-08-05 6:10 UTC (permalink / raw) To: ellie, linux-btrfs 在 2024/8/5 15:25, ellie 写道: > On 8/5/24 07:39, ellie wrote: >> Dear kernel list, >> >> I'm hoping this is the right place to sent this. But there seems to be >> a btrfs corruption issue on the Pine64 PinePhone: >> >> https://gitlab.com/postmarketOS/pmaports/-/issues/3058 >> >> The kernel is 6.9.10, I wouldn't know what exact additional patches >> may be used by postmarketOS (which is based on Alpine). The device is >> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/ >> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to >> check in software if it's 1.2a or 1.2b, and I don't remember which it is. >> >> This is on an SD Card, so an inherently rather unreliable storage >> medium. However, I tried two cards from what I believe to be two >> different vendors, Lexar and SanDisk, and I'm seeing this with both. >> >> The PinePhone had various chipset instability issues before, like >> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe >> has however been fixed since. I have no idea if that's relevant, I'm >> just pointing it out. I also don't know if other filesystems, like >> ext4 that I used before, might have also had corruption and just >> didn't detect it. Not that I ever noticed anything, but I'm not sure I >> necessarily ever would have. In the detailed report in pmOS issue, you mentioned it's a video file. I'm wondering if all the corruptions you see are from video files, especially if the video files are all recorded on the file. If that's the case, it may be related to the IO pattern, especially if the recording tool is using direct IO and didn't have proper writeback wait for those direct IO. Thanks, Qu >> >> Regards, >> >> Ellie > > I forgot to specify one testing detail: testing this seems to require > writing a couple of gigabytes to the SD Card. So that's an additional > difficulty, since I assume doing that too often will simply kill the > card for real, which limits how quick and often this can be tested. > > Regards, > > Ellie > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone 2024-08-05 6:10 ` Qu Wenruo @ 2024-08-05 6:20 ` ellie 2024-08-05 6:34 ` Qu Wenruo 0 siblings, 1 reply; 14+ messages in thread From: ellie @ 2024-08-05 6:20 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs On 8/5/24 08:10, Qu Wenruo wrote: > > > 在 2024/8/5 15:25, ellie 写道: >> On 8/5/24 07:39, ellie wrote: >>> Dear kernel list, >>> >>> I'm hoping this is the right place to sent this. But there seems to be >>> a btrfs corruption issue on the Pine64 PinePhone: >>> >>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058 >>> >>> The kernel is 6.9.10, I wouldn't know what exact additional patches >>> may be used by postmarketOS (which is based on Alpine). The device is >>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/ >>> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to >>> check in software if it's 1.2a or 1.2b, and I don't remember which it >>> is. >>> >>> This is on an SD Card, so an inherently rather unreliable storage >>> medium. However, I tried two cards from what I believe to be two >>> different vendors, Lexar and SanDisk, and I'm seeing this with both. >>> >>> The PinePhone had various chipset instability issues before, like >>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe >>> has however been fixed since. I have no idea if that's relevant, I'm >>> just pointing it out. I also don't know if other filesystems, like >>> ext4 that I used before, might have also had corruption and just >>> didn't detect it. Not that I ever noticed anything, but I'm not sure I >>> necessarily ever would have. > > In the detailed report in pmOS issue, you mentioned it's a video file. > > I'm wondering if all the corruptions you see are from video files, > especially if the video files are all recorded on the file. > > If that's the case, it may be related to the IO pattern, especially if > the recording tool is using direct IO and didn't have proper writeback > wait for those direct IO. > > Thanks, > Qu > Thanks so much for the quick input! All the files I mentioned in bug reports were written by syncthing, so there wasn't any on-device video recording involved. I once saw Nheko's database file corrupt however, so it's apparently not limited to syncthing. I'm guessing video files are affected so often simply due to their large size. Regards, Ellie ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone 2024-08-05 6:20 ` ellie @ 2024-08-05 6:34 ` Qu Wenruo 2024-08-06 16:02 ` ellie 0 siblings, 1 reply; 14+ messages in thread From: Qu Wenruo @ 2024-08-05 6:34 UTC (permalink / raw) To: ellie, linux-btrfs 在 2024/8/5 15:50, ellie 写道: > > > On 8/5/24 08:10, Qu Wenruo wrote: >> >> >> 在 2024/8/5 15:25, ellie 写道: >>> On 8/5/24 07:39, ellie wrote: >>>> Dear kernel list, >>>> >>>> I'm hoping this is the right place to sent this. But there seems to be >>>> a btrfs corruption issue on the Pine64 PinePhone: >>>> >>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058 >>>> >>>> The kernel is 6.9.10, I wouldn't know what exact additional patches >>>> may be used by postmarketOS (which is based on Alpine). The device is >>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/ >>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to >>>> check in software if it's 1.2a or 1.2b, and I don't remember which >>>> it is. >>>> >>>> This is on an SD Card, so an inherently rather unreliable storage >>>> medium. However, I tried two cards from what I believe to be two >>>> different vendors, Lexar and SanDisk, and I'm seeing this with both. >>>> >>>> The PinePhone had various chipset instability issues before, like >>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe >>>> has however been fixed since. I have no idea if that's relevant, I'm >>>> just pointing it out. I also don't know if other filesystems, like >>>> ext4 that I used before, might have also had corruption and just >>>> didn't detect it. Not that I ever noticed anything, but I'm not sure I >>>> necessarily ever would have. >> >> In the detailed report in pmOS issue, you mentioned it's a video file. >> >> I'm wondering if all the corruptions you see are from video files, >> especially if the video files are all recorded on the file. >> >> If that's the case, it may be related to the IO pattern, especially if >> the recording tool is using direct IO and didn't have proper writeback >> wait for those direct IO. >> >> Thanks, >> Qu >> > > Thanks so much for the quick input! > > All the files I mentioned in bug reports were written by syncthing, so > there wasn't any on-device video recording involved. I once saw Nheko's > database file corrupt however, so it's apparently not limited to > syncthing. I'm guessing video files are affected so often simply due to > their large size. I did a quick clone and search of syncthing. There is no usage of O_DIRECT directly, so I guess it's not the known csum mismatch caused by bad sync of direct IO writeback. In that case, since the corrupted file is syncthing synchronized, can you do a diff of the binary data? To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to mount the sdcard on another system, then compare the binary. (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd") At this stage, we need to find out what's really causing the problem, the btrfs itself or some thing lower level. (I strongly hope it's not btrfs, but either way it's not going to end up well) Thanks, Qu > > Regards, > > Ellie ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone 2024-08-05 6:34 ` Qu Wenruo @ 2024-08-06 16:02 ` ellie 2024-08-06 21:55 ` Qu Wenruo 0 siblings, 1 reply; 14+ messages in thread From: ellie @ 2024-08-06 16:02 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs On 8/5/24 08:34, Qu Wenruo wrote: > > > 在 2024/8/5 15:50, ellie 写道: >> >> >> On 8/5/24 08:10, Qu Wenruo wrote: >>> >>> >>> 在 2024/8/5 15:25, ellie 写道: >>>> On 8/5/24 07:39, ellie wrote: >>>>> Dear kernel list, >>>>> >>>>> I'm hoping this is the right place to sent this. But there seems to be >>>>> a btrfs corruption issue on the Pine64 PinePhone: >>>>> >>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058 >>>>> >>>>> The kernel is 6.9.10, I wouldn't know what exact additional patches >>>>> may be used by postmarketOS (which is based on Alpine). The device is >>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/ >>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to >>>>> check in software if it's 1.2a or 1.2b, and I don't remember which >>>>> it is. >>>>> >>>>> This is on an SD Card, so an inherently rather unreliable storage >>>>> medium. However, I tried two cards from what I believe to be two >>>>> different vendors, Lexar and SanDisk, and I'm seeing this with both. >>>>> >>>>> The PinePhone had various chipset instability issues before, like >>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe >>>>> has however been fixed since. I have no idea if that's relevant, I'm >>>>> just pointing it out. I also don't know if other filesystems, like >>>>> ext4 that I used before, might have also had corruption and just >>>>> didn't detect it. Not that I ever noticed anything, but I'm not sure I >>>>> necessarily ever would have. >>> >>> In the detailed report in pmOS issue, you mentioned it's a video file. >>> >>> I'm wondering if all the corruptions you see are from video files, >>> especially if the video files are all recorded on the file. >>> >>> If that's the case, it may be related to the IO pattern, especially if >>> the recording tool is using direct IO and didn't have proper writeback >>> wait for those direct IO. >>> >>> Thanks, >>> Qu >>> >> >> Thanks so much for the quick input! >> >> All the files I mentioned in bug reports were written by syncthing, so >> there wasn't any on-device video recording involved. I once saw Nheko's >> database file corrupt however, so it's apparently not limited to >> syncthing. I'm guessing video files are affected so often simply due to >> their large size. > > I did a quick clone and search of syncthing. > > There is no usage of O_DIRECT directly, so I guess it's not the known > csum mismatch caused by bad sync of direct IO writeback. > > In that case, since the corrupted file is syncthing synchronized, can > you do a diff of the binary data? > > To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to mount the > sdcard on another system, then compare the binary. > (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd") > > At this stage, we need to find out what's really causing the problem, > the btrfs itself or some thing lower level. > (I strongly hope it's not btrfs, but either way it's not going to end up > well) > > Thanks, > Qu Thanks for your detailed instructions! I was about to do as you said and ran the sync for a few hours, stopped it, and planned to run btrfs scrub this evening. However, I then ran into a hard shutdown due to what might be an upower bug (won't lie, was very annoyed at that point): https://gitlab.com/postmarketOS/pmaports/-/issues/3073 Should I still attach a diff for an affected file I find now? Or are the results going to be worthless if there was a hard shutdown in between, and I need to first fix the filesystem, repeat the sync test, and repeat finding a new corruption error to diff? Regards, Ellie ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone 2024-08-06 16:02 ` ellie @ 2024-08-06 21:55 ` Qu Wenruo 2024-08-08 11:31 ` ellie 0 siblings, 1 reply; 14+ messages in thread From: Qu Wenruo @ 2024-08-06 21:55 UTC (permalink / raw) To: ellie, linux-btrfs 在 2024/8/7 01:32, ellie 写道: > > > On 8/5/24 08:34, Qu Wenruo wrote: >> >> >> 在 2024/8/5 15:50, ellie 写道: >>> >>> >>> On 8/5/24 08:10, Qu Wenruo wrote: >>>> >>>> >>>> 在 2024/8/5 15:25, ellie 写道: >>>>> On 8/5/24 07:39, ellie wrote: >>>>>> Dear kernel list, >>>>>> >>>>>> I'm hoping this is the right place to sent this. But there seems >>>>>> to be >>>>>> a btrfs corruption issue on the Pine64 PinePhone: >>>>>> >>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058 >>>>>> >>>>>> The kernel is 6.9.10, I wouldn't know what exact additional patches >>>>>> may be used by postmarketOS (which is based on Alpine). The device is >>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/ >>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to >>>>>> check in software if it's 1.2a or 1.2b, and I don't remember which >>>>>> it is. >>>>>> >>>>>> This is on an SD Card, so an inherently rather unreliable storage >>>>>> medium. However, I tried two cards from what I believe to be two >>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with both. >>>>>> >>>>>> The PinePhone had various chipset instability issues before, like >>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe >>>>>> has however been fixed since. I have no idea if that's relevant, I'm >>>>>> just pointing it out. I also don't know if other filesystems, like >>>>>> ext4 that I used before, might have also had corruption and just >>>>>> didn't detect it. Not that I ever noticed anything, but I'm not >>>>>> sure I >>>>>> necessarily ever would have. >>>> >>>> In the detailed report in pmOS issue, you mentioned it's a video file. >>>> >>>> I'm wondering if all the corruptions you see are from video files, >>>> especially if the video files are all recorded on the file. >>>> >>>> If that's the case, it may be related to the IO pattern, especially if >>>> the recording tool is using direct IO and didn't have proper writeback >>>> wait for those direct IO. >>>> >>>> Thanks, >>>> Qu >>>> >>> >>> Thanks so much for the quick input! >>> >>> All the files I mentioned in bug reports were written by syncthing, so >>> there wasn't any on-device video recording involved. I once saw Nheko's >>> database file corrupt however, so it's apparently not limited to >>> syncthing. I'm guessing video files are affected so often simply due to >>> their large size. >> >> I did a quick clone and search of syncthing. >> >> There is no usage of O_DIRECT directly, so I guess it's not the known >> csum mismatch caused by bad sync of direct IO writeback. >> >> In that case, since the corrupted file is syncthing synchronized, can >> you do a diff of the binary data? >> >> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to mount the >> sdcard on another system, then compare the binary. >> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd") >> >> At this stage, we need to find out what's really causing the problem, >> the btrfs itself or some thing lower level. >> (I strongly hope it's not btrfs, but either way it's not going to end up >> well) >> >> Thanks, >> Qu > Thanks for your detailed instructions! I was about to do as you said and > ran the sync for a few hours, stopped it, and planned to run btrfs scrub > this evening. However, I then ran into a hard shutdown due to what might > be an upower bug (won't lie, was very annoyed at that point): > > https://gitlab.com/postmarketOS/pmaports/-/issues/3073 > > Should I still attach a diff for an affected file I find now? Or are the > results going to be worthless if there was a hard shutdown in between, > and I need to first fix the filesystem, repeat the sync test, and repeat > finding a new corruption error to diff? As long as you didn't touch those files, and scrub still reports errors on that file, the diff is still very helpful to provide some clue. Thanks, Qu > > Regards, > > Ellie > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone 2024-08-06 21:55 ` Qu Wenruo @ 2024-08-08 11:31 ` ellie 2024-08-19 3:58 ` ellie 0 siblings, 1 reply; 14+ messages in thread From: ellie @ 2024-08-08 11:31 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs On 8/6/24 23:55, Qu Wenruo wrote: > > > 在 2024/8/7 01:32, ellie 写道: >> >> >> On 8/5/24 08:34, Qu Wenruo wrote: >>> >>> >>> 在 2024/8/5 15:50, ellie 写道: >>>> >>>> >>>> On 8/5/24 08:10, Qu Wenruo wrote: >>>>> >>>>> >>>>> 在 2024/8/5 15:25, ellie 写道: >>>>>> On 8/5/24 07:39, ellie wrote: >>>>>>> Dear kernel list, >>>>>>> >>>>>>> I'm hoping this is the right place to sent this. But there seems >>>>>>> to be >>>>>>> a btrfs corruption issue on the Pine64 PinePhone: >>>>>>> >>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058 >>>>>>> >>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional patches >>>>>>> may be used by postmarketOS (which is based on Alpine). The >>>>>>> device is >>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/ >>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to >>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember which >>>>>>> it is. >>>>>>> >>>>>>> This is on an SD Card, so an inherently rather unreliable storage >>>>>>> medium. However, I tried two cards from what I believe to be two >>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with both. >>>>>>> >>>>>>> The PinePhone had various chipset instability issues before, like >>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I >>>>>>> believe >>>>>>> has however been fixed since. I have no idea if that's relevant, I'm >>>>>>> just pointing it out. I also don't know if other filesystems, like >>>>>>> ext4 that I used before, might have also had corruption and just >>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not >>>>>>> sure I >>>>>>> necessarily ever would have. >>>>> >>>>> In the detailed report in pmOS issue, you mentioned it's a video file. >>>>> >>>>> I'm wondering if all the corruptions you see are from video files, >>>>> especially if the video files are all recorded on the file. >>>>> >>>>> If that's the case, it may be related to the IO pattern, especially if >>>>> the recording tool is using direct IO and didn't have proper writeback >>>>> wait for those direct IO. >>>>> >>>>> Thanks, >>>>> Qu >>>>> >>>> >>>> Thanks so much for the quick input! >>>> >>>> All the files I mentioned in bug reports were written by syncthing, so >>>> there wasn't any on-device video recording involved. I once saw Nheko's >>>> database file corrupt however, so it's apparently not limited to >>>> syncthing. I'm guessing video files are affected so often simply due to >>>> their large size. >>> >>> I did a quick clone and search of syncthing. >>> >>> There is no usage of O_DIRECT directly, so I guess it's not the known >>> csum mismatch caused by bad sync of direct IO writeback. >>> >>> In that case, since the corrupted file is syncthing synchronized, can >>> you do a diff of the binary data? >>> >>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to mount the >>> sdcard on another system, then compare the binary. >>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd") >>> >>> At this stage, we need to find out what's really causing the problem, >>> the btrfs itself or some thing lower level. >>> (I strongly hope it's not btrfs, but either way it's not going to end up >>> well) >>> >>> Thanks, >>> Qu >> Thanks for your detailed instructions! I was about to do as you said and >> ran the sync for a few hours, stopped it, and planned to run btrfs scrub >> this evening. However, I then ran into a hard shutdown due to what might >> be an upower bug (won't lie, was very annoyed at that point): >> >> https://gitlab.com/postmarketOS/pmaports/-/issues/3073 >> >> Should I still attach a diff for an affected file I find now? Or are the >> results going to be worthless if there was a hard shutdown in between, >> and I need to first fix the filesystem, repeat the sync test, and repeat >> finding a new corruption error to diff? > > As long as you didn't touch those files, and scrub still reports errors > on that file, the diff is still very helpful to provide some clue. > I finally had a new corrupted file pop up, this was actually after any unintended sudden shutdown so there shouldn't be any interference from that: [128958.860335] BTRFS error (device dm-0): unable to fixup (regular) error at logical 133906497536 on dev /dev/mapper/root physical 135089684480 [128958.862548] BTRFS warning (device dm-0): checksum error at logical 133906497536 on dev /dev/mapper/root, physical 135089684480, root 257, inode 331715, offset 102400, length 4096, links 1 (path: ellie/Music/Baldur's Gate (2) II Shadows of Amn (2000)/06 City Gates.mp3) However, when manually mounting the file on the computer where it originates from and where the undamaged original file is: /mnt # mount -t btrfs -o rescue=all,ro,subvol=/@home,defaults /dev/mapper/blamap p64 /mnt # ls p64/ ellie /mnt # cp p64/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3 ./ /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3 /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3 /mnt # It seems like file is exactly the same, which I assume isn't meant to happen. I'm not sure what that implies, but I hope it's helpful info! Regards, Ellie ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone 2024-08-08 11:31 ` ellie @ 2024-08-19 3:58 ` ellie 2024-08-19 5:29 ` Qu Wenruo 0 siblings, 1 reply; 14+ messages in thread From: ellie @ 2024-08-19 3:58 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs Is there something else I could provide to help track this down? I assume just because the file contents happen to be fine, doesn't mean there wasn't corruption, like for example in the metadata. My apologies for taking up your time. Regards, Ellie On 8/8/24 13:31, ellie wrote: > On 8/6/24 23:55, Qu Wenruo wrote: >> >> >> 在 2024/8/7 01:32, ellie 写道: >>> >>> >>> On 8/5/24 08:34, Qu Wenruo wrote: >>>> >>>> >>>> 在 2024/8/5 15:50, ellie 写道: >>>>> >>>>> >>>>> On 8/5/24 08:10, Qu Wenruo wrote: >>>>>> >>>>>> >>>>>> 在 2024/8/5 15:25, ellie 写道: >>>>>>> On 8/5/24 07:39, ellie wrote: >>>>>>>> Dear kernel list, >>>>>>>> >>>>>>>> I'm hoping this is the right place to sent this. But there seems >>>>>>>> to be >>>>>>>> a btrfs corruption issue on the Pine64 PinePhone: >>>>>>>> >>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058 >>>>>>>> >>>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional patches >>>>>>>> may be used by postmarketOS (which is based on Alpine). The >>>>>>>> device is >>>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/ >>>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a >>>>>>>> way to >>>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember which >>>>>>>> it is. >>>>>>>> >>>>>>>> This is on an SD Card, so an inherently rather unreliable storage >>>>>>>> medium. However, I tried two cards from what I believe to be two >>>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with >>>>>>>> both. >>>>>>>> >>>>>>>> The PinePhone had various chipset instability issues before, like >>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I >>>>>>>> believe >>>>>>>> has however been fixed since. I have no idea if that's relevant, >>>>>>>> I'm >>>>>>>> just pointing it out. I also don't know if other filesystems, like >>>>>>>> ext4 that I used before, might have also had corruption and just >>>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not >>>>>>>> sure I >>>>>>>> necessarily ever would have. >>>>>> >>>>>> In the detailed report in pmOS issue, you mentioned it's a video >>>>>> file. >>>>>> >>>>>> I'm wondering if all the corruptions you see are from video files, >>>>>> especially if the video files are all recorded on the file. >>>>>> >>>>>> If that's the case, it may be related to the IO pattern, >>>>>> especially if >>>>>> the recording tool is using direct IO and didn't have proper >>>>>> writeback >>>>>> wait for those direct IO. >>>>>> >>>>>> Thanks, >>>>>> Qu >>>>>> >>>>> >>>>> Thanks so much for the quick input! >>>>> >>>>> All the files I mentioned in bug reports were written by syncthing, so >>>>> there wasn't any on-device video recording involved. I once saw >>>>> Nheko's >>>>> database file corrupt however, so it's apparently not limited to >>>>> syncthing. I'm guessing video files are affected so often simply >>>>> due to >>>>> their large size. >>>> >>>> I did a quick clone and search of syncthing. >>>> >>>> There is no usage of O_DIRECT directly, so I guess it's not the known >>>> csum mismatch caused by bad sync of direct IO writeback. >>>> >>>> In that case, since the corrupted file is syncthing synchronized, can >>>> you do a diff of the binary data? >>>> >>>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to mount >>>> the >>>> sdcard on another system, then compare the binary. >>>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd") >>>> >>>> At this stage, we need to find out what's really causing the problem, >>>> the btrfs itself or some thing lower level. >>>> (I strongly hope it's not btrfs, but either way it's not going to >>>> end up >>>> well) >>>> >>>> Thanks, >>>> Qu >>> Thanks for your detailed instructions! I was about to do as you said and >>> ran the sync for a few hours, stopped it, and planned to run btrfs scrub >>> this evening. However, I then ran into a hard shutdown due to what might >>> be an upower bug (won't lie, was very annoyed at that point): >>> >>> https://gitlab.com/postmarketOS/pmaports/-/issues/3073 >>> >>> Should I still attach a diff for an affected file I find now? Or are the >>> results going to be worthless if there was a hard shutdown in between, >>> and I need to first fix the filesystem, repeat the sync test, and repeat >>> finding a new corruption error to diff? >> >> As long as you didn't touch those files, and scrub still reports errors >> on that file, the diff is still very helpful to provide some clue. >> > > I finally had a new corrupted file pop up, this was actually after any > unintended sudden shutdown so there shouldn't be any interference from > that: > > [128958.860335] BTRFS error (device dm-0): unable to fixup (regular) > error at logical 133906497536 on dev /dev/mapper/root physical 135089684480 > [128958.862548] BTRFS warning (device dm-0): checksum error at logical > 133906497536 on dev /dev/mapper/root, physical 135089684480, root 257, > inode 331715, offset 102400, length 4096, links 1 (path: ellie/Music/ > Baldur's Gate (2) II Shadows of Amn (2000)/06 City Gates.mp3) > > However, when manually mounting the file on the computer where it > originates from and where the undamaged original file is: > > /mnt # mount -t btrfs -o rescue=all,ro,subvol=/@home,defaults /dev/ > mapper/blamap p64 > /mnt # ls p64/ > ellie > /mnt # cp p64/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\ Amn\ > \(2000\)/06\ City\ Gates.mp3 ./ > /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\ > \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3 > /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\ > \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3 > /mnt # > > It seems like file is exactly the same, which I assume isn't meant to > happen. > > I'm not sure what that implies, but I hope it's helpful info! > > Regards, > > Ellie > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone 2024-08-19 3:58 ` ellie @ 2024-08-19 5:29 ` Qu Wenruo 2024-08-19 8:16 ` ellie 2024-10-17 20:17 ` Ellie 0 siblings, 2 replies; 14+ messages in thread From: Qu Wenruo @ 2024-08-19 5:29 UTC (permalink / raw) To: ellie, linux-btrfs 在 2024/8/19 13:28, ellie 写道: > Is there something else I could provide to help track this down? I > assume just because the file contents happen to be fine, doesn't mean > there wasn't corruption, like for example in the metadata. My apologies > for taking up your time. This means, by somehow the data checksum is incorrect. This doesn't sound sane to me, so I can only come up two possible reasons: 1. The checksum algorithm on the platform is insane IIRC the SOC is pretty mature (although it also means old), this doesn't sound possible to me. 2. Memory hardware is incorrect Thus causing bitflip for data csum. Other than above two reasons, I can not come up with other reasons unfortunately. Thanks, Qu > > Regards, > > Ellie > > On 8/8/24 13:31, ellie wrote: >> On 8/6/24 23:55, Qu Wenruo wrote: >>> >>> >>> 在 2024/8/7 01:32, ellie 写道: >>>> >>>> >>>> On 8/5/24 08:34, Qu Wenruo wrote: >>>>> >>>>> >>>>> 在 2024/8/5 15:50, ellie 写道: >>>>>> >>>>>> >>>>>> On 8/5/24 08:10, Qu Wenruo wrote: >>>>>>> >>>>>>> >>>>>>> 在 2024/8/5 15:25, ellie 写道: >>>>>>>> On 8/5/24 07:39, ellie wrote: >>>>>>>>> Dear kernel list, >>>>>>>>> >>>>>>>>> I'm hoping this is the right place to sent this. But there seems >>>>>>>>> to be >>>>>>>>> a btrfs corruption issue on the Pine64 PinePhone: >>>>>>>>> >>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058 >>>>>>>>> >>>>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional >>>>>>>>> patches >>>>>>>>> may be used by postmarketOS (which is based on Alpine). The >>>>>>>>> device is >>>>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/ >>>>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a >>>>>>>>> way to >>>>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember which >>>>>>>>> it is. >>>>>>>>> >>>>>>>>> This is on an SD Card, so an inherently rather unreliable storage >>>>>>>>> medium. However, I tried two cards from what I believe to be two >>>>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with >>>>>>>>> both. >>>>>>>>> >>>>>>>>> The PinePhone had various chipset instability issues before, like >>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I >>>>>>>>> believe >>>>>>>>> has however been fixed since. I have no idea if that's >>>>>>>>> relevant, I'm >>>>>>>>> just pointing it out. I also don't know if other filesystems, like >>>>>>>>> ext4 that I used before, might have also had corruption and just >>>>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not >>>>>>>>> sure I >>>>>>>>> necessarily ever would have. >>>>>>> >>>>>>> In the detailed report in pmOS issue, you mentioned it's a video >>>>>>> file. >>>>>>> >>>>>>> I'm wondering if all the corruptions you see are from video files, >>>>>>> especially if the video files are all recorded on the file. >>>>>>> >>>>>>> If that's the case, it may be related to the IO pattern, >>>>>>> especially if >>>>>>> the recording tool is using direct IO and didn't have proper >>>>>>> writeback >>>>>>> wait for those direct IO. >>>>>>> >>>>>>> Thanks, >>>>>>> Qu >>>>>>> >>>>>> >>>>>> Thanks so much for the quick input! >>>>>> >>>>>> All the files I mentioned in bug reports were written by >>>>>> syncthing, so >>>>>> there wasn't any on-device video recording involved. I once saw >>>>>> Nheko's >>>>>> database file corrupt however, so it's apparently not limited to >>>>>> syncthing. I'm guessing video files are affected so often simply >>>>>> due to >>>>>> their large size. >>>>> >>>>> I did a quick clone and search of syncthing. >>>>> >>>>> There is no usage of O_DIRECT directly, so I guess it's not the known >>>>> csum mismatch caused by bad sync of direct IO writeback. >>>>> >>>>> In that case, since the corrupted file is syncthing synchronized, can >>>>> you do a diff of the binary data? >>>>> >>>>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to >>>>> mount the >>>>> sdcard on another system, then compare the binary. >>>>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd") >>>>> >>>>> At this stage, we need to find out what's really causing the problem, >>>>> the btrfs itself or some thing lower level. >>>>> (I strongly hope it's not btrfs, but either way it's not going to >>>>> end up >>>>> well) >>>>> >>>>> Thanks, >>>>> Qu >>>> Thanks for your detailed instructions! I was about to do as you said >>>> and >>>> ran the sync for a few hours, stopped it, and planned to run btrfs >>>> scrub >>>> this evening. However, I then ran into a hard shutdown due to what >>>> might >>>> be an upower bug (won't lie, was very annoyed at that point): >>>> >>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3073 >>>> >>>> Should I still attach a diff for an affected file I find now? Or are >>>> the >>>> results going to be worthless if there was a hard shutdown in between, >>>> and I need to first fix the filesystem, repeat the sync test, and >>>> repeat >>>> finding a new corruption error to diff? >>> >>> As long as you didn't touch those files, and scrub still reports errors >>> on that file, the diff is still very helpful to provide some clue. >>> >> >> I finally had a new corrupted file pop up, this was actually after any >> unintended sudden shutdown so there shouldn't be any interference from >> that: >> >> [128958.860335] BTRFS error (device dm-0): unable to fixup (regular) >> error at logical 133906497536 on dev /dev/mapper/root physical >> 135089684480 >> [128958.862548] BTRFS warning (device dm-0): checksum error at logical >> 133906497536 on dev /dev/mapper/root, physical 135089684480, root 257, >> inode 331715, offset 102400, length 4096, links 1 (path: ellie/Music/ >> Baldur's Gate (2) II Shadows of Amn (2000)/06 City Gates.mp3) >> >> However, when manually mounting the file on the computer where it >> originates from and where the undamaged original file is: >> >> /mnt # mount -t btrfs -o rescue=all,ro,subvol=/@home,defaults /dev/ >> mapper/blamap p64 >> /mnt # ls p64/ >> ellie >> /mnt # cp p64/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\ >> Amn\ \(2000\)/06\ City\ Gates.mp3 ./ >> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\ >> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3 >> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\ >> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3 >> /mnt # >> >> It seems like file is exactly the same, which I assume isn't meant to >> happen. >> >> I'm not sure what that implies, but I hope it's helpful info! >> >> Regards, >> >> Ellie >> > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone 2024-08-19 5:29 ` Qu Wenruo @ 2024-08-19 8:16 ` ellie 2024-10-17 20:17 ` Ellie 1 sibling, 0 replies; 14+ messages in thread From: ellie @ 2024-08-19 8:16 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs On 8/19/24 07:29, Qu Wenruo wrote: > > > 在 2024/8/19 13:28, ellie 写道: >> Is there something else I could provide to help track this down? I >> assume just because the file contents happen to be fine, doesn't mean >> there wasn't corruption, like for example in the metadata. My apologies >> for taking up your time. > > This means, by somehow the data checksum is incorrect. > > This doesn't sound sane to me, so I can only come up two possible reasons: > > 1. The checksum algorithm on the platform is insane > IIRC the SOC is pretty mature (although it also means old), this > doesn't sound possible to me. > > 2. Memory hardware is incorrect > Thus causing bitflip for data csum. > > Other than above two reasons, I can not come up with other reasons > unfortunately. > > Thanks, > Qu Thanks so much for your reply! I was curious and did some more tests. From a first impression, the memory seems to be possibly doing fine (the device has 3GB, so what I can test from userspace is a bit limited): # memtester 1024 memtester version 4.6.0 (64-bit) Copyright (C) 2001-2020 Charles Cazabon. Licensed under the GNU General Public License version 2 (only). pagesize is 4096 pagesizemask is 0xfffffffffffff000 want 1024MB (1073741824 bytes) got 1024MB (1073741824 bytes), trying mlock ...locked. Loop 1: Stuck Address : ok Random Value : ok Compare XOR : ok Compare SUB : ok Compare MUL : ok Compare DIV : ok Compare OR : ok Compare AND : ok Sequential Increment: ok Solid Bits : ok Block Sequential : ok Checkerboard : ok Bit Spread : ok Bit Flip : ok Walking Ones : ok Walking Zeroes : ok 8-bit Writes : ok 16-bit Writes : ok Loop 2: Stuck Address : ok Random Value : ok Compare XOR : ok Compare SUB : ok Compare MUL : ok Compare DIV : ok Compare OR : ok Compare AND : ok Sequential Increment: ok Solid Bits : ok Block Sequential : ok Checkerboard : ok Bit Spread : ok Bit Flip : ok Walking Ones : ok Walking Zeroes : ok 8-bit Writes : ok 16-bit Writes : ok Loop 3: Stuck Address : ok Random Value : ok Compare XOR : ok Compare SUB : ok Compare MUL : ok Compare DIV : ok Compare OR : ok Compare AND : ok Sequential Increment: ok Solid Bits : ok Block Sequential : ok Checkerboard : ok Bit Spread : ok Bit Flip : ok Walking Ones : ok Walking Zeroes : ok 8-bit Writes : ok 16-bit Writes : ok Loop 4: Stuck Address : ok Random Value : ok Compare XOR : ok Compare SUB : ok Compare MUL : ok Compare DIV : ok Compare OR : ok Compare AND : ok Sequential Increment: ok Solid Bits : ok Block Sequential : ok Checkerboard : ok Bit Spread : ok Bit Flip : ok Walking Ones : ok Walking Zeroes : ok 8-bit Writes : ok 16-bit Writes : ok > >> >> Regards, >> >> Ellie >> >> On 8/8/24 13:31, ellie wrote: >>> On 8/6/24 23:55, Qu Wenruo wrote: >>>> >>>> >>>> 在 2024/8/7 01:32, ellie 写道: >>>>> >>>>> >>>>> On 8/5/24 08:34, Qu Wenruo wrote: >>>>>> >>>>>> >>>>>> 在 2024/8/5 15:50, ellie 写道: >>>>>>> >>>>>>> >>>>>>> On 8/5/24 08:10, Qu Wenruo wrote: >>>>>>>> >>>>>>>> >>>>>>>> 在 2024/8/5 15:25, ellie 写道: >>>>>>>>> On 8/5/24 07:39, ellie wrote: >>>>>>>>>> Dear kernel list, >>>>>>>>>> >>>>>>>>>> I'm hoping this is the right place to sent this. But there seems >>>>>>>>>> to be >>>>>>>>>> a btrfs corruption issue on the Pine64 PinePhone: >>>>>>>>>> >>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058 >>>>>>>>>> >>>>>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional >>>>>>>>>> patches >>>>>>>>>> may be used by postmarketOS (which is based on Alpine). The >>>>>>>>>> device is >>>>>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/ >>>>>>>>>> wiki/ >>>>>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a >>>>>>>>>> way to >>>>>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember >>>>>>>>>> which >>>>>>>>>> it is. >>>>>>>>>> >>>>>>>>>> This is on an SD Card, so an inherently rather unreliable storage >>>>>>>>>> medium. However, I tried two cards from what I believe to be two >>>>>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with >>>>>>>>>> both. >>>>>>>>>> >>>>>>>>>> The PinePhone had various chipset instability issues before, like >>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I >>>>>>>>>> believe >>>>>>>>>> has however been fixed since. I have no idea if that's >>>>>>>>>> relevant, I'm >>>>>>>>>> just pointing it out. I also don't know if other filesystems, >>>>>>>>>> like >>>>>>>>>> ext4 that I used before, might have also had corruption and just >>>>>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not >>>>>>>>>> sure I >>>>>>>>>> necessarily ever would have. >>>>>>>> >>>>>>>> In the detailed report in pmOS issue, you mentioned it's a video >>>>>>>> file. >>>>>>>> >>>>>>>> I'm wondering if all the corruptions you see are from video files, >>>>>>>> especially if the video files are all recorded on the file. >>>>>>>> >>>>>>>> If that's the case, it may be related to the IO pattern, >>>>>>>> especially if >>>>>>>> the recording tool is using direct IO and didn't have proper >>>>>>>> writeback >>>>>>>> wait for those direct IO. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Qu >>>>>>>> >>>>>>> >>>>>>> Thanks so much for the quick input! >>>>>>> >>>>>>> All the files I mentioned in bug reports were written by >>>>>>> syncthing, so >>>>>>> there wasn't any on-device video recording involved. I once saw >>>>>>> Nheko's >>>>>>> database file corrupt however, so it's apparently not limited to >>>>>>> syncthing. I'm guessing video files are affected so often simply >>>>>>> due to >>>>>>> their large size. >>>>>> >>>>>> I did a quick clone and search of syncthing. >>>>>> >>>>>> There is no usage of O_DIRECT directly, so I guess it's not the known >>>>>> csum mismatch caused by bad sync of direct IO writeback. >>>>>> >>>>>> In that case, since the corrupted file is syncthing synchronized, can >>>>>> you do a diff of the binary data? >>>>>> >>>>>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to >>>>>> mount the >>>>>> sdcard on another system, then compare the binary. >>>>>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd") >>>>>> >>>>>> At this stage, we need to find out what's really causing the problem, >>>>>> the btrfs itself or some thing lower level. >>>>>> (I strongly hope it's not btrfs, but either way it's not going to >>>>>> end up >>>>>> well) >>>>>> >>>>>> Thanks, >>>>>> Qu >>>>> Thanks for your detailed instructions! I was about to do as you said >>>>> and >>>>> ran the sync for a few hours, stopped it, and planned to run btrfs >>>>> scrub >>>>> this evening. However, I then ran into a hard shutdown due to what >>>>> might >>>>> be an upower bug (won't lie, was very annoyed at that point): >>>>> >>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3073 >>>>> >>>>> Should I still attach a diff for an affected file I find now? Or are >>>>> the >>>>> results going to be worthless if there was a hard shutdown in between, >>>>> and I need to first fix the filesystem, repeat the sync test, and >>>>> repeat >>>>> finding a new corruption error to diff? >>>> >>>> As long as you didn't touch those files, and scrub still reports errors >>>> on that file, the diff is still very helpful to provide some clue. >>>> >>> >>> I finally had a new corrupted file pop up, this was actually after any >>> unintended sudden shutdown so there shouldn't be any interference from >>> that: >>> >>> [128958.860335] BTRFS error (device dm-0): unable to fixup (regular) >>> error at logical 133906497536 on dev /dev/mapper/root physical >>> 135089684480 >>> [128958.862548] BTRFS warning (device dm-0): checksum error at logical >>> 133906497536 on dev /dev/mapper/root, physical 135089684480, root 257, >>> inode 331715, offset 102400, length 4096, links 1 (path: ellie/Music/ >>> Baldur's Gate (2) II Shadows of Amn (2000)/06 City Gates.mp3) >>> >>> However, when manually mounting the file on the computer where it >>> originates from and where the undamaged original file is: >>> >>> /mnt # mount -t btrfs -o rescue=all,ro,subvol=/@home,defaults /dev/ >>> mapper/blamap p64 >>> /mnt # ls p64/ >>> ellie >>> /mnt # cp p64/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\ >>> Amn\ \(2000\)/06\ City\ Gates.mp3 ./ >>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\ >>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3 >>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\ >>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3 >>> /mnt # >>> >>> It seems like file is exactly the same, which I assume isn't meant to >>> happen. >>> >>> I'm not sure what that implies, but I hope it's helpful info! >>> >>> Regards, >>> >>> Ellie >>> >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone 2024-08-19 5:29 ` Qu Wenruo 2024-08-19 8:16 ` ellie @ 2024-10-17 20:17 ` Ellie 1 sibling, 0 replies; 14+ messages in thread From: Ellie @ 2024-10-17 20:17 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs On 8/19/24 7:29 AM, Qu Wenruo wrote: > > > 在 2024/8/19 13:28, ellie 写道: >> Is there something else I could provide to help track this down? I >> assume just because the file contents happen to be fine, doesn't mean >> there wasn't corruption, like for example in the metadata. My apologies >> for taking up your time. > > This means, by somehow the data checksum is incorrect. > > This doesn't sound sane to me, so I can only come up two possible reasons: > > 1. The checksum algorithm on the platform is insane > IIRC the SOC is pretty mature (although it also means old), this > doesn't sound possible to me. > > 2. Memory hardware is incorrect > Thus causing bitflip for data csum. > > Other than above two reasons, I can not come up with other reasons > unfortunately. > > Thanks, > Qu > I did let a memtest run on this device recently, which didn't reveal anything suspicious. However, this device was known to have memory hiccups: https://forum.pine64.org/showthread.php?tid=9832&page=10 As far as I know they were supposedly resolved, but I wouldn't be able to judge. I would assume memtest would show them if still present, but again I'm not sure. The checksum errors seem to be permanent whenever they happen, I can test this again if needed but I'm pretty sure I recall rerunning btrfs checks and the same error came back up again. I can only do very uninformed nonsense guesses what this means, but I guess this could imply there is a problem writing the metadata while the actual file is written correctly. I hope some of this is helpful for some ideas. Regards, Ellie >>>>>>>> 在 2024/8/5 15:25, ellie 写道: >>>>>>>>> On 8/5/24 07:39, ellie wrote: >>>>>>>>>> Dear kernel list, >>>>>>>>>> >>>>>>>>>> I'm hoping this is the right place to sent this. But there seems >>>>>>>>>> to be >>>>>>>>>> a btrfs corruption issue on the Pine64 PinePhone: >>>>>>>>>> >>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3058 >>>>>>>>>> >>>>>>>>>> The kernel is 6.9.10, I wouldn't know what exact additional >>>>>>>>>> patches >>>>>>>>>> may be used by postmarketOS (which is based on Alpine). The >>>>>>>>>> device is >>>>>>>>>> the PinePhone revision 1.2a or newer https://wiki.pine64.org/ >>>>>>>>>> wiki/ >>>>>>>>>> PinePhone#Hardware_revisions sadly there doesn't seem to be a >>>>>>>>>> way to >>>>>>>>>> check in software if it's 1.2a or 1.2b, and I don't remember >>>>>>>>>> which >>>>>>>>>> it is. >>>>>>>>>> >>>>>>>>>> This is on an SD Card, so an inherently rather unreliable storage >>>>>>>>>> medium. However, I tried two cards from what I believe to be two >>>>>>>>>> different vendors, Lexar and SanDisk, and I'm seeing this with >>>>>>>>>> both. >>>>>>>>>> >>>>>>>>>> The PinePhone had various chipset instability issues before, like >>>>>>>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I >>>>>>>>>> believe >>>>>>>>>> has however been fixed since. I have no idea if that's >>>>>>>>>> relevant, I'm >>>>>>>>>> just pointing it out. I also don't know if other filesystems, >>>>>>>>>> like >>>>>>>>>> ext4 that I used before, might have also had corruption and just >>>>>>>>>> didn't detect it. Not that I ever noticed anything, but I'm not >>>>>>>>>> sure I >>>>>>>>>> necessarily ever would have. >>>>>>>> >>>>>>>> In the detailed report in pmOS issue, you mentioned it's a video >>>>>>>> file. >>>>>>>> >>>>>>>> I'm wondering if all the corruptions you see are from video files, >>>>>>>> especially if the video files are all recorded on the file. >>>>>>>> >>>>>>>> If that's the case, it may be related to the IO pattern, >>>>>>>> especially if >>>>>>>> the recording tool is using direct IO and didn't have proper >>>>>>>> writeback >>>>>>>> wait for those direct IO. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Qu >>>>>>>> >>>>>>> >>>>>>> Thanks so much for the quick input! >>>>>>> >>>>>>> All the files I mentioned in bug reports were written by >>>>>>> syncthing, so >>>>>>> there wasn't any on-device video recording involved. I once saw >>>>>>> Nheko's >>>>>>> database file corrupt however, so it's apparently not limited to >>>>>>> syncthing. I'm guessing video files are affected so often simply >>>>>>> due to >>>>>>> their large size. >>>>>> >>>>>> I did a quick clone and search of syncthing. >>>>>> >>>>>> There is no usage of O_DIRECT directly, so I guess it's not the known >>>>>> csum mismatch caused by bad sync of direct IO writeback. >>>>>> >>>>>> In that case, since the corrupted file is syncthing synchronized, can >>>>>> you do a diff of the binary data? >>>>>> >>>>>> To avoid the EIO from btrfs, you can use "-o rescue=all,ro" to >>>>>> mount the >>>>>> sdcard on another system, then compare the binary. >>>>>> (e.g. "xxd file.good > good.xxd; xxd file.bad > bad.xxd; diff *.xxd") >>>>>> >>>>>> At this stage, we need to find out what's really causing the problem, >>>>>> the btrfs itself or some thing lower level. >>>>>> (I strongly hope it's not btrfs, but either way it's not going to >>>>>> end up >>>>>> well) >>>>>> >>>>>> Thanks, >>>>>> Qu >>>>> Thanks for your detailed instructions! I was about to do as you said >>>>> and >>>>> ran the sync for a few hours, stopped it, and planned to run btrfs >>>>> scrub >>>>> this evening. However, I then ran into a hard shutdown due to what >>>>> might >>>>> be an upower bug (won't lie, was very annoyed at that point): >>>>> >>>>> https://gitlab.com/postmarketOS/pmaports/-/issues/3073 >>>>> >>>>> Should I still attach a diff for an affected file I find now? Or are >>>>> the >>>>> results going to be worthless if there was a hard shutdown in between, >>>>> and I need to first fix the filesystem, repeat the sync test, and >>>>> repeat >>>>> finding a new corruption error to diff? >>>> >>>> As long as you didn't touch those files, and scrub still reports errors >>>> on that file, the diff is still very helpful to provide some clue. >>>> >>> >>> I finally had a new corrupted file pop up, this was actually after any >>> unintended sudden shutdown so there shouldn't be any interference from >>> that: >>> >>> [128958.860335] BTRFS error (device dm-0): unable to fixup (regular) >>> error at logical 133906497536 on dev /dev/mapper/root physical >>> 135089684480 >>> [128958.862548] BTRFS warning (device dm-0): checksum error at logical >>> 133906497536 on dev /dev/mapper/root, physical 135089684480, root 257, >>> inode 331715, offset 102400, length 4096, links 1 (path: ellie/Music/ >>> Baldur's Gate (2) II Shadows of Amn (2000)/06 City Gates.mp3) >>> >>> However, when manually mounting the file on the computer where it >>> originates from and where the undamaged original file is: >>> >>> /mnt # mount -t btrfs -o rescue=all,ro,subvol=/@home,defaults /dev/ >>> mapper/blamap p64 >>> /mnt # ls p64/ >>> ellie >>> /mnt # cp p64/ellie/Music/Baldur\'s\ Gate\ \(2\)\ II\ Shadows\ of\ >>> Amn\ \(2000\)/06\ City\ Gates.mp3 ./ >>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\ >>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3 >>> /mnt # diff 06\ City\ Gates.mp3 /home/ellie/Music/Baldur\'s\ Gate\ >>> \(2\)\ II\ Shadows\ of\ Amn\ \(2000\)/06\ City\ Gates.mp3 >>> /mnt # >>> >>> It seems like file is exactly the same, which I assume isn't meant to >>> happen. >>> >>> I'm not sure what that implies, but I hope it's helpful info! >>> >>> Regards, >>> >>> Ellie >>> >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs corruption issue on Pine64 PinePhone 2024-08-05 5:39 btrfs corruption issue on Pine64 PinePhone ellie 2024-08-05 5:55 ` ellie @ 2024-10-02 7:20 ` ellie 2024-12-16 22:53 ` BTRFS hangs and causes semi-freezes on PinePhone Ellie 1 sibling, 1 reply; 14+ messages in thread From: ellie @ 2024-10-02 7:20 UTC (permalink / raw) To: linux-btrfs An update: I've largely ignored the corruption issue for now, which is somewhat feasible since outside of large write loads for a longer time it doesn't seem to happen. But there seems to be another larger issue with btrfs on this device. When syncthing scans on all its threads seemingly maxing out I/O at least according to iotop, all other apps including Phosh (the window manager) freeze eveery 5 seconds or so for long 10+ seconds durations. It seems like syncthing simply reading blocks vital reads for even just basic continued operation of other processes. Something about its tuning seems to fundamentally not work on this low spec hardware. With ext4, I had no such issues. Sorry if my rambling isn't useful, I'm not experienced at reporting filesystem hiccups. Regards, Ellie On 8/5/24 07:39, ellie wrote: > Dear kernel list, > > I'm hoping this is the right place to sent this. But there seems to be a > btrfs corruption issue on the Pine64 PinePhone: > > https://gitlab.com/postmarketOS/pmaports/-/issues/3058 > > The kernel is 6.9.10, I wouldn't know what exact additional patches may > be used by postmarketOS (which is based on Alpine). The device is the > PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/ > PinePhone#Hardware_revisions sadly there doesn't seem to be a way to > check in software if it's 1.2a or 1.2b, and I don't remember which it is. > > This is on an SD Card, so an inherently rather unreliable storage > medium. However, I tried two cards from what I believe to be two > different vendors, Lexar and SanDisk, and I'm seeing this with both. > > The PinePhone had various chipset instability issues before, like > https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe > has however been fixed since. I have no idea if that's relevant, I'm > just pointing it out. I also don't know if other filesystems, like ext4 > that I used before, might have also had corruption and just didn't > detect it. Not that I ever noticed anything, but I'm not sure I > necessarily ever would have. > > Regards, > > Ellie > ^ permalink raw reply [flat|nested] 14+ messages in thread
* BTRFS hangs and causes semi-freezes on PinePhone 2024-10-02 7:20 ` ellie @ 2024-12-16 22:53 ` Ellie 0 siblings, 0 replies; 14+ messages in thread From: Ellie @ 2024-12-16 22:53 UTC (permalink / raw) To: linux-btrfs I've had to hard reset the device a few times because it became so slow, I couldn't easily log in anymore. The culprit seems to be BTRFS read perf, because when I manage to terminate read-heavy applications it usually recovers, and I didn't have this problem with EXT4. (It's not like the reads ever fully freeze, but they're so slow that apps including the lock screen will hang for multiple seconds up to a minute or so whenever they seem to try to access things on disk.) Regards, Ellie On 10/2/24 9:20 AM, ellie wrote: > An update: I've largely ignored the corruption issue for now, which is > somewhat feasible since outside of large write loads for a longer time > it doesn't seem to happen. > > But there seems to be another larger issue with btrfs on this device. > When syncthing scans on all its threads seemingly maxing out I/O at > least according to iotop, all other apps including Phosh (the window > manager) freeze eveery 5 seconds or so for long 10+ seconds durations. > It seems like syncthing simply reading blocks vital reads for even just > basic continued operation of other processes. Something about its tuning > seems to fundamentally not work on this low spec hardware. With ext4, I > had no such issues. > > Sorry if my rambling isn't useful, I'm not experienced at reporting > filesystem hiccups. > > Regards, > > Ellie > > On 8/5/24 07:39, ellie wrote: >> Dear kernel list, >> >> I'm hoping this is the right place to sent this. But there seems to be >> a btrfs corruption issue on the Pine64 PinePhone: >> >> https://gitlab.com/postmarketOS/pmaports/-/issues/3058 >> >> The kernel is 6.9.10, I wouldn't know what exact additional patches >> may be used by postmarketOS (which is based on Alpine). The device is >> the PinePhone revision 1.2a or newer https://wiki.pine64.org/wiki/ >> PinePhone#Hardware_revisions sadly there doesn't seem to be a way to >> check in software if it's 1.2a or 1.2b, and I don't remember which it is. >> >> This is on an SD Card, so an inherently rather unreliable storage >> medium. However, I tried two cards from what I believe to be two >> different vendors, Lexar and SanDisk, and I'm seeing this with both. >> >> The PinePhone had various chipset instability issues before, like >> https://gitlab.com/postmarketOS/pmaports/-/issues/805 which I believe >> has however been fixed since. I have no idea if that's relevant, I'm >> just pointing it out. I also don't know if other filesystems, like >> ext4 that I used before, might have also had corruption and just >> didn't detect it. Not that I ever noticed anything, but I'm not sure I >> necessarily ever would have. >> >> Regards, >> >> Ellie >> > ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-12-16 22:53 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-08-05 5:39 btrfs corruption issue on Pine64 PinePhone ellie 2024-08-05 5:55 ` ellie 2024-08-05 6:10 ` Qu Wenruo 2024-08-05 6:20 ` ellie 2024-08-05 6:34 ` Qu Wenruo 2024-08-06 16:02 ` ellie 2024-08-06 21:55 ` Qu Wenruo 2024-08-08 11:31 ` ellie 2024-08-19 3:58 ` ellie 2024-08-19 5:29 ` Qu Wenruo 2024-08-19 8:16 ` ellie 2024-10-17 20:17 ` Ellie 2024-10-02 7:20 ` ellie 2024-12-16 22:53 ` BTRFS hangs and causes semi-freezes on PinePhone Ellie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox