From: Gao Xiang <hsiangkao@linux.alibaba.com>
To: Juhyung Park <qkrwngud825@gmail.com>
Cc: Gao Xiang <xiang@kernel.org>,
linux-erofs@lists.ozlabs.org,
linux-f2fs-devel@lists.sourceforge.net,
linux-crypto@vger.kernel.org,
Yann Collet <yann.collet.73@gmail.com>
Subject: Re: Weird EROFS data corruption
Date: Mon, 4 Dec 2023 01:21:58 +0800 [thread overview]
Message-ID: <649a3bc4-58bb-1dc8-85fb-a56e47b3d5c9@linux.alibaba.com> (raw)
In-Reply-To: <CAD14+f2G-buxTaWgb23DYW-HSd1sch6tJNKV2strt=toASZXQQ@mail.gmail.com>
On 2023/12/4 01:01, Juhyung Park wrote:
> Hi Gao,
>
> On Mon, Dec 4, 2023 at 1:52 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> Hi Juhyung,
>>
>> On 2023/12/4 00:22, Juhyung Park wrote:
>>> (Cc'ing f2fs and crypto as I've noticed something similar with f2fs a
>>> while ago, which may mean that this is not specific to EROFS:
>>> https://lore.kernel.org/all/CAD14+f2nBZtLfLC6CwNjgCOuRRRjwzttp3D3iK4Of+1EEjK+cw@mail.gmail.com/
>>> )
>>>
>>> Hi.
>>>
>>> I'm encountering a very weird EROFS data corruption.
>>>
>>> I noticed when I build an EROFS image for AOSP development, the device
>>> would randomly not boot from a certain build.
>>> After inspecting the log, I noticed that a file got corrupted.
>>
>> Is it observed on your laptop (i7-1185G7), yes? or some other arm64
>> device?
>
> Yes, only on my laptop. The arm64 device seems fine.
> The reason that it would not boot was that the host machine (my
> laptop) was repacking the EROFS image wrongfully.
>
> The workflow is something like this:
> Server-built EROFS AOSP image -> Image copied to laptop -> Laptop
> mounts the EROFS image -> Copies the entire content to a scratch
> directory (CORRUPT!) -> Changes some files -> mkfs.erofs
>
> So the device is not responsible for the corruption, the laptop is.
Ok.
>
>>
>>>
>>> After adding a hash check during the build flow, I noticed that EROFS
>>> would randomly read data wrong.
>>>
>>> I now have a reliable method of reproducing the issue, but here's the
>>> funny/weird part: it's only happening on my laptop (i7-1185G7). This
>>> is not happening with my 128 cores buildfarm machine (Threadripper
>>> 3990X).>
>>> I first suspected a hardware issue, but:
>>> a. The laptop had its motherboard replaced recently (due to a failing
>>> physical Type-C port).
>>> b. The laptop passes memory test (memtest86).
>>> c. This happens on all kernel versions from v5.4 to the latest v6.6
>>> including my personal custom builds and Canonical's official Ubuntu
>>> kernels.
>>> d. This happens on different host SSDs and file-system combinations.
>>> e. This only happens on LZ4. LZ4HC doesn't trigger the issue.
>>> f. This only happens when mounting the image natively by the kernel.
>>> Using fuse with erofsfuse is fine.
>>
>> I think it's a weird issue with inplace decompression because you said
>> it depends on the hardware. In addition, with your dataset sadly I
>> cannot reproduce on my local server (Xeon(R) CPU E5-2682 v4).
>
> As I feared. Bummer :(
>
>>
>> What is the difference between these two machines? just different CPU or
>> they have some other difference like different compliers?
>
> I fully and exclusively control both devices, and the setup is almost the same.
> Same Ubuntu version, kernel/compiler version.
>
> But as I said, on my laptop, the issue happens on kernels that someone
> else (Canonical) built, so I don't think it matters.
The only thing I could say is that the kernel side has optimized
inplace decompression compared to fuse so that it will reuse the
same buffer for decompression but with a safe margin (according to
the current lz4 decompression implementation). It shouldn't behave
different just due to different CPUs. Let me find more clues
later, also maybe we should introduce a way for users to turn off
this if needed.
Thanks,
Gao Xiang
next prev parent reply other threads:[~2023-12-03 17:22 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-03 16:22 Weird EROFS data corruption Juhyung Park
2023-12-03 16:52 ` Gao Xiang
2023-12-03 17:01 ` Juhyung Park
2023-12-03 17:21 ` Gao Xiang [this message]
2023-12-03 17:32 ` Juhyung Park
2023-12-04 3:28 ` Gao Xiang
2023-12-04 3:41 ` Juhyung Park
2023-12-05 7:32 ` Gao Xiang
2023-12-05 14:23 ` Juhyung Park
2023-12-05 14:34 ` Gao Xiang
2023-12-05 14:43 ` Juhyung Park
2023-12-06 3:11 ` Gao Xiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=649a3bc4-58bb-1dc8-85fb-a56e47b3d5c9@linux.alibaba.com \
--to=hsiangkao@linux.alibaba.com \
--cc=linux-crypto@vger.kernel.org \
--cc=linux-erofs@lists.ozlabs.org \
--cc=linux-f2fs-devel@lists.sourceforge.net \
--cc=qkrwngud825@gmail.com \
--cc=xiang@kernel.org \
--cc=yann.collet.73@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox