From: "HAGIO KAZUHITO(萩尾 一仁)" <k-hagio-ab@nec.com>
To: Tao Liu <ltao@redhat.com>
Cc: "YAMAZAKI MASAMITSU(山崎 真光)" <yamazaki-msmt@nec.com>,
"kexec@lists.infradead.org" <kexec@lists.infradead.org>,
"sourabhjain@linux.ibm.com" <sourabhjain@linux.ibm.com>
Subject: Re: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N)
Date: Wed, 2 Jul 2025 00:13:15 +0000 [thread overview]
Message-ID: <5c425f4e-4e89-4500-993e-e4dfec50a4fb@nec.com> (raw)
In-Reply-To: <CAO7dBbXpkshxguPbHDA2YFZeYsDAbyswBCQ7x0UKY__4H1K2jQ@mail.gmail.com>
On 2025/07/01 16:59, Tao Liu wrote:
> Hi Kazu,
>
> Thanks for your comments!
>
> On Tue, Jul 1, 2025 at 7:38 PM HAGIO KAZUHITO(萩尾 一仁) <k-hagio-ab@nec.com> wrote:
>>
>> Hi Tao,
>>
>> thank you for the patch.
>>
>> On 2025/06/25 11:23, Tao Liu wrote:
>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be
>>> reproduced with upstream makedumpfile.
>>>
>>> When analyzing the corrupt vmcore using crash, the following error
>>> message will output:
>>>
>>> crash: compressed kdump: uncompress failed: 0
>>> crash: read error: kernel virtual address: c0001e2d2fe48000 type:
>>> "hardirq thread_union"
>>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000
>>> crash: compressed kdump: uncompress failed: 0
>>>
>>> If the vmcore is generated without num-threads option, then no such
>>> errors are noticed.
>>>
>>> With --num-threads=N enabled, there will be N sub-threads created. All
>>> sub-threads are producers which responsible for mm page processing, e.g.
>>> compression. The main thread is the consumer which responsible for
>>> writing the compressed data into file. page_flag_buf->ready is used to
>>> sync main and sub-threads. When a sub-thread finishes page processing,
>>> it will set ready flag to be FLAG_READY. In the meantime, main thread
>>> looply check all threads of the ready flags, and break the loop when
>>> find FLAG_READY.
>>
>> I've tried to reproduce the issue, but I couldn't on x86_64.
>
> Yes, I cannot reproduce it on x86_64 either, but the issue is very
> easily reproduced on ppc64 arch, which is where our QE reported.
> Recently we have enabled --num-threads=N in rhel by default. N ==
> nr_cpus in 2nd kernel, so QE noticed the issue.
I see, thank you for the information.
>
>>
>> Do you have any possible scenario that breaks a vmcore? I could not
>> think of it only by looking at the code.
>
> I guess the issue only been observed on ppc might be due to ppc's
> memory model, multi-thread scheduling algorithm etc. I'm not an expert
> on those. So I cannot give a clear explanation, sorry...
ok, I also don't think of how to debug this well..
>
> The page_flag_buf->ready is an integer that r/w by main and sub
> threads simultaneously. And the assignment operation, like
> page_flag_buf->ready = 1, might be composed of several assembly
> instructions. Without atomic r/w (memory) protection, there might be
> racing r/w just within the few instructions, which caused the data
> inconsistency. Frankly the ppc assembly consists of more instructions
> than x86_64 for the same c code, which enlarged the possibility of
> data racing.
>
> We can observe the issue without the help of crash, just compare the
> binary output of vmcore generated from the same core file, and
> compress it with or without --num-threads option. Then compare it with
> "cmp vmcore1 vmcore2" cmdline, and cmp will output bytes differ for
> the 2 vmcores, and this is unexpected.
>
>>
>> and this is just out of curiosity, is the issue reproduced with
>> makedumpfile compiled with -O0 too?
>
> Sorry, I haven't done the -O0 experiment, I can do it tomorrow and
> share my findings...
Thanks, we have to fix this anyway, I want a clue to think about a
possible scenario..
Thanks,
Kazu
next prev parent reply other threads:[~2025-07-02 0:13 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-25 2:23 [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) Tao Liu
2025-07-01 7:38 ` HAGIO KAZUHITO(萩尾 一仁)
2025-07-01 7:59 ` Tao Liu
2025-07-02 0:13 ` HAGIO KAZUHITO(萩尾 一仁) [this message]
2025-07-02 4:36 ` Tao Liu
2025-07-02 4:52 ` HAGIO KAZUHITO(萩尾 一仁)
2025-07-02 5:03 ` Tao Liu
2025-07-02 6:02 ` HAGIO KAZUHITO(萩尾 一仁)
2025-07-02 5:03 ` Sourabh Jain
2025-07-03 14:31 ` Petr Tesarik
2025-07-03 22:35 ` Tao Liu
2025-07-04 6:49 ` HAGIO KAZUHITO(萩尾 一仁)
2025-07-04 7:51 ` Tao Liu
2025-07-10 5:34 ` Tao Liu
2025-07-11 12:08 ` YAMAZAKI MASAMITSU(山崎 真光)
2025-07-13 23:37 ` Tao Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5c425f4e-4e89-4500-993e-e4dfec50a4fb@nec.com \
--to=k-hagio-ab@nec.com \
--cc=kexec@lists.infradead.org \
--cc=ltao@redhat.com \
--cc=sourabhjain@linux.ibm.com \
--cc=yamazaki-msmt@nec.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox