From: "HAGIO KAZUHITO(萩尾 一仁)" <k-hagio-ab@nec.com>
To: Tao Liu <ltao@redhat.com>
Cc: "YAMAZAKI MASAMITSU(山崎 真光)" <yamazaki-msmt@nec.com>,
"kexec@lists.infradead.org" <kexec@lists.infradead.org>,
"sourabhjain@linux.ibm.com" <sourabhjain@linux.ibm.com>
Subject: Re: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N)
Date: Wed, 2 Jul 2025 00:13:15 +0000 [thread overview]
Message-ID: <5c425f4e-4e89-4500-993e-e4dfec50a4fb@nec.com> (raw)
In-Reply-To: <CAO7dBbXpkshxguPbHDA2YFZeYsDAbyswBCQ7x0UKY__4H1K2jQ@mail.gmail.com>
On 2025/07/01 16:59, Tao Liu wrote:
> Hi Kazu,
>
> Thanks for your comments!
>
> On Tue, Jul 1, 2025 at 7:38 PM HAGIO KAZUHITO(萩尾 一仁) <k-hagio-ab@nec.com> wrote:
>>
>> Hi Tao,
>>
>> thank you for the patch.
>>
>> On 2025/06/25 11:23, Tao Liu wrote:
>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be
>>> reproduced with upstream makedumpfile.
>>>
>>> When analyzing the corrupt vmcore using crash, the following error
>>> message will output:
>>>
>>> crash: compressed kdump: uncompress failed: 0
>>> crash: read error: kernel virtual address: c0001e2d2fe48000 type:
>>> "hardirq thread_union"
>>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000
>>> crash: compressed kdump: uncompress failed: 0
>>>
>>> If the vmcore is generated without num-threads option, then no such
>>> errors are noticed.
>>>
>>> With --num-threads=N enabled, there will be N sub-threads created. All
>>> sub-threads are producers which responsible for mm page processing, e.g.
>>> compression. The main thread is the consumer which responsible for
>>> writing the compressed data into file. page_flag_buf->ready is used to
>>> sync main and sub-threads. When a sub-thread finishes page processing,
>>> it will set ready flag to be FLAG_READY. In the meantime, main thread
>>> looply check all threads of the ready flags, and break the loop when
>>> find FLAG_READY.
>>
>> I've tried to reproduce the issue, but I couldn't on x86_64.
>
> Yes, I cannot reproduce it on x86_64 either, but the issue is very
> easily reproduced on ppc64 arch, which is where our QE reported.
> Recently we have enabled --num-threads=N in rhel by default. N ==
> nr_cpus in 2nd kernel, so QE noticed the issue.
I see, thank you for the information.
>
>>
>> Do you have any possible scenario that breaks a vmcore? I could not
>> think of it only by looking at the code.
>
> I guess the issue only been observed on ppc might be due to ppc's
> memory model, multi-thread scheduling algorithm etc. I'm not an expert
> on those. So I cannot give a clear explanation, sorry...
ok, I also don't think of how to debug this well..
>
> The page_flag_buf->ready is an integer that r/w by main and sub
> threads simultaneously. And the assignment operation, like
> page_flag_buf->ready = 1, might be composed of several assembly
> instructions. Without atomic r/w (memory) protection, there might be
> racing r/w just within the few instructions, which caused the data
> inconsistency. Frankly the ppc assembly consists of more instructions
> than x86_64 for the same c code, which enlarged the possibility of
> data racing.
>
> We can observe the issue without the help of crash, just compare the
> binary output of vmcore generated from the same core file, and
> compress it with or without --num-threads option. Then compare it with
> "cmp vmcore1 vmcore2" cmdline, and cmp will output bytes differ for
> the 2 vmcores, and this is unexpected.
>
>>
>> and this is just out of curiosity, is the issue reproduced with
>> makedumpfile compiled with -O0 too?
>
> Sorry, I haven't done the -O0 experiment, I can do it tomorrow and
> share my findings...
Thanks, we have to fix this anyway, I want a clue to think about a
possible scenario..
Thanks,
Kazu
next prev parent reply other threads:[~2025-07-02 0:13 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-25 2:23 [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) Tao Liu
2025-07-01 7:38 ` HAGIO KAZUHITO(萩尾 一仁)
2025-07-01 7:59 ` Tao Liu
2025-07-02 0:13 ` HAGIO KAZUHITO(萩尾 一仁) [this message]
2025-07-02 4:36 ` Tao Liu
2025-07-02 4:52 ` HAGIO KAZUHITO(萩尾 一仁)
2025-07-02 5:03 ` Tao Liu
2025-07-02 6:02 ` HAGIO KAZUHITO(萩尾 一仁)
2025-07-02 5:03 ` Sourabh Jain
2025-07-03 14:31 ` Petr Tesarik
2025-07-03 22:35 ` Tao Liu
2025-07-04 6:49 ` HAGIO KAZUHITO(萩尾 一仁)
2025-07-04 7:51 ` Tao Liu
2025-07-10 5:34 ` Tao Liu
2025-07-11 12:08 ` YAMAZAKI MASAMITSU(山崎 真光)
2025-07-13 23:37 ` Tao Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5c425f4e-4e89-4500-993e-e4dfec50a4fb@nec.com \
--to=k-hagio-ab@nec.com \
--cc=kexec@lists.infradead.org \
--cc=ltao@redhat.com \
--cc=sourabhjain@linux.ibm.com \
--cc=yamazaki-msmt@nec.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.