From: Sourabh Jain <sourabhjain@linux.ibm.com>
To: Eric DeVolder <eric.devolder@oracle.com>,
Thomas Gleixner <tglx@linutronix.de>,
linux-kernel@vger.kernel.org, x86@kernel.org,
kexec@lists.infradead.org, ebiederm@xmission.com,
dyoung@redhat.com, bhe@redhat.com, vgoyal@redhat.com
Cc: mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
hpa@zytor.com, nramas@linux.microsoft.com,
thomas.lendacky@amd.com, robh@kernel.org, efault@gmx.de,
rppt@kernel.org, david@redhat.com, konrad.wilk@oracle.com,
boris.ostrovsky@oracle.com
Subject: Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
Date: Fri, 24 Feb 2023 14:04:54 +0530 [thread overview]
Message-ID: <fc9f8ba4-967a-6e48-df73-67427c145141@linux.ibm.com> (raw)
In-Reply-To: <ee58fcfa-1734-2992-d6b7-83f365a9e16a@oracle.com>
On 24/02/23 02:04, Eric DeVolder wrote:
>
>
> On 2/10/23 00:29, Sourabh Jain wrote:
>>
>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>
>>>
>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>> Hello Eric,
>>>>
>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>> Eric!
>>>>>>
>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>
>>>>>>> So my latest solution is introduce two new CPUHP states,
>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm
>>>>>>> open to better names.
>>>>>>>
>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after
>>>>>>> CPUHP_BRINGUP_CPU. My
>>>>>>> attempts at locating this state failed when inside the STARTING
>>>>>>> section, so I located
>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler
>>>>>>> is registered on
>>>>>>> this state as the callback for the .startup method.
>>>>>>>
>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before
>>>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>>>> placed it at the end of the PREPARE section. This crash hotplug
>>>>>>> handler is also
>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>
>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>
>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>> {
>>>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>
>>>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>> }
>>>>>>
>>>>>> and use this to query the actual state at crash time. That spares
>>>>>> all
>>>>>> those callback heuristics.
>>>>>>
>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr,
>>>>>>> vmcoreinfo,
>>>>>>> makedumpfile and (the consumer of it all) the userspace crash
>>>>>>> utility,
>>>>>>> in order to understand the impact of moving from
>>>>>>> for_each_present_cpu()
>>>>>>> to for_each_online_cpu().
>>>>>>
>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> tglx
>>>>>>
>>>>>>
>>>>>
>>>>> Thomas,
>>>>> I've investigated the passing of crash notes through the vmcore.
>>>>> What I've learned is that:
>>>>>
>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its
>>>>> job) does
>>>>> not care what the contents of cpu PT_NOTES are, but it does
>>>>> coalesce them together.
>>>>>
>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to
>>>>> determine its
>>>>> nr_cpus variable, which is reported in a header, but otherwise
>>>>> unused (except
>>>>> for sadump method).
>>>>>
>>>>> - the crash utility, for the purposes of determining the cpus,
>>>>> does not appear to
>>>>> reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>> cpu_[possible|present|online]_mask and computes nr_cpus from
>>>>> that, and also of
>>>>> course which are online. In addition, when crash does reference
>>>>> the cpu PT_NOTE,
>>>>> to get its prstatus, it does so by using a percpu technique
>>>>> directly in the vmcore
>>>>> image memory, not via the ELF structure. Said differently, it
>>>>> appears to me that
>>>>> crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather
>>>>> it obtains them
>>>>> via kernel cpumasks and the memory within the vmcore.
>>>>>
>>>>> With this understanding, I did some testing. Perhaps the most
>>>>> telling test was that I
>>>>> changed the number of cpu PT_NOTEs emitted in the
>>>>> crash_prepare_elf64_headers() to just 1,
>>>>> hot plugged some cpus, then also took a few offline sparsely via
>>>>> chcpu, then generated a
>>>>> vmcore. The crash utility had no problem loading the vmcore, it
>>>>> reported the proper number
>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and
>>>>> changing to a different
>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>
>>>>> My take away is that crash utility does not rely upon ELF cpu
>>>>> PT_NOTEs, it obtains the
>>>>> cpu information directly from kernel data structures. Perhaps at
>>>>> one time crash relied
>>>>> upon the ELF information, but no more. (Perhaps there are other
>>>>> crash dump analyzers
>>>>> that might rely on the ELF info?)
>>>>>
>>>>> So, all this to say that I see no need to change
>>>>> crash_prepare_elf64_headers(). There
>>>>> is no compelling reason to move away from for_each_present_cpu(),
>>>>> or modify the list for
>>>>> online/offline.
>>>>>
>>>>> Which then leaves the topic of the cpuhp state on which to
>>>>> register. Perhaps reverting
>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There
>>>>> does not appear to
>>>>> be a compelling need to accurately track whether the cpu went
>>>>> online/offline for the
>>>>> purposes of creating the elfcorehdr, as ultimately the crash
>>>>> utility pulls that from
>>>>> kernel data structures, not the elfcorehdr.
>>>>>
>>>>> I think this is what Sourabh has known and has been advocating for
>>>>> an optimization
>>>>> path that allows not regenerating the elfcorehdr on cpu changes
>>>>> (because all the percpu
>>>>> structs are all laid out). I do think it best to leave that as an
>>>>> arch choice.
>>>>
>>>> Since things are clear on how the PT_NOTES are consumed in kdump
>>>> kernel [fs/proc/vmcore.c],
>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>
>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>> If yes, can you please list the elfcorehdr components that changes
>>>> due to CPU hotplug.
>>> Due to the use of for_each_present_cpu(), it is possible for the
>>> number of cpu PT_NOTEs
>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does
>>> not impact the
>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>
>>>>
>>>> From what I understood, crash notes are prepared for possible CPUs
>>>> as system boots and
>>>> could be used to create a PT_NOTE section for each possible CPU
>>>> while generating the elfcorehdr
>>>> during the kdump kernel load.
>>>>
>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible
>>>> CPU there is no need to
>>>> regenerate it for CPU hotplug events. Or do we?
>>>
>>> For onlining/offlining of cpus, there is no need to regenerate the
>>> elfcorehdr. However,
>>> for actual hot un/plug of cpus, the answer is yes due to
>>> for_each_present_cpu(). The
>>> caveat here of course is that if crash utility is the only coredump
>>> analyzer of concern,
>>> then it doesn't care about these cpu PT_NOTEs and there would be no
>>> need to re-generate them.
>>>
>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into
>>> mainstream, impacts
>>> any of this.
>>>
>>> Perhaps the one item that might help here is to distinguish between
>>> actual hot un/plug of
>>> cpus, versus onlining/offlining. At the moment, I can not
>>> distinguish between a hot plug
>>> event and an online event (and unplug/offline). If those were
>>> distinguishable, then we
>>> could only regenerate on un/plug events.
>>>
>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>
>> Yes, because once elfcorehdr is built with possible CPUs we don't
>> have to worry about
>> hot[un]plug case.
>>
>> Here is my view on how things should be handled if a core-dump
>> analyzer is dependent on
>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>
>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash
>> notes (kernel has
>> one crash note per CPU for every possible CPU). Though the crash
>> notes are allocated
>> during the boot time they are populated when the system is on the
>> crash path.
>>
>> This is how crash notes are populated on PowerPC and I am expecting
>> it would be something
>> similar on other architectures too.
>>
>> The crashing CPU sends IPI to every other online CPU with a callback
>> function that updates the
>> crash notes of that specific CPU. Once the IPI completes the crashing
>> CPU updates its own crash
>> note and proceeds further.
>>
>> The crash notes of CPUs remain uninitialized if the CPUs were offline
>> or hot unplugged at the time
>> system crash. The core-dump analyzer should be able to identify
>> [un]/initialized crash notes
>> and display the information accordingly.
>>
>> Thoughts?
>>
>> - Sourabh
>
> I've been examining what it would mean to move to
> for_each_possible_cpu() in crash_prepare_elf64_headers(). I think it
> means:
>
> - Changing for_each_present_cpu() to for_each_possible_cpu() in
> crash_prepare_elf64_headers().
> - For kexec_load() syscall path, rewrite the incoming/supplied
> elfcorehdr immediately on the load with the elfcorehdr generated by
> crash_prepare_elf64_headers().
> - Eliminate/remove the cpuhp machinery for handling crash hotplug events.
If for_each_present_cpu is replaced with for_each_possible_cpu I still
need cpuhp machinery
to update FDT kexec segment for CPU hot add case.
>
> This would then setup PT_NOTEs for all possible cpus, which should in
> theory accommodate crash analyzers that rely on ELF PT_NOTEs for
> crash_notes.
>
> If staying with for_each_present_cpu() is ultimately decided, then I
> think leaving the cpuhp machinery in place and each arch could decide
> how to handle crash cpu hotplug events. The overhead for doing this is
> very minimal, and the events are likely very infrequent.
I agree. Some architectures may need cpuhp machinery to update kexec
segment[s] other then elfcorehdr. For example FDT on PowerPC.
- Sourabh Jain
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
next prev parent reply other threads:[~2023-02-24 8:35 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-31 22:42 [PATCH v18 0/7] crash: Kernel handling of CPU and memory hot un/plug Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 1/7] crash: move a few code bits to setup support of crash hotplug Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 2/7] crash: prototype change for crash_prepare_elf64_headers() Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 3/7] crash: add generic infrastructure for crash hotplug support Eric DeVolder
2023-02-09 19:10 ` Sourabh Jain
2023-02-10 16:51 ` Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 4/7] kexec: exclude elfcorehdr from the segment digest Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes Eric DeVolder
2023-02-01 11:33 ` Thomas Gleixner
2023-02-06 8:12 ` Sourabh Jain
2023-02-06 13:03 ` Thomas Gleixner
2023-02-07 17:23 ` Eric DeVolder
2023-02-08 13:44 ` Thomas Gleixner
2023-02-09 17:31 ` Eric DeVolder
2023-02-09 18:43 ` Sourabh Jain
2023-02-09 19:39 ` Eric DeVolder
2023-02-10 6:29 ` Sourabh Jain
2023-02-11 0:35 ` Eric DeVolder
2023-02-13 4:40 ` Sourabh Jain
2023-02-13 12:52 ` Thomas Gleixner
2023-02-15 2:53 ` Sourabh Jain
2023-02-28 12:44 ` Baoquan He
2023-02-28 18:52 ` Eric DeVolder
2023-03-01 15:48 ` Eric DeVolder
2023-03-02 10:51 ` Baoquan He
2023-03-02 5:23 ` Sourabh Jain
2023-02-23 20:34 ` Eric DeVolder
2023-02-24 8:34 ` Sourabh Jain [this message]
2023-02-24 20:16 ` Eric DeVolder
2023-02-27 6:11 ` Sourabh Jain
2023-02-28 21:50 ` Eric DeVolder
2023-03-01 6:22 ` Sourabh Jain
2023-03-01 14:16 ` Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 6/7] crash: memory and cpu hotplug sysfs attributes Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 7/7] x86/crash: add x86 crash hotplug support Eric DeVolder
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fc9f8ba4-967a-6e48-df73-67427c145141@linux.ibm.com \
--to=sourabhjain@linux.ibm.com \
--cc=bhe@redhat.com \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=dyoung@redhat.com \
--cc=ebiederm@xmission.com \
--cc=efault@gmx.de \
--cc=eric.devolder@oracle.com \
--cc=hpa@zytor.com \
--cc=kexec@lists.infradead.org \
--cc=konrad.wilk@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=nramas@linux.microsoft.com \
--cc=robh@kernel.org \
--cc=rppt@kernel.org \
--cc=tglx@linutronix.de \
--cc=thomas.lendacky@amd.com \
--cc=vgoyal@redhat.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox