From: Eric DeVolder <eric.devolder@oracle.com>
To: Sourabh Jain <sourabhjain@linux.ibm.com>,
Thomas Gleixner <tglx@linutronix.de>,
linux-kernel@vger.kernel.org, x86@kernel.org,
kexec@lists.infradead.org, ebiederm@xmission.com,
dyoung@redhat.com, bhe@redhat.com, vgoyal@redhat.com
Cc: mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
hpa@zytor.com, nramas@linux.microsoft.com,
thomas.lendacky@amd.com, robh@kernel.org, efault@gmx.de,
rppt@kernel.org, david@redhat.com, konrad.wilk@oracle.com,
boris.ostrovsky@oracle.com
Subject: Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
Date: Tue, 28 Feb 2023 15:50:39 -0600 [thread overview]
Message-ID: <4fbdbb3a-dcab-cbf1-0d24-a7be1858b048@oracle.com> (raw)
In-Reply-To: <8aa21721-6951-80d1-ae20-db4198b743ff@linux.ibm.com>
On 2/27/23 00:11, Sourabh Jain wrote:
>
> On 25/02/23 01:46, Eric DeVolder wrote:
>>
>>
>> On 2/24/23 02:34, Sourabh Jain wrote:
>>>
>>> On 24/02/23 02:04, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>
>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>
>>>>>>
>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>> Hello Eric,
>>>>>>>
>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>> Eric!
>>>>>>>>>
>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>
>>>>>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>>>>
>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>
>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>> {
>>>>>>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>
>>>>>>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> and use this to query the actual state at crash time. That spares all
>>>>>>>>> those callback heuristics.
>>>>>>>>>
>>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>>>>>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>
>>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> tglx
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thomas,
>>>>>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>>>>>>>
>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>>>>>>> not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>>>>>>>
>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>>>>>>> nr_cpus variable, which is reported in a header, but otherwise unused (except
>>>>>>>> for sadump method).
>>>>>>>>
>>>>>>>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>>>>>>> reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>> cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>>>>>>> course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>>>>>>> to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>>>>>>> image memory, not via the ELF structure. Said differently, it appears to me that
>>>>>>>> crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>>>>>>> via kernel cpumasks and the memory within the vmcore.
>>>>>>>>
>>>>>>>> With this understanding, I did some testing. Perhaps the most telling test was that I
>>>>>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>>>>>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>>>>>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>
>>>>>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>>>>>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>>>>>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>>>>>>>> that might rely on the ELF info?)
>>>>>>>>
>>>>>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>>>>>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>>>>>>>> online/offline.
>>>>>>>>
>>>>>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>>>>>>>> be a compelling need to accurately track whether the cpu went online/offline for the
>>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>
>>>>>>>> I think this is what Sourabh has known and has been advocating for an optimization
>>>>>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>>>>>>>> structs are all laid out). I do think it best to leave that as an arch choice.
>>>>>>>
>>>>>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>
>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
>>>>>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>
>>>>>>>
>>>>>>> From what I understood, crash notes are prepared for possible CPUs as system boots and
>>>>>>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr
>>>>>>> during the kdump kernel load.
>>>>>>>
>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>
>>>>>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
>>>>>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
>>>>>> caveat here of course is that if crash utility is the only coredump analyzer of concern,
>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.
>>>>>>
>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
>>>>>> any of this.
>>>>>>
>>>>>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of
>>>>>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
>>>>>> event and an online event (and unplug/offline). If those were distinguishable, then we
>>>>>> could only regenerate on un/plug events.
>>>>>>
>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>
>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about
>>>>> hot[un]plug case.
>>>>>
>>>>> Here is my view on how things should be handled if a core-dump analyzer is dependent on
>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>
>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has
>>>>> one crash note per CPU for every possible CPU). Though the crash notes are allocated
>>>>> during the boot time they are populated when the system is on the crash path.
>>>>>
>>>>> This is how crash notes are populated on PowerPC and I am expecting it would be something
>>>>> similar on other architectures too.
>>>>>
>>>>> The crashing CPU sends IPI to every other online CPU with a callback function that updates the
>>>>> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash
>>>>> note and proceeds further.
>>>>>
>>>>> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time
>>>>> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes
>>>>> and display the information accordingly.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> - Sourabh
>>>>
>>>> I've been examining what it would mean to move to for_each_possible_cpu() in
>>>> crash_prepare_elf64_headers(). I think it means:
>>>>
>>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers().
>>>> - For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the
>>>> load with the elfcorehdr generated by crash_prepare_elf64_headers().
>>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug events.
>>>
>>> If for_each_present_cpu is replaced with for_each_possible_cpu I still need cpuhp machinery
>>> to update FDT kexec segment for CPU hot add case.
>>
>> Ah, ok, that's important! So the cpuhp callbacks are still needed.
>>>
>>>
>>>>
>>>> This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash
>>>> analyzers that rely on ELF PT_NOTEs for crash_notes.
>>>>
>>>> If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp
>>>> machinery in place and each arch could decide how to handle crash cpu hotplug events. The
>>>> overhead for doing this is very minimal, and the events are likely very infrequent.
>>>
>>> I agree. Some architectures may need cpuhp machinery to update kexec segment[s] other then
>>> elfcorehdr. For example FDT on PowerPC.
>>>
>>> - Sourabh Jain
>>
>> OK, I was thinking that the desire was to eliminate the cpuhp callbacks. In reality, the desire is
>> to change to for_each_possible_cpu(). Given that the kernel creates crash_notes for all possible
>> cpus upon kernel boot, there seems to be no reason to not do this?
>>
>> HOWEVER...
>>
>> It's not clear to me that this particular change needs to be part of this series. It's inclusion
>> would facilitate PPC support, but doesn't "solve" anything in general. In fact it causes
>> kexec_load and kexec_file_load to deviate (kexec_load via userspace kexec does the equivalent of
>> for_each_present_cpu() where as with this change kexec_file_load would do for_each_possible_cpu();
>> until a hot plug event then both would do for_each_possible_cpu()). And if this change were to
>> arrive as part of Sourabh's PPC support, then it does not appear to impact x86 (not sure about
>> other arches). And the 'crash' dump analyzer doesn't care either way.
>>
>> Including this change would enable an optimization path (for x86 at least) that short-circuits cpu
>> hotplug changes in the arch crash handler, for example:
>>
>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>> index aca3f1817674..0883f6b11de4 100644
>> --- a/arch/x86/kernel/crash.c
>> +++ b/arch/x86/kernel/crash.c
>> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct kimage *image)
>> unsigned long mem, memsz;
>> unsigned long elfsz = 0;
>>
>> + if (image->file_mode && (
>> + image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
>> + image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
>> + return;
>> +
>> /*
>> * Create the new elfcorehdr reflecting the changes to CPU and/or
>> * memory resources.
>>
>> I'm not sure that is compelling given the infrequent nature of cpu hotplug events.
> It certainly closes/reduces the window where kdump is not active due kexec segment update.|
Fair enough. I plan to include this change in v19.
>
>>
>> In my mind I still have a question about kexec_load() path. The userspace kexec can not do the
>> equivalent of for_each_possible_cpu(). It can obtain max possible cpus from
>> /sys/devices/system/cpu/possible, but for those cpus not present the /sys/devices/system/cpu/cpuXX
>> is not available and so the crash_notes entries is not available. My attempts to expose all cpuXX
>> lead to odd behavior that was requiring changes in ACPI and arch code that looked untenable.
>>
>> There seem to be these options available for kexec_load() path:
>> - immediately rewrite the elfcorehdr upon load via a call to crash_prepare_elf64_headers(). I've
>> made this work with the following, as proof of concept:
> Yes regenerating/patching the elfcorehdr could be an option for kexec_load syscall.
So this is not needed by x86, but more so by ppc. Should this change be in the ppc set or this set?
>
>>
>> diff --git a/kernel/kexec.c b/kernel/kexec.c
>> index cb8e6e6f983c..4eb201270f97 100644
>> --- a/kernel/kexec.c
>> +++ b/kernel/kexec.c
>> @@ -163,6 +163,12 @@ static int do_kexec_load(unsigned long entry, unsigned long
>> kimage_free(image);
>> out_unlock:
>> kexec_unlock();
>> + if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
>> + if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) {
>> + crash_handle_hotplug_event(KEXEC_CRASH_HP_NONE, KEXEC_CRASH_HP_INVALID_CPU);
>> + }
>> + }
>> return ret;
>> }
>>
>> - Another option is spend the time to determine whether exposing all cpuXX is a viable solution; I
>> have no idea what impacts to userspace would be for possible-but-not-yet-present cpuXX entries
>> would be. It might also mean requiring a 'present' entry available within the cpuXX.
>>
>> - Another option is to simply let the hot plug events rewrite the elfcorehdr on demand. This is
>> what I've originally put forth, but not sure how this impacts PPC given for_each_possible_cpu()
>> change.
> Given that /sys/devices/system/cpu/cpuXX is not present for possbile-but-not-yet-present CPUs, I am
> wondering do we even have crash notes for possible CPUs on x86?
Yes there are crash_notes for all possible cpus on x86.
eric
>>
>> The concern is that today, both kexec_load and kexec_file_load mirror each other with respect to
>> for_each_present_cpu(); that is userspace kexec is able to generate the elfcorehdr the same as
>> would kexec_file_load, for cpus. But by changing to for_each_possible_cpu(), the two would deviate.
>
> Thanks,
> Sourabh Jain
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
next prev parent reply other threads:[~2023-02-28 21:51 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-31 22:42 [PATCH v18 0/7] crash: Kernel handling of CPU and memory hot un/plug Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 1/7] crash: move a few code bits to setup support of crash hotplug Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 2/7] crash: prototype change for crash_prepare_elf64_headers() Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 3/7] crash: add generic infrastructure for crash hotplug support Eric DeVolder
2023-02-09 19:10 ` Sourabh Jain
2023-02-10 16:51 ` Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 4/7] kexec: exclude elfcorehdr from the segment digest Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes Eric DeVolder
2023-02-01 11:33 ` Thomas Gleixner
2023-02-06 8:12 ` Sourabh Jain
2023-02-06 13:03 ` Thomas Gleixner
2023-02-07 17:23 ` Eric DeVolder
2023-02-08 13:44 ` Thomas Gleixner
2023-02-09 17:31 ` Eric DeVolder
2023-02-09 18:43 ` Sourabh Jain
2023-02-09 19:39 ` Eric DeVolder
2023-02-10 6:29 ` Sourabh Jain
2023-02-11 0:35 ` Eric DeVolder
2023-02-13 4:40 ` Sourabh Jain
2023-02-13 12:52 ` Thomas Gleixner
2023-02-15 2:53 ` Sourabh Jain
2023-02-28 12:44 ` Baoquan He
2023-02-28 18:52 ` Eric DeVolder
2023-03-01 15:48 ` Eric DeVolder
2023-03-02 10:51 ` Baoquan He
2023-03-02 5:23 ` Sourabh Jain
2023-02-23 20:34 ` Eric DeVolder
2023-02-24 8:34 ` Sourabh Jain
2023-02-24 20:16 ` Eric DeVolder
2023-02-27 6:11 ` Sourabh Jain
2023-02-28 21:50 ` Eric DeVolder [this message]
2023-03-01 6:22 ` Sourabh Jain
2023-03-01 14:16 ` Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 6/7] crash: memory and cpu hotplug sysfs attributes Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 7/7] x86/crash: add x86 crash hotplug support Eric DeVolder
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4fbdbb3a-dcab-cbf1-0d24-a7be1858b048@oracle.com \
--to=eric.devolder@oracle.com \
--cc=bhe@redhat.com \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=dyoung@redhat.com \
--cc=ebiederm@xmission.com \
--cc=efault@gmx.de \
--cc=hpa@zytor.com \
--cc=kexec@lists.infradead.org \
--cc=konrad.wilk@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=nramas@linux.microsoft.com \
--cc=robh@kernel.org \
--cc=rppt@kernel.org \
--cc=sourabhjain@linux.ibm.com \
--cc=tglx@linutronix.de \
--cc=thomas.lendacky@amd.com \
--cc=vgoyal@redhat.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox