Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes

From: Sourabh Jain <sourabhjain@linux.ibm.com>
To: Eric DeVolder <eric.devolder@oracle.com>, Baoquan He <bhe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	linux-kernel@vger.kernel.org, x86@kernel.org,
	kexec@lists.infradead.org, ebiederm@xmission.com,
	dyoung@redhat.com, vgoyal@redhat.com, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
	nramas@linux.microsoft.com, thomas.lendacky@amd.com,
	robh@kernel.org, efault@gmx.de, rppt@kernel.org,
	david@redhat.com, konrad.wilk@oracle.com,
	boris.ostrovsky@oracle.com
Subject: Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
Date: Thu, 2 Mar 2023 10:53:26 +0530	[thread overview]
Message-ID: <c91066ed-07b9-4e29-6fd0-683c788564be@linux.ibm.com> (raw)
In-Reply-To: <7b0e3676-40e1-3ea1-6d58-6d934970aa27@oracle.com>

On 01/03/23 00:22, Eric DeVolder wrote:
>
>
> On 2/28/23 06:44, Baoquan He wrote:
>> On 02/13/23 at 10:10am, Sourabh Jain wrote:
>>>
>>> On 11/02/23 06:05, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>
>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>
>>>>>>
>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>> Hello Eric,
>>>>>>>
>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>> Eric!
>>>>>>>>>
>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>
>>>>>>>>>> So my latest solution is introduce two new CPUHP
>>>>>>>>>> states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for
>>>>>>>>>> offlining. I'm open to better names.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be
>>>>>>>>>> placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>>> attempts at locating this state failed when
>>>>>>>>>> inside the STARTING section, so I located
>>>>>>>>>> this just inside the ONLINE sectoin. The crash
>>>>>>>>>> hotplug handler is registered on
>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be
>>>>>>>>>> placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>> placed it at the end of the PREPARE section.
>>>>>>>>>> This crash hotplug handler is also
>>>>>>>>>> registered on this state as the callback for the .teardown 
>>>>>>>>>> method.
>>>>>>>>>
>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>
>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>> {
>>>>>>>>>      struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>
>>>>>>>>>      return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> and use this to query the actual state at crash
>>>>>>>>> time. That spares all
>>>>>>>>> those callback heuristics.
>>>>>>>>>
>>>>>>>>>> I'm making my way though percpu crash_notes,
>>>>>>>>>> elfcorehdr, vmcoreinfo,
>>>>>>>>>> makedumpfile and (the consumer of it all) the
>>>>>>>>>> userspace crash utility,
>>>>>>>>>> in order to understand the impact of moving from
>>>>>>>>>> for_each_present_cpu()
>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>
>>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>           tglx
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thomas,
>>>>>>>> I've investigated the passing of crash notes through the
>>>>>>>> vmcore. What I've learned is that:
>>>>>>>>
>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references
>>>>>>>> to do its job) does
>>>>>>>>    not care what the contents of cpu PT_NOTES are, but it
>>>>>>>> does coalesce them together.
>>>>>>>>
>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in
>>>>>>>> order to determine its
>>>>>>>>    nr_cpus variable, which is reported in a header, but
>>>>>>>> otherwise unused (except
>>>>>>>>    for sadump method).
>>>>>>>>
>>>>>>>> - the crash utility, for the purposes of determining the
>>>>>>>> cpus, does not appear to
>>>>>>>>    reference the elfcorehdr PT_NOTEs. Instead it locates the 
>>>>>>>> various
>>>>>>>>    cpu_[possible|present|online]_mask and computes
>>>>>>>> nr_cpus from that, and also of
>>>>>>>>    course which are online. In addition, when crash does
>>>>>>>> reference the cpu PT_NOTE,
>>>>>>>>    to get its prstatus, it does so by using a percpu
>>>>>>>> technique directly in the vmcore
>>>>>>>>    image memory, not via the ELF structure. Said
>>>>>>>> differently, it appears to me that
>>>>>>>>    crash utility doesn't rely on the ELF PT_NOTEs for
>>>>>>>> cpus; rather it obtains them
>>>>>>>>    via kernel cpumasks and the memory within the vmcore.
>>>>>>>>
>>>>>>>> With this understanding, I did some testing. Perhaps the
>>>>>>>> most telling test was that I
>>>>>>>> changed the number of cpu PT_NOTEs emitted in the
>>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>>> hot plugged some cpus, then also took a few offline
>>>>>>>> sparsely via chcpu, then generated a
>>>>>>>> vmcore. The crash utility had no problem loading the
>>>>>>>> vmcore, it reported the proper number
>>>>>>>> of cpus and the number offline (despite only one cpu
>>>>>>>> PT_NOTE), and changing to a different
>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>
>>>>>>>> My take away is that crash utility does not rely upon
>>>>>>>> ELF cpu PT_NOTEs, it obtains the
>>>>>>>> cpu information directly from kernel data structures.
>>>>>>>> Perhaps at one time crash relied
>>>>>>>> upon the ELF information, but no more. (Perhaps there
>>>>>>>> are other crash dump analyzers
>>>>>>>> that might rely on the ELF info?)
>>>>>>>>
>>>>>>>> So, all this to say that I see no need to change
>>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>>> is no compelling reason to move away from
>>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>>> online/offline.
>>>>>>>>
>>>>>>>> Which then leaves the topic of the cpuhp state on which
>>>>>>>> to register. Perhaps reverting
>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right
>>>>>>>> answer. There does not appear to
>>>>>>>> be a compelling need to accurately track whether the cpu
>>>>>>>> went online/offline for the
>>>>>>>> purposes of creating the elfcorehdr, as ultimately the
>>>>>>>> crash utility pulls that from
>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>
>>>>>>>> I think this is what Sourabh has known and has been
>>>>>>>> advocating for an optimization
>>>>>>>> path that allows not regenerating the elfcorehdr on cpu
>>>>>>>> changes (because all the percpu
>>>>>>>> structs are all laid out). I do think it best to leave
>>>>>>>> that as an arch choice.
>>>>>>>
>>>>>>> Since things are clear on how the PT_NOTES are consumed in
>>>>>>> kdump kernel [fs/proc/vmcore.c],
>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>
>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>> If yes, can you please list the elfcorehdr components that
>>>>>>> changes due to CPU hotplug.
>>>>>> Due to the use of for_each_present_cpu(), it is possible for the
>>>>>> number of cpu PT_NOTEs
>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus
>>>>>> does not impact the
>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>
>>>>>>>
>>>>>>>   From what I understood, crash notes are prepared for
>>>>>>> possible CPUs as system boots and
>>>>>>> could be used to create a PT_NOTE section for each possible
>>>>>>> CPU while generating the elfcorehdr
>>>>>>> during the kdump kernel load.
>>>>>>>
>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every
>>>>>>> possible CPU there is no need to
>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>
>>>>>> For onlining/offlining of cpus, there is no need to regenerate
>>>>>> the elfcorehdr. However,
>>>>>> for actual hot un/plug of cpus, the answer is yes due to
>>>>>> for_each_present_cpu(). The
>>>>>> caveat here of course is that if crash utility is the only
>>>>>> coredump analyzer of concern,
>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be
>>>>>> no need to re-generate them.
>>>>>>
>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming
>>>>>> into mainstream, impacts
>>>>>> any of this.
>>>>>>
>>>>>> Perhaps the one item that might help here is to distinguish
>>>>>> between actual hot un/plug of
>>>>>> cpus, versus onlining/offlining. At the moment, I can not
>>>>>> distinguish between a hot plug
>>>>>> event and an online event (and unplug/offline). If those were
>>>>>> distinguishable, then we
>>>>>> could only regenerate on un/plug events.
>>>>>>
>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>
>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't
>>>>> have to worry about
>>>>> hot[un]plug case.
>>>>>
>>>>> Here is my view on how things should be handled if a core-dump
>>>>> analyzer is dependent on
>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>
>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash
>>>>> notes (kernel has
>>>>> one crash note per CPU for every possible CPU). Though the crash
>>>>> notes are allocated
>>>>> during the boot time they are populated when the system is on the
>>>>> crash path.
>>>>>
>>>>> This is how crash notes are populated on PowerPC and I am expecting
>>>>> it would be something
>>>>> similar on other architectures too.
>>>>>
>>>>> The crashing CPU sends IPI to every other online CPU with a callback
>>>>> function that updates the
>>>>> crash notes of that specific CPU. Once the IPI completes the
>>>>> crashing CPU updates its own crash
>>>>> note and proceeds further.
>>>>>
>>>>> The crash notes of CPUs remain uninitialized if the CPUs were
>>>>> offline or hot unplugged at the time
>>>>> system crash. The core-dump analyzer should be able to identify
>>>>> [un]/initialized crash notes
>>>>> and display the information accordingly.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> - Sourabh
>>>>
>>>> In general, I agree with your points. You've presented a strong 
>>>> case to
>>>> go with for_each_possible_cpu() in crash_prepare_elf64_headers() and
>>>> those crash notes would always be present, and we can ignore 
>>>> changes to
>>>> cpus wrt/ elfcorehdr updates.
>>>>
>>>> But what do we do about kexec_load() syscall? The way the userspace
>>>> utility works is it determines cpus by:
>>>>   nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
>>>> which is not the equivalent of possible_cpus. So the complete list of
>>>> cpu PT_NOTEs is not generated up front. We would need a solution for
>>>> that?
>>> Hello Eric,
>>>
>>> The sysconf document says _SC_NPROCESSORS_CONF is processors 
>>> configured,
>>> isn't that equivalent to possible CPUs?
>>>
>>> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on 
>>> powerPC
>>> it is possible CPUs.
>>
> Baoquan,
>
>>  From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is
>> returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present
>> cpus. If these are true, we can use them.
>
> Thomas Gleixner has pointed out that:
>
>  glibc tries to evaluate that in the following order:
>   1) /sys/devices/system/cpu/cpu*
>      That's present CPUs not possible CPUs
>   2) /proc/stat
>      That's online CPUs
>   3) sched_getaffinity()
>      That's online CPUs at best. In the worst case it's an affinity mask
>      which is set on a process group
>
> meaning that _SC_NPROCESSORS_CONF is not equivalent to 
> possible_cpus(). Furthermore, the /sys/system/devices/cpus/cpuXX 
> entries are not available for not-present-but-possible cpus; thus 
> userspace kexec utility can not write out the elfcorehdr with all 
> possible cpus listed.
>
>>
>> But I am wondering why the existing present cpu way is going to be
>> discarded. Sorry, I tried to go through this thread, it's too long, can
>> anyone summarize the reason with shorter and clear sentences. Sorry
>> again for that.
>
Hello Eric,

> By utilizing for_each_possible_cpu() in crash_prepare_elf64_headers(), 
> in the case of the kexec_file_load(), this change would simplify some 
> issues Sourabh has encountered for PPC support.

Things are fine even with for_each_present_cpu on PPC. It is just that I 
want to avoid
the regeneration of elfcorehdr for every CPU change by packing possible 
CPUs at once.

Thanks,
Sourabh Jain

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec