public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Baoquan He <bhe@redhat.com>
To: Eric DeVolder <eric.devolder@oracle.com>
Cc: linux-kernel@vger.kernel.org, x86@kernel.org,
	kexec@lists.infradead.org, ebiederm@xmission.com,
	dyoung@redhat.com, vgoyal@redhat.com, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, nramas@linux.microsoft.com,
	thomas.lendacky@amd.com, robh@kernel.org, efault@gmx.de,
	rppt@kernel.org, konrad.wilk@oracle.com,
	boris.ostrovsky@oracle.com
Subject: Re: [RFC v1 0/8] RFC v1: Kernel handling of CPU and memory hot un/plug for crash
Date: Fri, 19 Nov 2021 10:37:35 +0800	[thread overview]
Message-ID: <20211119023735.GH21646@MiWiFi-R3L-srv> (raw)
In-Reply-To: <20211118174948.37435-1-eric.devolder@oracle.com>

Hi Eric,

On 11/18/21 at 12:49pm, Eric DeVolder wrote:
> When the kdump service is loaded, if a CPU or memory is hot
> un/plugged, the crash elfcorehdr which describes the CPUs and memory
> in the system, must also be updated, else the resulting vmcore is
> inaccurate (eg. missing either CPU context or memory regions).
> 
> The current solution utilizes udev to initiate an unload-then-reload
> of the kdump image (e. kernel, initrd, boot_params, puratory and
> elfcorehdr) by the userspace kexec utility.
> 
> In the post https://lkml.org/lkml/2020/12/14/532 I outlined two
> problems with this userspace-initiated unload-then-reload approach as
> it pertains to supporting CPU and memory hot un/plug for kdump.
> (Note in that post, I erroneously called the elfcorehdr the vmcoreinfo
> structure. There is a vmcoreinfo structure, but it has a different
> purpose. So in that post substitute "elfcorehdr" for "vmcoreinfo".)

It's great you finally make this patchset to address the cpu/mem hotplug
issues raised before, I will review it carefully. And I have to say
sorry, because I ever promised you to do this but didn't keep it due to
personal reasons.

Thanks again for doing this.

> 
> The first problem being the time needed to complete the unload-then-
> reload of the kdump image, and the second being the effective race
> window that unload-then-reload effort creates.
> 
> The scenario I measured was a 32GiB guest being resized to 512GiB and
> observing it took over 4 minutes for udev to "settle down" and
> complete the unload-then-reload of the resulting 3840 hot plug events.
> Empirical evidence within our fleet substantiates this problem.
> 
> Each unload-then-reload creates a race window the size of which is the
> time it takes to reload the complete kdump image. Within the race
> window, kdump is not loaded and should a panic occur, the kernel halts
> rather than dumping core via kdump.
> 
> This patchset significantly improves upon the current solution by
> enabling the kernel to update only the necessary items of the kdump
> image. In the case of x86_64, that is just the elfcorehdr and the
> purgatory segments. These updates occur as fast as the hot un/plug
> events and significantly reduce the size of the race window.
> 
> This patchset introduces a generic crash hot un/plug handler that
> registers with the CPU and memory notifiers. Upon CPU or memory
> changes, this generic handler is invoked and performs important
> housekeeping, for example obtaining the appropriate lock, and then
> invokes an architecture specific handler to do the appropriate
> updates.
> 
> In the case of x86_64, the arch specific handler generates a new
> elfcorehdr, which reflects the current CPUs and memory regions, into a
> buffer. Since purgatory also does an integrity check via hash digests
> of the loaded segments, purgatory must also be updated with the new
> digests. The arch handler also generates a new purgatory into a
> buffer, performs the hash digests of the new memory segments, and then
> patches purgatory with the new digests.  If all succeeds, then the
> elfcorehdr and purgatory buffers over write the existing buffers and
> the new kdump image is live and ready to go. No involvement with
> userspace at all.
> 
> To accommodate a growing number of resources via hotplug, the
> elfcorehdr memory must be sufficiently large enough to accommodate
> changes. The CRASH_HOTPLUG_ELFCOREHDR_SZ configure item does just
> this.
> 
> To realize the benefits/test this patchset, one must make a couple
> of minor changes to userspace:
> 
>  - Disable the udev rule for updating kdump on hot un/plug changes
>    Eg. on RHEL: rm -f /usr/lib/udev/rules.d/98-kexec.rules
>    or other technique to neuter the rule.
> 
>  - Change to the kexec_file_load for loading the kdump kernel:
>    Eg. on RHEL: in /usr/bin/kdumpctl, change to:
>     standard_kexec_args="-p -d -s"
>    which adds the -s to select kexec_file_load syscall.
> 
> This work has raised the following questions for me:
> 
> First and foremost, this patch only works for the kexec_file_load
> syscall path (via "kexec -s -p" utility). The reason being that, for
> x86_64 anyway, the purgatory blob provided by userspace can not be
> readily decoded in order to update the hash digest. (The
> kexec_file_load purgatory is actually a small ELF object with symbols,
> so can be patched at run time.) With no way to update purgatory, the
> integrity check will always fail and and cause purgatory to hang at
> panic time.
> 
> That being said, I actually developed this against the kexec_load path
> and did have that working by making two one-line changes to userspace
> kexec utility: one change that effectively is
> CRASH_HOTPLUG_ELFCOREHDR_SZ and the other to disable the integrity
> check. But that does not seem to be a long term solution. A possible
> long term solution would be to allow the use of the kexec_file_load
> purgatory ELF object with the kexec_load path. While I believe that
> would work, I am unsure if there are any downsides to doing so.
> 
> The second problem is the use of CPUHP_AP_ONLINE_DYN.  The
> cpuhp_setup_state_nocalls() is invoked with parameter
> CPUHP_AP_ONLINE_DYN. While this works, when a CPU is being unplugged,
> the CPU still shows up in foreach_present_cpu() during the
> regeneration of the elfcorehdr, thus the need to explicitly check and
> exclude the soon-to-be offlined CPU in crash_prepare_elf64_headers().
> Perhaps if value(s) new/different than CPUHP_AP_ONLINE_DYN to
> cpuhp_setup_state() was utilized, then the offline cpu would no longer
> be in foreach_present_cpu(), and this change could be eliminated. I do
> not understand cpuhp_setup_state() well enough to choose, or create,
> appropriate value(s).
> 
> The third problem is the number of memory hot un/plug events.  If, for
> example, a 1GiB DIMM is hotplugged, that generate 8 memory events, one
> for each 128MiB memblock, yet the walk_system_ram_res() that is used
> to obtain the list of memory regions reports the single 1GiB; thus
> there are 7 un-necessary trips through this crash hotplug handler.
> Perhaps there is another way of handling memory events that would see
> the single 1GiB DIMM rather than each memblock?
> 
> Regards,
> eric
> 
> Eric DeVolder (8):
>   crash: fix minor typo/bug in debug message
>   crash hp: Introduce CRASH_HOTPLUG configuration options
>   crash hp: definitions and prototypes for crash hotplug support
>   crash hp: generic crash hotplug support infrastructure
>   crash hp: kexec_file changes for use by crash hotplug handler
>   crash hp: Add x86 crash hotplug state items to kimage
>   crash hp: Add x86 crash hotplug support for kexec_file_load
>   crash hp: Add x86 crash hotplug support for bzImage
> 
>  arch/x86/Kconfig                  |  26 +++
>  arch/x86/include/asm/kexec.h      |  10 ++
>  arch/x86/kernel/crash.c           | 257 +++++++++++++++++++++++++++++-
>  arch/x86/kernel/kexec-bzimage64.c |  12 ++
>  include/linux/kexec.h             |  22 ++-
>  kernel/crash_core.c               | 118 ++++++++++++++
>  kernel/kexec_file.c               |  19 ++-
>  7 files changed, 455 insertions(+), 9 deletions(-)
> 
> -- 
> 2.27.0
> 


  parent reply	other threads:[~2021-11-19  2:37 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-18 17:49 [RFC v1 0/8] RFC v1: Kernel handling of CPU and memory hot un/plug for crash Eric DeVolder
2021-11-18 17:49 ` [RFC v1 1/8] crash: fix minor typo/bug in debug message Eric DeVolder
2021-11-24  1:17   ` Baoquan He
2021-11-18 17:49 ` [RFC v1 2/8] crash hp: Introduce CRASH_HOTPLUG configuration options Eric DeVolder
2021-11-18 17:49 ` [RFC v1 3/8] crash hp: definitions and prototypes for crash hotplug support Eric DeVolder
2021-11-18 17:49 ` [RFC v1 4/8] crash hp: generic crash hotplug support infrastructure Eric DeVolder
2021-11-18 17:49 ` [RFC v1 5/8] crash hp: kexec_file changes for use by crash hotplug handler Eric DeVolder
2021-11-18 17:49 ` [RFC v1 6/8] crash hp: Add x86 crash hotplug state items to kimage Eric DeVolder
2021-11-18 17:49 ` [RFC v1 7/8] crash hp: Add x86 crash hotplug support for kexec_file_load Eric DeVolder
2021-11-18 17:49 ` [RFC v1 8/8] crash hp: Add x86 crash hotplug support for bzImage Eric DeVolder
2021-11-19  2:37 ` Baoquan He [this message]
2021-11-24  9:02 ` [RFC v1 0/8] RFC v1: Kernel handling of CPU and memory hot un/plug for crash Baoquan He
2021-11-29 19:42   ` Eric DeVolder
2021-12-01 12:59     ` Baoquan He
2021-12-07 20:04       ` Eric DeVolder
2021-11-29  8:45 ` Sourabh Jain
2021-11-29 20:00   ` Eric DeVolder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211119023735.GH21646@MiWiFi-R3L-srv \
    --to=bhe@redhat.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=dyoung@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=efault@gmx.de \
    --cc=eric.devolder@oracle.com \
    --cc=hpa@zytor.com \
    --cc=kexec@lists.infradead.org \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=nramas@linux.microsoft.com \
    --cc=robh@kernel.org \
    --cc=rppt@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=vgoyal@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox