* [PATCH] mce: include cmci during intel feature clearing
@ 2025-06-17 21:47 JP Kobryn
2025-06-25 18:15 ` Luck, Tony
2025-06-25 19:43 ` Borislav Petkov
0 siblings, 2 replies; 4+ messages in thread
From: JP Kobryn @ 2025-06-17 21:47 UTC (permalink / raw)
To: tony.luck, bp, tglx, mingo, dave.hansen, hpa, aijay
Cc: x86, linux-edac, linux-kernel, kernel-team
It was found that after a kexec on an intel CPU, MCE reporting was no
longer active. The root cause has been found to be that ownership of CMCI
banks is not cleared during the shutdown phase. As a result, when CPU's
come back online, they are unable to rediscover these occupied banks. If we
clear these CPU associations before booting into the new kernel, the CMCI
banks can be reclaimed and MCE reporting will become functional once more.
The existing code does seem to have the intention of clearing MCE-related
features via mcheck_cpu_clear(). During a kexec reboot, there are two
sequences that reach a call to mcheck_cpu_clear(). They are:
1) Stopping other (remote) CPU's via IPI:
native_machine_shutdown()
stop_other_cpus()
smp_ops.stop_other_cpus(1)
x86 smp: native_stop_other_cpus(1)
apic_send_IPI_allbutself(REBOOT_VECTOR)
...IPI is received on remote CPU's and IDT sysvec_reboot invoked:
stop_this_cpu()
mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
2) Seqence of stopping the active CPU (the one performing the kexec):
native_machine_shutdown()
stop_other_cpus()
smp_ops.stop_other_cpus(1)
x86 smp: native_stop_other_cpus(1)
mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
In both cases, the call to mcheck_cpu_clear() leads to the vendor specific
call to intel_feature_clear():
mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
__mcheck_cpu_clear_vendor(c)
switch (c->x86_vendor)
case X86_VENDOR_INTEL:
mce_intel_feature_clear(c)
Now looking at the pair of functions mce_intel_feature_{init,clear}, there
are 3 MCE features setup on the init side:
mce_intel_feature_init(c)
intel_init_cmci()
intel_init_lmce()
intel_imc_init(c)
On the other side in the clear function, only one of these features is
actually cleared:
mce_intel_feature_clear(c)
intel_clear_lmce()
Just focusing on the feature pertaining to the root cause of the kexec
issue, there would be a benefit if we additionally cleared the CMCI feature
within this routine - the banks would be free for acquisition on the boot
up side of a kexec. This patch adds the call to clear CMCI to this intel
routine.
Reported-by: Aijay Adams <aijay@meta.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
arch/x86/kernel/cpu/mce/intel.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index efcf21e9552e..9b149b9c4109 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -478,6 +478,7 @@ void mce_intel_feature_init(struct cpuinfo_x86 *c)
void mce_intel_feature_clear(struct cpuinfo_x86 *c)
{
intel_clear_lmce();
+ cmci_clear();
}
bool intel_filter_mce(struct mce *m)
--
2.47.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] mce: include cmci during intel feature clearing
2025-06-17 21:47 [PATCH] mce: include cmci during intel feature clearing JP Kobryn
@ 2025-06-25 18:15 ` Luck, Tony
2025-06-25 19:43 ` Borislav Petkov
1 sibling, 0 replies; 4+ messages in thread
From: Luck, Tony @ 2025-06-25 18:15 UTC (permalink / raw)
To: JP Kobryn
Cc: bp, tglx, mingo, dave.hansen, hpa, aijay, x86, linux-edac,
linux-kernel, kernel-team
On Tue, Jun 17, 2025 at 02:47:52PM -0700, JP Kobryn wrote:
> It was found that after a kexec on an intel CPU, MCE reporting was no
> longer active. The root cause has been found to be that ownership of CMCI
> banks is not cleared during the shutdown phase. As a result, when CPU's
> come back online, they are unable to rediscover these occupied banks. If we
> clear these CPU associations before booting into the new kernel, the CMCI
> banks can be reclaimed and MCE reporting will become functional once more.
>
> The existing code does seem to have the intention of clearing MCE-related
> features via mcheck_cpu_clear(). During a kexec reboot, there are two
> sequences that reach a call to mcheck_cpu_clear(). They are:
>
> 1) Stopping other (remote) CPU's via IPI:
> native_machine_shutdown()
> stop_other_cpus()
> smp_ops.stop_other_cpus(1)
> x86 smp: native_stop_other_cpus(1)
> apic_send_IPI_allbutself(REBOOT_VECTOR)
>
> ...IPI is received on remote CPU's and IDT sysvec_reboot invoked:
> stop_this_cpu()
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
>
> 2) Seqence of stopping the active CPU (the one performing the kexec):
> native_machine_shutdown()
> stop_other_cpus()
> smp_ops.stop_other_cpus(1)
> x86 smp: native_stop_other_cpus(1)
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
>
> In both cases, the call to mcheck_cpu_clear() leads to the vendor specific
> call to intel_feature_clear():
>
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
> __mcheck_cpu_clear_vendor(c)
> switch (c->x86_vendor)
> case X86_VENDOR_INTEL:
> mce_intel_feature_clear(c)
>
> Now looking at the pair of functions mce_intel_feature_{init,clear}, there
> are 3 MCE features setup on the init side:
>
> mce_intel_feature_init(c)
> intel_init_cmci()
> intel_init_lmce()
> intel_imc_init(c)
>
> On the other side in the clear function, only one of these features is
> actually cleared:
>
> mce_intel_feature_clear(c)
> intel_clear_lmce()
>
> Just focusing on the feature pertaining to the root cause of the kexec
> issue, there would be a benefit if we additionally cleared the CMCI feature
> within this routine - the banks would be free for acquisition on the boot
> up side of a kexec. This patch adds the call to clear CMCI to this intel
> routine.
>
> Reported-by: Aijay Adams <aijay@meta.com>
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
LGTM
Reviewed-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/mce/intel.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
> index efcf21e9552e..9b149b9c4109 100644
> --- a/arch/x86/kernel/cpu/mce/intel.c
> +++ b/arch/x86/kernel/cpu/mce/intel.c
> @@ -478,6 +478,7 @@ void mce_intel_feature_init(struct cpuinfo_x86 *c)
> void mce_intel_feature_clear(struct cpuinfo_x86 *c)
> {
> intel_clear_lmce();
> + cmci_clear();
I particularly like that you found a one-line fix!
> }
>
> bool intel_filter_mce(struct mce *m)
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mce: include cmci during intel feature clearing
2025-06-17 21:47 [PATCH] mce: include cmci during intel feature clearing JP Kobryn
2025-06-25 18:15 ` Luck, Tony
@ 2025-06-25 19:43 ` Borislav Petkov
2025-06-26 2:18 ` Zhuo, Qiuxu
1 sibling, 1 reply; 4+ messages in thread
From: Borislav Petkov @ 2025-06-25 19:43 UTC (permalink / raw)
To: JP Kobryn
Cc: tony.luck, tglx, mingo, dave.hansen, hpa, aijay, x86, linux-edac,
linux-kernel, kernel-team
On Tue, Jun 17, 2025 at 02:47:52PM -0700, JP Kobryn wrote:
> It was found that after a kexec on an intel CPU, MCE reporting was no
> longer active. The root cause has been found to be that ownership of CMCI
> banks is not cleared during the shutdown phase. As a result, when CPU's
> come back online, they are unable to rediscover these occupied banks. If we
> clear these CPU associations before booting into the new kernel, the CMCI
> banks can be reclaimed and MCE reporting will become functional once more.
>
> The existing code does seem to have the intention of clearing MCE-related
> features via mcheck_cpu_clear(). During a kexec reboot, there are two
> sequences that reach a call to mcheck_cpu_clear(). They are:
>
> 1) Stopping other (remote) CPU's via IPI:
> native_machine_shutdown()
> stop_other_cpus()
> smp_ops.stop_other_cpus(1)
> x86 smp: native_stop_other_cpus(1)
> apic_send_IPI_allbutself(REBOOT_VECTOR)
>
> ...IPI is received on remote CPU's and IDT sysvec_reboot invoked:
> stop_this_cpu()
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
>
> 2) Seqence of stopping the active CPU (the one performing the kexec):
> native_machine_shutdown()
> stop_other_cpus()
> smp_ops.stop_other_cpus(1)
> x86 smp: native_stop_other_cpus(1)
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
>
> In both cases, the call to mcheck_cpu_clear() leads to the vendor specific
> call to intel_feature_clear():
>
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
> __mcheck_cpu_clear_vendor(c)
> switch (c->x86_vendor)
> case X86_VENDOR_INTEL:
> mce_intel_feature_clear(c)
>
> Now looking at the pair of functions mce_intel_feature_{init,clear}, there
> are 3 MCE features setup on the init side:
>
> mce_intel_feature_init(c)
> intel_init_cmci()
> intel_init_lmce()
> intel_imc_init(c)
>
> On the other side in the clear function, only one of these features is
> actually cleared:
>
> mce_intel_feature_clear(c)
> intel_clear_lmce()
>
> Just focusing on the feature pertaining to the root cause of the kexec
> issue, there would be a benefit if we additionally cleared the CMCI feature
> within this routine - the banks would be free for acquisition on the boot
> up side of a kexec. This patch adds the call to clear CMCI to this intel
> routine.
Please:
- shorten this commit message - there really is no need to explain in such
detail that mcheck_cpu_clear() has simply forgotten to clear CMCI banks
too.
- run it through a spellchecker
- drop all personal pronouns
- write it in imperative tone
Some hints:
Section "2) Describe your changes" in
Documentation/process/submitting-patches.rst for more details.
Also, see section "Changelog" in
Documentation/process/maintainer-tip.rst
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [PATCH] mce: include cmci during intel feature clearing
2025-06-25 19:43 ` Borislav Petkov
@ 2025-06-26 2:18 ` Zhuo, Qiuxu
0 siblings, 0 replies; 4+ messages in thread
From: Zhuo, Qiuxu @ 2025-06-26 2:18 UTC (permalink / raw)
To: Borislav Petkov, JP Kobryn
Cc: Luck, Tony, tglx@linutronix.de, mingo@redhat.com,
dave.hansen@linux.intel.com, hpa@zytor.com, aijay@meta.com,
x86@kernel.org, linux-edac@vger.kernel.org,
linux-kernel@vger.kernel.org, kernel-team@meta.com
Hi JP,
> [...]
> > Just focusing on the feature pertaining to the root cause of the kexec
> > issue, there would be a benefit if we additionally cleared the CMCI
> > feature within this routine - the banks would be free for acquisition
> > on the boot up side of a kexec. This patch adds the call to clear CMCI
> > to this intel routine.
>
> Please:
>
> - shorten this commit message - there really is no need to explain in such
> detail that mcheck_cpu_clear() has simply forgotten to clear CMCI banks
> too.
>
> - run it through a spellchecker
>
> - drop all personal pronouns
>
> - write it in imperative tone
>
> Some hints:
>
> Section "2) Describe your changes" in
> Documentation/process/submitting-patches.rst for more details.
>
> Also, see section "Changelog" in
> Documentation/process/maintainer-tip.rst
>
Please address Boris' comments, other than that
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-06-26 2:18 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-17 21:47 [PATCH] mce: include cmci during intel feature clearing JP Kobryn
2025-06-25 18:15 ` Luck, Tony
2025-06-25 19:43 ` Borislav Petkov
2025-06-26 2:18 ` Zhuo, Qiuxu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).