[patch v3 0/7] x86/smp: Cure stop_other_cpus() and kexec() troubles

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch v3 0/7] x86/smp: Cure stop_other_cpus() and kexec() troubles
@ 2023-06-15 20:33 Thomas Gleixner
  2023-06-15 20:33 ` [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust Thomas Gleixner
                   ` (6 more replies)
  0 siblings, 7 replies; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-15 20:33 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman

This is the third version of the stop_other_cpus() / kexec()
vs. mwait_play_dead() series. Version 2 can be found here:

  https://lore.kernel.org/r/20230613115353.599087484@linutronix.de

The two issues addressed are:

  1) stop_other_cpus() continues after observing num_online_cpus() == 1.

     This is problematic because the to be stopped CPUs clear their online
     bit first and then invoke eventually WBINVD, which can take a long
     time. There seems to be an interaction between the WBINVD and the
     reboot mechanics as this intermittendly results in hangs.

  2) kexec() kernel can overwrite the memory locations which "offline" CPUs
     are monitoring. This write brings them out of MWAIT and they resume
     execution on overwritten text, page tables, data and stacks resulting
     in triple faults.

Cure them by:

  #1 Synchronizing stop_other_cpus() with a CPU mask which is updated in
     stop_this_cpu() _after_ WBINVD completes.

  #2 Bringing offline CPUs out of MWAIT and move them into HLT before
     starting the kexec() kernel. Optionaly send them an INIT IPI so they
     go back into wait for startup state.

Changes vs. V2:

  - Use a CPU mask instead of an atomic counter and send the NMI only to
    CPUs which did not report that they reached HLT. That's still not race
    free vs. a late handling of the reboot vector, but that's not fixable.

Interestingly enough testing the NMI mechanics unearthed that after soft
disabling the local APIC the CPU is _not_ handling the NMI despite the
SDM claiming:

  "The operation and response of a local APIC while in this software-disabled
   state is as follows:

   * The local APIC will respond normally to INIT, NMI, SMI, and SIPI messages."

I validated that even without handling the NMI, the CPU is kicked out of
HLT reliably.

It's unclear whether that's X2APIC specific and I neither verified that
behaviour on AMD. Nor is it clear what "respond normally" actually means.

The AMD APM is not helpful either:

  "SMI, NMI, INIT, Startup, and Remote Read interrupts may be accepted"

Oh well.

The series is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/kexec

Thanks,

	tglx
---
 include/asm/cpu.h |    2 
 include/asm/smp.h |    4 +
 kernel/process.c  |   25 +++++++--
 kernel/smp.c      |  111 +++++++++++++++++++++++++++++-----------
 kernel/smpboot.c  |  149 ++++++++++++++++++++++++++++++++++++++++--------------
 5 files changed, 220 insertions(+), 71 deletions(-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust
  2023-06-15 20:33 [patch v3 0/7] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
@ 2023-06-15 20:33 ` Thomas Gleixner
  2023-06-16  1:58   ` Ashok Raj
  2023-06-15 20:33 ` [patch v3 2/7] x86/smp: Dont access non-existing CPUID leaf Thomas Gleixner
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-15 20:33 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman

Tony reported intermittent lockups on poweroff. His analysis identified the
wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
on SME enabled machines a kexec() does not leave any stale data in the
caches when switching from encrypted to non-encrypted mode or vice versa.

That wbindv() is conditional on the SME feature bit which is read directly
from CPUID. But that readout does not check whether the CPUID leaf is
available or not. If it's not available the CPU will return the value of
the highest supported leaf instead. Depending on the content the "SME" bit
might be set or not.

That's incorrect but harmless. Making the CPUID readout conditional makes
the observed hangs go away, but it does not fix the underlying problem:

CPU0					CPU1

 stop_other_cpus()
   send_IPIs(REBOOT);			stop_this_cpu()
   while (num_online_cpus() > 1);         set_online(false);
   proceed... -> hang
				          wbinvd()

WBINVD is an expensive operation and if multiple CPUs issue it at the same
time the resulting delays are even larger.

But CPU0 already observed num_online_cpus() going down to 1 and proceeds
which causes the system to hang.

This issue exists independent of WBINVD, but the delays caused by WBINVD
make it more prominent.

Make this more robust by adding a cpumask which is initialized to the
online CPU mask before sending the IPIs and CPUs clear their bit in
stop_this_cpu() after the WBINVD completed. Check for that cpumask to
become empty in stop_other_cpus() instead of watching num_online_cpus().

The cpumask cannot plug all holes either, but it's better than a raw
counter and allows to restrict the NMI fallback IPI to be sent only to
the CPUs which have not reported within the timeout window.

Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Reported-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
---
V3: Use a cpumask to make the NMI case slightly safer - Ashok
---
 arch/x86/include/asm/cpu.h |    2 +
 arch/x86/kernel/process.c  |   23 +++++++++++++-
 arch/x86/kernel/smp.c      |   71 +++++++++++++++++++++++++++++++--------------
 3 files changed, 73 insertions(+), 23 deletions(-)


--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -98,4 +98,6 @@ extern u64 x86_read_arch_cap_msr(void);
 int intel_find_matching_signature(void *mc, unsigned int csig, int cpf);
 int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
 
+extern struct cpumask cpus_stop_mask;
+
 #endif /* _ASM_X86_CPU_H */
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -759,13 +759,23 @@ bool xen_set_default_idle(void)
 }
 #endif
 
+struct cpumask cpus_stop_mask;
+
 void __noreturn stop_this_cpu(void *dummy)
 {
+	unsigned int cpu = smp_processor_id();
+
 	local_irq_disable();
+
 	/*
-	 * Remove this CPU:
+	 * Remove this CPU from the online mask and disable it
+	 * unconditionally. This might be redundant in case that the reboot
+	 * vector was handled late and stop_other_cpus() sent an NMI.
+	 *
+	 * According to SDM and APM NMIs can be accepted even after soft
+	 * disabling the local APIC.
 	 */
-	set_cpu_online(smp_processor_id(), false);
+	set_cpu_online(cpu, false);
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
 
@@ -783,6 +793,15 @@ void __noreturn stop_this_cpu(void *dumm
 	 */
 	if (cpuid_eax(0x8000001f) & BIT(0))
 		native_wbinvd();
+
+	/*
+	 * This brings a cache line back and dirties it, but
+	 * native_stop_other_cpus() will overwrite cpus_stop_mask after it
+	 * observed that all CPUs reported stop. This write will invalidate
+	 * the related cache line on this CPU.
+	 */
+	cpumask_clear_cpu(cpu, &cpus_stop_mask);
+
 	for (;;) {
 		/*
 		 * Use native_halt() so that memory contents don't change
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -27,6 +27,7 @@
 #include <asm/mmu_context.h>
 #include <asm/proto.h>
 #include <asm/apic.h>
+#include <asm/cpu.h>
 #include <asm/idtentry.h>
 #include <asm/nmi.h>
 #include <asm/mce.h>
@@ -146,31 +147,43 @@ static int register_stop_handler(void)
 
 static void native_stop_other_cpus(int wait)
 {
-	unsigned long flags;
-	unsigned long timeout;
+	unsigned int cpu = smp_processor_id();
+	unsigned long flags, timeout;
 
 	if (reboot_force)
 		return;
 
-	/*
-	 * Use an own vector here because smp_call_function
-	 * does lots of things not suitable in a panic situation.
-	 */
+	/* Only proceed if this is the first CPU to reach this code */
+	if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1)
+		return;
 
 	/*
-	 * We start by using the REBOOT_VECTOR irq.
-	 * The irq is treated as a sync point to allow critical
-	 * regions of code on other cpus to release their spin locks
-	 * and re-enable irqs.  Jumping straight to an NMI might
-	 * accidentally cause deadlocks with further shutdown/panic
-	 * code.  By syncing, we give the cpus up to one second to
-	 * finish their work before we force them off with the NMI.
+	 * 1) Send an IPI on the reboot vector to all other CPUs.
+	 *
+	 *    The other CPUs should react on it after leaving critical
+	 *    sections and re-enabling interrupts. They might still hold
+	 *    locks, but there is nothing which can be done about that.
+	 *
+	 * 2) Wait for all other CPUs to report that they reached the
+	 *    HLT loop in stop_this_cpu()
+	 *
+	 * 3) If #2 timed out send an NMI to the CPUs which did not
+	 *    yet report
+	 *
+	 * 4) Wait for all other CPUs to report that they reached the
+	 *    HLT loop in stop_this_cpu()
+	 *
+	 * #3 can obviously race against a CPU reaching the HLT loop late.
+	 * That CPU will have reported already and the "have all CPUs
+	 * reached HLT" condition will be true despite the fact that the
+	 * other CPU is still handling the NMI. Again, there is no
+	 * protection against that as "disabled" APICs still respond to
+	 * NMIs.
 	 */
-	if (num_online_cpus() > 1) {
-		/* did someone beat us here? */
-		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
-			return;
+	cpumask_copy(&cpus_stop_mask, cpu_online_mask);
+	cpumask_clear_cpu(cpu, &cpus_stop_mask);
 
+	if (!cpumask_empty(&cpus_stop_mask)) {
 		/* sync above data before sending IRQ */
 		wmb();
 
@@ -183,24 +196,34 @@ static void native_stop_other_cpus(int w
 		 * CPUs reach shutdown state.
 		 */
 		timeout = USEC_PER_SEC;
-		while (num_online_cpus() > 1 && timeout--)
+		while (!cpumask_empty(&cpus_stop_mask) && timeout--)
 			udelay(1);
 	}
 
 	/* if the REBOOT_VECTOR didn't work, try with the NMI */
-	if (num_online_cpus() > 1) {
+	if (!cpumask_empty(&cpus_stop_mask)) {
 		/*
 		 * If NMI IPI is enabled, try to register the stop handler
 		 * and send the IPI. In any case try to wait for the other
 		 * CPUs to stop.
 		 */
 		if (!smp_no_nmi_ipi && !register_stop_handler()) {
+			u32 dm;
+
 			/* Sync above data before sending IRQ */
 			wmb();
 
 			pr_emerg("Shutting down cpus with NMI\n");
 
-			apic_send_IPI_allbutself(NMI_VECTOR);
+			dm = apic->dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
+			dm |= APIC_DM_NMI;
+
+			for_each_cpu(cpu, &cpus_stop_mask) {
+				u32 apicid = apic->cpu_present_to_apicid(cpu);
+
+				apic_icr_write(dm, apicid);
+				apic_wait_icr_idle();
+			}
 		}
 		/*
 		 * Don't wait longer than 10 ms if the caller didn't
@@ -208,7 +231,7 @@ static void native_stop_other_cpus(int w
 		 * one or more CPUs do not reach shutdown state.
 		 */
 		timeout = USEC_PER_MSEC * 10;
-		while (num_online_cpus() > 1 && (wait || timeout--))
+		while (!cpumask_empty(&cpus_stop_mask) && (wait || timeout--))
 			udelay(1);
 	}
 
@@ -216,6 +239,12 @@ static void native_stop_other_cpus(int w
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
 	local_irq_restore(flags);
+
+	/*
+	 * Ensure that the cpus_stop_mask cache lines are invalidated on
+	 * the other CPUs. See comment vs. SME in stop_this_cpu().
+	 */
+	cpumask_clear(&cpus_stop_mask);
 }
 
 /*


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust
  2023-06-15 20:33 ` [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust Thomas Gleixner
@ 2023-06-16  1:58   ` Ashok Raj
  2023-06-16  7:53     ` Thomas Gleixner
  2023-06-16 16:36     ` Tony Battersby
  0 siblings, 2 replies; 37+ messages in thread
From: Ashok Raj @ 2023-06-16  1:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

Hi Thomas,

On Thu, Jun 15, 2023 at 10:33:50PM +0200, Thomas Gleixner wrote:
> Tony reported intermittent lockups on poweroff. His analysis identified the
> wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
> on SME enabled machines a kexec() does not leave any stale data in the
> caches when switching from encrypted to non-encrypted mode or vice versa.
> 
> That wbindv() is conditional on the SME feature bit which is read directly
> from CPUID. But that readout does not check whether the CPUID leaf is
> available or not. If it's not available the CPU will return the value of
> the highest supported leaf instead. Depending on the content the "SME" bit
> might be set or not.
> 
> That's incorrect but harmless. Making the CPUID readout conditional makes
> the observed hangs go away, but it does not fix the underlying problem:
> 
> CPU0					CPU1
> 
>  stop_other_cpus()
>    send_IPIs(REBOOT);			stop_this_cpu()
>    while (num_online_cpus() > 1);         set_online(false);
>    proceed... -> hang
> 				          wbinvd()
> 
> WBINVD is an expensive operation and if multiple CPUs issue it at the same
> time the resulting delays are even larger.
> 
> But CPU0 already observed num_online_cpus() going down to 1 and proceeds
> which causes the system to hang.
> 
> This issue exists independent of WBINVD, but the delays caused by WBINVD
> make it more prominent.
> 
> Make this more robust by adding a cpumask which is initialized to the
> online CPU mask before sending the IPIs and CPUs clear their bit in
> stop_this_cpu() after the WBINVD completed. Check for that cpumask to
> become empty in stop_other_cpus() instead of watching num_online_cpus().
> 
> The cpumask cannot plug all holes either, but it's better than a raw
> counter and allows to restrict the NMI fallback IPI to be sent only to
> the CPUs which have not reported within the timeout window.
> 
> Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
> Reported-by: Tony Battersby <tonyb@cybernetics.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
> ---
> V3: Use a cpumask to make the NMI case slightly safer - Ashok
> ---
>  arch/x86/include/asm/cpu.h |    2 +
>  arch/x86/kernel/process.c  |   23 +++++++++++++-
>  arch/x86/kernel/smp.c      |   71 +++++++++++++++++++++++++++++++--------------
>  3 files changed, 73 insertions(+), 23 deletions(-)

I tested them and seems to work fine on my system.

Maybe Tony can check in his setup would be great.

One thought on sending NMI below.

[snip]

>  
>  	/* if the REBOOT_VECTOR didn't work, try with the NMI */
> -	if (num_online_cpus() > 1) {
> +	if (!cpumask_empty(&cpus_stop_mask)) {
>  		/*
>  		 * If NMI IPI is enabled, try to register the stop handler
>  		 * and send the IPI. In any case try to wait for the other
>  		 * CPUs to stop.
>  		 */
>  		if (!smp_no_nmi_ipi && !register_stop_handler()) {
> +			u32 dm;
> +
>  			/* Sync above data before sending IRQ */
>  			wmb();
>  
>  			pr_emerg("Shutting down cpus with NMI\n");
>  
> -			apic_send_IPI_allbutself(NMI_VECTOR);
> +			dm = apic->dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
> +			dm |= APIC_DM_NMI;
> +
> +			for_each_cpu(cpu, &cpus_stop_mask) {
> +				u32 apicid = apic->cpu_present_to_apicid(cpu);
> +
> +				apic_icr_write(dm, apicid);
> +				apic_wait_icr_idle();

can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust
  2023-06-16  1:58   ` Ashok Raj
@ 2023-06-16  7:53     ` Thomas Gleixner
  2023-06-16 14:13       ` Ashok Raj
  2023-06-16 16:36     ` Tony Battersby
  1 sibling, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-16  7:53 UTC (permalink / raw)
  To: Ashok Raj
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Thu, Jun 15 2023 at 18:58, Ashok Raj wrote:
> On Thu, Jun 15, 2023 at 10:33:50PM +0200, Thomas Gleixner wrote:
>> +			dm = apic->dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
>> +			dm |= APIC_DM_NMI;
>> +
>> +			for_each_cpu(cpu, &cpus_stop_mask) {
>> +				u32 apicid = apic->cpu_present_to_apicid(cpu);
>> +
>> +				apic_icr_write(dm, apicid);
>> +				apic_wait_icr_idle();
>
> can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??

That would not set APIC_DM_NMI in delivery mode and the IPI would be
sent with APIC_DM_FIXED.

Unfortunately we don't have apic->send_NMI() ...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust
  2023-06-16  7:53     ` Thomas Gleixner
@ 2023-06-16 14:13       ` Ashok Raj
  2023-06-16 18:01         ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Ashok Raj @ 2023-06-16 14:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Fri, Jun 16, 2023 at 09:53:25AM +0200, Thomas Gleixner wrote:
> On Thu, Jun 15 2023 at 18:58, Ashok Raj wrote:
> > On Thu, Jun 15, 2023 at 10:33:50PM +0200, Thomas Gleixner wrote:
> >> +			dm = apic->dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
> >> +			dm |= APIC_DM_NMI;
> >> +
> >> +			for_each_cpu(cpu, &cpus_stop_mask) {
> >> +				u32 apicid = apic->cpu_present_to_apicid(cpu);
> >> +
> >> +				apic_icr_write(dm, apicid);
> >> +				apic_wait_icr_idle();
> >
> > can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??
> 
> That would not set APIC_DM_NMI in delivery mode and the IPI would be
> sent with APIC_DM_FIXED.

That's correct. 

Maybe if we use apic->send_IPI_mask(cpumask_of(cpu), NMI_VECTOR)

apic->send_IPI_mask(cpumask_of(cpu),NMI_VECTOR)
   __x2apic_send_IPI_mask() 
      __x2apic_send_IPI_dest()
	  unsigned long cfg = __prepare_ICR(0, vector, dest);
	  native_x2apic_icr_write(cfg, apicid);

__prepare_ICR() seems to have the magic for APIC_DM_NMI?

I suppose this works for non-x2apic parts as well

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust
  2023-06-16 14:13       ` Ashok Raj
@ 2023-06-16 18:01         ` Thomas Gleixner
  2023-06-16 20:57           ` Ashok Raj
  2023-06-20  8:09           ` Borislav Petkov
  0 siblings, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-16 18:01 UTC (permalink / raw)
  To: Ashok Raj
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Fri, Jun 16 2023 at 07:13, Ashok Raj wrote:
> On Fri, Jun 16, 2023 at 09:53:25AM +0200, Thomas Gleixner wrote:
>> > can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??
>> 
>> That would not set APIC_DM_NMI in delivery mode and the IPI would be
>> sent with APIC_DM_FIXED.
>
> That's correct. 
>
> Maybe if we use apic->send_IPI_mask(cpumask_of(cpu), NMI_VECTOR)
>
> apic->send_IPI_mask(cpumask_of(cpu),NMI_VECTOR)
>    __x2apic_send_IPI_mask() 
>       __x2apic_send_IPI_dest()
> 	  unsigned long cfg = __prepare_ICR(0, vector, dest);
> 	  native_x2apic_icr_write(cfg, apicid);
>
> __prepare_ICR() seems to have the magic for APIC_DM_NMI?

Bah. I clearly neither can read nor can remember any of the internals of
that code:

apic->send_IPI()
  x2apic_send_IPI()
    __x2apic_send_IPI_dest()

It also works for all other APIC incarnations. 

Thanks,

        tglx
---
From: Thomas Gleixner <tglx@linutronix.de>
Subject: x86/smp: Make stop_other_cpus() more robust
Date: Wed, 26 Apr 2023 18:37:00 +0200

Tony reported intermittent lockups on poweroff. His analysis identified the
wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
on SME enabled machines a kexec() does not leave any stale data in the
caches when switching from encrypted to non-encrypted mode or vice versa.

That wbindv() is conditional on the SME feature bit which is read directly
from CPUID. But that readout does not check whether the CPUID leaf is
available or not. If it's not available the CPU will return the value of
the highest supported leaf instead. Depending on the content the "SME" bit
might be set or not.

That's incorrect but harmless. Making the CPUID readout conditional makes
the observed hangs go away, but it does not fix the underlying problem:

CPU0					CPU1

 stop_other_cpus()
   send_IPIs(REBOOT);			stop_this_cpu()
   while (num_online_cpus() > 1);         set_online(false);
   proceed... -> hang
				          wbinvd()

WBINVD is an expensive operation and if multiple CPUs issue it at the same
time the resulting delays are even larger.

But CPU0 already observed num_online_cpus() going down to 1 and proceeds
which causes the system to hang.

This issue exists independent of WBINVD, but the delays caused by WBINVD
make it more prominent.

Make this more robust by adding a cpumask which is initialized to the
online CPU mask before sending the IPIs and CPUs clear their bit in
stop_this_cpu() after the WBINVD completed. Check for that cpumask to
become empty in stop_other_cpus() instead of watching num_online_cpus().

The cpumask cannot plug all holes either, but it's better than a raw
counter and allows to restrict the NMI fallback IPI to be sent only the
CPUs which have not reported within the timeout window.

Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Reported-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
---
V3: Use a cpumask to make the NMI case slightly safer - Ashok
V4: Simplify the NMI IPI code - Ashok
---
 arch/x86/include/asm/cpu.h |    2 +
 arch/x86/kernel/process.c  |   23 +++++++++++++++-
 arch/x86/kernel/smp.c      |   62 +++++++++++++++++++++++++++++----------------
 3 files changed, 64 insertions(+), 23 deletions(-)


--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -98,4 +98,6 @@ extern u64 x86_read_arch_cap_msr(void);
 int intel_find_matching_signature(void *mc, unsigned int csig, int cpf);
 int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
 
+extern struct cpumask cpus_stop_mask;
+
 #endif /* _ASM_X86_CPU_H */
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -759,13 +759,23 @@ bool xen_set_default_idle(void)
 }
 #endif
 
+struct cpumask cpus_stop_mask;
+
 void __noreturn stop_this_cpu(void *dummy)
 {
+	unsigned int cpu = smp_processor_id();
+
 	local_irq_disable();
+
 	/*
-	 * Remove this CPU:
+	 * Remove this CPU from the online mask and disable it
+	 * unconditionally. This might be redundant in case that the reboot
+	 * vector was handled late and stop_other_cpus() sent an NMI.
+	 *
+	 * According to SDM and APM NMIs can be accepted even after soft
+	 * disabling the local APIC.
 	 */
-	set_cpu_online(smp_processor_id(), false);
+	set_cpu_online(cpu, false);
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
 
@@ -783,6 +793,15 @@ void __noreturn stop_this_cpu(void *dumm
 	 */
 	if (cpuid_eax(0x8000001f) & BIT(0))
 		native_wbinvd();
+
+	/*
+	 * This brings a cache line back and dirties it, but
+	 * native_stop_other_cpus() will overwrite cpus_stop_mask after it
+	 * observed that all CPUs reported stop. This write will invalidate
+	 * the related cache line on this CPU.
+	 */
+	cpumask_clear_cpu(cpu, &cpus_stop_mask);
+
 	for (;;) {
 		/*
 		 * Use native_halt() so that memory contents don't change
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -27,6 +27,7 @@
 #include <asm/mmu_context.h>
 #include <asm/proto.h>
 #include <asm/apic.h>
+#include <asm/cpu.h>
 #include <asm/idtentry.h>
 #include <asm/nmi.h>
 #include <asm/mce.h>
@@ -146,31 +147,43 @@ static int register_stop_handler(void)
 
 static void native_stop_other_cpus(int wait)
 {
-	unsigned long flags;
-	unsigned long timeout;
+	unsigned int cpu = smp_processor_id();
+	unsigned long flags, timeout;
 
 	if (reboot_force)
 		return;
 
-	/*
-	 * Use an own vector here because smp_call_function
-	 * does lots of things not suitable in a panic situation.
-	 */
+	/* Only proceed if this is the first CPU to reach this code */
+	if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1)
+		return;
 
 	/*
-	 * We start by using the REBOOT_VECTOR irq.
-	 * The irq is treated as a sync point to allow critical
-	 * regions of code on other cpus to release their spin locks
-	 * and re-enable irqs.  Jumping straight to an NMI might
-	 * accidentally cause deadlocks with further shutdown/panic
-	 * code.  By syncing, we give the cpus up to one second to
-	 * finish their work before we force them off with the NMI.
+	 * 1) Send an IPI on the reboot vector to all other CPUs.
+	 *
+	 *    The other CPUs should react on it after leaving critical
+	 *    sections and re-enabling interrupts. They might still hold
+	 *    locks, but there is nothing which can be done about that.
+	 *
+	 * 2) Wait for all other CPUs to report that they reached the
+	 *    HLT loop in stop_this_cpu()
+	 *
+	 * 3) If #2 timed out send an NMI to the CPUs which did not
+	 *    yet report
+	 *
+	 * 4) Wait for all other CPUs to report that they reached the
+	 *    HLT loop in stop_this_cpu()
+	 *
+	 * #3 can obviously race against a CPU reaching the HLT loop late.
+	 * That CPU will have reported already and the "have all CPUs
+	 * reached HLT" condition will be true despite the fact that the
+	 * other CPU is still handling the NMI. Again, there is no
+	 * protection against that as "disabled" APICs still respond to
+	 * NMIs.
 	 */
-	if (num_online_cpus() > 1) {
-		/* did someone beat us here? */
-		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
-			return;
+	cpumask_copy(&cpus_stop_mask, cpu_online_mask);
+	cpumask_clear_cpu(cpu, &cpus_stop_mask);
 
+	if (!cpumask_empty(&cpus_stop_mask)) {
 		/* sync above data before sending IRQ */
 		wmb();
 
@@ -183,12 +196,12 @@ static void native_stop_other_cpus(int w
 		 * CPUs reach shutdown state.
 		 */
 		timeout = USEC_PER_SEC;
-		while (num_online_cpus() > 1 && timeout--)
+		while (!cpumask_empty(&cpus_stop_mask) && timeout--)
 			udelay(1);
 	}
 
 	/* if the REBOOT_VECTOR didn't work, try with the NMI */
-	if (num_online_cpus() > 1) {
+	if (!cpumask_empty(&cpus_stop_mask)) {
 		/*
 		 * If NMI IPI is enabled, try to register the stop handler
 		 * and send the IPI. In any case try to wait for the other
@@ -200,7 +213,8 @@ static void native_stop_other_cpus(int w
 
 			pr_emerg("Shutting down cpus with NMI\n");
 
-			apic_send_IPI_allbutself(NMI_VECTOR);
+			for_each_cpu(cpu, &cpus_stop_mask)
+				apic->send_IPI(cpu, NMI_VECTOR);
 		}
 		/*
 		 * Don't wait longer than 10 ms if the caller didn't
@@ -208,7 +222,7 @@ static void native_stop_other_cpus(int w
 		 * one or more CPUs do not reach shutdown state.
 		 */
 		timeout = USEC_PER_MSEC * 10;
-		while (num_online_cpus() > 1 && (wait || timeout--))
+		while (!cpumask_empty(&cpus_stop_mask) && (wait || timeout--))
 			udelay(1);
 	}
 
@@ -216,6 +230,12 @@ static void native_stop_other_cpus(int w
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
 	local_irq_restore(flags);
+
+	/*
+	 * Ensure that the cpus_stop_mask cache lines are invalidated on
+	 * the other CPUs. See comment vs. SME in stop_this_cpu().
+	 */
+	cpumask_clear(&cpus_stop_mask);
 }
 
 /*

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust
  2023-06-16 18:01         ` Thomas Gleixner
@ 2023-06-16 20:57           ` Ashok Raj
  2023-06-19 17:51             ` Ashok Raj
  2023-06-20  8:09           ` Borislav Petkov
  1 sibling, 1 reply; 37+ messages in thread
From: Ashok Raj @ 2023-06-16 20:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Fri, Jun 16, 2023 at 08:01:56PM +0200, Thomas Gleixner wrote:
> On Fri, Jun 16 2023 at 07:13, Ashok Raj wrote:
> > On Fri, Jun 16, 2023 at 09:53:25AM +0200, Thomas Gleixner wrote:
> >> > can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??
> >> 
> >> That would not set APIC_DM_NMI in delivery mode and the IPI would be
> >> sent with APIC_DM_FIXED.
> >
> > That's correct. 
> >
> > Maybe if we use apic->send_IPI_mask(cpumask_of(cpu), NMI_VECTOR)
> >
> > apic->send_IPI_mask(cpumask_of(cpu),NMI_VECTOR)
> >    __x2apic_send_IPI_mask() 
> >       __x2apic_send_IPI_dest()
> > 	  unsigned long cfg = __prepare_ICR(0, vector, dest);
> > 	  native_x2apic_icr_write(cfg, apicid);
> >
> > __prepare_ICR() seems to have the magic for APIC_DM_NMI?
> 
> Bah. I clearly neither can read nor can remember any of the internals of
> that code:
> 
> apic->send_IPI()
>   x2apic_send_IPI()
>     __x2apic_send_IPI_dest()
> 
> It also works for all other APIC incarnations. 

Works!


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust
  2023-06-16 20:57           ` Ashok Raj
@ 2023-06-19 17:51             ` Ashok Raj
  0 siblings, 0 replies; 37+ messages in thread
From: Ashok Raj @ 2023-06-19 17:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Fri, Jun 16, 2023 at 01:57:59PM -0700, Ashok Raj wrote:
> On Fri, Jun 16, 2023 at 08:01:56PM +0200, Thomas Gleixner wrote:
> > On Fri, Jun 16 2023 at 07:13, Ashok Raj wrote:
> > > On Fri, Jun 16, 2023 at 09:53:25AM +0200, Thomas Gleixner wrote:
> > >> > can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??
> > >> 
> > >> That would not set APIC_DM_NMI in delivery mode and the IPI would be
> > >> sent with APIC_DM_FIXED.
> > >
> > > That's correct. 
> > >
> > > Maybe if we use apic->send_IPI_mask(cpumask_of(cpu), NMI_VECTOR)
> > >
> > > apic->send_IPI_mask(cpumask_of(cpu),NMI_VECTOR)
> > >    __x2apic_send_IPI_mask() 
> > >       __x2apic_send_IPI_dest()
> > > 	  unsigned long cfg = __prepare_ICR(0, vector, dest);
> > > 	  native_x2apic_icr_write(cfg, apicid);
> > >
> > > __prepare_ICR() seems to have the magic for APIC_DM_NMI?
> > 
> > Bah. I clearly neither can read nor can remember any of the internals of
> > that code:
> > 
> > apic->send_IPI()
> >   x2apic_send_IPI()
> >     __x2apic_send_IPI_dest()
> > 
> > It also works for all other APIC incarnations. 
> 
> Works!
> 

Forgot to add 

Reviewed-by: Ashok Raj <ashok.raj@intel.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust
  2023-06-16 18:01         ` Thomas Gleixner
  2023-06-16 20:57           ` Ashok Raj
@ 2023-06-20  8:09           ` Borislav Petkov
  1 sibling, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2023-06-20  8:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ashok Raj, LKML, x86, Mario Limonciello, Tom Lendacky,
	Tony Battersby, Ashok Raj, Tony Luck, Arjan van de Veen,
	Eric Biederman

On Fri, Jun 16, 2023 at 08:01:56PM +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> Subject: x86/smp: Make stop_other_cpus() more robust
> Date: Wed, 26 Apr 2023 18:37:00 +0200
> 
> Tony reported intermittent lockups on poweroff. His analysis identified the
> wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
> on SME enabled machines a kexec() does not leave any stale data in the
> caches when switching from encrypted to non-encrypted mode or vice versa.
> 
> That wbindv() is conditional on the SME feature bit which is read directly

wbinvd()

Otherwise:

Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust
  2023-06-16  1:58   ` Ashok Raj
  2023-06-16  7:53     ` Thomas Gleixner
@ 2023-06-16 16:36     ` Tony Battersby
  1 sibling, 0 replies; 37+ messages in thread
From: Tony Battersby @ 2023-06-16 16:36 UTC (permalink / raw)
  To: Ashok Raj, Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Ashok Raj, Tony Luck,
	Arjan van de Veen, Eric Biederman

On 6/15/23 21:58, Ashok Raj wrote:
> Hi Thomas,
>
> On Thu, Jun 15, 2023 at 10:33:50PM +0200, Thomas Gleixner wrote:
>> Tony reported intermittent lockups on poweroff. His analysis identified the
>> wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
>> on SME enabled machines a kexec() does not leave any stale data in the
>> caches when switching from encrypted to non-encrypted mode or vice versa.
>>
>> That wbindv() is conditional on the SME feature bit which is read directly
>> from CPUID. But that readout does not check whether the CPUID leaf is
>> available or not. If it's not available the CPU will return the value of
>> the highest supported leaf instead. Depending on the content the "SME" bit
>> might be set or not.
>>
>> That's incorrect but harmless. Making the CPUID readout conditional makes
>> the observed hangs go away, but it does not fix the underlying problem:
>>
>> CPU0					CPU1
>>
>>  stop_other_cpus()
>>    send_IPIs(REBOOT);			stop_this_cpu()
>>    while (num_online_cpus() > 1);         set_online(false);
>>    proceed... -> hang
>> 				          wbinvd()
>>
>> WBINVD is an expensive operation and if multiple CPUs issue it at the same
>> time the resulting delays are even larger.
>>
>> But CPU0 already observed num_online_cpus() going down to 1 and proceeds
>> which causes the system to hang.
>>
>> This issue exists independent of WBINVD, but the delays caused by WBINVD
>> make it more prominent.
>>
>> Make this more robust by adding a cpumask which is initialized to the
>> online CPU mask before sending the IPIs and CPUs clear their bit in
>> stop_this_cpu() after the WBINVD completed. Check for that cpumask to
>> become empty in stop_other_cpus() instead of watching num_online_cpus().
>>
>> The cpumask cannot plug all holes either, but it's better than a raw
>> counter and allows to restrict the NMI fallback IPI to be sent only to
>> the CPUs which have not reported within the timeout window.
>>
>> Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
>> Reported-by: Tony Battersby <tonyb@cybernetics.com>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
>> ---
>> V3: Use a cpumask to make the NMI case slightly safer - Ashok
>> ---
>>  arch/x86/include/asm/cpu.h |    2 +
>>  arch/x86/kernel/process.c  |   23 +++++++++++++-
>>  arch/x86/kernel/smp.c      |   71 +++++++++++++++++++++++++++++++--------------
>>  3 files changed, 73 insertions(+), 23 deletions(-)
> I tested them and seems to work fine on my system.
>
> Maybe Tony can check in his setup would be great.
>
plain 6.4-rc6: 50% failure rate
  poweroff success: 2
  poweroff fail:    2

6.4-rc6 with tglx v3 patch #1 only: 0% failure rate
  poweroff success: 10
  poweroff fail:    0

6.4-rc6 with all 7 tglx v3 patches: 0% failure rate
  poweroff success: 10
  poweroff fail:    0

Fixes my problem.

Tested-by: Tony Battersby <tonyb@cybernetics.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch v3 2/7] x86/smp: Dont access non-existing CPUID leaf
  2023-06-15 20:33 [patch v3 0/7] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
  2023-06-15 20:33 ` [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust Thomas Gleixner
@ 2023-06-15 20:33 ` Thomas Gleixner
  2023-06-19 17:02   ` Limonciello, Mario
  2023-06-20  8:20   ` Borislav Petkov
  2023-06-15 20:33 ` [patch v3 3/7] x86/smp: Remove pointless wmb()s from native_stop_other_cpus() Thomas Gleixner
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-15 20:33 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman

From: Tony Battersby <tonyb@cybernetics.com>

stop_this_cpu() tests CPUID leaf 0x8000001f::EAX unconditionally. CPUs
return the content of the highest supported leaf when a non-existing leaf
is read. So the result of the test is lottery except on AMD CPUs which
support that leaf.

While harmless it's incorrect and causes the conditional wbinvd() to be
issued where not required.

Check whether the leaf is supported before reading it.

[ tglx: Adjusted changelog ]

Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
---
 arch/x86/kernel/process.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -763,6 +763,7 @@ struct cpumask cpus_stop_mask;
 
 void __noreturn stop_this_cpu(void *dummy)
 {
+	struct cpuinfo_x86 *c = this_cpu_ptr(&cpu_info);
 	unsigned int cpu = smp_processor_id();
 
 	local_irq_disable();
@@ -777,7 +778,7 @@ void __noreturn stop_this_cpu(void *dumm
 	 */
 	set_cpu_online(cpu, false);
 	disable_local_APIC();
-	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
+	mcheck_cpu_clear(c);
 
 	/*
 	 * Use wbinvd on processors that support SME. This provides support
@@ -791,7 +792,7 @@ void __noreturn stop_this_cpu(void *dumm
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
 	 */
-	if (cpuid_eax(0x8000001f) & BIT(0))
+	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
 		native_wbinvd();
 
 	/*


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 2/7] x86/smp: Dont access non-existing CPUID leaf
  2023-06-15 20:33 ` [patch v3 2/7] x86/smp: Dont access non-existing CPUID leaf Thomas Gleixner
@ 2023-06-19 17:02   ` Limonciello, Mario
  2023-06-19 17:15     ` Thomas Gleixner
  2023-06-20  8:20   ` Borislav Petkov
  1 sibling, 1 reply; 37+ messages in thread
From: Limonciello, Mario @ 2023-06-19 17:02 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Tom Lendacky, Tony Battersby, Ashok Raj, Tony Luck,
	Arjan van de Veen, Eric Biederman


On 6/15/2023 3:33 PM, Thomas Gleixner wrote:
> From: Tony Battersby <tonyb@cybernetics.com>
>
> stop_this_cpu() tests CPUID leaf 0x8000001f::EAX unconditionally. CPUs
> return the content of the highest supported leaf when a non-existing leaf
> is read. So the result of the test is lottery except on AMD CPUs which
> support that leaf.
>
> While harmless it's incorrect and causes the conditional wbinvd() to be
> issued where not required.
>
> Check whether the leaf is supported before reading it.
>
> [ tglx: Adjusted changelog ]
>
> Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")

Thanks for this fix.
This particular patch should probably also CC to stable.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>

> Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/r/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
> ---
>   arch/x86/kernel/process.c |    5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -763,6 +763,7 @@ struct cpumask cpus_stop_mask;
>   
>   void __noreturn stop_this_cpu(void *dummy)
>   {
> +	struct cpuinfo_x86 *c = this_cpu_ptr(&cpu_info);
>   	unsigned int cpu = smp_processor_id();
>   
>   	local_irq_disable();
> @@ -777,7 +778,7 @@ void __noreturn stop_this_cpu(void *dumm
>   	 */
>   	set_cpu_online(cpu, false);
>   	disable_local_APIC();
> -	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
> +	mcheck_cpu_clear(c);
>   
>   	/*
>   	 * Use wbinvd on processors that support SME. This provides support
> @@ -791,7 +792,7 @@ void __noreturn stop_this_cpu(void *dumm
>   	 * Test the CPUID bit directly because the machine might've cleared
>   	 * X86_FEATURE_SME due to cmdline options.
>   	 */
> -	if (cpuid_eax(0x8000001f) & BIT(0))
> +	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
>   		native_wbinvd();
>   
>   	/*
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 2/7] x86/smp: Dont access non-existing CPUID leaf
  2023-06-19 17:02   ` Limonciello, Mario
@ 2023-06-19 17:15     ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-19 17:15 UTC (permalink / raw)
  To: Limonciello, Mario, LKML
  Cc: x86, Tom Lendacky, Tony Battersby, Ashok Raj, Tony Luck,
	Arjan van de Veen, Eric Biederman

On Mon, Jun 19 2023 at 12:02, Limonciello, Mario wrote:
> On 6/15/2023 3:33 PM, Thomas Gleixner wrote:
>> From: Tony Battersby <tonyb@cybernetics.com>
>>
>> stop_this_cpu() tests CPUID leaf 0x8000001f::EAX unconditionally. CPUs
>> return the content of the highest supported leaf when a non-existing leaf
>> is read. So the result of the test is lottery except on AMD CPUs which
>> support that leaf.
>>
>> While harmless it's incorrect and causes the conditional wbinvd() to be
>> issued where not required.
>>
>> Check whether the leaf is supported before reading it.
>>
>> [ tglx: Adjusted changelog ]
>>
>> Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
>
> Thanks for this fix.
> This particular patch should probably also CC to stable.

It's pretty much all stable material.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 2/7] x86/smp: Dont access non-existing CPUID leaf
  2023-06-15 20:33 ` [patch v3 2/7] x86/smp: Dont access non-existing CPUID leaf Thomas Gleixner
  2023-06-19 17:02   ` Limonciello, Mario
@ 2023-06-20  8:20   ` Borislav Petkov
  1 sibling, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2023-06-20  8:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman

On Thu, Jun 15, 2023 at 10:33:52PM +0200, Thomas Gleixner wrote:

> Subject: Re: [patch v3 2/7] x86/smp: Dont access non-existing CPUID leaf

"Do not access a non-existing... "

> From: Tony Battersby <tonyb@cybernetics.com>
> 
> stop_this_cpu() tests CPUID leaf 0x8000001f::EAX unconditionally. CPUs
> return the content of the highest supported leaf when a non-existing leaf
> is read.

This should be:

"On Intel, querying an invalid extended CPUID leaf returns the values of the
maximum basic CPUID leaf. On AMD, invalid CPUID leafs return zeros."

Other than that:

Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch v3 3/7] x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
  2023-06-15 20:33 [patch v3 0/7] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
  2023-06-15 20:33 ` [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust Thomas Gleixner
  2023-06-15 20:33 ` [patch v3 2/7] x86/smp: Dont access non-existing CPUID leaf Thomas Gleixner
@ 2023-06-15 20:33 ` Thomas Gleixner
  2023-06-20  8:47   ` Borislav Petkov
  2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
  2023-06-15 20:33 ` [patch v3 4/7] x86/smp: Use dedicated cache-line for mwait_play_dead() Thomas Gleixner
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-15 20:33 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman

The wmb()s before sending the IPIs are not synchronizing anything.

If at all then the apic IPI functions have to provide or act as appropriate
barriers.

Remove these cargo cult barriers which have no explanation of what they are
synchronizing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V3: Remove second instance and reword changelog - PeterZ
---
 arch/x86/kernel/smp.c |    6 ------
 1 file changed, 6 deletions(-)

--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -184,9 +184,6 @@ static void native_stop_other_cpus(int w
 	cpumask_clear_cpu(cpu, &cpus_stop_mask);
 
 	if (!cpumask_empty(&cpus_stop_mask)) {
-		/* sync above data before sending IRQ */
-		wmb();
-
 		apic_send_IPI_allbutself(REBOOT_VECTOR);
 
 		/*
@@ -210,9 +207,6 @@ static void native_stop_other_cpus(int w
 		if (!smp_no_nmi_ipi && !register_stop_handler()) {
 			u32 dm;
 
-			/* Sync above data before sending IRQ */
-			wmb();
-
 			pr_emerg("Shutting down cpus with NMI\n");
 
 			dm = apic->dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 3/7] x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
  2023-06-15 20:33 ` [patch v3 3/7] x86/smp: Remove pointless wmb()s from native_stop_other_cpus() Thomas Gleixner
@ 2023-06-20  8:47   ` Borislav Petkov
  2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2023-06-20  8:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman

On Thu, Jun 15, 2023 at 10:33:54PM +0200, Thomas Gleixner wrote:
> The wmb()s before sending the IPIs are not synchronizing anything.
> 
> If at all then the apic IPI functions have to provide or act as appropriate
> barriers.
> 
> Remove these cargo cult barriers which have no explanation of what they are
> synchronizing.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V3: Remove second instance and reword changelog - PeterZ
> ---
>  arch/x86/kernel/smp.c |    6 ------
>  1 file changed, 6 deletions(-)

Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip: x86/core] x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
  2023-06-15 20:33 ` [patch v3 3/7] x86/smp: Remove pointless wmb()s from native_stop_other_cpus() Thomas Gleixner
  2023-06-20  8:47   ` Borislav Petkov
@ 2023-06-20 13:00   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 37+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2023-06-20 13:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Borislav Petkov (AMD), stable, x86, linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     2affa6d6db28855e6340b060b809c23477aa546e
Gitweb:        https://git.kernel.org/tip/2affa6d6db28855e6340b060b809c23477aa546e
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 15 Jun 2023 22:33:54 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 20 Jun 2023 14:51:46 +02:00

x86/smp: Remove pointless wmb()s from native_stop_other_cpus()

The wmb()s before sending the IPIs are not synchronizing anything.

If at all then the apic IPI functions have to provide or act as appropriate
barriers.

Remove these cargo cult barriers which have no explanation of what they are
synchronizing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.378358382@linutronix.de

---
 arch/x86/kernel/smp.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 935bc65..d842875 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -184,9 +184,6 @@ static void native_stop_other_cpus(int wait)
 	cpumask_clear_cpu(cpu, &cpus_stop_mask);
 
 	if (!cpumask_empty(&cpus_stop_mask)) {
-		/* sync above data before sending IRQ */
-		wmb();
-
 		apic_send_IPI_allbutself(REBOOT_VECTOR);
 
 		/*
@@ -208,9 +205,6 @@ static void native_stop_other_cpus(int wait)
 		 * CPUs to stop.
 		 */
 		if (!smp_no_nmi_ipi && !register_stop_handler()) {
-			/* Sync above data before sending IRQ */
-			wmb();
-
 			pr_emerg("Shutting down cpus with NMI\n");
 
 			for_each_cpu(cpu, &cpus_stop_mask)

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [patch v3 4/7] x86/smp: Use dedicated cache-line for mwait_play_dead()
  2023-06-15 20:33 [patch v3 0/7] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
                   ` (2 preceding siblings ...)
  2023-06-15 20:33 ` [patch v3 3/7] x86/smp: Remove pointless wmb()s from native_stop_other_cpus() Thomas Gleixner
@ 2023-06-15 20:33 ` Thomas Gleixner
  2023-06-20  9:01   ` Borislav Petkov
  2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
  2023-06-15 20:33 ` [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-15 20:33 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman, Ashok Raj

Monitoring idletask::thread_info::flags in mwait_play_dead() has been an
obvious choice as all what is needed is a cache line which is not written
by other CPUs.

But there is a use case where a "dead" CPU needs to be brought out of that
mwait(): kexec().

The CPU needs to be brought out of mwait before kexec() as kexec() can
overwrite text, pagetables, stacks and the monitored cacheline of the
original kernel. The latter causes mwait to resume execution which
obviously causes havoc on the kexec kernel which results usually in triple
faults.

Use a dedicated per CPU storage to prepare for that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
 arch/x86/kernel/smpboot.c |   24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,17 @@ EXPORT_PER_CPU_SYMBOL(cpu_die_map);
 DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
 
+struct mwait_cpu_dead {
+	unsigned int	control;
+	unsigned int	status;
+};
+
+/*
+ * Cache line aligned data for mwait_play_dead(). Separate on purpose so
+ * that it's unlikely to be touched by other CPUs.
+ */
+static DEFINE_PER_CPU_ALIGNED(struct mwait_cpu_dead, mwait_cpu_dead);
+
 /* Logical package management. We might want to allocate that dynamically */
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
@@ -1758,10 +1769,10 @@ EXPORT_SYMBOL_GPL(cond_wakeup_cpu0);
  */
 static inline void mwait_play_dead(void)
 {
+	struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
 	unsigned int eax, ebx, ecx, edx;
 	unsigned int highest_cstate = 0;
 	unsigned int highest_subcstate = 0;
-	void *mwait_ptr;
 	int i;
 
 	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
@@ -1796,13 +1807,6 @@ static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
-	/*
-	 * This should be a memory location in a cache line which is
-	 * unlikely to be touched by other processors.  The actual
-	 * content is immaterial as it is not actually modified in any way.
-	 */
-	mwait_ptr = &current_thread_info()->flags;
-
 	wbinvd();
 
 	while (1) {
@@ -1814,9 +1818,9 @@ static inline void mwait_play_dead(void)
 		 * case where we return around the loop.
 		 */
 		mb();
-		clflush(mwait_ptr);
+		clflush(md);
 		mb();
-		__monitor(mwait_ptr, 0, 0);
+		__monitor(md, 0, 0);
 		mb();
 		__mwait(eax, 0);
 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 4/7] x86/smp: Use dedicated cache-line for mwait_play_dead()
  2023-06-15 20:33 ` [patch v3 4/7] x86/smp: Use dedicated cache-line for mwait_play_dead() Thomas Gleixner
@ 2023-06-20  9:01   ` Borislav Petkov
  2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2023-06-20  9:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Thu, Jun 15, 2023 at 10:33:55PM +0200, Thomas Gleixner wrote:
> Monitoring idletask::thread_info::flags in mwait_play_dead() has been an
> obvious choice as all what is needed is a cache line which is not written
> by other CPUs.
> 
> But there is a use case where a "dead" CPU needs to be brought out of that
> mwait(): kexec().

s/mwait()/wait state/

I guess.

> The CPU needs to be brought out of mwait before kexec() as kexec() can

Ditto.

> overwrite text, pagetables, stacks and the monitored cacheline of the
> original kernel.

Yikes.

> The latter causes mwait to resume execution which
> obviously causes havoc on the kexec kernel which results usually in triple
> faults.

This sounds like a stable fix, no?

Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip: x86/core] x86/smp: Use dedicated cache-line for mwait_play_dead()
  2023-06-15 20:33 ` [patch v3 4/7] x86/smp: Use dedicated cache-line for mwait_play_dead() Thomas Gleixner
  2023-06-20  9:01   ` Borislav Petkov
@ 2023-06-20 13:00   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 37+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2023-06-20 13:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Ashok Raj, Borislav Petkov (AMD), stable, x86,
	linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     f9c9987bf52f4e42e940ae217333ebb5a4c3b506
Gitweb:        https://git.kernel.org/tip/f9c9987bf52f4e42e940ae217333ebb5a4c3b506
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 15 Jun 2023 22:33:55 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 20 Jun 2023 14:51:47 +02:00

x86/smp: Use dedicated cache-line for mwait_play_dead()

Monitoring idletask::thread_info::flags in mwait_play_dead() has been an
obvious choice as all what is needed is a cache line which is not written
by other CPUs.

But there is a use case where a "dead" CPU needs to be brought out of
MWAIT: kexec().

This is required as kexec() can overwrite text, pagetables, stacks and the
monitored cacheline of the original kernel. The latter causes MWAIT to
resume execution which obviously causes havoc on the kexec kernel which
results usually in triple faults.

Use a dedicated per CPU storage to prepare for that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.434553750@linutronix.de

---
 arch/x86/kernel/smpboot.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 352f0ce..c5ac5d7 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,17 @@ EXPORT_PER_CPU_SYMBOL(cpu_die_map);
 DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
 
+struct mwait_cpu_dead {
+	unsigned int	control;
+	unsigned int	status;
+};
+
+/*
+ * Cache line aligned data for mwait_play_dead(). Separate on purpose so
+ * that it's unlikely to be touched by other CPUs.
+ */
+static DEFINE_PER_CPU_ALIGNED(struct mwait_cpu_dead, mwait_cpu_dead);
+
 /* Logical package management. We might want to allocate that dynamically */
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
@@ -1758,10 +1769,10 @@ EXPORT_SYMBOL_GPL(cond_wakeup_cpu0);
  */
 static inline void mwait_play_dead(void)
 {
+	struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
 	unsigned int eax, ebx, ecx, edx;
 	unsigned int highest_cstate = 0;
 	unsigned int highest_subcstate = 0;
-	void *mwait_ptr;
 	int i;
 
 	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
@@ -1796,13 +1807,6 @@ static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
-	/*
-	 * This should be a memory location in a cache line which is
-	 * unlikely to be touched by other processors.  The actual
-	 * content is immaterial as it is not actually modified in any way.
-	 */
-	mwait_ptr = &current_thread_info()->flags;
-
 	wbinvd();
 
 	while (1) {
@@ -1814,9 +1818,9 @@ static inline void mwait_play_dead(void)
 		 * case where we return around the loop.
 		 */
 		mb();
-		clflush(mwait_ptr);
+		clflush(md);
 		mb();
-		__monitor(mwait_ptr, 0, 0);
+		__monitor(md, 0, 0);
 		mb();
 		__mwait(eax, 0);
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play_dead() breakage
  2023-06-15 20:33 [patch v3 0/7] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
                   ` (3 preceding siblings ...)
  2023-06-15 20:33 ` [patch v3 4/7] x86/smp: Use dedicated cache-line for mwait_play_dead() Thomas Gleixner
@ 2023-06-15 20:33 ` Thomas Gleixner
  2023-06-20  9:23   ` Borislav Petkov
  2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
  2023-06-15 20:33 ` [patch v3 6/7] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
  2023-06-15 20:34 ` [patch v3 7/7] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner
  6 siblings, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-15 20:33 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman, Ashok Raj

TLDR: It's a mess.

When kexec() is executed on a system with "offline" CPUs, which are parked
in mwait_play_dead() it can end up in a triple fault during the bootup of
the kexec kernel or cause hard to diagnose data corruption.

The reason is that kexec() eventually overwrites the previous kernels text,
page tables, data and stack, If it writes to the cache line which is
monitored by an previously offlined CPU, MWAIT resumes execution and ends
up executing the wrong text, dereferencing overwritten page tables or
corrupting the kexec kernels data.

Cure this by bringing the offline CPUs out of MWAIT into HLT.

Write to the monitored cache line of each offline CPU, which makes MWAIT
resume execution. The written control word tells the offline CPUs to issue
HLT, which does not have the MWAIT problem.

That does not help, if a stray NMI, MCE or SMI hits the offline CPUs as
those make it come out of HLT.

A follow up change will put them into INIT, which protects at least against
NMI and SMI.

Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case")
Reported-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
 arch/x86/include/asm/smp.h |    2 +
 arch/x86/kernel/smp.c      |    5 +++
 arch/x86/kernel/smpboot.c  |   59 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+)

--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -132,6 +132,8 @@ void wbinvd_on_cpu(int cpu);
 int wbinvd_on_all_cpus(void);
 void cond_wakeup_cpu0(void);
 
+void smp_kick_mwait_play_dead(void);
+
 void native_smp_send_reschedule(int cpu);
 void native_send_call_func_ipi(const struct cpumask *mask);
 void native_send_call_func_single_ipi(int cpu);
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -21,6 +21,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/kexec.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -157,6 +158,10 @@ static void native_stop_other_cpus(int w
 	if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1)
 		return;
 
+	/* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */
+	if (kexec_in_progress)
+		smp_kick_mwait_play_dead();
+
 	/*
 	 * 1) Send an IPI on the reboot vector to all other CPUs.
 	 *
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -53,6 +53,7 @@
 #include <linux/tboot.h>
 #include <linux/gfp.h>
 #include <linux/cpuidle.h>
+#include <linux/kexec.h>
 #include <linux/numa.h>
 #include <linux/pgtable.h>
 #include <linux/overflow.h>
@@ -106,6 +107,9 @@ struct mwait_cpu_dead {
 	unsigned int	status;
 };
 
+#define CPUDEAD_MWAIT_WAIT	0xDEADBEEF
+#define CPUDEAD_MWAIT_KEXEC_HLT	0x4A17DEAD
+
 /*
  * Cache line aligned data for mwait_play_dead(). Separate on purpose so
  * that it's unlikely to be touched by other CPUs.
@@ -173,6 +177,10 @@ static void smp_callin(void)
 {
 	int cpuid;
 
+	/* Mop up eventual mwait_play_dead() wreckage */
+	this_cpu_write(mwait_cpu_dead.status, 0);
+	this_cpu_write(mwait_cpu_dead.control, 0);
+
 	/*
 	 * If waken up by an INIT in an 82489DX configuration
 	 * cpu_callout_mask guarantees we don't get here before
@@ -1807,6 +1815,10 @@ static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
+	/* Set up state for the kexec() hack below */
+	md->status = CPUDEAD_MWAIT_WAIT;
+	md->control = CPUDEAD_MWAIT_WAIT;
+
 	wbinvd();
 
 	while (1) {
@@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void)
 		mb();
 		__mwait(eax, 0);
 
+		if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
+			/*
+			 * Kexec is about to happen. Don't go back into mwait() as
+			 * the kexec kernel might overwrite text and data including
+			 * page tables and stack. So mwait() would resume when the
+			 * monitor cache line is written to and then the CPU goes
+			 * south due to overwritten text, page tables and stack.
+			 *
+			 * Note: This does _NOT_ protect against a stray MCE, NMI,
+			 * SMI. They will resume execution at the instruction
+			 * following the HLT instruction and run into the problem
+			 * which this is trying to prevent.
+			 */
+			WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
+			while(1)
+				native_halt();
+		}
+
 		cond_wakeup_cpu0();
 	}
 }
 
+/*
+ * Kick all "offline" CPUs out of mwait on kexec(). See comment in
+ * mwait_play_dead().
+ */
+void smp_kick_mwait_play_dead(void)
+{
+	u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;
+	struct mwait_cpu_dead *md;
+	unsigned int cpu, i;
+
+	for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) {
+		md = per_cpu_ptr(&mwait_cpu_dead, cpu);
+
+		/* Does it sit in mwait_play_dead() ? */
+		if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT)
+			continue;
+
+		/* Wait maximal 5ms */
+		for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) {
+			/* Bring it out of mwait */
+			WRITE_ONCE(md->control, newstate);
+			udelay(5);
+		}
+
+		if (READ_ONCE(md->status) != newstate)
+			pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu);
+	}
+}
+
 void __noreturn hlt_play_dead(void)
 {
 	if (__this_cpu_read(cpu_info.x86) >= 4)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play_dead() breakage
  2023-06-15 20:33 ` [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
@ 2023-06-20  9:23   ` Borislav Petkov
  2023-06-20 12:25     ` Thomas Gleixner
  2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
  1 sibling, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2023-06-20  9:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Thu, Jun 15, 2023 at 10:33:57PM +0200, Thomas Gleixner wrote:
> TLDR: It's a mess.

LOL. You don't need the rest of the commit message - this says it all.
:-P

> When kexec() is executed on a system with "offline" CPUs, which are parked

I'd say simply "with offlined CPUs"

> in mwait_play_dead() it can end up in a triple fault during the bootup of
> the kexec kernel or cause hard to diagnose data corruption.
> 
> The reason is that kexec() eventually overwrites the previous kernels text,

kernel's

> page tables, data and stack, If it writes to the cache line which is

... and stack. If it ...

> monitored by an previously offlined CPU, MWAIT resumes execution and ends

... by a previously...

> up executing the wrong text, dereferencing overwritten page tables or
> corrupting the kexec kernels data.

Lovely.

> Cure this by bringing the offline CPUs out of MWAIT into HLT.

... offlined CPUs... :

> Write to the monitored cache line of each offline CPU, which makes MWAIT

ditto.

> resume execution. The written control word tells the offline CPUs to issue

ditto and so on.

> HLT, which does not have the MWAIT problem.
> 
> That does not help, if a stray NMI, MCE or SMI hits the offline CPUs as
> those make it come out of HLT.
> 
> A follow up change will put them into INIT, which protects at least against
> NMI and SMI.

...

> @@ -1807,6 +1815,10 @@ static inline void mwait_play_dead(void)
>  			(highest_subcstate - 1);
>  	}
>  
> +	/* Set up state for the kexec() hack below */
> +	md->status = CPUDEAD_MWAIT_WAIT;
> +	md->control = CPUDEAD_MWAIT_WAIT;
> +
>  	wbinvd();
>  
>  	while (1) {
> @@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void)

JFYI: that last hunk has some conflicts applying to latest tip/master.
Might need merge resolving...

>  		mb();
>  		__mwait(eax, 0);
>  
> +		if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
> +			/*
> +			 * Kexec is about to happen. Don't go back into mwait() as
> +			 * the kexec kernel might overwrite text and data including
> +			 * page tables and stack. So mwait() would resume when the
> +			 * monitor cache line is written to and then the CPU goes
> +			 * south due to overwritten text, page tables and stack.
> +			 *
> +			 * Note: This does _NOT_ protect against a stray MCE, NMI,
> +			 * SMI. They will resume execution at the instruction
> +			 * following the HLT instruction and run into the problem
> +			 * which this is trying to prevent.
> +			 */
> +			WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
> +			while(1)
> +				native_halt();
> +		}
> +
>  		cond_wakeup_cpu0();
>  	}
>  }
>  
> +/*
> + * Kick all "offline" CPUs out of mwait on kexec(). See comment in
> + * mwait_play_dead().
> + */
> +void smp_kick_mwait_play_dead(void)
> +{
> +	u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;

Do you even need this newstate thing?

> +	struct mwait_cpu_dead *md;
> +	unsigned int cpu, i;
> +
> +	for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) {
> +		md = per_cpu_ptr(&mwait_cpu_dead, cpu);
> +
> +		/* Does it sit in mwait_play_dead() ? */
> +		if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT)
> +			continue;
> +
> +		/* Wait maximal 5ms */

		/* Wait 5ms maximum */

> +		for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) {
> +			/* Bring it out of mwait */
> +			WRITE_ONCE(md->control, newstate);
> +			udelay(5);
> +		}
> +
> +		if (READ_ONCE(md->status) != newstate)
> +			pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu);

Shouldn't this be a pr_err_once thing so that it doesn't flood the
console unnecessarily?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play_dead() breakage
  2023-06-20  9:23   ` Borislav Petkov
@ 2023-06-20 12:25     ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-20 12:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Tue, Jun 20 2023 at 11:23, Borislav Petkov wrote:
> On Thu, Jun 15, 2023 at 10:33:57PM +0200, Thomas Gleixner wrote:
>> TLDR: It's a mess.
>>  	while (1) {
>> @@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void)
>
> JFYI: that last hunk has some conflicts applying to latest tip/master.
> Might need merge resolving...

Yes, I know.

>> +/*
>> + * Kick all "offline" CPUs out of mwait on kexec(). See comment in
>> + * mwait_play_dead().
>> + */
>> +void smp_kick_mwait_play_dead(void)
>> +{
>> +	u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;
>
> Do you even need this newstate thing?

Yes, for two reasons:

  1) To explicitely tell the other CPU to go into HLT. MWAIT can resume
     execution due to SMIs or NMIs, so we don't want to go them into HLT
     unconditionally. TLD; .... :)

  2) Two have the state feedback from the other CPU.

>> +
>> +		if (READ_ONCE(md->status) != newstate)
>> +			pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu);
>
> Shouldn't this be a pr_err_once thing so that it doesn't flood the
> console unnecessarily?

Yes, no, do not know :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip: x86/core] x86/smp: Cure kexec() vs. mwait_play_dead() breakage
  2023-06-15 20:33 ` [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
  2023-06-20  9:23   ` Borislav Petkov
@ 2023-06-20 13:00   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 37+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2023-06-20 13:00 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Ashok Raj, Thomas Gleixner, stable, x86, linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     d7893093a7417527c0d73c9832244e65c9d0114f
Gitweb:        https://git.kernel.org/tip/d7893093a7417527c0d73c9832244e65c9d0114f
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 15 Jun 2023 22:33:57 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 20 Jun 2023 14:51:47 +02:00

x86/smp: Cure kexec() vs. mwait_play_dead() breakage

TLDR: It's a mess.

When kexec() is executed on a system with offline CPUs, which are parked in
mwait_play_dead() it can end up in a triple fault during the bootup of the
kexec kernel or cause hard to diagnose data corruption.

The reason is that kexec() eventually overwrites the previous kernel's text,
page tables, data and stack. If it writes to the cache line which is
monitored by a previously offlined CPU, MWAIT resumes execution and ends
up executing the wrong text, dereferencing overwritten page tables or
corrupting the kexec kernels data.

Cure this by bringing the offlined CPUs out of MWAIT into HLT.

Write to the monitored cache line of each offline CPU, which makes MWAIT
resume execution. The written control word tells the offlined CPUs to issue
HLT, which does not have the MWAIT problem.

That does not help, if a stray NMI, MCE or SMI hits the offlined CPUs as
those make it come out of HLT.

A follow up change will put them into INIT, which protects at least against
NMI and SMI.

Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case")
Reported-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.492257119@linutronix.de

---
 arch/x86/include/asm/smp.h |  2 +-
 arch/x86/kernel/smp.c      |  5 +++-
 arch/x86/kernel/smpboot.c  | 59 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 66 insertions(+)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 4e91054..d4ce5cb 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -132,6 +132,8 @@ void wbinvd_on_cpu(int cpu);
 int wbinvd_on_all_cpus(void);
 void cond_wakeup_cpu0(void);
 
+void smp_kick_mwait_play_dead(void);
+
 void native_smp_send_reschedule(int cpu);
 void native_send_call_func_ipi(const struct cpumask *mask);
 void native_send_call_func_single_ipi(int cpu);
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d842875..174d623 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -21,6 +21,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/kexec.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -157,6 +158,10 @@ static void native_stop_other_cpus(int wait)
 	if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1)
 		return;
 
+	/* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */
+	if (kexec_in_progress)
+		smp_kick_mwait_play_dead();
+
 	/*
 	 * 1) Send an IPI on the reboot vector to all other CPUs.
 	 *
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index c5ac5d7..483df04 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -53,6 +53,7 @@
 #include <linux/tboot.h>
 #include <linux/gfp.h>
 #include <linux/cpuidle.h>
+#include <linux/kexec.h>
 #include <linux/numa.h>
 #include <linux/pgtable.h>
 #include <linux/overflow.h>
@@ -106,6 +107,9 @@ struct mwait_cpu_dead {
 	unsigned int	status;
 };
 
+#define CPUDEAD_MWAIT_WAIT	0xDEADBEEF
+#define CPUDEAD_MWAIT_KEXEC_HLT	0x4A17DEAD
+
 /*
  * Cache line aligned data for mwait_play_dead(). Separate on purpose so
  * that it's unlikely to be touched by other CPUs.
@@ -173,6 +177,10 @@ static void smp_callin(void)
 {
 	int cpuid;
 
+	/* Mop up eventual mwait_play_dead() wreckage */
+	this_cpu_write(mwait_cpu_dead.status, 0);
+	this_cpu_write(mwait_cpu_dead.control, 0);
+
 	/*
 	 * If waken up by an INIT in an 82489DX configuration
 	 * cpu_callout_mask guarantees we don't get here before
@@ -1807,6 +1815,10 @@ static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
+	/* Set up state for the kexec() hack below */
+	md->status = CPUDEAD_MWAIT_WAIT;
+	md->control = CPUDEAD_MWAIT_WAIT;
+
 	wbinvd();
 
 	while (1) {
@@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void)
 		mb();
 		__mwait(eax, 0);
 
+		if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
+			/*
+			 * Kexec is about to happen. Don't go back into mwait() as
+			 * the kexec kernel might overwrite text and data including
+			 * page tables and stack. So mwait() would resume when the
+			 * monitor cache line is written to and then the CPU goes
+			 * south due to overwritten text, page tables and stack.
+			 *
+			 * Note: This does _NOT_ protect against a stray MCE, NMI,
+			 * SMI. They will resume execution at the instruction
+			 * following the HLT instruction and run into the problem
+			 * which this is trying to prevent.
+			 */
+			WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
+			while(1)
+				native_halt();
+		}
+
 		cond_wakeup_cpu0();
 	}
 }
 
+/*
+ * Kick all "offline" CPUs out of mwait on kexec(). See comment in
+ * mwait_play_dead().
+ */
+void smp_kick_mwait_play_dead(void)
+{
+	u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;
+	struct mwait_cpu_dead *md;
+	unsigned int cpu, i;
+
+	for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) {
+		md = per_cpu_ptr(&mwait_cpu_dead, cpu);
+
+		/* Does it sit in mwait_play_dead() ? */
+		if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT)
+			continue;
+
+		/* Wait up to 5ms */
+		for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) {
+			/* Bring it out of mwait */
+			WRITE_ONCE(md->control, newstate);
+			udelay(5);
+		}
+
+		if (READ_ONCE(md->status) != newstate)
+			pr_err_once("CPU%u is stuck in mwait_play_dead()\n", cpu);
+	}
+}
+
 void __noreturn hlt_play_dead(void)
 {
 	if (__this_cpu_read(cpu_info.x86) >= 4)

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [patch v3 6/7] x86/smp: Split sending INIT IPI out into a helper function
  2023-06-15 20:33 [patch v3 0/7] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
                   ` (4 preceding siblings ...)
  2023-06-15 20:33 ` [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
@ 2023-06-15 20:33 ` Thomas Gleixner
  2023-06-20  9:29   ` Borislav Petkov
  2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
  2023-06-15 20:34 ` [patch v3 7/7] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner
  6 siblings, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-15 20:33 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman, Ashok Raj

Putting CPUs into INIT is a safer place during kexec() to park CPUs.

Split the INIT assert/deassert sequence out so it can be reused.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
V2: Fix rebase screwup
---
 arch/x86/kernel/smpboot.c |   49 ++++++++++++++++++----------------------------
 1 file changed, 20 insertions(+), 29 deletions(-)

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -853,47 +853,38 @@ wakeup_secondary_cpu_via_nmi(int apicid,
 	return (send_status | accept_status);
 }
 
-static int
-wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
+static void send_init_sequence(int phys_apicid)
 {
-	unsigned long send_status = 0, accept_status = 0;
-	int maxlvt, num_starts, j;
-
-	maxlvt = lapic_get_maxlvt();
+	int maxlvt = lapic_get_maxlvt();
 
-	/*
-	 * Be paranoid about clearing APIC errors.
-	 */
+	/* Be paranoid about clearing APIC errors. */
 	if (APIC_INTEGRATED(boot_cpu_apic_version)) {
-		if (maxlvt > 3)		/* Due to the Pentium erratum 3AP.  */
+		/* Due to the Pentium erratum 3AP.  */
+		if (maxlvt > 3)
 			apic_write(APIC_ESR, 0);
 		apic_read(APIC_ESR);
 	}
 
-	pr_debug("Asserting INIT\n");
-
-	/*
-	 * Turn INIT on target chip
-	 */
-	/*
-	 * Send IPI
-	 */
-	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
-		       phys_apicid);
-
-	pr_debug("Waiting for send to finish...\n");
-	send_status = safe_apic_wait_icr_idle();
+	/* Assert INIT on the target CPU */
+	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, phys_apicid);
+	safe_apic_wait_icr_idle();
 
 	udelay(init_udelay);
 
-	pr_debug("Deasserting INIT\n");
-
-	/* Target chip */
-	/* Send IPI */
+	/* Deassert INIT on the target CPU */
 	apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
+	safe_apic_wait_icr_idle();
+}
+
+/*
+ * Wake up AP by INIT, INIT, STARTUP sequence.
+ */
+static int wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
+{
+	unsigned long send_status = 0, accept_status = 0;
+	int num_starts, j, maxlvt = lapic_get_maxlvt();
 
-	pr_debug("Waiting for send to finish...\n");
-	send_status = safe_apic_wait_icr_idle();
+	send_init_sequence(phys_apicid);
 
 	mb();
 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 6/7] x86/smp: Split sending INIT IPI out into a helper function
  2023-06-15 20:33 ` [patch v3 6/7] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
@ 2023-06-20  9:29   ` Borislav Petkov
  2023-06-20 12:30     ` Thomas Gleixner
  2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
  1 sibling, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2023-06-20  9:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Thu, Jun 15, 2023 at 10:33:58PM +0200, Thomas Gleixner wrote:
> Putting CPUs into INIT is a safer place during kexec() to park CPUs.
> 
> Split the INIT assert/deassert sequence out so it can be reused.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Ashok Raj <ashok.raj@intel.com>
> ---
> V2: Fix rebase screwup
> ---
>  arch/x86/kernel/smpboot.c |   49 ++++++++++++++++++----------------------------
>  1 file changed, 20 insertions(+), 29 deletions(-)
> 
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -853,47 +853,38 @@ wakeup_secondary_cpu_via_nmi(int apicid,
>  	return (send_status | accept_status);
>  }
>  
> -static int
> -wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
> +static void send_init_sequence(int phys_apicid)
>  {
> -	unsigned long send_status = 0, accept_status = 0;
> -	int maxlvt, num_starts, j;
> -
> -	maxlvt = lapic_get_maxlvt();
> +	int maxlvt = lapic_get_maxlvt();

Whoops:

arch/x86/kernel/smpboot.c: In function ‘send_init_sequence’:
arch/x86/kernel/smpboot.c:860:9: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  860 |         int maxlvt = lapic_get_maxlvt();
      |         ^~~

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 6/7] x86/smp: Split sending INIT IPI out into a helper function
  2023-06-20  9:29   ` Borislav Petkov
@ 2023-06-20 12:30     ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-20 12:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Tue, Jun 20 2023 at 11:29, Borislav Petkov wrote:
> On Thu, Jun 15, 2023 at 10:33:58PM +0200, Thomas Gleixner wrote:
>>  
>> -static int
>> -wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
>> +static void send_init_sequence(int phys_apicid)
>>  {
>> -	unsigned long send_status = 0, accept_status = 0;
>> -	int maxlvt, num_starts, j;
>> -
>> -	maxlvt = lapic_get_maxlvt();
>> +	int maxlvt = lapic_get_maxlvt();
>
> Whoops:
>
> arch/x86/kernel/smpboot.c: In function ‘send_init_sequence’:
> arch/x86/kernel/smpboot.c:860:9: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
>   860 |         int maxlvt = lapic_get_maxlvt();
>       |         ^~~

How so? It's the first thing in the function, no?

And in the other hunk it's _before_ code too.

Conflict resolution artifact?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip: x86/core] x86/smp: Split sending INIT IPI out into a helper function
  2023-06-15 20:33 ` [patch v3 6/7] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
  2023-06-20  9:29   ` Borislav Petkov
@ 2023-06-20 13:00   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 37+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2023-06-20 13:00 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Ashok Raj, x86, linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     6087dd5e86ff03a8cd4cffdf463a7f457e65cbff
Gitweb:        https://git.kernel.org/tip/6087dd5e86ff03a8cd4cffdf463a7f457e65cbff
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 15 Jun 2023 22:33:58 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 20 Jun 2023 14:51:47 +02:00

x86/smp: Split sending INIT IPI out into a helper function

Putting CPUs into INIT is a safer place during kexec() to park CPUs.

Split the INIT assert/deassert sequence out so it can be reused.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Link: https://lore.kernel.org/r/20230615193330.551157083@linutronix.de
---
 arch/x86/kernel/smpboot.c | 49 +++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 483df04..b403ead 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -853,47 +853,38 @@ wakeup_secondary_cpu_via_nmi(int apicid, unsigned long start_eip)
 	return (send_status | accept_status);
 }
 
-static int
-wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
+static void send_init_sequence(int phys_apicid)
 {
-	unsigned long send_status = 0, accept_status = 0;
-	int maxlvt, num_starts, j;
+	int maxlvt = lapic_get_maxlvt();
 
-	maxlvt = lapic_get_maxlvt();
-
-	/*
-	 * Be paranoid about clearing APIC errors.
-	 */
+	/* Be paranoid about clearing APIC errors. */
 	if (APIC_INTEGRATED(boot_cpu_apic_version)) {
-		if (maxlvt > 3)		/* Due to the Pentium erratum 3AP.  */
+		/* Due to the Pentium erratum 3AP.  */
+		if (maxlvt > 3)
 			apic_write(APIC_ESR, 0);
 		apic_read(APIC_ESR);
 	}
 
-	pr_debug("Asserting INIT\n");
-
-	/*
-	 * Turn INIT on target chip
-	 */
-	/*
-	 * Send IPI
-	 */
-	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
-		       phys_apicid);
-
-	pr_debug("Waiting for send to finish...\n");
-	send_status = safe_apic_wait_icr_idle();
+	/* Assert INIT on the target CPU */
+	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, phys_apicid);
+	safe_apic_wait_icr_idle();
 
 	udelay(init_udelay);
 
-	pr_debug("Deasserting INIT\n");
-
-	/* Target chip */
-	/* Send IPI */
+	/* Deassert INIT on the target CPU */
 	apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
+	safe_apic_wait_icr_idle();
+}
 
-	pr_debug("Waiting for send to finish...\n");
-	send_status = safe_apic_wait_icr_idle();
+/*
+ * Wake up AP by INIT, INIT, STARTUP sequence.
+ */
+static int wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
+{
+	unsigned long send_status = 0, accept_status = 0;
+	int num_starts, j, maxlvt = lapic_get_maxlvt();
+
+	send_init_sequence(phys_apicid);
 
 	mb();
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [patch v3 7/7] x86/smp: Put CPUs into INIT on shutdown if possible
  2023-06-15 20:33 [patch v3 0/7] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
                   ` (5 preceding siblings ...)
  2023-06-15 20:33 ` [patch v3 6/7] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
@ 2023-06-15 20:34 ` Thomas Gleixner
  2023-06-20 10:27   ` Borislav Petkov
                     ` (2 more replies)
  6 siblings, 3 replies; 37+ messages in thread
From: Thomas Gleixner @ 2023-06-15 20:34 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman, Ashok Raj

Parking CPUs in a HLT loop is not completely safe vs. kexec() as HLT can
resume execution due to NMI, SMI and MCE, which has the same issue as the
MWAIT loop.

Kicking the secondary CPUs into INIT makes this safe against NMI and SMI.

A broadcast MCE will take the machine down, but a broadcast MCE which makes
HLT resume and execute overwritten text, pagetables or data will end up in
a disaster too.

So chose the lesser of two evils and kick the secondary CPUs into INIT
unless the system has installed special wakeup mechanisms which are not
using INIT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
V3: Renamed the function to smp_park_other_cpus_in_init() so it can
    be reused for crash eventually.
---
 arch/x86/include/asm/smp.h |    2 ++
 arch/x86/kernel/smp.c      |   39 ++++++++++++++++++++++++++++++++-------
 arch/x86/kernel/smpboot.c  |   19 +++++++++++++++++++
 3 files changed, 53 insertions(+), 7 deletions(-)

--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -139,6 +139,8 @@ void native_send_call_func_ipi(const str
 void native_send_call_func_single_ipi(int cpu);
 void x86_idle_thread_init(unsigned int cpu, struct task_struct *idle);
 
+bool smp_park_other_cpus_in_init(void);
+
 void smp_store_boot_cpu_info(void);
 void smp_store_cpu_info(int id);
 
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -131,7 +131,7 @@ static int smp_stop_nmi_callback(unsigne
 }
 
 /*
- * this function calls the 'stop' function on all other CPUs in the system.
+ * Disable virtualization, APIC etc. and park the CPU in a HLT loop
  */
 DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
 {
@@ -172,13 +172,17 @@ static void native_stop_other_cpus(int w
 	 * 2) Wait for all other CPUs to report that they reached the
 	 *    HLT loop in stop_this_cpu()
 	 *
-	 * 3) If #2 timed out send an NMI to the CPUs which did not
-	 *    yet report
+	 * 3) If the system uses INIT/STARTUP for CPU bringup, then
+	 *    send all present CPUs an INIT vector, which brings them
+	 *    completely out of the way.
 	 *
-	 * 4) Wait for all other CPUs to report that they reached the
+	 * 4) If #3 is not possible and #2 timed out send an NMI to the
+	 *    CPUs which did not yet report
+	 *
+	 * 5) Wait for all other CPUs to report that they reached the
 	 *    HLT loop in stop_this_cpu()
 	 *
-	 * #3 can obviously race against a CPU reaching the HLT loop late.
+	 * #4 can obviously race against a CPU reaching the HLT loop late.
 	 * That CPU will have reported already and the "have all CPUs
 	 * reached HLT" condition will be true despite the fact that the
 	 * other CPU is still handling the NMI. Again, there is no
@@ -194,7 +198,7 @@ static void native_stop_other_cpus(int w
 		/*
 		 * Don't wait longer than a second for IPI completion. The
 		 * wait request is not checked here because that would
-		 * prevent an NMI shutdown attempt in case that not all
+		 * prevent an NMI/INIT shutdown in case that not all
 		 * CPUs reach shutdown state.
 		 */
 		timeout = USEC_PER_SEC;
@@ -202,7 +206,27 @@ static void native_stop_other_cpus(int w
 			udelay(1);
 	}
 
-	/* if the REBOOT_VECTOR didn't work, try with the NMI */
+	/*
+	 * Park all other CPUs in INIT including "offline" CPUs, if
+	 * possible. That's a safe place where they can't resume execution
+	 * of HLT and then execute the HLT loop from overwritten text or
+	 * page tables.
+	 *
+	 * The only downside is a broadcast MCE, but up to the point where
+	 * the kexec() kernel brought all APs online again an MCE will just
+	 * make HLT resume and handle the MCE. The machine crashs and burns
+	 * due to overwritten text, page tables and data. So there is a
+	 * choice between fire and frying pan. The result is pretty much
+	 * the same. Chose frying pan until x86 provides a sane mechanism
+	 * to park a CPU.
+	 */
+	if (smp_park_other_cpus_in_init())
+		goto done;
+
+	/*
+	 * If park with INIT was not possible and the REBOOT_VECTOR didn't
+	 * take all secondary CPUs offline, try with the NMI.
+	 */
 	if (!cpumask_empty(&cpus_stop_mask)) {
 		/*
 		 * If NMI IPI is enabled, try to register the stop handler
@@ -234,6 +258,7 @@ static void native_stop_other_cpus(int w
 			udelay(1);
 	}
 
+done:
 	local_irq_save(flags);
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1465,6 +1465,25 @@ void arch_thaw_secondary_cpus_end(void)
 	cache_aps_init();
 }
 
+bool smp_park_other_cpus_in_init(void)
+{
+	unsigned int cpu, this_cpu = smp_processor_id();
+	unsigned int apicid;
+
+	if (apic->wakeup_secondary_cpu_64 || apic->wakeup_secondary_cpu)
+		return false;
+
+	for_each_present_cpu(cpu) {
+		if (cpu == this_cpu)
+			continue;
+		apicid = apic->cpu_present_to_apicid(cpu);
+		if (apicid == BAD_APICID)
+			continue;
+		send_init_sequence(apicid);
+	}
+	return true;
+}
+
 /*
  * Early setup to make printk work.
  */


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch v3 7/7] x86/smp: Put CPUs into INIT on shutdown if possible
  2023-06-15 20:34 ` [patch v3 7/7] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner
@ 2023-06-20 10:27   ` Borislav Petkov
  2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
  2023-07-03  3:44   ` [BUG REPORT] Triggering a panic in an x86 virtual machine does not wait Baokun Li
  2 siblings, 0 replies; 37+ messages in thread
From: Borislav Petkov @ 2023-06-20 10:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Thu, Jun 15, 2023 at 10:34:00PM +0200, Thomas Gleixner wrote:
> @@ -202,7 +206,27 @@ static void native_stop_other_cpus(int w
>  			udelay(1);
>  	}
>  
> -	/* if the REBOOT_VECTOR didn't work, try with the NMI */
> +	/*
> +	 * Park all other CPUs in INIT including "offline" CPUs, if
> +	 * possible. That's a safe place where they can't resume execution
> +	 * of HLT and then execute the HLT loop from overwritten text or
> +	 * page tables.
> +	 *
> +	 * The only downside is a broadcast MCE, but up to the point where
> +	 * the kexec() kernel brought all APs online again an MCE will just
> +	 * make HLT resume and handle the MCE. The machine crashs and burns

"crashes"

With that

Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip: x86/core] x86/smp: Put CPUs into INIT on shutdown if possible
  2023-06-15 20:34 ` [patch v3 7/7] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner
  2023-06-20 10:27   ` Borislav Petkov
@ 2023-06-20 13:00   ` tip-bot2 for Thomas Gleixner
  2023-07-03  3:44   ` [BUG REPORT] Triggering a panic in an x86 virtual machine does not wait Baokun Li
  2 siblings, 0 replies; 37+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2023-06-20 13:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Ashok Raj, Borislav Petkov (AMD), x86,
	linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     45e34c8af58f23db4474e2bfe79183efec09a18b
Gitweb:        https://git.kernel.org/tip/45e34c8af58f23db4474e2bfe79183efec09a18b
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 15 Jun 2023 22:34:00 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 20 Jun 2023 14:51:47 +02:00

x86/smp: Put CPUs into INIT on shutdown if possible

Parking CPUs in a HLT loop is not completely safe vs. kexec() as HLT can
resume execution due to NMI, SMI and MCE, which has the same issue as the
MWAIT loop.

Kicking the secondary CPUs into INIT makes this safe against NMI and SMI.

A broadcast MCE will take the machine down, but a broadcast MCE which makes
HLT resume and execute overwritten text, pagetables or data will end up in
a disaster too.

So chose the lesser of two evils and kick the secondary CPUs into INIT
unless the system has installed special wakeup mechanisms which are not
using INIT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230615193330.608657211@linutronix.de

---
 arch/x86/include/asm/smp.h |  2 ++-
 arch/x86/kernel/smp.c      | 39 ++++++++++++++++++++++++++++++-------
 arch/x86/kernel/smpboot.c  | 19 ++++++++++++++++++-
 3 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index d4ce5cb..5906aa9 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -139,6 +139,8 @@ void native_send_call_func_ipi(const struct cpumask *mask);
 void native_send_call_func_single_ipi(int cpu);
 void x86_idle_thread_init(unsigned int cpu, struct task_struct *idle);
 
+bool smp_park_other_cpus_in_init(void);
+
 void smp_store_boot_cpu_info(void);
 void smp_store_cpu_info(int id);
 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 174d623..0076932 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -131,7 +131,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs)
 }
 
 /*
- * this function calls the 'stop' function on all other CPUs in the system.
+ * Disable virtualization, APIC etc. and park the CPU in a HLT loop
  */
 DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
 {
@@ -172,13 +172,17 @@ static void native_stop_other_cpus(int wait)
 	 * 2) Wait for all other CPUs to report that they reached the
 	 *    HLT loop in stop_this_cpu()
 	 *
-	 * 3) If #2 timed out send an NMI to the CPUs which did not
-	 *    yet report
+	 * 3) If the system uses INIT/STARTUP for CPU bringup, then
+	 *    send all present CPUs an INIT vector, which brings them
+	 *    completely out of the way.
 	 *
-	 * 4) Wait for all other CPUs to report that they reached the
+	 * 4) If #3 is not possible and #2 timed out send an NMI to the
+	 *    CPUs which did not yet report
+	 *
+	 * 5) Wait for all other CPUs to report that they reached the
 	 *    HLT loop in stop_this_cpu()
 	 *
-	 * #3 can obviously race against a CPU reaching the HLT loop late.
+	 * #4 can obviously race against a CPU reaching the HLT loop late.
 	 * That CPU will have reported already and the "have all CPUs
 	 * reached HLT" condition will be true despite the fact that the
 	 * other CPU is still handling the NMI. Again, there is no
@@ -194,7 +198,7 @@ static void native_stop_other_cpus(int wait)
 		/*
 		 * Don't wait longer than a second for IPI completion. The
 		 * wait request is not checked here because that would
-		 * prevent an NMI shutdown attempt in case that not all
+		 * prevent an NMI/INIT shutdown in case that not all
 		 * CPUs reach shutdown state.
 		 */
 		timeout = USEC_PER_SEC;
@@ -202,7 +206,27 @@ static void native_stop_other_cpus(int wait)
 			udelay(1);
 	}
 
-	/* if the REBOOT_VECTOR didn't work, try with the NMI */
+	/*
+	 * Park all other CPUs in INIT including "offline" CPUs, if
+	 * possible. That's a safe place where they can't resume execution
+	 * of HLT and then execute the HLT loop from overwritten text or
+	 * page tables.
+	 *
+	 * The only downside is a broadcast MCE, but up to the point where
+	 * the kexec() kernel brought all APs online again an MCE will just
+	 * make HLT resume and handle the MCE. The machine crashes and burns
+	 * due to overwritten text, page tables and data. So there is a
+	 * choice between fire and frying pan. The result is pretty much
+	 * the same. Chose frying pan until x86 provides a sane mechanism
+	 * to park a CPU.
+	 */
+	if (smp_park_other_cpus_in_init())
+		goto done;
+
+	/*
+	 * If park with INIT was not possible and the REBOOT_VECTOR didn't
+	 * take all secondary CPUs offline, try with the NMI.
+	 */
 	if (!cpumask_empty(&cpus_stop_mask)) {
 		/*
 		 * If NMI IPI is enabled, try to register the stop handler
@@ -225,6 +249,7 @@ static void native_stop_other_cpus(int wait)
 			udelay(1);
 	}
 
+done:
 	local_irq_save(flags);
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index b403ead..4ee4339 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1465,6 +1465,25 @@ void arch_thaw_secondary_cpus_end(void)
 	cache_aps_init();
 }
 
+bool smp_park_other_cpus_in_init(void)
+{
+	unsigned int cpu, this_cpu = smp_processor_id();
+	unsigned int apicid;
+
+	if (apic->wakeup_secondary_cpu_64 || apic->wakeup_secondary_cpu)
+		return false;
+
+	for_each_present_cpu(cpu) {
+		if (cpu == this_cpu)
+			continue;
+		apicid = apic->cpu_present_to_apicid(cpu);
+		if (apicid == BAD_APICID)
+			continue;
+		send_init_sequence(apicid);
+	}
+	return true;
+}
+
 /*
  * Early setup to make printk work.
  */

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [BUG REPORT] Triggering a panic in an x86 virtual machine does not wait
  2023-06-15 20:34 ` [patch v3 7/7] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner
  2023-06-20 10:27   ` Borislav Petkov
  2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
@ 2023-07-03  3:44   ` Baokun Li
  2023-07-05  8:59     ` Thomas Gleixner
  2 siblings, 1 reply; 37+ messages in thread
From: Baokun Li @ 2023-07-03  3:44 UTC (permalink / raw)
  To: tglx
  Cc: arjan, ashok.raj, ashok.raj, ebiederm, linux-kernel,
	mario.limonciello, thomas.lendacky, tony.luck, tonyb, x86,
	yangerkun

When I manually trigger panic in a qume x86 VM with

        `echo c > /proc/sysrq-trigger`,

  I find that the VM will probably reboot directly, but the 
PANIC_TIMEOUT is 0.
This prevents us from exporting the vmcore via panic, and even if we succeed
in panic exporting the vmcore, the processes in the vmcore are mostly
stop_this_cpu(). By dichotomizing we found the patch that introduced the
behavior change

    45e34c8af58f ("x86/smp: Put CPUs into INIT on shutdown if possible"),

can anyone help to see what is happening?

Thanks!
-- 
With Best Regards,
Baokun Li
.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [BUG REPORT] Triggering a panic in an x86 virtual machine does not wait
  2023-07-03  3:44   ` [BUG REPORT] Triggering a panic in an x86 virtual machine does not wait Baokun Li
@ 2023-07-05  8:59     ` Thomas Gleixner
  2023-07-06  6:44       ` Baokun Li
  2023-07-07 13:49       ` [tip: x86/core] x86/smp: Don't send INIT to boot CPU tip-bot2 for Thomas Gleixner
  0 siblings, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2023-07-05  8:59 UTC (permalink / raw)
  To: Baokun Li
  Cc: arjan, ashok.raj, ashok.raj, ebiederm, linux-kernel,
	mario.limonciello, thomas.lendacky, tony.luck, tonyb, x86,
	yangerkun, Baoquan He, kexec

On Mon, Jul 03 2023 at 11:44, Baokun Li wrote:

> When I manually trigger panic in a qume x86 VM with
>
>         `echo c > /proc/sysrq-trigger`,
>
>   I find that the VM will probably reboot directly, but the 
> PANIC_TIMEOUT is 0.
> This prevents us from exporting the vmcore via panic, and even if we succeed
> in panic exporting the vmcore, the processes in the vmcore are mostly
> stop_this_cpu(). By dichotomizing we found the patch that introduced the
> behavior change
>
>     45e34c8af58f ("x86/smp: Put CPUs into INIT on shutdown if possible"),

Bah, I missed that this is used by crash too. So if this happens to be
invoked on an AP, i.e. not on CPU 0, then the INIT will reset the
machine. Fix below.

Thanks,

        tglx
---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ed2d51960a7d..e1aa2cd7734b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1348,6 +1348,14 @@ bool smp_park_other_cpus_in_init(void)
 	if (apic->wakeup_secondary_cpu_64 || apic->wakeup_secondary_cpu)
 		return false;
 
+	/*
+	 * If this is a crash stop which does not execute on the boot CPU,
+	 * then this cannot use the INIT mechanism because INIT to the boot
+	 * CPU will reset the machine.
+	 */
+	if (this_cpu)
+		return false;
+
 	for_each_present_cpu(cpu) {
 		if (cpu == this_cpu)
 			continue;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [BUG REPORT] Triggering a panic in an x86 virtual machine does not wait
  2023-07-05  8:59     ` Thomas Gleixner
@ 2023-07-06  6:44       ` Baokun Li
  2023-07-07 10:18         ` Thomas Gleixner
  2023-07-07 13:49       ` [tip: x86/core] x86/smp: Don't send INIT to boot CPU tip-bot2 for Thomas Gleixner
  1 sibling, 1 reply; 37+ messages in thread
From: Baokun Li @ 2023-07-06  6:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: arjan, ashok.raj, ashok.raj, ebiederm, linux-kernel,
	mario.limonciello, thomas.lendacky, tony.luck, tonyb, x86,
	yangerkun, Baoquan He, kexec, Baokun Li

On 2023/7/5 16:59, Thomas Gleixner wrote:
> On Mon, Jul 03 2023 at 11:44, Baokun Li wrote:
>
>> When I manually trigger panic in a qume x86 VM with
>>
>>          `echo c > /proc/sysrq-trigger`,
>>
>>    I find that the VM will probably reboot directly, but the
>> PANIC_TIMEOUT is 0.
>> This prevents us from exporting the vmcore via panic, and even if we succeed
>> in panic exporting the vmcore, the processes in the vmcore are mostly
>> stop_this_cpu(). By dichotomizing we found the patch that introduced the
>> behavior change
>>
>>      45e34c8af58f ("x86/smp: Put CPUs into INIT on shutdown if possible"),
> Bah, I missed that this is used by crash too. So if this happens to be
> invoked on an AP, i.e. not on CPU 0, then the INIT will reset the
> machine. Fix below.
>
> Thanks,
>
>          tglx
> ---
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index ed2d51960a7d..e1aa2cd7734b 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1348,6 +1348,14 @@ bool smp_park_other_cpus_in_init(void)
>   	if (apic->wakeup_secondary_cpu_64 || apic->wakeup_secondary_cpu)
>   		return false;
>   
> +	/*
> +	 * If this is a crash stop which does not execute on the boot CPU,
> +	 * then this cannot use the INIT mechanism because INIT to the boot
> +	 * CPU will reset the machine.
> +	 */
> +	if (this_cpu)
> +		return false;
> +
>   	for_each_present_cpu(cpu) {
>   		if (cpu == this_cpu)
>   			continue;
This patch does fix the problem of rebooting at panic, but the exported 
stack
stays at stop_this_cpu() like below, instead of showing what the 
corresponding
process is doing as before.

PID: 681      TASK: ffff9ac2429d3080  CPU: 2    COMMAND: "fsstress"
  #0 [ffffb00200184fd0] stop_this_cpu at ffffffff89a4ffd8
  #1 [ffffb00200184fe8] __sysvec_reboot at ffffffff89a94213
  #2 [ffffb00200184ff0] sysvec_reboot at ffffffff8aee7491
--- <IRQ stack> ---
     RIP: 0000000000000010  RSP: 0000000000000018  RFLAGS: ffffb00200f8bd08
     RAX: ffff9ac256fda9d8  RBX: 0000000009973a85  RCX: ffff9ac256fda078
     RDX: ffff9ac24416e300  RSI: ffff9ac256fda9e0  RDI: ffffffffffffffff
     RBP: ffff9ac2443a5f88   R8: 0000000000000000   R9: ffff9ac2422eeea0
     R10: ffff9ac256fda9d8  R11: 0000000000549921  R12: ffff9ac2422eeea0
     R13: ffff9ac251cd23c8  R14: ffff9ac24269a800  R15: ffff9ac251cd2150
     ORIG_RAX: ffffffff8a1719e4  CS: 0206  SS: ffffffff8a1719c8
bt: WARNING: possibly bogus exception frame

Do you know how this happened? I would be grateful if you could fix it.

Thanks!
-- 
With Best Regards,
Baokun Li
.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [BUG REPORT] Triggering a panic in an x86 virtual machine does not wait
  2023-07-06  6:44       ` Baokun Li
@ 2023-07-07 10:18         ` Thomas Gleixner
  2023-07-07 12:40           ` Baokun Li
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2023-07-07 10:18 UTC (permalink / raw)
  To: Baokun Li
  Cc: arjan, ashok.raj, ashok.raj, ebiederm, linux-kernel,
	mario.limonciello, thomas.lendacky, tony.luck, tonyb, x86,
	yangerkun, Baoquan He, kexec, Baokun Li

On Thu, Jul 06 2023 at 14:44, Baokun Li wrote:
> On 2023/7/5 16:59, Thomas Gleixner wrote:
>> +	/*
>> +	 * If this is a crash stop which does not execute on the boot CPU,
>> +	 * then this cannot use the INIT mechanism because INIT to the boot
>> +	 * CPU will reset the machine.
>> +	 */
>> +	if (this_cpu)
>> +		return false;

> This patch does fix the problem of rebooting at panic, but the
> exported stack stays at stop_this_cpu() like below, instead of showing
> what the corresponding process is doing as before.
>
> PID: 681      TASK: ffff9ac2429d3080  CPU: 2    COMMAND: "fsstress"
>   #0 [ffffb00200184fd0] stop_this_cpu at ffffffff89a4ffd8
>   #1 [ffffb00200184fe8] __sysvec_reboot at ffffffff89a94213
>   #2 [ffffb00200184ff0] sysvec_reboot at ffffffff8aee7491
> --- <IRQ stack> ---
>      RIP: 0000000000000010  RSP: 0000000000000018  RFLAGS: ffffb00200f8bd08
>      RAX: ffff9ac256fda9d8  RBX: 0000000009973a85  RCX: ffff9ac256fda078
>      RDX: ffff9ac24416e300  RSI: ffff9ac256fda9e0  RDI: ffffffffffffffff
>      RBP: ffff9ac2443a5f88   R8: 0000000000000000   R9: ffff9ac2422eeea0
>      R10: ffff9ac256fda9d8  R11: 0000000000549921  R12: ffff9ac2422eeea0
>      R13: ffff9ac251cd23c8  R14: ffff9ac24269a800  R15: ffff9ac251cd2150
>      ORIG_RAX: ffffffff8a1719e4  CS: 0206  SS: ffffffff8a1719c8
> bt: WARNING: possibly bogus exception frame
>
> Do you know how this happened? I would be grateful if you could fix it.

No, I don't. But there is clearly a hint:

> bt: WARNING: possibly bogus exception frame

So the exception frame seems to be corrupted. I have no idea why.

The question is, whether this goes away when you revert that commit or not.
I can't oracle that out from your report.

Can you please revert 45e34c8af58f on top of Linus tree and verify that
it makes the issue go away?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [BUG REPORT] Triggering a panic in an x86 virtual machine does not wait
  2023-07-07 10:18         ` Thomas Gleixner
@ 2023-07-07 12:40           ` Baokun Li
  0 siblings, 0 replies; 37+ messages in thread
From: Baokun Li @ 2023-07-07 12:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: arjan, ashok.raj, ashok.raj, ebiederm, linux-kernel,
	mario.limonciello, thomas.lendacky, tony.luck, tonyb, x86,
	yangerkun, Baoquan He, kexec, Baokun Li

On 2023/7/7 18:18, Thomas Gleixner wrote:
> On Thu, Jul 06 2023 at 14:44, Baokun Li wrote:
>> On 2023/7/5 16:59, Thomas Gleixner wrote:
>>> +	/*
>>> +	 * If this is a crash stop which does not execute on the boot CPU,
>>> +	 * then this cannot use the INIT mechanism because INIT to the boot
>>> +	 * CPU will reset the machine.
>>> +	 */
>>> +	if (this_cpu)
>>> +		return false;

This does solve the problem of x86 VMs not waiting when they panic, so

Reported-and-tested-by: Baokun Li <libaokun1@huawei.com>

>> This patch does fix the problem of rebooting at panic, but the
>> exported stack stays at stop_this_cpu() like below, instead of showing
>> what the corresponding process is doing as before.
>>
>> PID: 681      TASK: ffff9ac2429d3080  CPU: 2    COMMAND: "fsstress"
>>    #0 [ffffb00200184fd0] stop_this_cpu at ffffffff89a4ffd8
>>    #1 [ffffb00200184fe8] __sysvec_reboot at ffffffff89a94213
>>    #2 [ffffb00200184ff0] sysvec_reboot at ffffffff8aee7491
>> --- <IRQ stack> ---
>>       RIP: 0000000000000010  RSP: 0000000000000018  RFLAGS: ffffb00200f8bd08
>>       RAX: ffff9ac256fda9d8  RBX: 0000000009973a85  RCX: ffff9ac256fda078
>>       RDX: ffff9ac24416e300  RSI: ffff9ac256fda9e0  RDI: ffffffffffffffff
>>       RBP: ffff9ac2443a5f88   R8: 0000000000000000   R9: ffff9ac2422eeea0
>>       R10: ffff9ac256fda9d8  R11: 0000000000549921  R12: ffff9ac2422eeea0
>>       R13: ffff9ac251cd23c8  R14: ffff9ac24269a800  R15: ffff9ac251cd2150
>>       ORIG_RAX: ffffffff8a1719e4  CS: 0206  SS: ffffffff8a1719c8
>> bt: WARNING: possibly bogus exception frame
>>
>> Do you know how this happened? I would be grateful if you could fix it.
> No, I don't. But there is clearly a hint:
>
>> bt: WARNING: possibly bogus exception frame
> So the exception frame seems to be corrupted. I have no idea why.
>
> The question is, whether this goes away when you revert that commit or not.
> I can't oracle that out from your report.
>
> Can you please revert 45e34c8af58f on top of Linus tree and verify that
> it makes the issue go away?
>
> Thanks,
>
>          tglx
Yes, the stop_this_cpu() issue persisted after I reverted 45e34c8af58f 
and it
has nothing to do with your patch, I will try to bisect to find out 
which patch
introduced the issue.

Thank you very much for helping locate and rectify the problem that the x86
VM panic does not wait!

Cheers!
-- 
With Best Regards,
Baokun Li
.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip: x86/core] x86/smp: Don't send INIT to boot CPU
  2023-07-05  8:59     ` Thomas Gleixner
  2023-07-06  6:44       ` Baokun Li
@ 2023-07-07 13:49       ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 37+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2023-07-07 13:49 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Baokun Li, Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     b1472a60a584694875a05cf8bcba8bdf0dc1cd3a
Gitweb:        https://git.kernel.org/tip/b1472a60a584694875a05cf8bcba8bdf0dc1cd3a
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 05 Jul 2023 10:59:23 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 07 Jul 2023 15:42:31 +02:00

x86/smp: Don't send INIT to boot CPU

Parking CPUs in INIT works well, except for the crash case when the CPU
which invokes smp_park_other_cpus_in_init() is not the boot CPU. Sending
INIT to the boot CPU resets the whole machine.

Prevent this by validating that this runs on the boot CPU. If not fall back
and let CPUs hang in HLT.

Fixes: 45e34c8af58f ("x86/smp: Put CPUs into INIT on shutdown if possible")
Reported-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/87ttui91jo.ffs@tglx
---
 arch/x86/kernel/smpboot.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 4ee4339..7417d9b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1473,6 +1473,14 @@ bool smp_park_other_cpus_in_init(void)
 	if (apic->wakeup_secondary_cpu_64 || apic->wakeup_secondary_cpu)
 		return false;
 
+	/*
+	 * If this is a crash stop which does not execute on the boot CPU,
+	 * then this cannot use the INIT mechanism because INIT to the boot
+	 * CPU will reset the machine.
+	 */
+	if (this_cpu)
+		return false;
+
 	for_each_present_cpu(cpu) {
 		if (cpu == this_cpu)
 			continue;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-07-07 13:49 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-15 20:33 [patch v3 0/7] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
2023-06-15 20:33 ` [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust Thomas Gleixner
2023-06-16  1:58   ` Ashok Raj
2023-06-16  7:53     ` Thomas Gleixner
2023-06-16 14:13       ` Ashok Raj
2023-06-16 18:01         ` Thomas Gleixner
2023-06-16 20:57           ` Ashok Raj
2023-06-19 17:51             ` Ashok Raj
2023-06-20  8:09           ` Borislav Petkov
2023-06-16 16:36     ` Tony Battersby
2023-06-15 20:33 ` [patch v3 2/7] x86/smp: Dont access non-existing CPUID leaf Thomas Gleixner
2023-06-19 17:02   ` Limonciello, Mario
2023-06-19 17:15     ` Thomas Gleixner
2023-06-20  8:20   ` Borislav Petkov
2023-06-15 20:33 ` [patch v3 3/7] x86/smp: Remove pointless wmb()s from native_stop_other_cpus() Thomas Gleixner
2023-06-20  8:47   ` Borislav Petkov
2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
2023-06-15 20:33 ` [patch v3 4/7] x86/smp: Use dedicated cache-line for mwait_play_dead() Thomas Gleixner
2023-06-20  9:01   ` Borislav Petkov
2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
2023-06-15 20:33 ` [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
2023-06-20  9:23   ` Borislav Petkov
2023-06-20 12:25     ` Thomas Gleixner
2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
2023-06-15 20:33 ` [patch v3 6/7] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
2023-06-20  9:29   ` Borislav Petkov
2023-06-20 12:30     ` Thomas Gleixner
2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
2023-06-15 20:34 ` [patch v3 7/7] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner
2023-06-20 10:27   ` Borislav Petkov
2023-06-20 13:00   ` [tip: x86/core] " tip-bot2 for Thomas Gleixner
2023-07-03  3:44   ` [BUG REPORT] Triggering a panic in an x86 virtual machine does not wait Baokun Li
2023-07-05  8:59     ` Thomas Gleixner
2023-07-06  6:44       ` Baokun Li
2023-07-07 10:18         ` Thomas Gleixner
2023-07-07 12:40           ` Baokun Li
2023-07-07 13:49       ` [tip: x86/core] x86/smp: Don't send INIT to boot CPU tip-bot2 for Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).