[patch V2 0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch V2 0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles
@ 2023-06-13 12:17 Thomas Gleixner
  2023-06-13 12:17 ` [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust Thomas Gleixner
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Thomas Gleixner @ 2023-06-13 12:17 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman

This is the second version of the kexec() vs. mwait_play_dead()
series. Version 1 can be found here:

  https://lore.kernel.org/r/20230603193439.502645149@linutronix.de

Aside of picking up the correction of the original patch 5 this also
integrates a fix for intermittend reboot hangs reported by Tony:

  https://lore.kernel.org/r/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com

which touches the same area. While halfways independent I added them here
as these changes conflict nicely.

So the two issues are:

  1) stop_other_cpus() continues after observing num_online_cpus() == 1.
     
     This is problematic because the to be stopped CPUs clear their online
     bit first and then invoke eventually WBINVD, which can take a long
     time. There seems to be an interaction between the WBINVD and the
     reboot mechanics as this intermittendly results in hangs.

  2) kexec() kernel can overwrite the memory locations which "offline" CPUs
     are monitoring. This write brings them out of MWAIT and they resume
     execution on overwritten text, page tables, data and stacks resulting
     in triple faults.

Cure them by:

  #1 Synchronizing stop_other_cpus() with an atomic variable which is
     decremented in stop_this_cpu() _after_ WBINVD completes.

  #2 Bringing offline CPUs out of MWAIT and move them into HLT before
     starting the kexec() kernel. Optionaly send them an INIT IPI so they
     go back into wait for startup state.

The series is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/kexec

Thanks,

	tglx
---
 include/asm/cpu.h |    2 
 include/asm/smp.h |    4 +
 kernel/process.c  |   16 +++++
 kernel/smp.c      |   79 ++++++++++++++++++----------
 kernel/smpboot.c  |  149 ++++++++++++++++++++++++++++++++++++++++--------------
 5 files changed, 183 insertions(+), 67 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust
  2023-06-13 12:17 [patch V2 0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
@ 2023-06-13 12:17 ` Thomas Gleixner
  2023-06-14 19:42   ` Ashok Raj
  2023-06-13 12:17 ` [patch V2 2/8] x86/smp: Dont access non-existing CPUID leaf Thomas Gleixner
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2023-06-13 12:17 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman

Tony reported intermittent lockups on poweroff. His analysis identified the
wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
on SME enabled machines a kexec() does not leave any stale data in the
caches when switching from encrypted to non-encrypted mode or vice versa.

That wbindv() is conditional on the SME feature bit which is read directly
from CPUID. But that readout does not check whether the CPUID leaf is
available or not. If it's not available the CPU will return the value of
the highest supported leaf instead. Depending on the content the "SME" bit
might be set or not.

That's incorrect but harmless. Making the CPUID readout conditional makes
the observed hangs go away, but it does not fix the underlying problem:

CPU0					CPU1

 stop_other_cpus()
   send_IPIs(REBOOT);			stop_this_cpu()
   while (num_online_cpus() > 1);         set_online(false);
   proceed... -> hang
				          wbinvd()

WBINVD is an expensive operation and if multiple CPUs issue it at the same
time the resulting delays are even larger.

But CPU0 already observed num_online_cpus() going down to 1 and proceeds
which causes the system to hang.

Make this more robust by adding a counter which is set to the number of
online CPUs before sending the IPIs and decremented in stop_this_cpu()
after the WBINVD completed. Check for that counter in stop_other_cpus()
instead of watching num_online_cpus().

Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Reported-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
---
 arch/x86/include/asm/cpu.h |    2 ++
 arch/x86/kernel/process.c  |   10 ++++++++++
 arch/x86/kernel/smp.c      |   15 ++++++++++++---
 3 files changed, 24 insertions(+), 3 deletions(-)


--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -98,4 +98,6 @@ extern u64 x86_read_arch_cap_msr(void);
 int intel_find_matching_signature(void *mc, unsigned int csig, int cpf);
 int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
 
+extern atomic_t stop_cpus_count;
+
 #endif /* _ASM_X86_CPU_H */
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -759,6 +759,8 @@ bool xen_set_default_idle(void)
 }
 #endif
 
+atomic_t stop_cpus_count;
+
 void __noreturn stop_this_cpu(void *dummy)
 {
 	local_irq_disable();
@@ -783,6 +785,14 @@ void __noreturn stop_this_cpu(void *dumm
 	 */
 	if (cpuid_eax(0x8000001f) & BIT(0))
 		native_wbinvd();
+
+	/*
+	 * native_stop_other_cpus() will write to @stop_cpus_count after
+	 * observing that it went down to zero, which will invalidate the
+	 * cacheline on this CPU.
+	 */
+	atomic_dec(&stop_cpus_count);
+
 	for (;;) {
 		/*
 		 * Use native_halt() so that memory contents don't change
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -27,6 +27,7 @@
 #include <asm/mmu_context.h>
 #include <asm/proto.h>
 #include <asm/apic.h>
+#include <asm/cpu.h>
 #include <asm/idtentry.h>
 #include <asm/nmi.h>
 #include <asm/mce.h>
@@ -171,6 +172,8 @@ static void native_stop_other_cpus(int w
 		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
 			return;
 
+		atomic_set(&stop_cpus_count, num_online_cpus() - 1);
+
 		/* sync above data before sending IRQ */
 		wmb();
 
@@ -183,12 +186,12 @@ static void native_stop_other_cpus(int w
 		 * CPUs reach shutdown state.
 		 */
 		timeout = USEC_PER_SEC;
-		while (num_online_cpus() > 1 && timeout--)
+		while (atomic_read(&stop_cpus_count) > 0 && timeout--)
 			udelay(1);
 	}
 
 	/* if the REBOOT_VECTOR didn't work, try with the NMI */
-	if (num_online_cpus() > 1) {
+	if (atomic_read(&stop_cpus_count) > 0) {
 		/*
 		 * If NMI IPI is enabled, try to register the stop handler
 		 * and send the IPI. In any case try to wait for the other
@@ -208,7 +211,7 @@ static void native_stop_other_cpus(int w
 		 * one or more CPUs do not reach shutdown state.
 		 */
 		timeout = USEC_PER_MSEC * 10;
-		while (num_online_cpus() > 1 && (wait || timeout--))
+		while (atomic_read(&stop_cpus_count) > 0 && (wait || timeout--))
 			udelay(1);
 	}
 
@@ -216,6 +219,12 @@ static void native_stop_other_cpus(int w
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
 	local_irq_restore(flags);
+
+	/*
+	 * Ensure that the cache line is invalidated on the other CPUs. See
+	 * comment vs. SME in stop_this_cpu().
+	 */
+	atomic_set(&stop_cpus_count, INT_MAX);
 }
 
 /*


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch V2 2/8] x86/smp: Dont access non-existing CPUID leaf
  2023-06-13 12:17 [patch V2 0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
  2023-06-13 12:17 ` [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust Thomas Gleixner
@ 2023-06-13 12:17 ` Thomas Gleixner
  2023-06-13 12:17 ` [patch V2 3/8] x86/smp: Remove pointless wmb() from native_stop_other_cpus() Thomas Gleixner
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Thomas Gleixner @ 2023-06-13 12:17 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman

From: Tony Battersby <tonyb@cybernetics.com>

stop_this_cpu() tests CPUID leaf 0x8000001f::EAX unconditionally. CPUs
return the content of the highest supported leaf when a non-existing leaf
is read. So the result of the test is lottery except on AMD CPUs which
support that leaf.

While harmless it's incorrect and causes the conditional wbinvd() to be
issued where not required.

Check whether the leaf is supported before reading it.

[ tglx: Adjusted changelog ]

Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
---
 arch/x86/kernel/process.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -763,13 +763,15 @@ atomic_t stop_cpus_count;
 
 void __noreturn stop_this_cpu(void *dummy)
 {
+	struct cpuinfo_x86 *c = this_cpu_ptr(&cpu_info);
+
 	local_irq_disable();
 	/*
 	 * Remove this CPU:
 	 */
 	set_cpu_online(smp_processor_id(), false);
 	disable_local_APIC();
-	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
+	mcheck_cpu_clear(c);
 
 	/*
 	 * Use wbinvd on processors that support SME. This provides support
@@ -783,7 +785,7 @@ void __noreturn stop_this_cpu(void *dumm
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
 	 */
-	if (cpuid_eax(0x8000001f) & BIT(0))
+	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
 		native_wbinvd();
 
 	/*


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch V2 3/8] x86/smp: Remove pointless wmb() from native_stop_other_cpus()
  2023-06-13 12:17 [patch V2 0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
  2023-06-13 12:17 ` [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust Thomas Gleixner
  2023-06-13 12:17 ` [patch V2 2/8] x86/smp: Dont access non-existing CPUID leaf Thomas Gleixner
@ 2023-06-13 12:17 ` Thomas Gleixner
  2023-06-15  8:58   ` Peter Zijlstra
  2023-06-13 12:17 ` [patch V2 4/8] x86/smp: Acquire stopping_cpu unconditionally Thomas Gleixner
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2023-06-13 12:17 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman, Ashok Raj

The wmb() after the successfull atomic_cmpxchg() is complete voodoo along
with the comment stating "sync above data before sending IRQ".

There is no "above" data except for the atomic_t stopping_cpu which has
just been acquired. The reboot IPI handler does not check any data and
unconditionally disables the CPU.

Remove this cargo cult barrier.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
 arch/x86/kernel/smp.c |    3 ---
 1 file changed, 3 deletions(-)

--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -174,9 +174,6 @@ static void native_stop_other_cpus(int w
 
 		atomic_set(&stop_cpus_count, num_online_cpus() - 1);
 
-		/* sync above data before sending IRQ */
-		wmb();
-
 		apic_send_IPI_allbutself(REBOOT_VECTOR);
 
 		/*


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch V2 4/8] x86/smp: Acquire stopping_cpu unconditionally
  2023-06-13 12:17 [patch V2 0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
                   ` (2 preceding siblings ...)
  2023-06-13 12:17 ` [patch V2 3/8] x86/smp: Remove pointless wmb() from native_stop_other_cpus() Thomas Gleixner
@ 2023-06-13 12:17 ` Thomas Gleixner
  2023-06-15  9:02   ` Peter Zijlstra
  2023-06-13 12:18 ` [patch V2 5/8] x86/smp: Use dedicated cache-line for mwait_play_dead() Thomas Gleixner
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2023-06-13 12:17 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman, Ashok Raj

There is no reason to acquire the stopping_cpu atomic_t only when there is
more than one online CPU.

Make it unconditional to prepare for fixing the kexec() problem when there
are present but "offline" CPUs which play dead in mwait_play_dead().

They need to be brought out of mwait before kexec() as kexec() can
overwrite text, pagetables, stacks and the monitored cacheline of the
original kernel. The latter causes mwait to resume execution which
obviously causes havoc on the kexec kernel which results usually in triple
faults.

Move the acquire out of the num_online_cpus() > 1 condition so the upcoming
'kick mwait' fixup is properly protected.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
 arch/x86/kernel/smp.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -153,6 +153,12 @@ static void native_stop_other_cpus(int w
 	if (reboot_force)
 		return;
 
+	/* Only proceed if this is the first CPU to reach this code */
+	if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
+		return;
+
+	atomic_set(&stop_cpus_count, num_online_cpus() - 1);
+
 	/*
 	 * Use an own vector here because smp_call_function
 	 * does lots of things not suitable in a panic situation.
@@ -167,13 +173,7 @@ static void native_stop_other_cpus(int w
 	 * code.  By syncing, we give the cpus up to one second to
 	 * finish their work before we force them off with the NMI.
 	 */
-	if (num_online_cpus() > 1) {
-		/* did someone beat us here? */
-		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
-			return;
-
-		atomic_set(&stop_cpus_count, num_online_cpus() - 1);
-
+	if (atomic_read(&stop_cpus_count) > 0) {
 		apic_send_IPI_allbutself(REBOOT_VECTOR);
 
 		/*


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch V2 5/8] x86/smp: Use dedicated cache-line for mwait_play_dead()
  2023-06-13 12:17 [patch V2 0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
                   ` (3 preceding siblings ...)
  2023-06-13 12:17 ` [patch V2 4/8] x86/smp: Acquire stopping_cpu unconditionally Thomas Gleixner
@ 2023-06-13 12:18 ` Thomas Gleixner
  2023-06-13 12:18 ` [patch V2 6/8] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Thomas Gleixner @ 2023-06-13 12:18 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman, Ashok Raj

Monitoring idletask::thread_info::flags in mwait_play_dead() has been an
obvious choice as all what is needed is a cache line which is not written
by other CPUs.

But there is a use case where a "dead" CPU needs to be brought out of that
mwait(): kexec().

The CPU needs to be brought out of mwait before kexec() as kexec() can
overwrite text, pagetables, stacks and the monitored cacheline of the
original kernel. The latter causes mwait to resume execution which
obviously causes havoc on the kexec kernel which results usually in triple
faults.

Use a dedicated per CPU storage to prepare for that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
 arch/x86/kernel/smpboot.c |   24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,17 @@ EXPORT_PER_CPU_SYMBOL(cpu_die_map);
 DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
 
+struct mwait_cpu_dead {
+	unsigned int	control;
+	unsigned int	status;
+};
+
+/*
+ * Cache line aligned data for mwait_play_dead(). Separate on purpose so
+ * that it's unlikely to be touched by other CPUs.
+ */
+static DEFINE_PER_CPU_ALIGNED(struct mwait_cpu_dead, mwait_cpu_dead);
+
 /* Logical package management. We might want to allocate that dynamically */
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
@@ -1758,10 +1769,10 @@ EXPORT_SYMBOL_GPL(cond_wakeup_cpu0);
  */
 static inline void mwait_play_dead(void)
 {
+	struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
 	unsigned int eax, ebx, ecx, edx;
 	unsigned int highest_cstate = 0;
 	unsigned int highest_subcstate = 0;
-	void *mwait_ptr;
 	int i;
 
 	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
@@ -1796,13 +1807,6 @@ static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
-	/*
-	 * This should be a memory location in a cache line which is
-	 * unlikely to be touched by other processors.  The actual
-	 * content is immaterial as it is not actually modified in any way.
-	 */
-	mwait_ptr = &current_thread_info()->flags;
-
 	wbinvd();
 
 	while (1) {
@@ -1814,9 +1818,9 @@ static inline void mwait_play_dead(void)
 		 * case where we return around the loop.
 		 */
 		mb();
-		clflush(mwait_ptr);
+		clflush(md);
 		mb();
-		__monitor(mwait_ptr, 0, 0);
+		__monitor(md, 0, 0);
 		mb();
 		__mwait(eax, 0);
 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch V2 6/8] x86/smp: Cure kexec() vs. mwait_play_dead() breakage
  2023-06-13 12:17 [patch V2 0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
                   ` (4 preceding siblings ...)
  2023-06-13 12:18 ` [patch V2 5/8] x86/smp: Use dedicated cache-line for mwait_play_dead() Thomas Gleixner
@ 2023-06-13 12:18 ` Thomas Gleixner
  2023-06-13 12:18 ` [patch V2 7/8] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
  2023-06-13 12:18 ` [patch V2 8/8] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner
  7 siblings, 0 replies; 16+ messages in thread
From: Thomas Gleixner @ 2023-06-13 12:18 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman, Ashok Raj

TLDR: It's a mess.

When kexec() is executed on a system with "offline" CPUs, which are parked
in mwait_play_dead() it can end up in a triple fault during the bootup of
the kexec kernel or cause hard to diagnose data corruption.

The reason is that kexec() eventually overwrites the previous kernels text,
page tables, data and stack, If it writes to the cache line which is
monitored by an previously offlined CPU, MWAIT resumes execution and ends
up executing the wrong text, dereferencing overwritten page tables or
corrupting the kexec kernels data.

Cure this by bringing the offline CPUs out of MWAIT into HLT.

Write to the monitored cache line of each offline CPU, which makes MWAIT
resume execution. The written control word tells the offline CPUs to issue
HLT, which does not have the MWAIT problem.

That does not help, if a stray NMI, MCE or SMI hits the offline CPUs as
those make it come out of HLT.

A follow up change will put them into INIT, which protects at least against
NMI and SMI.

Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case")
Reported-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
 arch/x86/include/asm/smp.h |    2 +
 arch/x86/kernel/smp.c      |   23 ++++++++---------
 arch/x86/kernel/smpboot.c  |   59 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 72 insertions(+), 12 deletions(-)

--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -132,6 +132,8 @@ void wbinvd_on_cpu(int cpu);
 int wbinvd_on_all_cpus(void);
 void cond_wakeup_cpu0(void);
 
+void smp_kick_mwait_play_dead(void);
+
 void native_smp_send_reschedule(int cpu);
 void native_send_call_func_ipi(const struct cpumask *mask);
 void native_send_call_func_single_ipi(int cpu);
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -21,6 +21,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/kexec.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -157,21 +158,19 @@ static void native_stop_other_cpus(int w
 	if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
 		return;
 
-	atomic_set(&stop_cpus_count, num_online_cpus() - 1);
+	/* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */
+	if (kexec_in_progress)
+		smp_kick_mwait_play_dead();
 
-	/*
-	 * Use an own vector here because smp_call_function
-	 * does lots of things not suitable in a panic situation.
-	 */
+	atomic_set(&stop_cpus_count, num_online_cpus() - 1);
 
 	/*
-	 * We start by using the REBOOT_VECTOR irq.
-	 * The irq is treated as a sync point to allow critical
-	 * regions of code on other cpus to release their spin locks
-	 * and re-enable irqs.  Jumping straight to an NMI might
-	 * accidentally cause deadlocks with further shutdown/panic
-	 * code.  By syncing, we give the cpus up to one second to
-	 * finish their work before we force them off with the NMI.
+	 * Start by using the REBOOT_VECTOR. That acts as a sync point to
+	 * allow critical regions of code on other cpus to leave their
+	 * critical regions. Jumping straight to an NMI might accidentally
+	 * cause deadlocks with further shutdown code. This gives the CPUs
+	 * up to one second to finish their work before forcing them off
+	 * with the NMI.
 	 */
 	if (atomic_read(&stop_cpus_count) > 0) {
 		apic_send_IPI_allbutself(REBOOT_VECTOR);
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -53,6 +53,7 @@
 #include <linux/tboot.h>
 #include <linux/gfp.h>
 #include <linux/cpuidle.h>
+#include <linux/kexec.h>
 #include <linux/numa.h>
 #include <linux/pgtable.h>
 #include <linux/overflow.h>
@@ -106,6 +107,9 @@ struct mwait_cpu_dead {
 	unsigned int	status;
 };
 
+#define CPUDEAD_MWAIT_WAIT	0xDEADBEEF
+#define CPUDEAD_MWAIT_KEXEC_HLT	0x4A17DEAD
+
 /*
  * Cache line aligned data for mwait_play_dead(). Separate on purpose so
  * that it's unlikely to be touched by other CPUs.
@@ -173,6 +177,10 @@ static void smp_callin(void)
 {
 	int cpuid;
 
+	/* Mop up eventual mwait_play_dead() wreckage */
+	this_cpu_write(mwait_cpu_dead.status, 0);
+	this_cpu_write(mwait_cpu_dead.control, 0);
+
 	/*
 	 * If waken up by an INIT in an 82489DX configuration
 	 * cpu_callout_mask guarantees we don't get here before
@@ -1807,6 +1815,10 @@ static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
+	/* Set up state for the kexec() hack below */
+	md->status = CPUDEAD_MWAIT_WAIT;
+	md->control = CPUDEAD_MWAIT_WAIT;
+
 	wbinvd();
 
 	while (1) {
@@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void)
 		mb();
 		__mwait(eax, 0);
 
+		if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
+			/*
+			 * Kexec is about to happen. Don't go back into mwait() as
+			 * the kexec kernel might overwrite text and data including
+			 * page tables and stack. So mwait() would resume when the
+			 * monitor cache line is written to and then the CPU goes
+			 * south due to overwritten text, page tables and stack.
+			 *
+			 * Note: This does _NOT_ protect against a stray MCE, NMI,
+			 * SMI. They will resume execution at the instruction
+			 * following the HLT instruction and run into the problem
+			 * which this is trying to prevent.
+			 */
+			WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
+			while(1)
+				native_halt();
+		}
+
 		cond_wakeup_cpu0();
 	}
 }
 
+/*
+ * Kick all "offline" CPUs out of mwait on kexec(). See comment in
+ * mwait_play_dead().
+ */
+void smp_kick_mwait_play_dead(void)
+{
+	u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;
+	struct mwait_cpu_dead *md;
+	unsigned int cpu, i;
+
+	for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) {
+		md = per_cpu_ptr(&mwait_cpu_dead, cpu);
+
+		/* Does it sit in mwait_play_dead() ? */
+		if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT)
+			continue;
+
+		/* Wait maximal 5ms */
+		for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) {
+			/* Bring it out of mwait */
+			WRITE_ONCE(md->control, newstate);
+			udelay(5);
+		}
+
+		if (READ_ONCE(md->status) != newstate)
+			pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu);
+	}
+}
+
 void __noreturn hlt_play_dead(void)
 {
 	if (__this_cpu_read(cpu_info.x86) >= 4)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch V2 7/8] x86/smp: Split sending INIT IPI out into a helper function
  2023-06-13 12:17 [patch V2 0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
                   ` (5 preceding siblings ...)
  2023-06-13 12:18 ` [patch V2 6/8] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
@ 2023-06-13 12:18 ` Thomas Gleixner
  2023-06-13 12:18 ` [patch V2 8/8] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner
  7 siblings, 0 replies; 16+ messages in thread
From: Thomas Gleixner @ 2023-06-13 12:18 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman, Ashok Raj

Putting CPUs into INIT is a safer place during kexec() to park CPUs.

Split the INIT assert/deassert sequence out so it can be reused.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
V2: Fix rebase screwup
---
 arch/x86/kernel/smpboot.c |   49 ++++++++++++++++++----------------------------
 1 file changed, 20 insertions(+), 29 deletions(-)

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -853,47 +853,38 @@ wakeup_secondary_cpu_via_nmi(int apicid,
 	return (send_status | accept_status);
 }
 
-static int
-wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
+static void send_init_sequence(int phys_apicid)
 {
-	unsigned long send_status = 0, accept_status = 0;
-	int maxlvt, num_starts, j;
-
-	maxlvt = lapic_get_maxlvt();
+	int maxlvt = lapic_get_maxlvt();
 
-	/*
-	 * Be paranoid about clearing APIC errors.
-	 */
+	/* Be paranoid about clearing APIC errors. */
 	if (APIC_INTEGRATED(boot_cpu_apic_version)) {
-		if (maxlvt > 3)		/* Due to the Pentium erratum 3AP.  */
+		/* Due to the Pentium erratum 3AP.  */
+		if (maxlvt > 3)
 			apic_write(APIC_ESR, 0);
 		apic_read(APIC_ESR);
 	}
 
-	pr_debug("Asserting INIT\n");
-
-	/*
-	 * Turn INIT on target chip
-	 */
-	/*
-	 * Send IPI
-	 */
-	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
-		       phys_apicid);
-
-	pr_debug("Waiting for send to finish...\n");
-	send_status = safe_apic_wait_icr_idle();
+	/* Assert INIT on the target CPU */
+	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, phys_apicid);
+	safe_apic_wait_icr_idle();
 
 	udelay(init_udelay);
 
-	pr_debug("Deasserting INIT\n");
-
-	/* Target chip */
-	/* Send IPI */
+	/* Deassert INIT on the target CPU */
 	apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
+	safe_apic_wait_icr_idle();
+}
+
+/*
+ * Wake up AP by INIT, INIT, STARTUP sequence.
+ */
+static int wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
+{
+	unsigned long send_status = 0, accept_status = 0;
+	int num_starts, j, maxlvt = lapic_get_maxlvt();
 
-	pr_debug("Waiting for send to finish...\n");
-	send_status = safe_apic_wait_icr_idle();
+	send_init_sequence(phys_apicid);
 
 	mb();
 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch V2 8/8] x86/smp: Put CPUs into INIT on shutdown if possible
  2023-06-13 12:17 [patch V2 0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
                   ` (6 preceding siblings ...)
  2023-06-13 12:18 ` [patch V2 7/8] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
@ 2023-06-13 12:18 ` Thomas Gleixner
  7 siblings, 0 replies; 16+ messages in thread
From: Thomas Gleixner @ 2023-06-13 12:18 UTC (permalink / raw)
  To: LKML
  Cc: x86, Mario Limonciello, Tom Lendacky, Tony Battersby, Ashok Raj,
	Tony Luck, Arjan van de Veen, Eric Biederman, Ashok Raj

Parking CPUs in a HLT loop is not completely safe vs. kexec() as HLT can
resume execution due to NMI, SMI and MCE, which has the same issue as the
MWAIT loop.

Kicking the secondary CPUs into INIT makes this safe against NMI and SMI.

A broadcast MCE will take the machine down, but a broadcast MCE which makes
HLT resume and execute overwritten text, pagetables or data will end up in
a disaster too.

So chose the lesser of two evils and kick the secondary CPUs into INIT
unless the system has installed special wakeup mechanisms which are not
using INIT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
 arch/x86/include/asm/smp.h |    2 ++
 arch/x86/kernel/smp.c      |   38 +++++++++++++++++++++++++++++---------
 arch/x86/kernel/smpboot.c  |   19 +++++++++++++++++++
 3 files changed, 50 insertions(+), 9 deletions(-)

--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -139,6 +139,8 @@ void native_send_call_func_ipi(const str
 void native_send_call_func_single_ipi(int cpu);
 void x86_idle_thread_init(unsigned int cpu, struct task_struct *idle);
 
+bool smp_park_nonboot_cpus_in_init(void);
+
 void smp_store_boot_cpu_info(void);
 void smp_store_cpu_info(int id);
 
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -131,7 +131,7 @@ static int smp_stop_nmi_callback(unsigne
 }
 
 /*
- * this function calls the 'stop' function on all other CPUs in the system.
+ * Disable virtualization, APIC etc. and park the CPU in a HLT loop
  */
 DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
 {
@@ -148,8 +148,7 @@ static int register_stop_handler(void)
 
 static void native_stop_other_cpus(int wait)
 {
-	unsigned long flags;
-	unsigned long timeout;
+	unsigned long flags, timeout;
 
 	if (reboot_force)
 		return;
@@ -167,10 +166,10 @@ static void native_stop_other_cpus(int w
 	/*
 	 * Start by using the REBOOT_VECTOR. That acts as a sync point to
 	 * allow critical regions of code on other cpus to leave their
-	 * critical regions. Jumping straight to an NMI might accidentally
-	 * cause deadlocks with further shutdown code. This gives the CPUs
-	 * up to one second to finish their work before forcing them off
-	 * with the NMI.
+	 * critical regions. Jumping straight to NMI or INIT might
+	 * accidentally cause deadlocks with further shutdown code. This
+	 * gives the CPUs up to one second to finish their work before
+	 * forcing them off with the NMI or INIT.
 	 */
 	if (atomic_read(&stop_cpus_count) > 0) {
 		apic_send_IPI_allbutself(REBOOT_VECTOR);
@@ -178,7 +177,7 @@ static void native_stop_other_cpus(int w
 		/*
 		 * Don't wait longer than a second for IPI completion. The
 		 * wait request is not checked here because that would
-		 * prevent an NMI shutdown attempt in case that not all
+		 * prevent an NMI/INIT shutdown in case that not all
 		 * CPUs reach shutdown state.
 		 */
 		timeout = USEC_PER_SEC;
@@ -186,7 +185,27 @@ static void native_stop_other_cpus(int w
 			udelay(1);
 	}
 
-	/* if the REBOOT_VECTOR didn't work, try with the NMI */
+	/*
+	 * Park all nonboot CPUs in INIT including offline CPUs, if
+	 * possible. That's a safe place where they can't resume execution
+	 * of HLT and then execute the HLT loop from overwritten text or
+	 * page tables.
+	 *
+	 * The only downside is a broadcast MCE, but up to the point where
+	 * the kexec() kernel brought all APs online again an MCE will just
+	 * make HLT resume and handle the MCE. The machine crashs and burns
+	 * due to overwritten text, page tables and data. So there is a
+	 * choice between fire and frying pan. The result is pretty much
+	 * the same. Chose frying pan until x86 provides a sane mechanism
+	 * to park a CPU.
+	 */
+	if (smp_park_nonboot_cpus_in_init())
+		goto done;
+
+	/*
+	 * If park with INIT was not possible and the REBOOT_VECTOR didn't
+	 * take all secondary CPUs offline, try with the NMI.
+	 */
 	if (atomic_read(&stop_cpus_count) > 0) {
 		/*
 		 * If NMI IPI is enabled, try to register the stop handler
@@ -211,6 +230,7 @@ static void native_stop_other_cpus(int w
 			udelay(1);
 	}
 
+done:
 	local_irq_save(flags);
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1465,6 +1465,25 @@ void arch_thaw_secondary_cpus_end(void)
 	cache_aps_init();
 }
 
+bool smp_park_nonboot_cpus_in_init(void)
+{
+	unsigned int cpu, this_cpu = smp_processor_id();
+	unsigned int apicid;
+
+	if (apic->wakeup_secondary_cpu_64 || apic->wakeup_secondary_cpu)
+		return false;
+
+	for_each_present_cpu(cpu) {
+		if (cpu == this_cpu)
+			continue;
+		apicid = apic->cpu_present_to_apicid(cpu);
+		if (apicid == BAD_APICID)
+			continue;
+		send_init_sequence(apicid);
+	}
+	return true;
+}
+
 /*
  * Early setup to make printk work.
  */


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust
  2023-06-13 12:17 ` [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust Thomas Gleixner
@ 2023-06-14 19:42   ` Ashok Raj
  2023-06-14 19:53     ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Ashok Raj @ 2023-06-14 19:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

Hi Thomas,

On Tue, Jun 13, 2023 at 02:17:55PM +0200, Thomas Gleixner wrote:
> Tony reported intermittent lockups on poweroff. His analysis identified the
> wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
> on SME enabled machines a kexec() does not leave any stale data in the
> caches when switching from encrypted to non-encrypted mode or vice versa.
> 
> That wbindv() is conditional on the SME feature bit which is read directly
> from CPUID. But that readout does not check whether the CPUID leaf is
> available or not. If it's not available the CPU will return the value of
> the highest supported leaf instead. Depending on the content the "SME" bit
> might be set or not.
> 
> That's incorrect but harmless. Making the CPUID readout conditional makes
> the observed hangs go away, but it does not fix the underlying problem:
> 
> CPU0					CPU1
> 
>  stop_other_cpus()
>    send_IPIs(REBOOT);			stop_this_cpu()
>    while (num_online_cpus() > 1);         set_online(false);
>    proceed... -> hang
> 				          wbinvd()
> 
> WBINVD is an expensive operation and if multiple CPUs issue it at the same
> time the resulting delays are even larger.

Is this situation similar to what happened with the unexpected wakeups from
mwait_play_dead()? 

i.e the wbinvd() takes a while, and when CPU0 moves ahead, the previous
kernel marches past the wbinvd() instruction since we didn't wait to ensure
this has indeed completed?

native_machine_halt()
{
	machine_shutdown()->stop_other_cpus() 
	stop_this_cpu();<---- Unbalanced atomic_dec()?
}

But the last stop_this_cpu() in native_machine_halt() would 
make the count go negative? Maybe that's OK since no one is waiting for it
to go zero at that point?

> 
> But CPU0 already observed num_online_cpus() going down to 1 and proceeds
> which causes the system to hang.
> 
> Make this more robust by adding a counter which is set to the number of
> online CPUs before sending the IPIs and decremented in stop_this_cpu()
> after the WBINVD completed. Check for that counter in stop_other_cpus()
> instead of watching num_online_cpus().
> 
> Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
> Reported-by: Tony Battersby <tonyb@cybernetics.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
> ---
>  arch/x86/include/asm/cpu.h |    2 ++
>  arch/x86/kernel/process.c  |   10 ++++++++++
>  arch/x86/kernel/smp.c      |   15 ++++++++++++---
>  3 files changed, 24 insertions(+), 3 deletions(-)
> 
> 
> --- a/arch/x86/include/asm/cpu.h
> +++ b/arch/x86/include/asm/cpu.h
> @@ -98,4 +98,6 @@ extern u64 x86_read_arch_cap_msr(void);
>  int intel_find_matching_signature(void *mc, unsigned int csig, int cpf);
>  int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
>  
> +extern atomic_t stop_cpus_count;
> +
>  #endif /* _ASM_X86_CPU_H */
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -759,6 +759,8 @@ bool xen_set_default_idle(void)
>  }
>  #endif
>  
> +atomic_t stop_cpus_count;
> +
>  void __noreturn stop_this_cpu(void *dummy)
>  {
>  	local_irq_disable();
> @@ -783,6 +785,14 @@ void __noreturn stop_this_cpu(void *dumm
>  	 */
>  	if (cpuid_eax(0x8000001f) & BIT(0))
>  		native_wbinvd();
> +
> +	/*
> +	 * native_stop_other_cpus() will write to @stop_cpus_count after
> +	 * observing that it went down to zero, which will invalidate the
> +	 * cacheline on this CPU.
> +	 */
> +	atomic_dec(&stop_cpus_count);
> +
>  	for (;;) {
>  		/*
>  		 * Use native_halt() so that memory contents don't change
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -27,6 +27,7 @@
>  #include <asm/mmu_context.h>
>  #include <asm/proto.h>
>  #include <asm/apic.h>
> +#include <asm/cpu.h>
>  #include <asm/idtentry.h>
>  #include <asm/nmi.h>
>  #include <asm/mce.h>
> @@ -171,6 +172,8 @@ static void native_stop_other_cpus(int w
>  		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
>  			return;
>  
> +		atomic_set(&stop_cpus_count, num_online_cpus() - 1);
> +
>  		/* sync above data before sending IRQ */
>  		wmb();
>  
> @@ -183,12 +186,12 @@ static void native_stop_other_cpus(int w
>  		 * CPUs reach shutdown state.
>  		 */
>  		timeout = USEC_PER_SEC;
> -		while (num_online_cpus() > 1 && timeout--)
> +		while (atomic_read(&stop_cpus_count) > 0 && timeout--)
>  			udelay(1);
>  	}
>  
>  	/* if the REBOOT_VECTOR didn't work, try with the NMI */
> -	if (num_online_cpus() > 1) {
> +	if (atomic_read(&stop_cpus_count) > 0) {
>  		/*
>  		 * If NMI IPI is enabled, try to register the stop handler
>  		 * and send the IPI. In any case try to wait for the other
> @@ -208,7 +211,7 @@ static void native_stop_other_cpus(int w
>  		 * one or more CPUs do not reach shutdown state.
>  		 */
>  		timeout = USEC_PER_MSEC * 10;
> -		while (num_online_cpus() > 1 && (wait || timeout--))
> +		while (atomic_read(&stop_cpus_count) > 0 && (wait || timeout--))
>  			udelay(1);
>  	}

If we go down the INIT path, life is less complicated.. 

After REBOOT_VECTOR IPI, if stop_cpus_count > 0, we send NMI to all CPUs.
Won't this completely update the atomic_dec() since CPUs in hlt() will also
take the NMI correct? I'm not sure if this is problematic.

Or should we reinitialize stop_cpus_count before the NMI hurrah?

>  
> @@ -216,6 +219,12 @@ static void native_stop_other_cpus(int w
>  	disable_local_APIC();
>  	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
>  	local_irq_restore(flags);
> +
> +	/*
> +	 * Ensure that the cache line is invalidated on the other CPUs. See
> +	 * comment vs. SME in stop_this_cpu().
> +	 */
> +	atomic_set(&stop_cpus_count, INT_MAX);

Didn't understand why INT_MAX here?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust
  2023-06-14 19:42   ` Ashok Raj
@ 2023-06-14 19:53     ` Thomas Gleixner
  2023-06-14 20:47       ` Ashok Raj
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2023-06-14 19:53 UTC (permalink / raw)
  To: Ashok Raj
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Wed, Jun 14 2023 at 12:42, Ashok Raj wrote:
> On Tue, Jun 13, 2023 at 02:17:55PM +0200, Thomas Gleixner wrote:
>> 
>> WBINVD is an expensive operation and if multiple CPUs issue it at the same
>> time the resulting delays are even larger.
>
> Is this situation similar to what happened with the unexpected wakeups from
> mwait_play_dead()? 
>
> i.e the wbinvd() takes a while, and when CPU0 moves ahead, the previous
> kernel marches past the wbinvd() instruction since we didn't wait to ensure
> this has indeed completed?

This is about reboot and I don't know how wbinvd() interacts with
that. But yes, this could be an issue vs. kexec() too.

> native_machine_halt()
> {
> 	machine_shutdown()->stop_other_cpus() 
> 	stop_this_cpu();<---- Unbalanced atomic_dec()?
> }
>
> But the last stop_this_cpu() in native_machine_halt() would 
> make the count go negative? Maybe that's OK since no one is waiting for it
> to go zero at that point?

Correct it's the CPU which stopped the others.

>>  		timeout = USEC_PER_MSEC * 10;
>> -		while (num_online_cpus() > 1 && (wait || timeout--))
>> +		while (atomic_read(&stop_cpus_count) > 0 && (wait || timeout--))
>>  			udelay(1);
>>  	}
>
> If we go down the INIT path, life is less complicated.. 
>
> After REBOOT_VECTOR IPI, if stop_cpus_count > 0, we send NMI to all CPUs.
> Won't this completely update the atomic_dec() since CPUs in hlt() will also
> take the NMI correct? I'm not sure if this is problematic.
>
> Or should we reinitialize stop_cpus_count before the NMI hurrah

Bah. Didn't think about HLT. Let me go back to the drawing board. Good catch!

>> +	/*
>> +	 * Ensure that the cache line is invalidated on the other CPUs. See
>> +	 * comment vs. SME in stop_this_cpu().
>> +	 */
>> +	atomic_set(&stop_cpus_count, INT_MAX);
>
> Didn't understand why INT_MAX here?

Any random number will do. The only purpose is to ensure that there is
no dirty cache line on the other (stopped) CPUs.

Now let me look into this NMI cruft.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust
  2023-06-14 19:53     ` Thomas Gleixner
@ 2023-06-14 20:47       ` Ashok Raj
  2023-06-14 22:40         ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Ashok Raj @ 2023-06-14 20:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Wed, Jun 14, 2023 at 09:53:21PM +0200, Thomas Gleixner wrote:
> > If we go down the INIT path, life is less complicated.. 
> >
> > After REBOOT_VECTOR IPI, if stop_cpus_count > 0, we send NMI to all CPUs.
> > Won't this completely update the atomic_dec() since CPUs in hlt() will also
> > take the NMI correct? I'm not sure if this is problematic.
> >
> > Or should we reinitialize stop_cpus_count before the NMI hurrah
> 
> Bah. Didn't think about HLT. Let me go back to the drawing board. Good catch!
> 
> >> +	/*
> >> +	 * Ensure that the cache line is invalidated on the other CPUs. See
> >> +	 * comment vs. SME in stop_this_cpu().
> >> +	 */
> >> +	atomic_set(&stop_cpus_count, INT_MAX);
> >
> > Didn't understand why INT_MAX here?
> 
> Any random number will do. The only purpose is to ensure that there is
> no dirty cache line on the other (stopped) CPUs.
> 
> Now let me look into this NMI cruft.
> 

Maybe if each CPU going down can set their mask, we can simply hit NMI to
only the problematic ones?

The simple count doesn't capture the CPUs in trouble.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust
  2023-06-14 20:47       ` Ashok Raj
@ 2023-06-14 22:40         ` Thomas Gleixner
  0 siblings, 0 replies; 16+ messages in thread
From: Thomas Gleixner @ 2023-06-14 22:40 UTC (permalink / raw)
  To: Ashok Raj
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Wed, Jun 14 2023 at 13:47, Ashok Raj wrote:
> On Wed, Jun 14, 2023 at 09:53:21PM +0200, Thomas Gleixner wrote:
>> 
>> Now let me look into this NMI cruft.
>> 
>
> Maybe if each CPU going down can set their mask, we can simply hit NMI to
> only the problematic ones?
>
> The simple count doesn't capture the CPUs in trouble.

Even a mask is not cutting it. If CPUs did not react on the reboot
vector then there is no guarantee that they are not going to do so
concurrently to the NMI IPI:

CPU0                          CPU1

IPI(BROADCAST, REBOOT);
wait() // timeout
                              stop_this_cpu()
if (!all_stopped()) {
  for_each_cpu(cpu, mask) {
                                mark_stopped(); <- all_stopped() == true now
       IPI(cpu, NMI);
  }                            --> NMI()

  // no wait() because all_stopped() == true

proceed_and_hope() ....

On bare metal this is likely to "work" by chance, but in a guest all
bets are off.

I'm not surprised at all.

The approach of piling hardware and firmware legacy on top of hardware
and firmware legacy in the hope that we can "fix" that in software was
wrong from the very beginning.

What's surprising is that this worked for a really long time. Though
with increasing complexity the thereby produced debris is starting to
rear its ugly head.

I'm sure the marketing departements of _all_ x86 vendors will come up
with a brilliant slogan for that. Something like:

  "We are committed to ensure that you are able to experience the
   failures of the past forever with increasingly improved performance
   and new exciting features which are fully backwards failure
   compatible."

TBH, the (OS) software industry has proliferated that by joining the
'features first' choir without much thought and push back. See
arch/x86/kernel/cpu/* for prime examples.

Ranted enough. I'm going to sleep now and look at this mess tomorrow
morning with brain awake again. Though that will not change the
underlying problem, which is unfixable.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch V2 3/8] x86/smp: Remove pointless wmb() from native_stop_other_cpus()
  2023-06-13 12:17 ` [patch V2 3/8] x86/smp: Remove pointless wmb() from native_stop_other_cpus() Thomas Gleixner
@ 2023-06-15  8:58   ` Peter Zijlstra
  2023-06-15 10:57     ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2023-06-15  8:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Tue, Jun 13, 2023 at 02:17:58PM +0200, Thomas Gleixner wrote:
> The wmb() after the successfull atomic_cmpxchg() is complete voodoo along
> with the comment stating "sync above data before sending IRQ".
> 
> There is no "above" data except for the atomic_t stopping_cpu which has
> just been acquired. The reboot IPI handler does not check any data and
> unconditionally disables the CPU.
> 
> Remove this cargo cult barrier.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  arch/x86/kernel/smp.c |    3 ---
>  1 file changed, 3 deletions(-)
> 
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -174,9 +174,6 @@ static void native_stop_other_cpus(int w
>  
>  		atomic_set(&stop_cpus_count, num_online_cpus() - 1);
>  
> -		/* sync above data before sending IRQ */
> -		wmb();
> -
>  		apic_send_IPI_allbutself(REBOOT_VECTOR);

There's a second one a little below. That too should go.

More to the point, the apic_send_*() functions should be the ones that
ensure this if required etc.. See for example weak_wrmsr_fence() for
x2apic.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch V2 4/8] x86/smp: Acquire stopping_cpu unconditionally
  2023-06-13 12:17 ` [patch V2 4/8] x86/smp: Acquire stopping_cpu unconditionally Thomas Gleixner
@ 2023-06-15  9:02   ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2023-06-15  9:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Tue, Jun 13, 2023 at 02:17:59PM +0200, Thomas Gleixner wrote:
> There is no reason to acquire the stopping_cpu atomic_t only when there is
> more than one online CPU.
> 
> Make it unconditional to prepare for fixing the kexec() problem when there
> are present but "offline" CPUs which play dead in mwait_play_dead().
> 
> They need to be brought out of mwait before kexec() as kexec() can
> overwrite text, pagetables, stacks and the monitored cacheline of the
> original kernel. The latter causes mwait to resume execution which
> obviously causes havoc on the kexec kernel which results usually in triple
> faults.
> 
> Move the acquire out of the num_online_cpus() > 1 condition so the upcoming
> 'kick mwait' fixup is properly protected.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  arch/x86/kernel/smp.c |   14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -153,6 +153,12 @@ static void native_stop_other_cpus(int w
>  	if (reboot_force)
>  		return;
>  
> +	/* Only proceed if this is the first CPU to reach this code */
> +	if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
> +		return;
> +
> +	atomic_set(&stop_cpus_count, num_online_cpus() - 1);
> +

	if (({ int old = -1; !atomic_try_cmpxchg(&stopping_cpu, &old, safe_smp_processor_id()); }))
		return;

Doesn't really roll of the tongue, does it :/

Also, I don't think anybody cares about performance at this point, so
ignore I wrote this email.

/me presses send anyway.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch V2 3/8] x86/smp: Remove pointless wmb() from native_stop_other_cpus()
  2023-06-15  8:58   ` Peter Zijlstra
@ 2023-06-15 10:57     ` Thomas Gleixner
  0 siblings, 0 replies; 16+ messages in thread
From: Thomas Gleixner @ 2023-06-15 10:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, x86, Mario Limonciello, Tom Lendacky, Tony Battersby,
	Ashok Raj, Tony Luck, Arjan van de Veen, Eric Biederman,
	Ashok Raj

On Thu, Jun 15 2023 at 10:58, Peter Zijlstra wrote:

> On Tue, Jun 13, 2023 at 02:17:58PM +0200, Thomas Gleixner wrote:
>> The wmb() after the successfull atomic_cmpxchg() is complete voodoo along
>> with the comment stating "sync above data before sending IRQ".
>> 
>> There is no "above" data except for the atomic_t stopping_cpu which has
>> just been acquired. The reboot IPI handler does not check any data and
>> unconditionally disables the CPU.
>> 
>> Remove this cargo cult barrier.
>> 
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> Reviewed-by: Ashok Raj <ashok.raj@intel.com>
>> ---
>>  arch/x86/kernel/smp.c |    3 ---
>>  1 file changed, 3 deletions(-)
>> 
>> --- a/arch/x86/kernel/smp.c
>> +++ b/arch/x86/kernel/smp.c
>> @@ -174,9 +174,6 @@ static void native_stop_other_cpus(int w
>>  
>>  		atomic_set(&stop_cpus_count, num_online_cpus() - 1);
>>  
>> -		/* sync above data before sending IRQ */
>> -		wmb();
>> -
>>  		apic_send_IPI_allbutself(REBOOT_VECTOR);
>
> There's a second one a little below. That too should go.

Duh, yes.

> More to the point, the apic_send_*() functions should be the ones that
> ensure this if required etc.. See for example weak_wrmsr_fence() for
> x2apic.

Correct.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-06-15 10:57 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-13 12:17 [patch V2 0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
2023-06-13 12:17 ` [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust Thomas Gleixner
2023-06-14 19:42   ` Ashok Raj
2023-06-14 19:53     ` Thomas Gleixner
2023-06-14 20:47       ` Ashok Raj
2023-06-14 22:40         ` Thomas Gleixner
2023-06-13 12:17 ` [patch V2 2/8] x86/smp: Dont access non-existing CPUID leaf Thomas Gleixner
2023-06-13 12:17 ` [patch V2 3/8] x86/smp: Remove pointless wmb() from native_stop_other_cpus() Thomas Gleixner
2023-06-15  8:58   ` Peter Zijlstra
2023-06-15 10:57     ` Thomas Gleixner
2023-06-13 12:17 ` [patch V2 4/8] x86/smp: Acquire stopping_cpu unconditionally Thomas Gleixner
2023-06-15  9:02   ` Peter Zijlstra
2023-06-13 12:18 ` [patch V2 5/8] x86/smp: Use dedicated cache-line for mwait_play_dead() Thomas Gleixner
2023-06-13 12:18 ` [patch V2 6/8] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
2023-06-13 12:18 ` [patch V2 7/8] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
2023-06-13 12:18 ` [patch V2 8/8] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox