stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	patches@lists.linux.dev, Tony Battersby <tonyb@cybernetics.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Borislav Petkov (AMD)" <bp@alien8.de>,
	Ashok Raj <ashok.raj@intel.com>
Subject: [PATCH 6.1 07/30] x86/smp: Make stop_other_cpus() more robust
Date: Thu, 29 Jun 2023 20:43:26 +0200	[thread overview]
Message-ID: <20230629184151.965693617@linuxfoundation.org> (raw)
In-Reply-To: <20230629184151.651069086@linuxfoundation.org>

From: Thomas Gleixner <tglx@linutronix.de>

commit 1f5e7eb7868e42227ac426c96d437117e6e06e8e upstream.

Tony reported intermittent lockups on poweroff. His analysis identified the
wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
on SME enabled machines a kexec() does not leave any stale data in the
caches when switching from encrypted to non-encrypted mode or vice versa.

That wbinvd() is conditional on the SME feature bit which is read directly
from CPUID. But that readout does not check whether the CPUID leaf is
available or not. If it's not available the CPU will return the value of
the highest supported leaf instead. Depending on the content the "SME" bit
might be set or not.

That's incorrect but harmless. Making the CPUID readout conditional makes
the observed hangs go away, but it does not fix the underlying problem:

CPU0					CPU1

 stop_other_cpus()
   send_IPIs(REBOOT);			stop_this_cpu()
   while (num_online_cpus() > 1);         set_online(false);
   proceed... -> hang
				          wbinvd()

WBINVD is an expensive operation and if multiple CPUs issue it at the same
time the resulting delays are even larger.

But CPU0 already observed num_online_cpus() going down to 1 and proceeds
which causes the system to hang.

This issue exists independent of WBINVD, but the delays caused by WBINVD
make it more prominent.

Make this more robust by adding a cpumask which is initialized to the
online CPU mask before sending the IPIs and CPUs clear their bit in
stop_this_cpu() after the WBINVD completed. Check for that cpumask to
become empty in stop_other_cpus() instead of watching num_online_cpus().

The cpumask cannot plug all holes either, but it's better than a raw
counter and allows to restrict the NMI fallback IPI to be sent only the
CPUs which have not reported within the timeout window.

Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Reported-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
Link: https://lore.kernel.org/r/87h6r770bv.ffs@tglx
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 arch/x86/include/asm/cpu.h |    2 +
 arch/x86/kernel/process.c  |   23 +++++++++++++++-
 arch/x86/kernel/smp.c      |   62 +++++++++++++++++++++++++++++----------------
 3 files changed, 64 insertions(+), 23 deletions(-)

--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -96,4 +96,6 @@ static inline bool intel_cpu_signatures_
 
 extern u64 x86_read_arch_cap_msr(void);
 
+extern struct cpumask cpus_stop_mask;
+
 #endif /* _ASM_X86_CPU_H */
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -744,13 +744,23 @@ bool xen_set_default_idle(void)
 }
 #endif
 
+struct cpumask cpus_stop_mask;
+
 void __noreturn stop_this_cpu(void *dummy)
 {
+	unsigned int cpu = smp_processor_id();
+
 	local_irq_disable();
+
 	/*
-	 * Remove this CPU:
+	 * Remove this CPU from the online mask and disable it
+	 * unconditionally. This might be redundant in case that the reboot
+	 * vector was handled late and stop_other_cpus() sent an NMI.
+	 *
+	 * According to SDM and APM NMIs can be accepted even after soft
+	 * disabling the local APIC.
 	 */
-	set_cpu_online(smp_processor_id(), false);
+	set_cpu_online(cpu, false);
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
 
@@ -768,6 +778,15 @@ void __noreturn stop_this_cpu(void *dumm
 	 */
 	if (cpuid_eax(0x8000001f) & BIT(0))
 		native_wbinvd();
+
+	/*
+	 * This brings a cache line back and dirties it, but
+	 * native_stop_other_cpus() will overwrite cpus_stop_mask after it
+	 * observed that all CPUs reported stop. This write will invalidate
+	 * the related cache line on this CPU.
+	 */
+	cpumask_clear_cpu(cpu, &cpus_stop_mask);
+
 	for (;;) {
 		/*
 		 * Use native_halt() so that memory contents don't change
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -27,6 +27,7 @@
 #include <asm/mmu_context.h>
 #include <asm/proto.h>
 #include <asm/apic.h>
+#include <asm/cpu.h>
 #include <asm/idtentry.h>
 #include <asm/nmi.h>
 #include <asm/mce.h>
@@ -146,31 +147,43 @@ static int register_stop_handler(void)
 
 static void native_stop_other_cpus(int wait)
 {
-	unsigned long flags;
-	unsigned long timeout;
+	unsigned int cpu = smp_processor_id();
+	unsigned long flags, timeout;
 
 	if (reboot_force)
 		return;
 
-	/*
-	 * Use an own vector here because smp_call_function
-	 * does lots of things not suitable in a panic situation.
-	 */
+	/* Only proceed if this is the first CPU to reach this code */
+	if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1)
+		return;
 
 	/*
-	 * We start by using the REBOOT_VECTOR irq.
-	 * The irq is treated as a sync point to allow critical
-	 * regions of code on other cpus to release their spin locks
-	 * and re-enable irqs.  Jumping straight to an NMI might
-	 * accidentally cause deadlocks with further shutdown/panic
-	 * code.  By syncing, we give the cpus up to one second to
-	 * finish their work before we force them off with the NMI.
+	 * 1) Send an IPI on the reboot vector to all other CPUs.
+	 *
+	 *    The other CPUs should react on it after leaving critical
+	 *    sections and re-enabling interrupts. They might still hold
+	 *    locks, but there is nothing which can be done about that.
+	 *
+	 * 2) Wait for all other CPUs to report that they reached the
+	 *    HLT loop in stop_this_cpu()
+	 *
+	 * 3) If #2 timed out send an NMI to the CPUs which did not
+	 *    yet report
+	 *
+	 * 4) Wait for all other CPUs to report that they reached the
+	 *    HLT loop in stop_this_cpu()
+	 *
+	 * #3 can obviously race against a CPU reaching the HLT loop late.
+	 * That CPU will have reported already and the "have all CPUs
+	 * reached HLT" condition will be true despite the fact that the
+	 * other CPU is still handling the NMI. Again, there is no
+	 * protection against that as "disabled" APICs still respond to
+	 * NMIs.
 	 */
-	if (num_online_cpus() > 1) {
-		/* did someone beat us here? */
-		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
-			return;
+	cpumask_copy(&cpus_stop_mask, cpu_online_mask);
+	cpumask_clear_cpu(cpu, &cpus_stop_mask);
 
+	if (!cpumask_empty(&cpus_stop_mask)) {
 		/* sync above data before sending IRQ */
 		wmb();
 
@@ -183,12 +196,12 @@ static void native_stop_other_cpus(int w
 		 * CPUs reach shutdown state.
 		 */
 		timeout = USEC_PER_SEC;
-		while (num_online_cpus() > 1 && timeout--)
+		while (!cpumask_empty(&cpus_stop_mask) && timeout--)
 			udelay(1);
 	}
 
 	/* if the REBOOT_VECTOR didn't work, try with the NMI */
-	if (num_online_cpus() > 1) {
+	if (!cpumask_empty(&cpus_stop_mask)) {
 		/*
 		 * If NMI IPI is enabled, try to register the stop handler
 		 * and send the IPI. In any case try to wait for the other
@@ -200,7 +213,8 @@ static void native_stop_other_cpus(int w
 
 			pr_emerg("Shutting down cpus with NMI\n");
 
-			apic_send_IPI_allbutself(NMI_VECTOR);
+			for_each_cpu(cpu, &cpus_stop_mask)
+				apic->send_IPI(cpu, NMI_VECTOR);
 		}
 		/*
 		 * Don't wait longer than 10 ms if the caller didn't
@@ -208,7 +222,7 @@ static void native_stop_other_cpus(int w
 		 * one or more CPUs do not reach shutdown state.
 		 */
 		timeout = USEC_PER_MSEC * 10;
-		while (num_online_cpus() > 1 && (wait || timeout--))
+		while (!cpumask_empty(&cpus_stop_mask) && (wait || timeout--))
 			udelay(1);
 	}
 
@@ -216,6 +230,12 @@ static void native_stop_other_cpus(int w
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
 	local_irq_restore(flags);
+
+	/*
+	 * Ensure that the cpus_stop_mask cache lines are invalidated on
+	 * the other CPUs. See comment vs. SME in stop_this_cpu().
+	 */
+	cpumask_clear(&cpus_stop_mask);
 }
 
 /*



  parent reply	other threads:[~2023-06-29 18:45 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-29 18:43 [PATCH 6.1 00/30] 6.1.37-rc1 review Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 01/30] mm/mmap: Fix error path in do_vmi_align_munmap() Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 02/30] mm/mmap: Fix error return " Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 03/30] mptcp: ensure listener is unhashed before updating the sk status Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 04/30] mm, hwpoison: try to recover from copy-on write faults Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 05/30] mm, hwpoison: when copy-on-write hits poison, take page offline Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 06/30] x86/microcode/AMD: Load late on both threads too Greg Kroah-Hartman
2023-06-29 18:43 ` Greg Kroah-Hartman [this message]
2023-06-29 18:43 ` [PATCH 6.1 08/30] x86/smp: Dont access non-existing CPUID leaf Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 09/30] x86/smp: Remove pointless wmb()s from native_stop_other_cpus() Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 10/30] x86/smp: Use dedicated cache-line for mwait_play_dead() Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 11/30] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 12/30] can: isotp: isotp_sendmsg(): fix return error fix on TX path Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 13/30] maple_tree: fix potential out-of-bounds access in mas_wr_end_piv() Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 14/30] mm: introduce new lock_mm_and_find_vma() page fault helper Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 15/30] mm: make the page fault mmap locking killable Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 16/30] arm64/mm: Convert to using lock_mm_and_find_vma() Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 17/30] powerpc/mm: " Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 18/30] mips/mm: " Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 19/30] riscv/mm: " Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 20/30] arm/mm: " Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 21/30] mm/fault: convert remaining simple cases to lock_mm_and_find_vma() Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 22/30] powerpc/mm: convert coprocessor fault " Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 23/30] mm: make find_extend_vma() fail if write lock not held Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 24/30] execve: expand new process stack manually ahead of time Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 25/30] mm: always expand the stack with the mmap write lock held Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 26/30] fbdev: fix potential OOB read in fast_imageblit() Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 27/30] HID: hidraw: fix data race on device refcount Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 28/30] HID: wacom: Use ktime_t rather than int when dealing with timestamps Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 29/30] HID: logitech-hidpp: add HIDPP_QUIRK_DELAYED_INIT for the T651 Greg Kroah-Hartman
2023-06-29 18:43 ` [PATCH 6.1 30/30] Revert "thermal/drivers/mediatek: Use devm_of_iomap to avoid resource leak in mtk_thermal_probe" Greg Kroah-Hartman
2023-06-29 21:56 ` [PATCH 6.1 00/30] 6.1.37-rc1 review ogasawara takeshi
2023-06-29 22:25 ` Daniel Díaz
2023-06-30  5:18   ` Greg Kroah-Hartman
2023-06-30  5:21     ` Daniel Díaz
2023-06-30  5:30       ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230629184151.965693617@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=ashok.raj@intel.com \
    --cc=bp@alien8.de \
    --cc=patches@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tonyb@cybernetics.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).