public inbox for linux-edac@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] x86/mce/inject: Fix printk deadlock causing mce_timed_out panic
@ 2021-05-09  5:32 Lv Ying
  2021-05-08 14:33 ` Borislav Petkov
  0 siblings, 1 reply; 6+ messages in thread
From: Lv Ying @ 2021-05-09  5:32 UTC (permalink / raw)
  To: tony.luck, bp; +Cc: linux-edac, lvying6, fanwentao

The mce-inject SRAO broadcast error injection on 4-core CPU caused mce_timed_out
panic as following:
Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler

There are two CPUs's backtrace are the same:

Call Trace:
panic
mce_panic
mce_timed_out
do_machine_check
raise_exception
mce_raise_notify
nmi_handle
do_nmi
--- <NMI exception stack> ---

Another CPU's backtrace:

Call Trace:
panic
mce_panic
mce_timed_out
do_machine_check
raise_exception
mce_raise_notify
nmi_handle
do_nmi
--- <NMI exception stack> ---
...
console_unlock
vprintk_emit
printk

The last CPU's backtrace:

Call Trace:
crash_nmi_callback
nmi_handle
do_nmi
--- <NMI exception stack> ---
vprintk_emit
printk
raise_local
mce_inject_raise
notifier_call_chain
mce_chrdev_write

So, the last CPU does not go to mce_timed_out function causing mce_panic.
The last CPU stuck's reason is as follows:

	CPU A				CPU B
	 |				 |
	printk				 |
	 |				 |
	hold console_sem		 |
	 |				 |
	broadcast NMI		<-	send NMI IPI
	 |				 |
	 |				 |
	mce_timed_out			printk
	wait all the CPU		 |
					can not hold console_sem
					set console_waiter true
					 |
					while (console_waiter)
					  cpu_releax;

The CPU A will call console_lock_spinning_disable_and_check set console_waiter
false. However, this function will never be called as NMI handler stuck in mce_timed_out.

So, after CPU B send NMI IPI to all the other CPUs, before CPU B itself call
raise_exception to go to mce_timed_out. No printk should be called. Just remove
this pr_info, the pr_info("MCE exception done on CPU %d\n", cpu) after
raise_exception is enough to show that the CPU has handled the MCE exception.

Signed-off-by: Lv Ying <lvying6@huawei.com>
---
 arch/x86/kernel/cpu/mce/inject.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c
index 4e86d97f9653..e84a5ffd08f4 100644
--- a/arch/x86/kernel/cpu/mce/inject.c
+++ b/arch/x86/kernel/cpu/mce/inject.c
@@ -194,7 +194,6 @@ static int raise_local(void)
 	int cpu = m->extcpu;
 
 	if (m->inject_flags & MCJ_EXCEPTION) {
-		pr_info("Triggering MCE exception on CPU %d\n", cpu);
 		switch (context) {
 		case MCJ_CTX_IRQ:
 			/*
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-05-10 10:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-05-09  5:32 [RFC PATCH] x86/mce/inject: Fix printk deadlock causing mce_timed_out panic Lv Ying
2021-05-08 14:33 ` Borislav Petkov
2021-05-11  1:33   ` Lv Ying
2021-05-10  9:36     ` Borislav Petkov
2021-05-10  9:44       ` Borislav Petkov
2021-05-11  2:44         ` Lv Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox