From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757394AbbCCSyh (ORCPT ); Tue, 3 Mar 2015 13:54:37 -0500 Received: from mail.skyhub.de ([78.46.96.112]:58682 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757162AbbCCSyg (ORCPT ); Tue, 3 Mar 2015 13:54:36 -0500 Date: Tue, 3 Mar 2015 19:53:24 +0100 From: Borislav Petkov To: Naoya Horiguchi Cc: Tony Luck , Prarit Bhargava , Vivek Goyal , "linux-kernel@vger.kernel.org" , Junichi Nomura , Kiyoshi Ueda Subject: Re: [PATCH v3 1/2] x86: mce: kexec: switch MCE handler for kexec/kdump Message-ID: <20150303185324.GF25768@pd.tnic> References: <1425373306-26187-1-git-send-email-n-horiguchi@ah.jp.nec.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1425373306-26187-1-git-send-email-n-horiguchi@ah.jp.nec.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 03, 2015 at 09:01:49AM +0000, Naoya Horiguchi wrote: > kexec disables (or "shoots down") all CPUs other than a crashing CPU before > entering the 2nd kernel. But the MCE handler is still enabled after that, > so if MCE happens and broadcasts over the CPUs after the main thread starts > the 2nd kernel (which might not initialize MCE device yet, or might decide > not to enable it,) MCE handler runs only on the other CPUs (not on the main > thread,) leading to kernel panic with MCE synchronization. The user-visible > effect of this bug is kdump failure. > > Our standard MCE handler do_machine_check() assumes some about system's > status and it's hard to alter it to cover kexec/kdump context, so let's add > another kdump-specific one and switch to it. > > Note that this problem exists since current MCE handler was implemented in > 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached > a timeout") made it more visible by changing the default behavior of the > synchronization timeout from "ignore" to "panic". > > Signed-off-by: Naoya Horiguchi > Cc: [2.6.32+] > --- > ChangeLog v2 -> v3 > - go to "switch MCE handler" approach > > ChangeLog v1 -> v2 > - clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using > global flag to ignore MCE events. > - fixed the description of the problem > --- > arch/x86/include/asm/mce.h | 6 +++++ > arch/x86/kernel/cpu/mcheck/mce.c | 47 ++++++++++++++++++++++++++++++++++++++++ > arch/x86/kernel/crash.c | 3 +++ > 3 files changed, 56 insertions(+) > > diff --git v3.19.orig/arch/x86/include/asm/mce.h v3.19/arch/x86/include/asm/mce.h > index 51b26e895933..8010d4b77183 100644 > --- v3.19.orig/arch/x86/include/asm/mce.h > +++ v3.19/arch/x86/include/asm/mce.h > @@ -114,6 +114,9 @@ struct mca_config { > int monarch_timeout; > int panic_timeout; > u32 rip_msr; > +#ifdef CONFIG_KEXEC > + int kdump_cpu; > +#endif This CONFIG_KEXEC-ifdeffery is too ugly to live. Please put everything in arch/x86/kernel/crash.c. AFAICT, you don't need to touch anything in arch/x86/kernel/cpu/mcheck/ for your purposes. Thanks. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --