From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753444AbdBTN1F (ORCPT ); Mon, 20 Feb 2017 08:27:05 -0500 Received: from mx1.redhat.com ([209.132.183.28]:36756 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752821AbdBTN1D (ORCPT ); Mon, 20 Feb 2017 08:27:03 -0500 Reply-To: xlpang@redhat.com Subject: Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made References: <1487571037-10821-1-git-send-email-xlpang@redhat.com> <20170220110941.vwcm3je3e4kkei6o@pd.tnic> To: Borislav Petkov , Xunlei Pang Cc: x86@kernel.org, linux-kernel@vger.kernel.org, kexec@lists.infradead.org, Tony Luck , Ingo Molnar , Dave Young , Prarit Bhargava , Junichi Nomura , Kiyoshi Ueda , Naoya Horiguchi From: Xunlei Pang Message-ID: <58AAEF34.9000303@redhat.com> Date: Mon, 20 Feb 2017 21:29:24 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: <20170220110941.vwcm3je3e4kkei6o@pd.tnic> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Mon, 20 Feb 2017 13:27:04 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/20/2017 at 07:09 PM, Borislav Petkov wrote: > On Mon, Feb 20, 2017 at 02:10:37PM +0800, Xunlei Pang wrote: >> @@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long error_code) >> */ >> int lmce = 1; >> >> - /* If this CPU is offline, just bail out. */ >> - if (cpu_is_offline(smp_processor_id())) { >> + /* If nmi shootdown happened or this CPU is offline, just bail out. */ >> + if (cpus_shotdown() || > I don't like "cpus_shotdown" - it doesn't hint at all that this is > special-handling crash/kdump. > > And more importantly, I want it to be obvious that we do let the > crashing CPU into the MCE handler. Ok, I will export crashing_cpu and use it directly in mce handler. > > Why? > > If we didn't, you will not handle *any* MCE, even a fatal one, during > dumping memory so if that dump is corrupted from the MCE, you won't > know. And I don't want to be the one staring at the corrupted dump and > wondering why I'm seeing what I'm seeing. > > IOW, if we get a fatal MCE during dumping then we should go and die. > This is much better than silently corrupting the dump and not even > saying anything about it. > My thought is that it doesn't matter after kdump boots as new mce handler will be installed. If we get a fatal MCE during kdumping, the new handler will handle the cpus running kdump kernel correctly. There is a small window between crash and kdump kernel boot, so if a SRAO comes within this window it will also cause the mce synchronization problem on the crashing cpu if we don't bail out the crashing cpu. Regards, Xunlei