From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751195AbdBUBYN (ORCPT ); Mon, 20 Feb 2017 20:24:13 -0500 Received: from mx1.redhat.com ([209.132.183.28]:49450 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750925AbdBUBYM (ORCPT ); Mon, 20 Feb 2017 20:24:12 -0500 Reply-To: xlpang@redhat.com Subject: Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made References: <1487571037-10821-1-git-send-email-xlpang@redhat.com> <20170220110941.vwcm3je3e4kkei6o@pd.tnic> <58AAEF34.9000303@redhat.com> To: Borislav Petkov , Xunlei Pang Cc: x86@kernel.org, linux-kernel@vger.kernel.org, kexec@lists.infradead.org, Tony Luck , Ingo Molnar , Dave Young , Prarit Bhargava , Junichi Nomura , Kiyoshi Ueda , Naoya Horiguchi From: Xunlei Pang Message-ID: <58AB9745.6060507@redhat.com> Date: Tue, 21 Feb 2017 09:26:29 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: <58AAEF34.9000303@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Tue, 21 Feb 2017 01:24:12 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/20/2017 at 09:29 PM, Xunlei Pang wrote: > On 02/20/2017 at 07:09 PM, Borislav Petkov wrote: >> On Mon, Feb 20, 2017 at 02:10:37PM +0800, Xunlei Pang wrote: >>> @@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long error_code) >>> */ >>> int lmce = 1; >>> >>> - /* If this CPU is offline, just bail out. */ >>> - if (cpu_is_offline(smp_processor_id())) { >>> + /* If nmi shootdown happened or this CPU is offline, just bail out. */ >>> + if (cpus_shotdown() || >> I don't like "cpus_shotdown" - it doesn't hint at all that this is >> special-handling crash/kdump. >> >> And more importantly, I want it to be obvious that we do let the >> crashing CPU into the MCE handler. > Ok, I will export crashing_cpu and use it directly in mce handler. Forget to mention, one reason I introduced cpus_shotdown() is that "crashing_cpu" is defined with CONFIG_SMP=y, so we have to export it unconditionally if we don't want to add the conditional code(i.e. with #ifdef CONFIG_SMP quoted) in mce.c. Regards, Xunlei > >> Why? >> >> If we didn't, you will not handle *any* MCE, even a fatal one, during >> dumping memory so if that dump is corrupted from the MCE, you won't >> know. And I don't want to be the one staring at the corrupted dump and >> wondering why I'm seeing what I'm seeing. >> >> IOW, if we get a fatal MCE during dumping then we should go and die. >> This is much better than silently corrupting the dump and not even >> saying anything about it. >> > My thought is that it doesn't matter after kdump boots as new mce handler > will be installed. If we get a fatal MCE during kdumping, the new handler will > handle the cpus running kdump kernel correctly. > > There is a small window between crash and kdump kernel boot, so if a SRAO comes > within this window it will also cause the mce synchronization problem on the crashing > cpu if we don't bail out the crashing cpu. > > Regards, > Xunlei