From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753444AbdBTN1F (ORCPT <rfc822;w@1wt.eu>);
        Mon, 20 Feb 2017 08:27:05 -0500
Received: from mx1.redhat.com ([209.132.183.28]:36756 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752821AbdBTN1D (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Feb 2017 08:27:03 -0500
Reply-To: xlpang@redhat.com
Subject: Re: [PATCH v2] x86/mce: Don't participate in rendezvous process once
 nmi_shootdown_cpus() was made
References: <1487571037-10821-1-git-send-email-xlpang@redhat.com>
 <20170220110941.vwcm3je3e4kkei6o@pd.tnic>
To: Borislav Petkov <bp@alien8.de>, Xunlei Pang <xlpang@redhat.com>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
        kexec@lists.infradead.org, Tony Luck <tony.luck@intel.com>,
        Ingo Molnar <mingo@redhat.com>, Dave Young <dyoung@redhat.com>,
        Prarit Bhargava <prarit@redhat.com>,
        Junichi Nomura <j-nomura@ce.jp.nec.com>,
        Kiyoshi Ueda <k-ueda@ct.jp.nec.com>,
        Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
From: Xunlei Pang <xpang@redhat.com>
Message-ID: <58AAEF34.9000303@redhat.com>
Date: Mon, 20 Feb 2017 21:29:24 +0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <20170220110941.vwcm3je3e4kkei6o@pd.tnic>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Mon, 20 Feb 2017 13:27:04 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 02/20/2017 at 07:09 PM, Borislav Petkov wrote:
> On Mon, Feb 20, 2017 at 02:10:37PM +0800, Xunlei Pang wrote:
>> @@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>>  	 */
>>  	int lmce = 1;
>>  
>> -	/* If this CPU is offline, just bail out. */
>> -	if (cpu_is_offline(smp_processor_id())) {
>> +	/* If nmi shootdown happened or this CPU is offline, just bail out. */
>> +	if (cpus_shotdown() ||
> I don't like "cpus_shotdown" - it doesn't hint at all that this is
> special-handling crash/kdump.
>
> And more importantly, I want it to be obvious that we do let the
> crashing CPU into the MCE handler.

Ok, I will export crashing_cpu and use it directly in mce handler.

>
> Why?
>
> If we didn't, you will not handle *any* MCE, even a fatal one, during
> dumping memory so if that dump is corrupted from the MCE, you won't
> know. And I don't want to be the one staring at the corrupted dump and
> wondering why I'm seeing what I'm seeing.
>
> IOW, if we get a fatal MCE during dumping then we should go and die.
> This is much better than silently corrupting the dump and not even
> saying anything about it.
>

My thought is that it doesn't matter after kdump boots as new mce handler
will be installed. If we get a fatal MCE during kdumping, the new handler will
handle the cpus running kdump kernel correctly.

There is a small window between crash and kdump kernel boot, so if a SRAO comes
within this window it will also cause the mce synchronization problem on the crashing
cpu if we don't bail out the crashing cpu.

Regards,
Xunlei