All of lore.kernel.org
 help / color / mirror / Atom feed
From: Prarit Bhargava <prarit@redhat.com>
To: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Tony Luck <tony.luck@intel.com>, Borislav Petkov <bp@alien8.de>,
	Vivek Goyal <vgoyal@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Junichi Nomura <j-nomura@ce.jp.nec.com>,
	Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Subject: Re: [PATCH v2 1/2] x86: mce: kexec: turn off MCE in kexec
Date: Fri, 27 Feb 2015 08:14:47 -0500	[thread overview]
Message-ID: <54F06DC7.9060900@redhat.com> (raw)
In-Reply-To: <54F06718.3020401@gmail.com>



On 02/27/2015 07:46 AM, Naoya Horiguchi wrote:
> Hi Prarit,
> 
> On Fri, Feb 27, 2015 at 06:09:52AM -0500, Prarit Bhargava wrote:
> ...
>> > @@ -157,6 +160,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>> >      /* The kernel is broken so disable interrupts */
>> >      local_irq_disable();
>> >
>> > +    /*
>> > +     * We can't expect MCE handling to work any more, so turn it off.
>> > +     */
>> > +    cpu_emergency_mce_disable();
>>
>> What if the system is actually having problems with MCE errors -- which are
>> leading to system panics of some sort.  Do you *really* want the system to
>> continue on at that point?
> 
> Yes, when running the above code, the system doesn't run any business logic,
> so no worry about consuming broken data caused by HW errors.
> And what we really want to get is any kind of information to find out what
> caused the 1st panic, which are likely to be contained in kdump data.
> So I think it's justified to improve the success rate of kdump by continuing
> the operation here.

I looked into it a bit further -- IIUC (according to the Intel spec) disabling
MCE this way will result in power cycle of the system if an MCE is detected.  So
I guess it isn't a worry for Intel.  If anyone from AMD can hazard a guess what
happens in their case it would be appreciated.

I still don't like this approach all that much as a corrected non-fatal error is
something I would want to know about as an admin, but that risk is mitigated by
BMC and system monitoring hardware.

>But the MCE handler is still enabled after that, so
>if MCE happens and broadcasts around CPUs after the main thread starts the
>2nd kernel (which might not start MCE yet, or might decide not to start MCE,)
>MCE handler runs only on the other CPUs (not on the main thread,) leading to
>kernel panic with MCE synchronization.

Not having looked at the code (and relying on your description) -- there is no
way to disable the MCE handler?

P.

  reply	other threads:[~2015-02-27 13:15 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-27  4:58 [PATCH v2 1/2] x86: mce: kexec: turn off MCE in kexec Naoya Horiguchi
2015-02-27  4:58 ` [PATCH v2 2/2] x86: mce: comment about MCE synchronization timeout on definition of tolerant Naoya Horiguchi
2015-02-27 11:09 ` [PATCH v2 1/2] x86: mce: kexec: turn off MCE in kexec Prarit Bhargava
2015-02-27 12:06   ` Borislav Petkov
2015-02-27 18:27     ` Luck, Tony
2015-03-02  2:31       ` Naoya Horiguchi
2015-03-02 12:17         ` Borislav Petkov
2015-03-02 14:33           ` Naoya Horiguchi
2015-03-02 16:32             ` Borislav Petkov
2015-03-02 16:50               ` Prarit Bhargava
2015-03-02 17:25                 ` Borislav Petkov
2015-02-27 12:46   ` Naoya Horiguchi
2015-02-27 13:14     ` Prarit Bhargava [this message]
2015-03-02  2:16       ` Naoya Horiguchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54F06DC7.9060900@redhat.com \
    --to=prarit@redhat.com \
    --cc=bp@alien8.de \
    --cc=j-nomura@ce.jp.nec.com \
    --cc=k-ueda@ct.jp.nec.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=tony.luck@intel.com \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.