public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Naoya Horiguchi <nao.horiguchi@gmail.com>
To: Borislav Petkov <bp@alien8.de>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Tony Luck <tony.luck@intel.com>, Vivek Goyal <vgoyal@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Junichi Nomura <j-nomura@ce.jp.nec.com>,
	Kiyoshi Ueda <k-ueda@ct.jp.nec.com>,
	nao.horiguchi@gmail.com
Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together
Date: Mon, 23 Feb 2015 22:01:50 +0900	[thread overview]
Message-ID: <54EB24BE.5050006@gmail.com> (raw)
In-Reply-To: <20150223092739.GA22757@pd.tnic>

# I resend this, sorry if you receive this twice.

On Mon, Feb 23, 2015 at 10:27:39AM +0100, Borislav Petkov wrote:
> On Mon, Feb 23, 2015 at 09:12:29AM +0000, Naoya Horiguchi wrote:
> > kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> > entering the 2nd kernel. This disablement is done via NMI, and the crashing
> > CPU wait for the completions by spinning at most for 1 second.
> > However, there is a race window if this NMI handling doesn't complete within
> > the 1 second on some CPU, which cause the fragile situation where only a
> > portion of online CPUs are responsive to MCE interrupt. If MCE happens during
> > this race window, MCE synchronization always timeouts and results in kernel
> > panic. So the user-visible effect of this bug is kdump failure.
> >
> > Note that this race window did exist when current MCE handler was implemented
> > around 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has
> > reached a timeout") made it more visible by changing the default behavior of
> > the synchronization timeout from "ignore" to "panic".
>
> Let me guess: you could raise the tolerance level to 3 temporarily from
> native_machine_crash_shutdown() and not touch the #MC handler at all,
> right?

Yes, that can be a right solution for fixing the kdump failure itself, but I
think that it might not be the best solution from the viewpoint of messaging to
userspace. What end users see is like these timeout messages:
  - "Timeout: Not all CPUs entered broadcast exception handler",
  - "Timeout: Subject CPUs unable to finish machine check processing",
  - "Timeout: Monarch CPU unable to finish machine check processing", or
  - "Timeout: Monarch CPU did not finish machine check processing".
These are informative for developers like us, but confusing for end users.
If we can guess that what end users want to know is whether the kdump is
reliable or not, so "Machine Check ignored because crash dump is running."
sounds a bit better to me.

But yes, I agree that using mca_cfg->tolerant is a nice idea, so I'd like to
define another value to show that kdump is running. Does it make sense to you?

Thanks,
Naoya Horiguchi

  reply	other threads:[~2015-02-23 13:01 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-23  9:12 [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together Naoya Horiguchi
2015-02-23  9:12 ` [PATCH 2/2] x86: mce: comment about MCE synchronization timeout on definition of tolerant Naoya Horiguchi
2015-02-23  9:27 ` [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together Borislav Petkov
2015-02-23 13:01   ` Naoya Horiguchi [this message]
2015-02-23 13:58     ` Borislav Petkov
2015-02-23 15:41       ` Naoya Horiguchi
2015-02-23 17:06         ` Borislav Petkov
2015-02-24  8:15           ` Naoya Horiguchi
2015-02-24  9:56             ` Borislav Petkov
2015-02-24 18:20               ` Luck, Tony
2015-02-24 18:40                 ` Borislav Petkov
2015-02-24 18:47                   ` Luck, Tony
2015-02-24 21:19                     ` Borislav Petkov
2015-02-25  0:54               ` Naoya Horiguchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54EB24BE.5050006@gmail.com \
    --to=nao.horiguchi@gmail.com \
    --cc=bp@alien8.de \
    --cc=j-nomura@ce.jp.nec.com \
    --cc=k-ueda@ct.jp.nec.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=tony.luck@intel.com \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox