public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Don Zickus <dzickus@redhat.com>
To: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>, Ingo Molnar <mingo@elte.hu>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Robert Richter <robert.richter@amd.com>,
	"peterz@infradead.org" <peterz@infradead.org>
Subject: Re: [PATCH -v3 5/6] x86, NMI, treat unknown NMI as hardware error
Date: Thu, 21 Oct 2010 22:56:57 -0400	[thread overview]
Message-ID: <20101022025657.GA10556@redhat.com> (raw)
In-Reply-To: <1287713110.2862.54.camel@yhuang-dev>

On Fri, Oct 22, 2010 at 10:05:10AM +0800, Huang Ying wrote:
> > > > Well, do you have an alternative way to handle broken hardware?  Broken
> > > > hardware has generated NMIs, sometimes if I am lucky SERRs.  The ones that
> > > > generate SERRs can be filtered through a different path, but what about
> > > > the ones that don't?
> > > > 
> > > 
> > > Don, AFAIK you're saying the same thing as Ying: an unknown NMI is 
> > > a hardware error.
> > > 
> > > The reason the hardware does that is that it wants to tell us:
> > > 
> > > "I lost track of an error. There is corrupted data somewhere in the system.
> > > Please stop, don't do anything that could consume that data. S.O.S."
> > > 
> > > The correct answer for that is panic.
> > 
> > After re-reading Huang's patch, I am starting to understand what you mean
> > by broken hardware.  Basically you are trying to distinguish between
> > legacy systems that were 'broken' in the sense they would randomly send
> > uknown NMIs for no good reason, hence the 'Dazed and confused' messages
> > and hardware errors on more modern systems that say, 'Hardware error,
> > panicing check your BIOS for more info' (or whatever).
> 
> Yes.
> 
> > So Huang's patch was sort of acting like a switch.  On legacy systems use
> > 'Dazed and confused' for unknown NMIs.  Whereas on whitelisted modern
> > systems use a more relavant 'Check BIOS for error' message.  Is that
> > right?
> 
> In fact we want to go panic and 'check BIOS for error, contact your
> hardware vendor' for all systems. But as you said, there are some
> 'broken hardware' randomly send unknown NMIs for no good reason. So a
> white list is used for them. And not all pre-Nehalem machines are
> 'broken' in fact.

Ok, I think I finally understand what you guys are trying to do.  I also
can't see a problem with it.  Though I think the patch could probably use
some clean up to make it more clear.  Off the top of my head perhaps a
function call that sets the variable unknown_nmi_as_hwerr instead of
setting it explicitly and maybe structuring unknown_nmi() with an if-then
modern-message; else legacy-message; to possibly make it obvious what the
code is trying to acheive.

And yeah I know not all pre-Nehalem machines are broken.  I am usually
sarcastic when I mention that just because being at IDF last year, I got
the impression that pre-Nehalem machines were considered the dark ages.
:-)

I am actually curious to know how many x86_64 machines would be considered
broken?

> 
> > That's why you guys are complaining that registering a die_notifier would
> > be silly?
> 
> I think whether going die_notifier or unknown_nmi_error() depends on it
> is general or specific for some hardware. Do you agree with that?

Well I am hoping the only general case would be the one you want to use
now.  Everything else would be specific and require a die_notifier.  I
mean how many different ways do we want to have a printk/panic in
unknown_nmi()?

Cheers,
Don

  reply	other threads:[~2010-10-22  2:57 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-09  6:49 [PATCH -v3 1/6] x86, NMI, Add NMI symbol constants and rename memory parity to PCI SERR Huang Ying
2010-10-09  6:49 ` [PATCH -v3 2/6] x86, NMI, Add touch_nmi_watchdog to io_check_error delay Huang Ying
2010-10-09  6:49 ` [PATCH -v3 3/6] x86, NMI, Rewrite NMI handler Huang Ying
2010-10-11 16:13   ` Peter Zijlstra
2010-10-11 20:35     ` Don Zickus
2010-10-12  0:50     ` Huang Ying
2010-10-12  6:04       ` Peter Zijlstra
2010-10-12  6:14         ` Huang Ying
2010-10-12  6:31           ` Peter Zijlstra
2010-10-12  6:37             ` Huang Ying
2010-10-12  6:40               ` Peter Zijlstra
2010-10-12  6:45                 ` Huang Ying
2010-10-12  6:49                   ` Peter Zijlstra
2010-10-12  6:54                     ` Huang Ying
2010-10-12 13:51                     ` Andi Kleen
2010-10-12 14:15                       ` Peter Zijlstra
2010-10-27 16:45                         ` Don Zickus
2010-10-27 17:08                           ` Peter Zijlstra
2010-10-27 18:07                             ` Don Zickus
2010-11-02 17:50                             ` Don Zickus
2010-11-02 18:16                               ` Huang Ying
2010-11-02 19:11                                 ` Don Zickus
2010-11-02 20:47                                 ` Don Zickus
2010-10-09  6:49 ` [PATCH -v3 4/6] Make NMI reason io port (0x61) can be processed on any CPU Huang Ying
2010-10-09  6:49 ` [PATCH -v3 5/6] x86, NMI, treat unknown NMI as hardware error Huang Ying
2010-10-10 14:07   ` Alan Cox
2010-10-10 14:13     ` Andi Kleen
2010-10-11 21:08       ` Don Zickus
2010-10-11 21:12         ` Don Zickus
2010-10-11 21:20   ` Don Zickus
2010-10-12  1:10     ` Huang Ying
2010-10-20  6:12     ` Huang Ying
2010-10-20 14:15       ` Don Zickus
2010-10-21  1:14         ` Huang Ying
2010-10-21  2:31           ` Don Zickus
2010-10-21  5:17             ` Huang Ying
2010-10-21 14:10               ` Don Zickus
2010-10-21 15:45                 ` Andi Kleen
2010-10-22  1:49                   ` Don Zickus
2010-10-22  2:05                     ` Huang Ying
2010-10-22  2:56                       ` Don Zickus [this message]
2010-10-22  5:23                         ` Huang Ying
2010-10-22  9:24                     ` Andi Kleen
2010-10-09  6:49 ` [PATCH -v3 6/6] x86, NMI, Remove do_nmi_callback logic Huang Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101022025657.GA10556@redhat.com \
    --to=dzickus@redhat.com \
    --cc=andi@firstfloor.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=robert.richter@amd.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox