public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Don Zickus <dzickus@redhat.com>
To: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>, Ingo Molnar <mingo@elte.hu>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Robert Richter <robert.richter@amd.com>,
	"peterz@infradead.org" <peterz@infradead.org>
Subject: Re: [PATCH -v3 5/6] x86, NMI, treat unknown NMI as hardware error
Date: Thu, 21 Oct 2010 21:49:55 -0400	[thread overview]
Message-ID: <20101022014955.GP19090@redhat.com> (raw)
In-Reply-To: <20101021154543.GD21695@basil.fritz.box>

On Thu, Oct 21, 2010 at 05:45:44PM +0200, Andi Kleen wrote:
> On Thu, Oct 21, 2010 at 10:10:02AM -0400, Don Zickus wrote:
> > On Thu, Oct 21, 2010 at 01:17:31PM +0800, Huang Ying wrote:
> > > > > But there is some general rules for unknown NMI. We think unknown NMI is
> > > > > hardware error notification on all systems except systems with broken
> > > > > hardware or software bugs, stone age machines. Do you agree with that?
> > > > 
> > > > Nope.  In my experiences, most of our customers are still running
> > > > pre-Nehalem boxes, therefore most unknown NMIs are from broken hardware or
> > > > bad firmware (at least in the bugzillas I deal with).
> > > 
> > > It seems that we have different point of view for reason of unknown NMI.
> > > Should broken hardware be seen as hardware error?
> > 
> > Well, do you have an alternative way to handle broken hardware?  Broken
> > hardware has generated NMIs, sometimes if I am lucky SERRs.  The ones that
> > generate SERRs can be filtered through a different path, but what about
> > the ones that don't?
> > 
> 
> Don, AFAIK you're saying the same thing as Ying: an unknown NMI is 
> a hardware error.
> 
> The reason the hardware does that is that it wants to tell us:
> 
> "I lost track of an error. There is corrupted data somewhere in the system.
> Please stop, don't do anything that could consume that data. S.O.S."
> 
> The correct answer for that is panic.

After re-reading Huang's patch, I am starting to understand what you mean
by broken hardware.  Basically you are trying to distinguish between
legacy systems that were 'broken' in the sense they would randomly send
uknown NMIs for no good reason, hence the 'Dazed and confused' messages
and hardware errors on more modern systems that say, 'Hardware error,
panicing check your BIOS for more info' (or whatever).

So Huang's patch was sort of acting like a switch.  On legacy systems use
'Dazed and confused' for unknown NMIs.  Whereas on whitelisted modern
systems use a more relavant 'Check BIOS for error' message.  Is that
right?

That's why you guys are complaining that registering a die_notifier would
be silly?

> 
> The only problem we're trying to solve here is to distingush
> other cases, like when software uses NMIs for something else
> or there may be other cases

Ok.

> 
> > > 
> > > As far as I know, Windows treat unknown NMI as hardware errors. Although
> > > we are programming for Linux not Windows. Many hardware are built for
> > > Windows.
> > 
> > I was told Windows treats _any_ NMI as hardware errors, not just unknown
> > ones. :-)
> 
> I don't know details about Windows' code here, but I assume that 
> they have a way to hook into the NMI too, otherwise drivers like the HP
> watchdog wouldn't work on Windows. But yes we know they shut down.
> 
> Essentially the hardware (and the BIOS) is designed 
> under the assumption that the OS behaves like Windows for this.
> 
> In my long experience in making Linux work on all kinds
> of hardware I learned very firmly that trying to drive PC hardware 
> in a different way than Windows does is usually a bad idea.

Of course.

> 
> The approach of Ying's patch kit was really to make
> the behaviour more like Windows.

Ok.

> 
> 
> > Probably.  I guess I don't fully understand your definition of hardware
> > error notification so I can't tell if we are arguing or agreeing (but
> > using different words).
> > 
> > How do you envision the code looking like with hardware error
> > notification?
> > 
> > I just wanted to keep the code in traps.c simple and clean and not
> > constantly add new #ifdefs every time Intel came up with an interesting
> > way to determine a hardware error condition.
> > 
> > For example, I am not the biggest fan of seeing stuff like edac or mce
> > inside the code and would prefer them to use notifiers.  But that is just
> > my opinion.
> 
> mce is only for testing anyways, "real" mce doesn't need vector 0.
> So basically it's just a fancy "smp_call_nmi()" thingy.
> 
> EDAC shouldn't need it either, if it's does it's some probably
> broken legacy behaviour.
> 
> > If you have a framework that you wanted to propose that could encapsulate
> > an ever growing class of hardware error notifications, I would be
> > interested
> 
> The trend in error reporting is away from NMIs and towards machine checks.
> My bunch is that NMI reporting is more or less a legacy problem, so we have
> to deal with what is there today.

That's good.

> 
> I don't think you need to worry about a lot more hardware NMI sources.

Well until those machines dominate the marketplace, I'm stuck supporting
those pre-Nahelam boxes with customers that committed to 10 years with
last year's technology.  ;-)

> 
> The only thing that we need to worry about is more software NMI sources,
> these unfortunately like to multiply.

Sure.

Thanks,
Don

  reply	other threads:[~2010-10-22  1:50 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-09  6:49 [PATCH -v3 1/6] x86, NMI, Add NMI symbol constants and rename memory parity to PCI SERR Huang Ying
2010-10-09  6:49 ` [PATCH -v3 2/6] x86, NMI, Add touch_nmi_watchdog to io_check_error delay Huang Ying
2010-10-09  6:49 ` [PATCH -v3 3/6] x86, NMI, Rewrite NMI handler Huang Ying
2010-10-11 16:13   ` Peter Zijlstra
2010-10-11 20:35     ` Don Zickus
2010-10-12  0:50     ` Huang Ying
2010-10-12  6:04       ` Peter Zijlstra
2010-10-12  6:14         ` Huang Ying
2010-10-12  6:31           ` Peter Zijlstra
2010-10-12  6:37             ` Huang Ying
2010-10-12  6:40               ` Peter Zijlstra
2010-10-12  6:45                 ` Huang Ying
2010-10-12  6:49                   ` Peter Zijlstra
2010-10-12  6:54                     ` Huang Ying
2010-10-12 13:51                     ` Andi Kleen
2010-10-12 14:15                       ` Peter Zijlstra
2010-10-27 16:45                         ` Don Zickus
2010-10-27 17:08                           ` Peter Zijlstra
2010-10-27 18:07                             ` Don Zickus
2010-11-02 17:50                             ` Don Zickus
2010-11-02 18:16                               ` Huang Ying
2010-11-02 19:11                                 ` Don Zickus
2010-11-02 20:47                                 ` Don Zickus
2010-10-09  6:49 ` [PATCH -v3 4/6] Make NMI reason io port (0x61) can be processed on any CPU Huang Ying
2010-10-09  6:49 ` [PATCH -v3 5/6] x86, NMI, treat unknown NMI as hardware error Huang Ying
2010-10-10 14:07   ` Alan Cox
2010-10-10 14:13     ` Andi Kleen
2010-10-11 21:08       ` Don Zickus
2010-10-11 21:12         ` Don Zickus
2010-10-11 21:20   ` Don Zickus
2010-10-12  1:10     ` Huang Ying
2010-10-20  6:12     ` Huang Ying
2010-10-20 14:15       ` Don Zickus
2010-10-21  1:14         ` Huang Ying
2010-10-21  2:31           ` Don Zickus
2010-10-21  5:17             ` Huang Ying
2010-10-21 14:10               ` Don Zickus
2010-10-21 15:45                 ` Andi Kleen
2010-10-22  1:49                   ` Don Zickus [this message]
2010-10-22  2:05                     ` Huang Ying
2010-10-22  2:56                       ` Don Zickus
2010-10-22  5:23                         ` Huang Ying
2010-10-22  9:24                     ` Andi Kleen
2010-10-09  6:49 ` [PATCH -v3 6/6] x86, NMI, Remove do_nmi_callback logic Huang Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101022014955.GP19090@redhat.com \
    --to=dzickus@redhat.com \
    --cc=andi@firstfloor.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=robert.richter@amd.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox