Re: [PATCH -v3 5/6] x86, NMI, treat unknown NMI as hardware error

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Huang Ying <ying.huang@intel.com>
To: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>, "H. Peter Anvin" <hpa@zytor.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Andi Kleen <andi@firstfloor.org>,
	Robert Richter <robert.richter@amd.com>,
	"peterz@infradead.org" <peterz@infradead.org>
Subject: Re: [PATCH -v3 5/6] x86, NMI, treat unknown NMI as hardware error
Date: Thu, 21 Oct 2010 13:17:31 +0800	[thread overview]
Message-ID: <1287638251.19320.190.camel@yhuang-dev> (raw)
In-Reply-To: <20101021023135.GA12086@redhat.com>

On Thu, 2010-10-21 at 10:31 +0800, Don Zickus wrote:
> On Thu, Oct 21, 2010 at 09:14:03AM +0800, Huang Ying wrote:
> > > > DIE_NMI_IPI case. I think the code added is for general unknown NMI
> > > > processing instead of a device driver. What we do is not to add special
> > > > processing for some devices, but treat unknown NMI as hardware error
> > > > notification in general and use a white list to deal with broken
> > > > hardware and stone age machine. Do you agree?
> > > > 
> > > > If so, it should not be turned into a notifier block unless you want to
> > > > turn all general unknown NMI processing code into a notifier block.
> > > 
> > > Well, yes I actually do, mainly to keep the code simpler.  But also, after
> > > having a conversation with someone yesterday, realized that unknown NMIs
> > > are dealt with on a platform level and not a chipset level.
> > 
> > But there is some general rules for unknown NMI. We think unknown NMI is
> > hardware error notification on all systems except systems with broken
> > hardware or software bugs, stone age machines. Do you agree with that?
> 
> Nope.  In my experiences, most of our customers are still running
> pre-Nehalem boxes, therefore most unknown NMIs are from broken hardware or
> bad firmware (at least in the bugzillas I deal with).

It seems that we have different point of view for reason of unknown NMI.
Should broken hardware be seen as hardware error?

As far as I know, Windows treat unknown NMI as hardware errors. Although
we are programming for Linux not Windows. Many hardware are built for
Windows.

> I would be excited if I was getting some sort of hardware error
> notification, because then I would know where the NMI might be coming
> from.  Instead, I have customers pull out cards out of their machine or
> instrument their kernel to see which device is causing the problem.  Slow
> and painful.

Hope new machine will have better hardware error reporting. :)

> > > The reason I say that is some companies, like HP, have a special driver
> > > hpwdt that they want to run in the case of an unknown NMI.  They don't
> > > care about HEST or the other stuff, they want their BIOS call to take care
> > > of it.  So now that hack has to be put into notifier somewhere.
> > 
> > Yes. I found that during NMI handler development. It sits in a notifier
> > chain and in a driver. hpwdt uses unknown NMI for watchdog timeout
> > notification, it is a platform feature and should be implemented in a
> 
> Actually no it doesn't, the name HP watchdog is deceiving.  The intent HP
> has with that handler is any unknown NMI needs to be trapped by that
> driver so it can do an SMI call, which tries to source the NMI and save
> its result in NVRAM.  Then it jumps back to the kernel for a reboot.
> 
> I have been dealing with HP for 3 years with that driver, I have gotten
> quite familiar with the NMI part of it. :-)

It seems that I am fooled by the naming :). Originally I had thought
that if someone does not write the misc file (notify firmware) in time,
NMI will be triggered as a watchdog.

> > driver. But we want to implement a general default unknown NMI
> > processing logic, not do that for some specific platform or chipset.
> > 
> > > I can only imagine Dell trying to do something similar as a value add.
> > > 
> > > To me it just makes sense to setup all the HEST stuff as default notifier
> > > blocks and then have platform specific drivers register on top of them
> > > (using the priority scheme).  This to me gives everyone flexibility on how
> > > to handle the unknown NMIs.
> > 
> > Yes. HEST code will be in a driver and will register a notifier block to
> > do its work.
> > 
> > > Thoughts?
> > 
> > But the code in this patch is not for HEST. (HEST is only used to
> > implement the white list). I think the code is for a general standard
> > feature. I don't want to add HEST processing here.
> > 
> > Do you think it should be a general rule to treat all unknown NMI as
> > hardware error notification except some broken hardware and stone age
> > machines?
> 
> I guess my impression of what unknown NMIs should do might be a little
> different than yours (not saying my view is a correct one, just the view I
> have when I answer your questions).

Yes. I think so too. The reply following is my understanding for that.
My understanding may be not correct too. :)

> (after spending more time thinking about this while looking at nmi
> priorities)
> 
> I thought anything that registers with a notifier and cases off of
> DIE_NMI, should be a driver/subsystem that expects and _can properly
> handle_ an NMI.  The expectation is that it can successfully detect the
> NMI is its own and return a NOTIFY_STOP if it is (after processing it).
> [I excluded DIE_NMI_IPI because of PeterZ's comments]

I think notifier registered on DIE_NMI can panic too. Why prevent it?

> Whereas DIE_NMIUNKNOWN would be for drivers/subsystem that can probably
> detect the NMI is its own but can't do anything but panic or drivers that
> don't know but want to handle the panic in their own special way (ie
> hpwdt, or sgi's x2apic_uv_x.c where they like to use nmi_buttons to debug
> stalls or hangs but don't want to panic).

I think drivers want to handle the unknown NMI in their own special way
are the expected users of DIE_NMIUNKNOWN. While drivers that can detect
the NMI is its own and will go panic should be registered on DIE_NMI.

> And if noone wants to attempt to handle it after that, then call
> unknown_nmi_error() (minus the notify_die(DIE_NMIUNKNOWN)).

I think unknown_nmi_error() (minus the notify_die(DIE_NMIUNKNOWN) is the
general default operation for unknown NMI. DIE_NMIUNKNOWN is for drivers
processing after determining the NMI is unknown and before the general
default operation.

> So to me hardware error notification, would just detect what chipset it is
> on and if it is something that matches its whitelist, register and use
> DIE_NMIUNKNOWN.  unknown_nmi_error() would just continue to be this
> general and vague thing that on more modern systems will likely never be
> called.

The difference between us is that I think it should be a general rule to
treat unknown NMI as hardware error notification, while you think it
should be in a driver for some special hardware. That is, it is general
or special?

> Anyway that is how I viewed everything or how I wouldn't mind seeing it
> implemented.  Then again, my view could be completely wrong.  :-)
> 
> I'll just rely on majority concensus on somebody's view.

That is fair. Thanks.

Best Regards,
Huang Ying

next prev parent reply	other threads:[~2010-10-21  5:17 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-09  6:49 [PATCH -v3 1/6] x86, NMI, Add NMI symbol constants and rename memory parity to PCI SERR Huang Ying
2010-10-09  6:49 ` [PATCH -v3 2/6] x86, NMI, Add touch_nmi_watchdog to io_check_error delay Huang Ying
2010-10-09  6:49 ` [PATCH -v3 3/6] x86, NMI, Rewrite NMI handler Huang Ying
2010-10-11 16:13   ` Peter Zijlstra
2010-10-11 20:35     ` Don Zickus
2010-10-12  0:50     ` Huang Ying
2010-10-12  6:04       ` Peter Zijlstra
2010-10-12  6:14         ` Huang Ying
2010-10-12  6:31           ` Peter Zijlstra
2010-10-12  6:37             ` Huang Ying
2010-10-12  6:40               ` Peter Zijlstra
2010-10-12  6:45                 ` Huang Ying
2010-10-12  6:49                   ` Peter Zijlstra
2010-10-12  6:54                     ` Huang Ying
2010-10-12 13:51                     ` Andi Kleen
2010-10-12 14:15                       ` Peter Zijlstra
2010-10-27 16:45                         ` Don Zickus
2010-10-27 17:08                           ` Peter Zijlstra
2010-10-27 18:07                             ` Don Zickus
2010-11-02 17:50                             ` Don Zickus
2010-11-02 18:16                               ` Huang Ying
2010-11-02 19:11                                 ` Don Zickus
2010-11-02 20:47                                 ` Don Zickus
2010-10-09  6:49 ` [PATCH -v3 4/6] Make NMI reason io port (0x61) can be processed on any CPU Huang Ying
2010-10-09  6:49 ` [PATCH -v3 5/6] x86, NMI, treat unknown NMI as hardware error Huang Ying
2010-10-10 14:07   ` Alan Cox
2010-10-10 14:13     ` Andi Kleen
2010-10-11 21:08       ` Don Zickus
2010-10-11 21:12         ` Don Zickus
2010-10-11 21:20   ` Don Zickus
2010-10-12  1:10     ` Huang Ying
2010-10-20  6:12     ` Huang Ying
2010-10-20 14:15       ` Don Zickus
2010-10-21  1:14         ` Huang Ying
2010-10-21  2:31           ` Don Zickus
2010-10-21  5:17             ` Huang Ying [this message]
2010-10-21 14:10               ` Don Zickus
2010-10-21 15:45                 ` Andi Kleen
2010-10-22  1:49                   ` Don Zickus
2010-10-22  2:05                     ` Huang Ying
2010-10-22  2:56                       ` Don Zickus
2010-10-22  5:23                         ` Huang Ying
2010-10-22  9:24                     ` Andi Kleen
2010-10-09  6:49 ` [PATCH -v3 6/6] x86, NMI, Remove do_nmi_callback logic Huang Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1287638251.19320.190.camel@yhuang-dev \
    --to=ying.huang@intel.com \
    --cc=andi@firstfloor.org \
    --cc=dzickus@redhat.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=robert.richter@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox