public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Huang Ying <ying.huang@intel.com>
To: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>, "H. Peter Anvin" <hpa@zytor.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Andi Kleen <andi@firstfloor.org>,
	Robert Richter <robert.richter@amd.com>,
	"peterz@infradead.org" <peterz@infradead.org>
Subject: Re: [PATCH -v3 5/6] x86, NMI, treat unknown NMI as hardware error
Date: Thu, 21 Oct 2010 09:14:03 +0800	[thread overview]
Message-ID: <1287623643.19320.40.camel@yhuang-dev> (raw)
In-Reply-To: <20101020141558.GB19090@redhat.com>

On Wed, 2010-10-20 at 22:15 +0800, Don Zickus wrote:
> On Wed, Oct 20, 2010 at 02:12:37PM +0800, Huang Ying wrote:
> > Hi, Don,
> > 
> > On Tue, 2010-10-12 at 05:20 +0800, Don Zickus wrote:
> > > > @@ -366,6 +368,15 @@ unknown_nmi_error(unsigned char reason,
> > > >  	if (notify_die(DIE_NMIUNKNOWN, "nmi", regs, reason, 2, SIGINT) ==
> > > >  			NOTIFY_STOP)
> > > >  		return;
> > > > +	/*
> > > > +	 * On some platforms, hardware errors may be notified via
> > > > +	 * unknown NMI
> > > > +	 */
> > > > +	if (unknown_nmi_as_hwerr)
> > > > +		panic(
> > > > +		"NMI for hardware error without error record: Not continuing\n"
> > > > +		"Please check BIOS/BMC log for further information.");
> > > > +
> > > >  #ifdef CONFIG_MCA
> > > >  	/*
> > > >  	 * Might actually be able to figure out what the guilty party
> > > 
> > > The only quirk I have left is the above piece, which is basically a
> > > philosophy difference with Robert and myself.  Where we believe it should
> > > be on the die_chain and Andi and yourself would like to see it explicitly
> > > called out.
> > 
> > After some more thought, I found this is different from DIE_NMI and
> > DIE_NMI_IPI case. I think the code added is for general unknown NMI
> > processing instead of a device driver. What we do is not to add special
> > processing for some devices, but treat unknown NMI as hardware error
> > notification in general and use a white list to deal with broken
> > hardware and stone age machine. Do you agree?
> > 
> > If so, it should not be turned into a notifier block unless you want to
> > turn all general unknown NMI processing code into a notifier block.
> 
> Well, yes I actually do, mainly to keep the code simpler.  But also, after
> having a conversation with someone yesterday, realized that unknown NMIs
> are dealt with on a platform level and not a chipset level.

But there is some general rules for unknown NMI. We think unknown NMI is
hardware error notification on all systems except systems with broken
hardware or software bugs, stone age machines. Do you agree with that?

> The reason I say that is some companies, like HP, have a special driver
> hpwdt that they want to run in the case of an unknown NMI.  They don't
> care about HEST or the other stuff, they want their BIOS call to take care
> of it.  So now that hack has to be put into notifier somewhere.

Yes. I found that during NMI handler development. It sits in a notifier
chain and in a driver. hpwdt uses unknown NMI for watchdog timeout
notification, it is a platform feature and should be implemented in a
driver. But we want to implement a general default unknown NMI
processing logic, not do that for some specific platform or chipset.

> I can only imagine Dell trying to do something similar as a value add.
> 
> To me it just makes sense to setup all the HEST stuff as default notifier
> blocks and then have platform specific drivers register on top of them
> (using the priority scheme).  This to me gives everyone flexibility on how
> to handle the unknown NMIs.

Yes. HEST code will be in a driver and will register a notifier block to
do its work.

> Thoughts?

But the code in this patch is not for HEST. (HEST is only used to
implement the white list). I think the code is for a general standard
feature. I don't want to add HEST processing here.

Do you think it should be a general rule to treat all unknown NMI as
hardware error notification except some broken hardware and stone age
machines?

If so, this patch add that general logic to the general NMI handling
code. It is not for specific hardware. The HEST and PCI ID table are
used to implement a white list to deal with the broken hardware and
stone age machines.

Best Regards,
Huang Ying



  reply	other threads:[~2010-10-21  1:14 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-09  6:49 [PATCH -v3 1/6] x86, NMI, Add NMI symbol constants and rename memory parity to PCI SERR Huang Ying
2010-10-09  6:49 ` [PATCH -v3 2/6] x86, NMI, Add touch_nmi_watchdog to io_check_error delay Huang Ying
2010-10-09  6:49 ` [PATCH -v3 3/6] x86, NMI, Rewrite NMI handler Huang Ying
2010-10-11 16:13   ` Peter Zijlstra
2010-10-11 20:35     ` Don Zickus
2010-10-12  0:50     ` Huang Ying
2010-10-12  6:04       ` Peter Zijlstra
2010-10-12  6:14         ` Huang Ying
2010-10-12  6:31           ` Peter Zijlstra
2010-10-12  6:37             ` Huang Ying
2010-10-12  6:40               ` Peter Zijlstra
2010-10-12  6:45                 ` Huang Ying
2010-10-12  6:49                   ` Peter Zijlstra
2010-10-12  6:54                     ` Huang Ying
2010-10-12 13:51                     ` Andi Kleen
2010-10-12 14:15                       ` Peter Zijlstra
2010-10-27 16:45                         ` Don Zickus
2010-10-27 17:08                           ` Peter Zijlstra
2010-10-27 18:07                             ` Don Zickus
2010-11-02 17:50                             ` Don Zickus
2010-11-02 18:16                               ` Huang Ying
2010-11-02 19:11                                 ` Don Zickus
2010-11-02 20:47                                 ` Don Zickus
2010-10-09  6:49 ` [PATCH -v3 4/6] Make NMI reason io port (0x61) can be processed on any CPU Huang Ying
2010-10-09  6:49 ` [PATCH -v3 5/6] x86, NMI, treat unknown NMI as hardware error Huang Ying
2010-10-10 14:07   ` Alan Cox
2010-10-10 14:13     ` Andi Kleen
2010-10-11 21:08       ` Don Zickus
2010-10-11 21:12         ` Don Zickus
2010-10-11 21:20   ` Don Zickus
2010-10-12  1:10     ` Huang Ying
2010-10-20  6:12     ` Huang Ying
2010-10-20 14:15       ` Don Zickus
2010-10-21  1:14         ` Huang Ying [this message]
2010-10-21  2:31           ` Don Zickus
2010-10-21  5:17             ` Huang Ying
2010-10-21 14:10               ` Don Zickus
2010-10-21 15:45                 ` Andi Kleen
2010-10-22  1:49                   ` Don Zickus
2010-10-22  2:05                     ` Huang Ying
2010-10-22  2:56                       ` Don Zickus
2010-10-22  5:23                         ` Huang Ying
2010-10-22  9:24                     ` Andi Kleen
2010-10-09  6:49 ` [PATCH -v3 6/6] x86, NMI, Remove do_nmi_callback logic Huang Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1287623643.19320.40.camel@yhuang-dev \
    --to=ying.huang@intel.com \
    --cc=andi@firstfloor.org \
    --cc=dzickus@redhat.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=robert.richter@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox