All of lore.kernel.org
 help / color / mirror / Atom feed
From: Don Zickus <dzickus@redhat.com>
To: huang ying <huang.ying.caritas@gmail.com>
Cc: Andi Kleen <andi@firstfloor.org>,
	Huang Ying <ying.huang@intel.com>, Ingo Molnar <mingo@elte.hu>,
	"H. Peter Anvin" <hpa@zytor.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants
Date: Thu, 23 Sep 2010 10:16:51 -0400	[thread overview]
Message-ID: <20100923141651.GL26290@redhat.com> (raw)
In-Reply-To: <AANLkTi=gYa5aEhrQAgpgfopoyW_1v0TJsnQ7bSbTV52i@mail.gmail.com>

On Thu, Sep 23, 2010 at 05:29:57PM +0800, huang ying wrote:
> Hi, Don,
> 
> On Thu, Sep 23, 2010 at 12:07 AM, Don Zickus <dzickus@redhat.com> wrote:
> > On Wed, Sep 22, 2010 at 12:19:16AM +0200, Andi Kleen wrote:
> >>
> >> >
> >> > I guess adding either another knob to override the hardware error option
> >> > or tying it in with the panic_on_unknown_error option might make me more
> >> > comfortable.  That way enterprise customers can always just enable it by
> >> > default and desktop users (for now) could have it off.
> >>
> >> Anything that needs explicit enabling is a bad idea, that
> >> would lead to a lot of users running in "corrupt my data" mode.
> >
> > I know.  But as I said earlier in my emails, I am trying to figure out how
> > to deal with the fallout from unknown nmis turning into panics.  Today
> > people see unknown nmis.  They may or may not be corrupting data.  They
> > usually file a bug.  Currently it is hard for me to diagnosis the problem.
> > Usually the old 'upgrade your bios/firmware' does the trick.  Sometimes it
> > doesn't and people feel like their machines still run fine.  So they
> > ignore it (for good or for bad).
> >
> > Turning unknown nmis into panics would break their current setup without
> > much gain.  So I was trying to propose something temporarily until we
> > could get a better infrastructure to isolate the source and provide better
> > info on what to do.
> >
> > I agree with you that long term unknown nmis should be panics.  I just get
> > nervous about doing that now from a support perspective.
> 
> In fact, we use white list policy here. Only systems with HEST or
> identified by chipset host bridge PCI ID will panic for unknown NMI.
> So I think systems you worried about will not have this enabled.
> 
> >> The code currently uses the presence of a HEST error table
> >> to detect a server. HEST should be only available on servers.
> >>
> >> On servers at least panic should be default.
> >
> > Ok.  That's fine. But then what.  What does a developer do with that
> > panic?  There's no useful info.  That is sorta my problem.  Then again I
> > do not know much about HEST.
> 
> On some system, there is some hardware error log in BMC/BIOS. The
> hardware error log can be gotten via IPMI or BIOS menu. Otherwise, can
> we get some useful info for unknown NMI? If we can, can we collect the
> info, then print it on console and save it into flash via ERST (part
> of APEI too) before panic?

Ok.  Does the BIOS/BMC automatically do this?  Can we just print a message
on panic saying checking your BIOS/BMC logs for more info?

I would love to add code to gather more useful info for unknown NMIs, but
is it expected that HEST does some of this?  I guess what I am trying to
figure out, if we are going to put intelligence to detect a HEST enabled
machine and panic when unknown NMI comes along (presumably from HEST??),
then can we leverage HEST at all to understand why the NMI happened or
point the user to the BIOS/BMC to get more info.  In other words, what
value do we get HEST other than we detect its there, lets panic.

> 
> HEST is defined in ACPI spec 4.0 and later version in section named
> APEI (ACPI Platform Error Interface). It is used to describe the error
> sources of system. It should be available only on server platform.

Ok.  Does the kernel have intelligence to use it or the BIOS yet?

Cheers,
Don

  reply	other threads:[~2010-09-23 14:17 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-10  2:51 [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants Huang Ying
2010-09-10  2:51 ` [RFC 2/6] x86, NMI, Add touch_nmi_watchdog to io_check_error delay Huang Ying
2010-09-10  2:51 ` [RFC 3/6] x86, NMI, Rename memory parity error to PCI SERR error Huang Ying
2010-09-13  1:02   ` Robert Richter
2010-09-13  2:02     ` Huang Ying
2010-09-16  8:18       ` Robert Richter
2010-09-17  0:08         ` Huang Ying
2010-09-17  9:14           ` Robert Richter
2010-09-19  0:20             ` Huang Ying
2010-09-20  8:00               ` Robert Richter
2010-09-20 12:59                 ` Borislav Petkov
2010-09-21  0:22                   ` Huang Ying
2010-09-21  6:37                     ` Borislav Petkov
2010-09-21 14:08                       ` Doug Thompson
2010-09-21 23:04   ` Maciej W. Rozycki
2010-09-23  5:37     ` huang ying
2010-09-29  0:26       ` Maciej W. Rozycki
2010-09-10  2:51 ` [RFC 4/6] x86, NMI, Rewrite NMI handler Huang Ying
2010-09-10 15:56   ` Don Zickus
2010-09-10 16:03     ` Andi Kleen
2010-09-10 18:29       ` Don Zickus
2010-09-13  2:09         ` Huang Ying
2010-09-13 14:04           ` Don Zickus
2010-09-14  5:12             ` Huang Ying
2010-09-14 13:37               ` Don Zickus
2010-09-13  1:16   ` Robert Richter
2010-09-10  2:51 ` [RFC 5/6] x86, NMI, Add support to notify hardware error with unknown NMI Huang Ying
2010-09-10 16:02   ` Don Zickus
2010-09-10 16:19     ` Andi Kleen
2010-09-10 18:40       ` Don Zickus
2010-09-13  2:19         ` Huang Ying
2010-09-13 14:11           ` Don Zickus
2010-09-13 15:24             ` Andi Kleen
2010-09-13 15:47               ` Don Zickus
2010-09-13 16:57                 ` Andi Kleen
2010-09-13 17:53                   ` Don Zickus
2010-09-13 18:07                     ` Andi Kleen
2010-09-13 18:23                       ` Don Zickus
2010-09-13 18:36                         ` Andi Kleen
2010-09-13 19:36                           ` Don Zickus
2010-09-13 20:49                             ` Andi Kleen
2010-09-13 21:25                               ` Valdis.Kletnieks
2010-09-14  7:48                                 ` Andi Kleen
2010-09-14 17:54                                   ` Valdis.Kletnieks
2010-09-14 12:21                             ` Ingo Molnar
2010-09-14 13:45                               ` Don Zickus
2010-09-14 19:34                               ` Cyrill Gorcunov
2010-09-15  9:29                                 ` Ingo Molnar
2010-09-10  2:51 ` [RFC 6/6] x86, NMI, Remove do_nmi_callback logic Huang Ying
2010-09-10 16:13   ` Don Zickus
2010-09-13  2:27     ` Huang Ying
2010-09-13  6:24       ` Ingo Molnar
2010-09-10 20:37 ` [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants Peter Zijlstra
2010-09-10 22:58   ` H. Peter Anvin
2010-09-11  8:50   ` Andi Kleen
2010-09-13  1:30     ` Robert Richter
2010-09-21 21:48 ` Don Zickus
2010-09-21 22:19   ` Andi Kleen
2010-09-22 16:07     ` Don Zickus
2010-09-23  9:29       ` huang ying
2010-09-23 14:16         ` Don Zickus [this message]
2010-09-24 11:50           ` huang ying
2010-09-24 14:29             ` Don Zickus
2010-09-23  9:51   ` huang ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100923141651.GL26290@redhat.com \
    --to=dzickus@redhat.com \
    --cc=andi@firstfloor.org \
    --cc=hpa@zytor.com \
    --cc=huang.ying.caritas@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.