From: Don Zickus <dzickus@redhat.com>
To: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>, Ingo Molnar <mingo@elte.hu>,
"H. Peter Anvin" <hpa@zytor.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [RFC 5/6] x86, NMI, Add support to notify hardware error with unknown NMI
Date: Mon, 13 Sep 2010 13:53:46 -0400 [thread overview]
Message-ID: <20100913175346.GC26290@redhat.com> (raw)
In-Reply-To: <20100913185721.59ad9b4d@basil.nowhere.org>
On Mon, Sep 13, 2010 at 06:57:21PM +0200, Andi Kleen wrote:
> On Mon, 13 Sep 2010 11:47:50 -0400
> Don Zickus <dzickus@redhat.com> wrote:
>
> >
> > >
> > > Then there isn't necessarily something to "debug": data corruption
> > > can happen without any bugs being around (and in fact
> > > that's the common case, assuming production systems)
> > >
> > > So I'm not sure what you're debugging here. Are you being the
> > > support technician for the system through bugzilla? That sounds
> > > inefficient.
> >
> > The problem I repeatedly deal with for RHEL systems is a customer
> > sees an unknown NMI printed on their screen and sometimes the machine
> > falls apart shortly after, sometimes it doesn't. Obviously they are
> > going to file a bug asking why. A chunk of the problems are bad
> > hardware/firmware. But the problem is which one.
>
> NMIs are usually hardware.
>
> BTW one big issue here is that we don't display anything
> on the screen so the system seems hung. KMS solves this,
> but unfortunately not for the video chipsets
> often used in servers.
No most of our customer see messages being sent to the console or serial
part. I haven't seen KMS hiding the info yet.
>
> Part of it is solved by serializing the error
> and defaulting to reboot after panic (currently NMI doesn't do that,
> MCE already does, NMI should too imho)
>
> >
> > Replacing a slot card is easy, replacing a motherboard is not. So I
> > usually try to determine which device is failing by walking the pci
> > bus and looking for the serr bits or some of the pci-e status bits.
>
> You don't necessarily need to replace anything, it could
> be just unlucky data corruption (e.g. a big enough cosmic ray
> that flipped enough bits that the normal error correction
> couldn't fix it anymore)
No, these are easily reproducible NMIs. So far it they have been the
result of bad firmware (either features that are marked supported but not,
or register settings that changed between updates), nic cards that had
issues, or bad motherboards.
None of these issues went away because of a reboot.
>
> >
> > It is inefficient, but I haven't had time to figure out a way to
> > clean it up. And just for the record, I usually see an unknown NMI
> > report every other week.
>
> At least ignoring the data corruption is not the way to handle
> it. I don't think you'll do your customers a favor this way.
I never said I ignore them. We try to resolve them quickly.
>
> > > Anyways for hardware support we could probably dump some
> > > more information at panic or better through error
> > > serialization, but the word "debug" is IMHO an very wrong
> > > way to think about that.
> >
> > Well, I can use 'diagnos' or 'determine' if you want. But at the end
> > of the day, we have customers that see scary software messages and
> > expect us to give them reasonable direction to fix their problems.
>
> Usually these problems shouldn't be handled by kernel hackers,
> it's something for a hardware technician. If kernel
> hackers have to handle it something is very wrong.
>
> IMHO the software should give the customer enough information
> to fix (or rather let their hardware technician) work it out.
Yes, I agree, but the hardware folks usually like it when we give them a
better clue than 'hardware is broken'. Something like the network stopped
working or your storage controller's firmware went bad, is usually more
helpful.
And the thing is, the hardware usually leaves a bread cumb trail of where
things went wrong. It is just a matter of poking different chips to find
out who generated the error and report that.
>
> BTW one issue is that the screen is not big enough for all
> the information that is really useful for this. So I suspect
> to have it really useful you need to accept that some information
> will only be available through serialization (e.g. if you
> wanted to dump parts of the PCI config space)
Honestly, I don't think you need much screen real estate. It would be
nice when an unknown NMI comes in, if the kernel just pokes around the hardware
registers and display a summary of what it found. For example,
The following devices had error bits set in the status registers:
PCI device x:y.z - STATUS_BIT1 | STATUS_BIT2
HW device xyz - STATUS_BIT3
...
This at least gives the users some hardware they can remove/replace to see
if the problem goes away.
Right now I feel like it is one giant guessing game.
But I guess if we accept the fact that an unknown NMI will panic the box,
then we can probably be a little more liberal in breaking spinlocks and
poking around the hardware to display some userful info.
Just some thoughts.
Cheers,
Don
next prev parent reply other threads:[~2010-09-13 17:54 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-10 2:51 [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants Huang Ying
2010-09-10 2:51 ` [RFC 2/6] x86, NMI, Add touch_nmi_watchdog to io_check_error delay Huang Ying
2010-09-10 2:51 ` [RFC 3/6] x86, NMI, Rename memory parity error to PCI SERR error Huang Ying
2010-09-13 1:02 ` Robert Richter
2010-09-13 2:02 ` Huang Ying
2010-09-16 8:18 ` Robert Richter
2010-09-17 0:08 ` Huang Ying
2010-09-17 9:14 ` Robert Richter
2010-09-19 0:20 ` Huang Ying
2010-09-20 8:00 ` Robert Richter
2010-09-20 12:59 ` Borislav Petkov
2010-09-21 0:22 ` Huang Ying
2010-09-21 6:37 ` Borislav Petkov
2010-09-21 14:08 ` Doug Thompson
2010-09-21 23:04 ` Maciej W. Rozycki
2010-09-23 5:37 ` huang ying
2010-09-29 0:26 ` Maciej W. Rozycki
2010-09-10 2:51 ` [RFC 4/6] x86, NMI, Rewrite NMI handler Huang Ying
2010-09-10 15:56 ` Don Zickus
2010-09-10 16:03 ` Andi Kleen
2010-09-10 18:29 ` Don Zickus
2010-09-13 2:09 ` Huang Ying
2010-09-13 14:04 ` Don Zickus
2010-09-14 5:12 ` Huang Ying
2010-09-14 13:37 ` Don Zickus
2010-09-13 1:16 ` Robert Richter
2010-09-10 2:51 ` [RFC 5/6] x86, NMI, Add support to notify hardware error with unknown NMI Huang Ying
2010-09-10 16:02 ` Don Zickus
2010-09-10 16:19 ` Andi Kleen
2010-09-10 18:40 ` Don Zickus
2010-09-13 2:19 ` Huang Ying
2010-09-13 14:11 ` Don Zickus
2010-09-13 15:24 ` Andi Kleen
2010-09-13 15:47 ` Don Zickus
2010-09-13 16:57 ` Andi Kleen
2010-09-13 17:53 ` Don Zickus [this message]
2010-09-13 18:07 ` Andi Kleen
2010-09-13 18:23 ` Don Zickus
2010-09-13 18:36 ` Andi Kleen
2010-09-13 19:36 ` Don Zickus
2010-09-13 20:49 ` Andi Kleen
2010-09-13 21:25 ` Valdis.Kletnieks
2010-09-14 7:48 ` Andi Kleen
2010-09-14 17:54 ` Valdis.Kletnieks
2010-09-14 12:21 ` Ingo Molnar
2010-09-14 13:45 ` Don Zickus
2010-09-14 19:34 ` Cyrill Gorcunov
2010-09-15 9:29 ` Ingo Molnar
2010-09-10 2:51 ` [RFC 6/6] x86, NMI, Remove do_nmi_callback logic Huang Ying
2010-09-10 16:13 ` Don Zickus
2010-09-13 2:27 ` Huang Ying
2010-09-13 6:24 ` Ingo Molnar
2010-09-10 20:37 ` [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants Peter Zijlstra
2010-09-10 22:58 ` H. Peter Anvin
2010-09-11 8:50 ` Andi Kleen
2010-09-13 1:30 ` Robert Richter
2010-09-21 21:48 ` Don Zickus
2010-09-21 22:19 ` Andi Kleen
2010-09-22 16:07 ` Don Zickus
2010-09-23 9:29 ` huang ying
2010-09-23 14:16 ` Don Zickus
2010-09-24 11:50 ` huang ying
2010-09-24 14:29 ` Don Zickus
2010-09-23 9:51 ` huang ying
-- strict thread matches above, loose matches on Subject: below --
2010-09-14 14:31 [RFC 5/6] x86, NMI, Add support to notify hardware error with unknown NMI Andi Kleen
2010-09-14 15:17 ` Don Zickus
2010-09-14 17:40 ` Andi Kleen
2010-09-14 17:48 ` Ingo Molnar
2010-09-15 5:06 ` Huang Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100913175346.GC26290@redhat.com \
--to=dzickus@redhat.com \
--cc=andi@firstfloor.org \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).