All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andi Kleen <andi@firstfloor.org>
To: Don Zickus <dzickus@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>, Ingo Molnar <mingo@elte.hu>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [RFC 5/6] x86, NMI, Add support to notify hardware error with unknown NMI
Date: Mon, 13 Sep 2010 18:57:21 +0200	[thread overview]
Message-ID: <20100913185721.59ad9b4d@basil.nowhere.org> (raw)
In-Reply-To: <20100913154750.GA26290@redhat.com>

On Mon, 13 Sep 2010 11:47:50 -0400
Don Zickus <dzickus@redhat.com> wrote:

> 
> > 
> > Then there isn't necessarily something to "debug": data corruption
> > can happen without any bugs being around (and in fact
> > that's the common case, assuming production systems)
> > 
> > So I'm not sure what you're debugging here. Are you being the
> > support technician for the system through bugzilla? That sounds
> > inefficient.
> 
> The problem I repeatedly deal with for RHEL systems is a customer
> sees an unknown NMI printed on their screen and sometimes the machine
> falls apart shortly after, sometimes it doesn't.  Obviously they are
> going to file a bug asking why.  A chunk of the problems are bad
> hardware/firmware.  But the problem is which one.

NMIs are usually hardware.

BTW one big issue here is that we don't display anything
on the screen so the system seems hung. KMS solves this,
but unfortunately not for the video chipsets 
often used in servers.

Part of it is solved by serializing the error
and defaulting to reboot after panic (currently NMI doesn't do that,
MCE already does, NMI should too imho) 

> 
> Replacing a slot card is easy, replacing a motherboard is not.  So I
> usually try to determine which device is failing by walking the pci
> bus and looking for the serr bits or some of the pci-e status bits.

You don't necessarily need to replace anything, it could
be just unlucky data corruption (e.g. a big enough cosmic ray
that flipped enough bits that the normal error correction
couldn't fix it anymore)

> 
> It is inefficient, but I haven't had time to figure out a way to
> clean it up.  And just for the record, I usually see an unknown NMI
> report every other week.

At least ignoring the data corruption is not the way to handle
it. I don't think you'll do your customers a favor this way.
 
> > Anyways for hardware support we could probably dump some
> > more information at panic or better through error
> > serialization, but the word "debug" is IMHO an very wrong
> > way to think about that.
> 
> Well, I can use 'diagnos' or 'determine' if you want.  But at the end
> of the day, we have customers that see scary software messages and
> expect us to give them reasonable direction to fix their problems.

Usually these problems shouldn't be handled by kernel hackers,
it's something for a hardware technician. If kernel
hackers have to handle it something is very wrong.

IMHO the software should give the customer enough information
to fix (or rather let their hardware technician) work it out.

If it's not good enough for this we have to improve it. But
ignoring the errors is not the way to do that.

BTW one issue is that the screen is not big enough for all
the information that is really useful for this. So I suspect
to have it really useful you need to accept that some information
will only be available through serialization (e.g. if you 
wanted to dump parts of the PCI config space)

-Andi



-- 
ak@linux.intel.com -- Speaking for myself only.

  reply	other threads:[~2010-09-13 16:57 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-10  2:51 [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants Huang Ying
2010-09-10  2:51 ` [RFC 2/6] x86, NMI, Add touch_nmi_watchdog to io_check_error delay Huang Ying
2010-09-10  2:51 ` [RFC 3/6] x86, NMI, Rename memory parity error to PCI SERR error Huang Ying
2010-09-13  1:02   ` Robert Richter
2010-09-13  2:02     ` Huang Ying
2010-09-16  8:18       ` Robert Richter
2010-09-17  0:08         ` Huang Ying
2010-09-17  9:14           ` Robert Richter
2010-09-19  0:20             ` Huang Ying
2010-09-20  8:00               ` Robert Richter
2010-09-20 12:59                 ` Borislav Petkov
2010-09-21  0:22                   ` Huang Ying
2010-09-21  6:37                     ` Borislav Petkov
2010-09-21 14:08                       ` Doug Thompson
2010-09-21 23:04   ` Maciej W. Rozycki
2010-09-23  5:37     ` huang ying
2010-09-29  0:26       ` Maciej W. Rozycki
2010-09-10  2:51 ` [RFC 4/6] x86, NMI, Rewrite NMI handler Huang Ying
2010-09-10 15:56   ` Don Zickus
2010-09-10 16:03     ` Andi Kleen
2010-09-10 18:29       ` Don Zickus
2010-09-13  2:09         ` Huang Ying
2010-09-13 14:04           ` Don Zickus
2010-09-14  5:12             ` Huang Ying
2010-09-14 13:37               ` Don Zickus
2010-09-13  1:16   ` Robert Richter
2010-09-10  2:51 ` [RFC 5/6] x86, NMI, Add support to notify hardware error with unknown NMI Huang Ying
2010-09-10 16:02   ` Don Zickus
2010-09-10 16:19     ` Andi Kleen
2010-09-10 18:40       ` Don Zickus
2010-09-13  2:19         ` Huang Ying
2010-09-13 14:11           ` Don Zickus
2010-09-13 15:24             ` Andi Kleen
2010-09-13 15:47               ` Don Zickus
2010-09-13 16:57                 ` Andi Kleen [this message]
2010-09-13 17:53                   ` Don Zickus
2010-09-13 18:07                     ` Andi Kleen
2010-09-13 18:23                       ` Don Zickus
2010-09-13 18:36                         ` Andi Kleen
2010-09-13 19:36                           ` Don Zickus
2010-09-13 20:49                             ` Andi Kleen
2010-09-13 21:25                               ` Valdis.Kletnieks
2010-09-14  7:48                                 ` Andi Kleen
2010-09-14 17:54                                   ` Valdis.Kletnieks
2010-09-14 12:21                             ` Ingo Molnar
2010-09-14 13:45                               ` Don Zickus
2010-09-14 19:34                               ` Cyrill Gorcunov
2010-09-15  9:29                                 ` Ingo Molnar
2010-09-10  2:51 ` [RFC 6/6] x86, NMI, Remove do_nmi_callback logic Huang Ying
2010-09-10 16:13   ` Don Zickus
2010-09-13  2:27     ` Huang Ying
2010-09-13  6:24       ` Ingo Molnar
2010-09-10 20:37 ` [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants Peter Zijlstra
2010-09-10 22:58   ` H. Peter Anvin
2010-09-11  8:50   ` Andi Kleen
2010-09-13  1:30     ` Robert Richter
2010-09-21 21:48 ` Don Zickus
2010-09-21 22:19   ` Andi Kleen
2010-09-22 16:07     ` Don Zickus
2010-09-23  9:29       ` huang ying
2010-09-23 14:16         ` Don Zickus
2010-09-24 11:50           ` huang ying
2010-09-24 14:29             ` Don Zickus
2010-09-23  9:51   ` huang ying
  -- strict thread matches above, loose matches on Subject: below --
2010-09-14 14:31 [RFC 5/6] x86, NMI, Add support to notify hardware error with unknown NMI Andi Kleen
2010-09-14 15:17 ` Don Zickus
2010-09-14 17:40   ` Andi Kleen
2010-09-14 17:48 ` Ingo Molnar
2010-09-15  5:06   ` Huang Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100913185721.59ad9b4d@basil.nowhere.org \
    --to=andi@firstfloor.org \
    --cc=dzickus@redhat.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.