All of lore.kernel.org
 help / color / mirror / Atom feed
From: Don Zickus <dzickus@redhat.com>
To: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>, Ingo Molnar <mingo@elte.hu>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [RFC 5/6] x86, NMI, Add support to notify hardware error with unknown NMI
Date: Mon, 13 Sep 2010 11:47:50 -0400	[thread overview]
Message-ID: <20100913154750.GA26290@redhat.com> (raw)
In-Reply-To: <20100913172438.37443bf7@basil.nowhere.org>

On Mon, Sep 13, 2010 at 05:24:38PM +0200, Andi Kleen wrote:
> 
> Don,
> 
> > Unfortunately, most of the bugzillas I deal with, unkown NMIs are the
> > result of SERRs.  While you can consider that hardware error
> > reporting, the easiest way for me to debug those problems currently
> > is to have reporters run 'lspci -vvv' after the NMI is displayed to
> > figure out who caused the NMI.
> > 
> > My fear is that panic'ing the box on unknown NMIs on those platforms
> > will hinder my ability to easily debug those NMIs.
> 
> 
> The reason the NMI is sent is that there is a "lost" 
> data corruption somewhere in the system and if you don't 
> stop it the system the corruption may make it to disk,
> become permanent etc. The hardware was designed
> under the assumption that  the system is stopped by software
> when this happens (same reason as why many machine
> checks cause panics)

Yeah, I know. I was being too ignorant perhaps.

> 
> Then there isn't necessarily something to "debug": data corruption
> can happen without any bugs being around (and in fact
> that's the common case, assuming production systems)
> 
> So I'm not sure what you're debugging here. Are you being the support
> technician for the system through bugzilla? That sounds
> inefficient.

The problem I repeatedly deal with for RHEL systems is a customer sees an
unknown NMI printed on their screen and sometimes the machine falls apart
shortly after, sometimes it doesn't.  Obviously they are going to file a
bug asking why.  A chunk of the problems are bad hardware/firmware.  But
the problem is which one.

Replacing a slot card is easy, replacing a motherboard is not.  So I
usually try to determine which device is failing by walking the pci bus
and looking for the serr bits or some of the pci-e status bits.

It is inefficient, but I haven't had time to figure out a way to clean it
up.  And just for the record, I usually see an unknown NMI report every
other week.

> 
> Anyways for hardware support we could probably dump some
> more information at panic or better through error
> serialization, but the word "debug" is IMHO an very wrong
> way to think about that.

Well, I can use 'diagnos' or 'determine' if you want.  But at the end of
the day, we have customers that see scary software messages and expect us
to give them reasonable direction to fix their problems.

> 
> If this is about driver debugging it's entirely reasonable
> to have a special setting (e.g. disable the panic), 
> but the defaults should be set in a way to avoid
> spreading data corruption,.

Ok.  I can accept that.

Cheers,
Don

  reply	other threads:[~2010-09-13 15:48 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-10  2:51 [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants Huang Ying
2010-09-10  2:51 ` [RFC 2/6] x86, NMI, Add touch_nmi_watchdog to io_check_error delay Huang Ying
2010-09-10  2:51 ` [RFC 3/6] x86, NMI, Rename memory parity error to PCI SERR error Huang Ying
2010-09-13  1:02   ` Robert Richter
2010-09-13  2:02     ` Huang Ying
2010-09-16  8:18       ` Robert Richter
2010-09-17  0:08         ` Huang Ying
2010-09-17  9:14           ` Robert Richter
2010-09-19  0:20             ` Huang Ying
2010-09-20  8:00               ` Robert Richter
2010-09-20 12:59                 ` Borislav Petkov
2010-09-21  0:22                   ` Huang Ying
2010-09-21  6:37                     ` Borislav Petkov
2010-09-21 14:08                       ` Doug Thompson
2010-09-21 23:04   ` Maciej W. Rozycki
2010-09-23  5:37     ` huang ying
2010-09-29  0:26       ` Maciej W. Rozycki
2010-09-10  2:51 ` [RFC 4/6] x86, NMI, Rewrite NMI handler Huang Ying
2010-09-10 15:56   ` Don Zickus
2010-09-10 16:03     ` Andi Kleen
2010-09-10 18:29       ` Don Zickus
2010-09-13  2:09         ` Huang Ying
2010-09-13 14:04           ` Don Zickus
2010-09-14  5:12             ` Huang Ying
2010-09-14 13:37               ` Don Zickus
2010-09-13  1:16   ` Robert Richter
2010-09-10  2:51 ` [RFC 5/6] x86, NMI, Add support to notify hardware error with unknown NMI Huang Ying
2010-09-10 16:02   ` Don Zickus
2010-09-10 16:19     ` Andi Kleen
2010-09-10 18:40       ` Don Zickus
2010-09-13  2:19         ` Huang Ying
2010-09-13 14:11           ` Don Zickus
2010-09-13 15:24             ` Andi Kleen
2010-09-13 15:47               ` Don Zickus [this message]
2010-09-13 16:57                 ` Andi Kleen
2010-09-13 17:53                   ` Don Zickus
2010-09-13 18:07                     ` Andi Kleen
2010-09-13 18:23                       ` Don Zickus
2010-09-13 18:36                         ` Andi Kleen
2010-09-13 19:36                           ` Don Zickus
2010-09-13 20:49                             ` Andi Kleen
2010-09-13 21:25                               ` Valdis.Kletnieks
2010-09-14  7:48                                 ` Andi Kleen
2010-09-14 17:54                                   ` Valdis.Kletnieks
2010-09-14 12:21                             ` Ingo Molnar
2010-09-14 13:45                               ` Don Zickus
2010-09-14 19:34                               ` Cyrill Gorcunov
2010-09-15  9:29                                 ` Ingo Molnar
2010-09-10  2:51 ` [RFC 6/6] x86, NMI, Remove do_nmi_callback logic Huang Ying
2010-09-10 16:13   ` Don Zickus
2010-09-13  2:27     ` Huang Ying
2010-09-13  6:24       ` Ingo Molnar
2010-09-10 20:37 ` [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants Peter Zijlstra
2010-09-10 22:58   ` H. Peter Anvin
2010-09-11  8:50   ` Andi Kleen
2010-09-13  1:30     ` Robert Richter
2010-09-21 21:48 ` Don Zickus
2010-09-21 22:19   ` Andi Kleen
2010-09-22 16:07     ` Don Zickus
2010-09-23  9:29       ` huang ying
2010-09-23 14:16         ` Don Zickus
2010-09-24 11:50           ` huang ying
2010-09-24 14:29             ` Don Zickus
2010-09-23  9:51   ` huang ying
  -- strict thread matches above, loose matches on Subject: below --
2010-09-14 14:31 [RFC 5/6] x86, NMI, Add support to notify hardware error with unknown NMI Andi Kleen
2010-09-14 15:17 ` Don Zickus
2010-09-14 17:40   ` Andi Kleen
2010-09-14 17:48 ` Ingo Molnar
2010-09-15  5:06   ` Huang Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100913154750.GA26290@redhat.com \
    --to=dzickus@redhat.com \
    --cc=andi@firstfloor.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.