public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Linas Vepstas <linas@austin.ibm.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>,
	"Nguyen, Tom L" <tom.l.nguyen@intel.com>,
	Paul Mackerras <paulus@samba.org>,
	Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
	Greg KH <greg@kroah.com>,
	Linux Kernel list <linux-kernel@vger.kernel.org>,
	ak@muc.de, linuxppc64-dev <linuxppc64-dev@ozlabs.org>,
	linux-pci@atrey.karlin.mff.cuni.cz
Subject: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)
Date: Fri, 18 Mar 2005 18:35:32 -0600	[thread overview]
Message-ID: <20050319003532.GS498@austin.ibm.com> (raw)
In-Reply-To: <1111187582.1236.192.camel@gaston>

On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark:
> 
> Additionally, in "real life", very few errors are cause by known errata.
> If the drivers know about the errata, they usually already work around
> them. Afaik, most of the errors are caused by transcient conditions on
> the bus or the device, like a bit beeing flipped, or thermal
> conditions... 


Heh. Let me describe "real life" a bit more accurately.

We've been running with pci error detection enabled here for the last
two years.  Based on this experience, the ballpark figures are:

90% of all detected errors were device driver bugs coupled to 
    pci card hardware errata

9% poorly seated pci cards (remove/reseat will make problem go away)

1% transient/other.


We've seen *EVERY* and I mean *EVERY* device driver that we've put
under stress tests (e.g. peak i/o rates for > 72 hours, e.g. 
massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
driver tripped on an EEH error detect that was traced back to 
a device driver bug.  Not to blame the drivers, a lot of these
were related to pci card hardware/foirmware bugs.  For example, 
I think grepping for "split completion" and "NAPI" in the 
patches/errata for e100 and e1000 for the last year will reveal 
some of the stuff that was found.  As far as I know,
for every bug found, a patch made it into mainline.

As a rule, it seems that finding these device driver bugs was
very hard; we had some people work on these for months, and in 
the case of the e1000, we managed to get Intel engineers to fly
out here and stare at PCI bus traces for a few days.  (Thanks Intel!)
Ditto for Emulex.  For ipr, we had inhouse people.

So overall, PCI error detection did have the expected effect 
(protecting the kernel from corruption, e.g. due to DMA's going 
to wild addresses), but I don't think anybody expected that the
vast majority would be software/hardware bugs, instead of transient 
effects.

What's ironic in all of this is that by adding error recovery,
device driver bugs will be able to hide more effectively ... 
if there's a pci bus error due to a driver bug, the pci card
will get rebooted, the kernel will burp for 3 seconds, and 
things will keep going, and most sysadmins won't notice or 
won't care.

--linas


  reply	other threads:[~2005-03-19  0:35 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-03-18 17:24 PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery) Nguyen, Tom L
2005-03-18 18:10 ` Grant Grundler
2005-03-18 23:13   ` Benjamin Herrenschmidt
2005-03-19  0:35     ` Linas Vepstas [this message]
2005-03-19  1:24       ` Real-life pci errors (Was: " Benjamin Herrenschmidt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050319003532.GS498@austin.ibm.com \
    --to=linas@austin.ibm.com \
    --cc=ak@muc.de \
    --cc=benh@kernel.crashing.org \
    --cc=greg@kroah.com \
    --cc=grundler@parisc-linux.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@atrey.karlin.mff.cuni.cz \
    --cc=linuxppc64-dev@ozlabs.org \
    --cc=paulus@samba.org \
    --cc=seto.hidetoshi@jp.fujitsu.com \
    --cc=tom.l.nguyen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox