All of lore.kernel.org
 help / color / mirror / Atom feed
From: Russ Anderson <rja@sgi.com>
To: Andi Kleen <andi@firstfloor.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
	Borislav Petkov <bp@amd64.org>,
	"Luck, Tony" <tony.luck@intel.com>,
	Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
	Mauro Carvalho Chehab <mchehab@redhat.com>,
	"Young, Brent" <brent.young@intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Matt Domsch <Matt_Domsch@dell.com>,
	Doug Thompson <dougthompson@xmission.com>,
	Joe Perches <joe@perches.com>, Ingo Molnar <mingo@elte.hu>,
	"bluesmoke-devel@lists.sourceforge.net"
	<bluesmoke-devel@lists.sourceforge.net>,
	Linux Edac Mailing List <linux-edac@vger.kernel.org>,
	rja@sgi.com
Subject: Re: Hardware Error Kernel Mini-Summit
Date: Mon, 24 May 2010 11:21:24 -0500	[thread overview]
Message-ID: <20100524162124.GB7145@sgi.com> (raw)
In-Reply-To: <20100519090323.GA18073@basil.fritz.box>

On Wed, May 19, 2010 at 11:03:24AM +0200, Andi Kleen wrote:
> Hi Eric,
> 
> > I'm not ready to believe the average person that is running linux
> > is too stupid to understand the difference between a hardware
> > error and a software error.
> 
> Experience disagrees with you (that is not sure about average,
> but at least there's a significant portion) 
> 
> Also again today there are other reasons for it.

I agree with Andi.  While there are a wire range of users, the
vast majority know little about the hardware they are running
on.  Even in commercial settings, where users/admins are better
educated, there is little time to do detailed error analysis.

The more errors are detected/analyzed/corrected/recovered, the
better it is for everyone.

 
> > > Really to do anything useful with them you need trends
> > > and automatic actions (like predictive page offlining)
> > 
> > Not at all, and I don't have a clue where you start thinking
> > predictive page offlining makes the least bit of sense.  Broken
> > or even weak bits are rarely the common reason for ECC errors.
> 
> There are various studies that disagree with you on that.

Having the infrastructure to automatically off-line pages
is a good thing.  The details of where to set the predictive
threshold likely will be hardware specific (different DIMM
types failing at different rates).  It needs to be adjustable.

> > > A log isn't really a good format for that
> > 
> > A log is a fine format for realizing you have a problem.  A
> 
> A low steady rate of corrected errors on a large system
> is expected.  In fact if you look at the memory error log.
> of a large system (towards TBs) it nearly always has some 
> memory related events.

Yes, there are certainly examples of that.  

> In this case a log is not really useful. What you need
> is useful thresholds and a good summary.

The larger the system the more important a good summary is.

> > - Errors that occur frequently. That is broken hardware of one time or
> >   another.  I want to know about that so I can schedule down time to replace
> >   my memory before I get an uncorrected ECC error.  Errors of this kind
> >   are likely happening frequently enough as to impact performance.
> 
> Same issue here: if something is truly broken it floods
> you with errors.
> 
> First this costs a lot of time to process and it does not 
> actually tell you anything useful because most errors in a flood
> are similar.
> 
> Basically you don't care if you have 100 or 1000 errors, 
> and you definitely don't want all the of the errors filling up
> your disk and using up your CPU.
> 
> Again a threshold with an action is much more useful here.

Yes, good points.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

  reply	other threads:[~2010-05-24 16:21 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-17 18:23 Hardware Error Kernel Mini-Summit Mauro Carvalho Chehab
2010-05-17 22:41 ` Andi Kleen
2010-05-18 16:50   ` Mauro Carvalho Chehab
2010-05-18 18:10     ` Andi Kleen
2010-05-18 18:10       ` Andi Kleen
2010-05-18  6:52 ` Hidetoshi Seto
2010-05-18  6:52   ` Hidetoshi Seto
2010-05-18 16:44   ` Mauro Carvalho Chehab
2010-05-18 16:44     ` Mauro Carvalho Chehab
2010-05-18 17:42     ` Joe Perches
2010-05-18 17:59       ` Mauro Carvalho Chehab
2010-05-18 18:45       ` Andi Kleen
2010-05-18 18:57         ` Joe Perches
2010-05-18 18:53       ` Ingo Molnar
2010-05-18 19:08         ` Luck, Tony
2010-05-18 19:18           ` Borislav Petkov
2010-05-18 19:34             ` Ingo Molnar
2010-05-18 22:14             ` Eric W. Biederman
2010-05-18 22:14               ` Eric W. Biederman
2010-05-18 22:28               ` Andi Kleen
2010-05-19  1:14                 ` Eric W. Biederman
2010-05-19  1:14                   ` Eric W. Biederman
2010-05-19  6:46                   ` Borislav Petkov
2010-05-19  7:09                     ` Ingo Molnar
2010-05-19 11:54                       ` Mauro Carvalho Chehab
2010-05-19 11:54                         ` Mauro Carvalho Chehab
2010-05-20 12:37                         ` Ingo Molnar
2010-06-14 10:03                       ` Nils Carlson
2010-06-14 10:03                         ` Nils Carlson
2010-06-14 11:49                         ` Andi Kleen
2010-06-14 19:47                           ` Nils Carlson
2010-06-14 19:47                             ` Nils Carlson
2010-06-14 20:21                             ` Andi Kleen
2010-06-14 21:02                               ` Nils Carlson
2010-06-14 20:06                           ` Eric W. Biederman
2010-06-14 20:06                             ` Eric W. Biederman
2010-06-14 20:21                             ` Luck, Tony
2010-06-14 20:36                             ` Andi Kleen
2010-06-14 20:36                               ` Andi Kleen
2010-06-14 21:34                               ` Tony Luck
2010-06-14 21:34                                 ` Tony Luck
2010-06-14 23:46                                 ` Doug Thompson
2010-06-15  6:56                                   ` Andi Kleen
2010-06-15  8:06                                     ` Nils Carlson
2010-06-15  8:06                                       ` Nils Carlson
2010-06-15 10:01                                       ` Borislav Petkov
2010-06-15 11:41                                       ` Andi Kleen
2010-06-15 11:41                                         ` Andi Kleen
2010-06-15 12:21                                         ` Nils Carlson
2010-06-15 18:15                                           ` Luck, Tony
2010-06-15 18:38                                             ` Nils Carlson
2010-06-15 18:38                                               ` Nils Carlson
2010-06-15 19:37                                             ` Andi Kleen
2010-06-15 19:37                                               ` Andi Kleen
2010-06-15 19:35                                           ` Andi Kleen
2010-06-15 20:48                                             ` Nils Carlson
2010-06-15 20:48                                               ` Nils Carlson
2010-06-16  9:40                                               ` Andi Kleen
2010-06-16  9:40                                                 ` Andi Kleen
2010-06-15 22:33                                     ` Tony Luck
2010-06-15  6:44                                 ` Andi Kleen
2010-06-15  6:44                                   ` Andi Kleen
2010-05-19  9:03                   ` Andi Kleen
2010-05-24 16:21                     ` Russ Anderson [this message]
2010-05-24 18:26                       ` Andi Kleen
2010-05-24 18:26                         ` Andi Kleen
2010-05-19 17:30                   ` Tony Luck
2010-05-24 15:55                     ` Russ Anderson
2010-05-24 17:35                       ` Tony Luck
2010-05-24 18:31                         ` Andi Kleen
2010-05-18 22:29               ` Ingo Molnar
2010-05-18 19:30           ` Ingo Molnar
2010-05-18 20:42             ` Ingo Molnar
2010-05-18 21:37               ` Tony Luck
2010-05-18 22:00                 ` Ingo Molnar
2010-05-24 17:13                   ` Russ Anderson
2010-05-19  6:39                 ` Ingo Molnar
2010-05-18 13:06 ` Borislav Petkov
2010-05-18 16:52   ` Mauro Carvalho Chehab
2010-05-18 16:52     ` Mauro Carvalho Chehab
2010-05-18 17:06 ` Mauro Carvalho Chehab
2010-05-18 17:06   ` Mauro Carvalho Chehab
  -- strict thread matches above, loose matches on Subject: below --
2010-06-16  8:57 George Spelvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100524162124.GB7145@sgi.com \
    --to=rja@sgi.com \
    --cc=Matt_Domsch@dell.com \
    --cc=andi@firstfloor.org \
    --cc=bluesmoke-devel@lists.sourceforge.net \
    --cc=bp@amd64.org \
    --cc=brent.young@intel.com \
    --cc=dougthompson@xmission.com \
    --cc=ebiederm@xmission.com \
    --cc=joe@perches.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@redhat.com \
    --cc=mingo@elte.hu \
    --cc=mingo@redhat.com \
    --cc=seto.hidetoshi@jp.fujitsu.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.