public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
From: Ben Woodard <woodard@redhat.com>
To: linux-ia64@vger.kernel.org
Subject: RE: new utility for decoding salinfo records
Date: Tue, 11 Jan 2005 20:26:17 +0000	[thread overview]
Message-ID: <1105475177.22104.113.camel@quince.llnl.gov> (raw)
In-Reply-To: <1105458388.22104.7.camel@quince.llnl.gov>

On Tue, 2005-01-11 at 11:49, Luck, Tony wrote:
> >  Ben>        salinfo_decode2 also has the capability to generate
> >  Ben> output that is designed to be easily parsed by a machine. This
> >  Ben> is useful when you want to automate monitoring of large numbers
> >  Ben> of machines. For example, instead of having scripts notify you
> >  Ben> every time an ignorable single bit memory error occurs, the
> >  Ben> monitoring scripts can easily ignore those errors and only
> >  Ben> point out higher priority error conditions.
> >
> >It seems a bit dangerous to me to encourage ignoring single-bit
> >errors.  Perhaps it would be better to suggest to summarize these
> >errors?
> 
> Ben's world view might be a little skewed by his test case :-)
> 
> http://www.californiadigital.com/thunder.shtml
> [web page is out of date in regard to position on the top500 list, it
> was pushed down to #5 in the latest list].
> 
> For this system you really wouldn't want to wake your system
> admins for every single bit error that was reported (though
> summarizing the errors in a weekly/monthly report would of course
> be a good thing).  I believe that salinfo_decode2 makes doing
> this easy too.
> 

Tony is correct about that, I really don't have much experience with
anything except Thunder. Working exclusively on a fairly unique machine
gives one a fairly unique perspective.

What we find here are that: 

1) almost all nodes get some SBEs once in a while. Over time these
accumulate in the directory. We don't consider this to be a problem.
Gamma rays do happnen and if you have a big enough target and you sample
for a long enough time, you are bound to catch a few.

2) A few nodes (0.48% or about .06% of the DIMMs) get around 39-173
SBEs/week. This does not seem to be a problem and the problem doesn't
seem to get worse. We have decided as a policy to accept this reasonably
low rate of SBE errors as "OK". In the worst case, we seem to get about
1 SBE/hr.

3) If there is a real failure, it shows up really quickly. We have all
sorts of SBEs or MBEs. In that case we replace the DIMM immediately.

So does anyone with "normal world" experience have any suggestions on
how I should take into account the various perspectives? 

Do other people consider the isolated SBE a problem? 

Do other people consider 1SBE/hr on a DIMM a real problem that needs to
be fixed?


> -Tony


  parent reply	other threads:[~2005-01-11 20:26 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
2005-01-11 19:03 ` David Mosberger
2005-01-11 19:49 ` Luck, Tony
2005-01-11 20:25 ` David Mosberger
2005-01-11 20:26 ` Ben Woodard [this message]
2005-01-11 20:53 ` Mark Goodwin
2005-01-11 21:03 ` Ben Woodard
2005-01-11 21:12 ` Ben Woodard
2005-01-11 21:22 ` Russ Anderson
2005-01-11 21:23 ` Luck, Tony
2005-01-11 21:25 ` David Mosberger
2005-01-11 21:36 ` David Mosberger
2005-01-11 21:36 ` Matthias Fouquet-Lapar
2005-01-11 21:37 ` Ben Woodard
2005-01-11 21:42 ` David Mosberger
2005-01-11 21:58 ` Russ Anderson
2005-01-11 22:02 ` David Mosberger
2005-01-11 22:26 ` Matthias Fouquet-Lapar
2005-01-12  4:10 ` Keith Owens
2005-01-12  6:08 ` Luck, Tony
2005-01-12  6:43 ` Keith Owens
2005-01-12  9:34 ` Matthias Fouquet-Lapar
2005-01-12 16:57 ` Ben Woodard
2005-01-12 20:46 ` Keith Owens

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1105475177.22104.113.camel@quince.llnl.gov \
    --to=woodard@redhat.com \
    --cc=linux-ia64@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox