new utility for decoding salinfo records

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* new utility for decoding salinfo records
@ 2005-01-11 15:46 Ben Woodard
  2005-01-11 19:03 ` David Mosberger
                   ` (22 more replies)
  0 siblings, 23 replies; 24+ messages in thread
From: Ben Woodard @ 2005-01-11 15:46 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 8269 bytes --]

Excuse me if this ends up being a duplicate. I mailed this out last
night but for some reason, it hasn't come through. It is not in the
archives nor have I seen it come back through to my mail box.

Here is a new utility for looking into salinfo records. It several
things differently than salinfo_decode. We have found that this helps
considerably in understanding problem on our itanium servers. The
attached patch applies to salinfo-0.7 and does not modify
salinfo_decode's functioning in any way. In fact the only files that are
modifed are the Makefile and the spec file.

Here is the man page which tries to illustrate some of the features
which were designed into salinfo_decode2.

SALINFO_DECODE2(8)      Decode Itanium SAL Error Records      SALINFO_DECODE2(8)
 
NAME
       salinfo_decode2 - decode Itanium SAL error records
 
SYNOPSIS
       salinfo_decode2 [OPTION]... [FILE | DIRECTORY]...
 
DESCRIPTION
       salinfo_decode2 decodes CMC/CPE/MCA/INIT records obtained from the SAL.
       It will take a list of files or directories and print out the requested
       information  about  the salinfo records that are contained within those
       files. This is notably different than the salinfo_decode program  which
       processes  only a single record at a time. Experience has shown that it
       can be difficult to identify a hardware failure of the  type  found  in
       the  salinfo  logs  because the failure results in many salinfo records
       being created. salinfo_decode2 allows a system administrator to  glance
       at  a  directory  full  of errors or some subset of files and obtain an
       overall impression of how meaningful the errors are.  This is  done  by
       turning  down  the  verbosity  and  generalizing  what  is  there. More
       experienced  administrators  can  turn  up  the   verbosity   and   get
       progressively more detailed information.
 
       salinfo_decode2  also  has  the  capability  to generate output that is
       designed to be easily parsed by a machine. This is useful when you want
       to  automate  monitoring  of  large  numbers  of machines. For example,
       instead of having scripts notify you every time an ignorable single bit
       memory  error  occurs,  the  monitoring scripts can easily ignore those
       errors and only point out higher priority error conditions.
 
       If no files or directories are specified on the command line, stdin  is
       read and is assumed to be a SAL record.
 
       salinfo_decode2  also  has the advantage that a SAL record from an ia64
       can be inspected and analyzed on a non-ia64, non-little endian machine.
       For  example,  a  system  administrator  using  an ia32 workstation can
       inspect SAL records from an ia64 cluster.  The design of  the  original
       salinfo_decode’s  internal  architecture  precludes this kind of cross-
       platform utilization.
 
OPTIONS
       -h, --help
              Print usage and exit
 
       -V, --version
              Print version information and exit
 
       -c, --cmc
              Only print cmc records
 
       -p, --cpe
              Only print cpe records
 
       -m, --mca
              Only print mca records
 
       -i, --init
              Only print init records
                                                                                
       -d, --dimm-offset
              Count dimms starting at 1 not 0. This is  useful  when  the  SAL
              reports  failures  starting with 0 but the numbers silk screened
              on a the motherboard begin with  1.  This  helps  reduce  system
              administrator confusion when replacing the memory DIMM.
 
       -o, --cpu-offset
              Count  cpus  starting  at  1  not 0. This is useful when the SAL
              reports failures starting with 0 but the numbers  silk  screened
              on  the  motherboard  begin  with  1.  This  helps reduce system
              administrator confusion when replacing CPUs.
 
       --tiger4
              The same as -d & -o. The Intel Tiger 4 motherboard’s  silkscreen
              counts  both CPUs and DIMMs beginning with 1 rather than 0 which
              is what the SAL returns.
 
       -f, --forgiving
              Be forgiving of errors when opening files and reading data
 
       -r, --recursive
              When a database is a directory traverse its sub-directories
 
       -v, --verbosity
              Specify the verbosity to print records. Verbosity  can  be  1-6.
              However,  as  the  verbosity  increases, the likelihood that the
              printing of the detailed information hasn’t been implemented yet
              also  increases.  Patches  to  remedy this situation are eagerly
              accepted.  The goal with the progressive levels of verbosity  is
              to  facilitate  understanding  of records, not just to blurt out
              every scrap of  available  information.  Since  verbosity  6  is
              largely  not  implemented  yet,  if  you  need  all of available
              information, use the original salinfo_decode.
 
       -s, --scriptable
              Output in  a  machine  readable  format.  This  is  designed  to
              facilitate quick and easy shell scripting with the output. Refer
              to the examples section for intended use.
 
EXAMPLES
       Pointing salinfo_decode2 at a  directory  of  a  few  errors  with  the
       verbosity   set   very  low  shows  that  all  the  errors  are  mainly
       inconsequential:
 
       $ ./salinfo_decode2 -v1 tigertest/
       cpe with severity "corrected" occurred at 12:03:08 on Apr 1 2004
       cpe with severity "corrected" occurred at 12:03:10 on Apr 1 2004
       cpe with severity "corrected" occurred at 12:32:14 on Apr 1 2004
       cpe with severity "corrected" occurred at 17:24:44 on Apr 1 2004
 
       Here is an example of how different levels  of  verbosity  present  the
       same SAL record differently:
 
       $ ./salinfo_decode2 -v1 sample_data/tdev2-2004-04-01-12:03:08-cpu1-cpe0
       cpe with severity "corrected" occurred at 12:03:08 on Apr 1 2004
                                                                                
       $ ./salinfo_decode2 -v2 sample_data/tdev2-2004-04-01-12:03:08-cpu1-cpe0
       record 612413502631444488 contains the following sections: (PCI component) (PCI component) (PCI component) (PCI component) (memory) (platform specific)
 
       $ ./salinfo_decode2 -v3 sample_data/tdev2-2004-04-01-12:03:08-cpu1-cpe0
       record 612413502631444488 contains the following sections:
       PCI component with (vend/dev) 8086/500 at (Seg/Bus/Dev/Func) 0/255/24/0 reported a fault
       PCI component with (vend/dev) 8086/501 at (Seg/Bus/Dev/Func) 0/255/24/1 reported a fault
       PCI component with (vend/dev) 8086/502 at (Seg/Bus/Dev/Func) 0/255/24/2 reported a fault
       PCI component with (vend/dev) 8086/503 at (Seg/Bus/Dev/Func) 0/255/24/3 reported a fault
       Memory fault at (node/card/module/bank/device) 0/0/8/0/0
       OEM component with id 0x44fc4766d807e40f reported a fault
 
       Here is an example of how to use the scriptable interface to change the
       formatting of the output and to select one record  out  of  many  which
       match a specific criteria.
 
       $ ./salinfo_decode2 -v1 -s sample_data/ | while read line;do
       > eval $line
       > if [ "$severity" != "corrected" ];then
       >    echo $month/$day/$year
       > fi
       > done
       4/1/2004
 
BUGS
       Many  levels  of  verbosity  for  many  types  of  errors  are  not yet
       implemented. The project reached a state where it did  what  the  users
       needed  it  to do and then I was asked to work on other things. Patches
       are greatfully accepted.
 
AUTHOR
       Ben Woodard <woodard@redhat.com>
 
SEE ALSO
       salinfo_decode(8)
                                                                                
Linux                             Jan 6, 2005                SALINFO_DECODE2(8)


[-- Attachment #2: salinfo_decode2.patch.gz --]
[-- Type: application/x-gzip, Size: 29546 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
@ 2005-01-11 19:03 ` David Mosberger
  2005-01-11 19:49 ` Luck, Tony
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: David Mosberger @ 2005-01-11 19:03 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 11 Jan 2005 07:46:28 -0800, Ben Woodard <woodard@redhat.com> said:

  Ben>        salinfo_decode2 also has the capability to generate
  Ben> output that is designed to be easily parsed by a machine. This
  Ben> is useful when you want to automate monitoring of large numbers
  Ben> of machines. For example, instead of having scripts notify you
  Ben> every time an ignorable single bit memory error occurs, the
  Ben> monitoring scripts can easily ignore those errors and only
  Ben> point out higher priority error conditions.

It seems a bit dangerous to me to encourage ignoring single-bit
errors.  Perhaps it would be better to suggest to summarize these
errors?

	--david

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
  2005-01-11 19:03 ` David Mosberger
@ 2005-01-11 19:49 ` Luck, Tony
  2005-01-11 20:25 ` David Mosberger
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Luck, Tony @ 2005-01-11 19:49 UTC (permalink / raw)
  To: linux-ia64

>  Ben>        salinfo_decode2 also has the capability to generate
>  Ben> output that is designed to be easily parsed by a machine. This
>  Ben> is useful when you want to automate monitoring of large numbers
>  Ben> of machines. For example, instead of having scripts notify you
>  Ben> every time an ignorable single bit memory error occurs, the
>  Ben> monitoring scripts can easily ignore those errors and only
>  Ben> point out higher priority error conditions.
>
>It seems a bit dangerous to me to encourage ignoring single-bit
>errors.  Perhaps it would be better to suggest to summarize these
>errors?

Ben's world view might be a little skewed by his test case :-)

http://www.californiadigital.com/thunder.shtml
[web page is out of date in regard to position on the top500 list, it
was pushed down to #5 in the latest list].

For this system you really wouldn't want to wake your system
admins for every single bit error that was reported (though
summarizing the errors in a weekly/monthly report would of course
be a good thing).  I believe that salinfo_decode2 makes doing
this easy too.

-Tony

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
  2005-01-11 19:03 ` David Mosberger
  2005-01-11 19:49 ` Luck, Tony
@ 2005-01-11 20:25 ` David Mosberger
  2005-01-11 20:26 ` Ben Woodard
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: David Mosberger @ 2005-01-11 20:25 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 11 Jan 2005 11:49:52 -0800, "Luck, Tony" <tony.luck@intel.com> said:

  Tony> For this system you really wouldn't want to wake your system
  Tony> admins for every single bit error that was reported

Of course not.

  Tony> (though summarizing the errors in a weekly/monthly report
  Tony> would of course be a good thing).  I believe that
  Tony> salinfo_decode2 makes doing this easy too.

Yes.  While individual single-bit errors aren't terribly interesting,
periodic summaries almost certainly would be.  If only so you know
when to order replacement DIMMs... ;-)

	--david

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (2 preceding siblings ...)
  2005-01-11 20:25 ` David Mosberger
@ 2005-01-11 20:26 ` Ben Woodard
  2005-01-11 20:53 ` Mark Goodwin
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Ben Woodard @ 2005-01-11 20:26 UTC (permalink / raw)
  To: linux-ia64

On Tue, 2005-01-11 at 11:49, Luck, Tony wrote:
> >  Ben>        salinfo_decode2 also has the capability to generate
> >  Ben> output that is designed to be easily parsed by a machine. This
> >  Ben> is useful when you want to automate monitoring of large numbers
> >  Ben> of machines. For example, instead of having scripts notify you
> >  Ben> every time an ignorable single bit memory error occurs, the
> >  Ben> monitoring scripts can easily ignore those errors and only
> >  Ben> point out higher priority error conditions.
> >
> >It seems a bit dangerous to me to encourage ignoring single-bit
> >errors.  Perhaps it would be better to suggest to summarize these
> >errors?
> 
> Ben's world view might be a little skewed by his test case :-)
> 
> http://www.californiadigital.com/thunder.shtml
> [web page is out of date in regard to position on the top500 list, it
> was pushed down to #5 in the latest list].
> 
> For this system you really wouldn't want to wake your system
> admins for every single bit error that was reported (though
> summarizing the errors in a weekly/monthly report would of course
> be a good thing).  I believe that salinfo_decode2 makes doing
> this easy too.
> 

Tony is correct about that, I really don't have much experience with
anything except Thunder. Working exclusively on a fairly unique machine
gives one a fairly unique perspective.

What we find here are that: 

1) almost all nodes get some SBEs once in a while. Over time these
accumulate in the directory. We don't consider this to be a problem.
Gamma rays do happnen and if you have a big enough target and you sample
for a long enough time, you are bound to catch a few.

2) A few nodes (0.48% or about .06% of the DIMMs) get around 39-173
SBEs/week. This does not seem to be a problem and the problem doesn't
seem to get worse. We have decided as a policy to accept this reasonably
low rate of SBE errors as "OK". In the worst case, we seem to get about
1 SBE/hr.

3) If there is a real failure, it shows up really quickly. We have all
sorts of SBEs or MBEs. In that case we replace the DIMM immediately.

So does anyone with "normal world" experience have any suggestions on
how I should take into account the various perspectives? 

Do other people consider the isolated SBE a problem? 

Do other people consider 1SBE/hr on a DIMM a real problem that needs to
be fixed?

> -Tony

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (3 preceding siblings ...)
  2005-01-11 20:26 ` Ben Woodard
@ 2005-01-11 20:53 ` Mark Goodwin
  2005-01-11 21:03 ` Ben Woodard
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Mark Goodwin @ 2005-01-11 20:53 UTC (permalink / raw)
  To: linux-ia64


On Tue, 11 Jan 2005, Ben Woodard wrote:
> ...
> 3) If there is a real failure, it shows up really quickly. We have all
> sorts of SBEs or MBEs. In that case we replace the DIMM immediately.
>
> So does anyone with "normal world" experience have any suggestions on
> how I should take into account the various perspectives?
>
> Do other people consider the isolated SBE a problem?

considered normal, fully recoverable.

>
> Do other people consider 1SBE/hr on a DIMM a real problem that needs to
> be fixed?

this is a concern if the failing DIMM ends up with uncorrectable MBEs.
Do you have any evidence that a relatively high rate of SBEs on a
DIMM can be used to predict that MBEs are likely to start occurring?
Memory hot-unplug or a bad-page reserving strategy based on such
prediction may be interesting.

-- Mark

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (4 preceding siblings ...)
  2005-01-11 20:53 ` Mark Goodwin
@ 2005-01-11 21:03 ` Ben Woodard
  2005-01-11 21:12 ` Ben Woodard
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Ben Woodard @ 2005-01-11 21:03 UTC (permalink / raw)
  To: linux-ia64

On Tue, 2005-01-11 at 12:53, Mark Goodwin wrote:
> On Tue, 11 Jan 2005, Ben Woodard wrote:
> > ...
> > 3) If there is a real failure, it shows up really quickly. We have all
> > sorts of SBEs or MBEs. In that case we replace the DIMM immediately.
> >
> > So does anyone with "normal world" experience have any suggestions on
> > how I should take into account the various perspectives?
> >
> > Do other people consider the isolated SBE a problem?
> 
> considered normal, fully recoverable.
> 
> >
> > Do other people consider 1SBE/hr on a DIMM a real problem that needs to
> > be fixed?
> 
> this is a concern if the failing DIMM ends up with uncorrectable MBEs.
> Do you have any evidence that a relatively high rate of SBEs on a
> DIMM can be used to predict that MBEs are likely to start occurring?

No quite the contrary. We believed rates of SBEs in the neighborhood of
1/hr would ultimately lead to MBEs but further testing has shown that we
really don't see DIMMS with SBEs turing in MBEs.

We did replace plenty of DIMMs which did have higher rates of SBEs
simply because it takes computational time to handle a SBE and we feared
it would introduce additional time in tightly coupled in scientific
codes.

> Memory hot-unplug or a bad-page reserving strategy based on such
> prediction may be interesting.
> 
> -- Mark


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (5 preceding siblings ...)
  2005-01-11 21:03 ` Ben Woodard
@ 2005-01-11 21:12 ` Ben Woodard
  2005-01-11 21:22 ` Russ Anderson
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Ben Woodard @ 2005-01-11 21:12 UTC (permalink / raw)
  To: linux-ia64

On Tue, 2005-01-11 at 11:03, David Mosberger wrote:
> >>>>> On Tue, 11 Jan 2005 07:46:28 -0800, Ben Woodard <woodard@redhat.com> said:
> 
>   Ben>        salinfo_decode2 also has the capability to generate
>   Ben> output that is designed to be easily parsed by a machine. This
>   Ben> is useful when you want to automate monitoring of large numbers
>   Ben> of machines. For example, instead of having scripts notify you
>   Ben> every time an ignorable single bit memory error occurs, the
>   Ben> monitoring scripts can easily ignore those errors and only
>   Ben> point out higher priority error conditions.
> 
> It seems a bit dangerous to me to encourage ignoring single-bit
> errors.  Perhaps it would be better to suggest to summarize these
> errors?
> 
> 	--david

Does this sound like better wording to you?

salinfo_decode2 also has the capability to generate output that is
designed to be easily parsed by a machine. This is useful when you want
to automate monitoring of large numbers of machines. For example,
instead of having scripts notify the sysadmin every time an isolated
single bit memory error occurs, the monitoring scripts can be designed
to ignore those isolated errors (but save them for later summary
reporting) and notify the sysadmin only if the rate exceeds a specified
threshold.

-ben



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (6 preceding siblings ...)
  2005-01-11 21:12 ` Ben Woodard
@ 2005-01-11 21:22 ` Russ Anderson
  2005-01-11 21:23 ` Luck, Tony
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Russ Anderson @ 2005-01-11 21:22 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote:
> 
> Yes.  While individual single-bit errors aren't terribly interesting,
> periodic summaries almost certainly would be.  If only so you know
> when to order replacement DIMMs... ;-)

The only reason customers care about single bits (a recovered error)
is out of fear that they will soon lead to a multi-bit error (that
is not recoverable) that crashes the system.  If the system recovers 
from multi-bits without crashing, either by killing the app
that hit the multi-bit or (better) by backing up to the last 
checkpoint (losing processing time, but not data), then the 
customer won't even care about single bits.

Then the answer is you order the replacement DIMMs after they fail.  :-)

Or maybe not even then.  Hard drives have flaw tables that indicate
the parts of the disks to avoid.  If memory DIMMs had flaw tables,
and the equivilent of badblocks, why would you replace a DIMM?

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (7 preceding siblings ...)
  2005-01-11 21:22 ` Russ Anderson
@ 2005-01-11 21:23 ` Luck, Tony
  2005-01-11 21:25 ` David Mosberger
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Luck, Tony @ 2005-01-11 21:23 UTC (permalink / raw)
  To: linux-ia64

>1) almost all nodes get some SBEs once in a while. Over time these
>accumulate in the directory. We don't consider this to be a problem.
>Gamma rays do happen and if you have a big enough target and 
>you sample for a long enough time, you are bound to catch a few.

<nitpick>The bit flips are more likely the result of neutrons than
gamma rays ... though the neutrons in question were produced by "cosmic"
rays hitting the upper atmosphere.</nitpick>

>Do other people consider 1SBE/hr on a DIMM a real problem that needs to
>be fixed?

That's way too high for background neutrons.  Possibly a DIMM that
got zapped by a static discharge sometime in its life?  Or just a
manufacturing defect small enough to get past the normal testing
process?

Whether it is a problem depends on the liklihood of it cascading into
a multi-bit error ... for which I don't have any data.

-Tony

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (8 preceding siblings ...)
  2005-01-11 21:23 ` Luck, Tony
@ 2005-01-11 21:25 ` David Mosberger
  2005-01-11 21:36 ` David Mosberger
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: David Mosberger @ 2005-01-11 21:25 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 11 Jan 2005 13:12:37 -0800, Ben Woodard <woodard@redhat.com> said:

  Ben> Does this sound like better wording to you?

Yup.

>>>>> On Tue, 11 Jan 2005 13:03:22 -0800, Ben Woodard <woodard@redhat.com> said:

  Ben> We believed rates of SBEs in the neighborhood of 1/hr would
  Ben> ultimately lead to MBEs but further testing has shown that we
  Ben> really don't see DIMMS with SBEs turing in MBEs.

That's very interesting.

	--david

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (9 preceding siblings ...)
  2005-01-11 21:25 ` David Mosberger
@ 2005-01-11 21:36 ` David Mosberger
  2005-01-11 21:36 ` Matthias Fouquet-Lapar
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: David Mosberger @ 2005-01-11 21:36 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 11 Jan 2005 13:23:48 -0800, "Luck, Tony" <tony.luck@intel.com> said:

  Tony> Whether it is a problem depends on the liklihood of it
  Tony> cascading into a multi-bit error ... for which I don't have
  Tony> any data.

While this is not an area I have experience with, it does seem to me
that considering how many clusters (really: "machines" with large
amounts of memory) are out there, there seems an amazing dearth of
solid data.  The memory manufacturers presumably have it, but are
disinterested in sharing.  On the other hand, I don't see any reason
why cluster operators (such as national labs) don't collect & share
such data more.  It's difficult for systems folks to make good choices
without such data, especially since the effects often appear to be
counter-intuitive (like SBEs not turning into MBEs).

	--david

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (10 preceding siblings ...)
  2005-01-11 21:36 ` David Mosberger
@ 2005-01-11 21:36 ` Matthias Fouquet-Lapar
  2005-01-11 21:37 ` Ben Woodard
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Matthias Fouquet-Lapar @ 2005-01-11 21:36 UTC (permalink / raw)
  To: linux-ia64

>   Ben> We believed rates of SBEs in the neighborhood of 1/hr would
>   Ben> ultimately lead to MBEs but further testing has shown that we
>   Ben> really don't see DIMMS with SBEs turing in MBEs.
> 
> That's very interesting.

We have seen both : hard SBEs which never end up in a UCE and bursts of
SBEs which will lead to UCEs. It is DRAM vendor specific and it depends
in which phase of the chips life cycle the error occurs (infant mortality
or not). 

Another important data point is if the error is "soft", i.e. after a scrub
operation it's corrected (probably caused by an alpha particle hit) or
"hard", i.e. the error still is there after the memory location has been
re-written.

I think as long as it is possible to log all errors, the following toolchain
can be adopted. Depending on the system infrastructure it might be useful
to capture additional information such as :

  - data pattern including ECC
  - environmental conditions (voltage, temperature)
  - DIMM serial numbers

The later is becoming a real issue when dealing with systems which have
several tera-bytes of main memory, but as I said this really is very 
platform specific

Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (11 preceding siblings ...)
  2005-01-11 21:36 ` Matthias Fouquet-Lapar
@ 2005-01-11 21:37 ` Ben Woodard
  2005-01-11 21:42 ` David Mosberger
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Ben Woodard @ 2005-01-11 21:37 UTC (permalink / raw)
  To: linux-ia64

On Tue, 2005-01-11 at 13:25, David Mosberger wrote:
> >>>>> On Tue, 11 Jan 2005 13:12:37 -0800, Ben Woodard <woodard@redhat.com> said:
> 
>   Ben> Does this sound like better wording to you?
> 
> Yup.
> 

OK then, I made the changes in my CVS. I won't send out an update to the
diff for a while yet hopeing to catch other things at the same time.

Anyone have any comments on the program itself?

-ben



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (12 preceding siblings ...)
  2005-01-11 21:37 ` Ben Woodard
@ 2005-01-11 21:42 ` David Mosberger
  2005-01-11 21:58 ` Russ Anderson
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: David Mosberger @ 2005-01-11 21:42 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 11 Jan 2005 13:37:05 -0800, Ben Woodard <woodard@redhat.com> said:

  Ben> Anyone have any comments on the program itself?

I'm not really qualified to comment, since I never worked on a large
cluster myself.  On the face of it, it seems like a clear improvement
to me.

	--david

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (13 preceding siblings ...)
  2005-01-11 21:42 ` David Mosberger
@ 2005-01-11 21:58 ` Russ Anderson
  2005-01-11 22:02 ` David Mosberger
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Russ Anderson @ 2005-01-11 21:58 UTC (permalink / raw)
  To: linux-ia64

Ben Woodard wrote:
> 
> So does anyone with "normal world" experience have any suggestions on
> how I should take into account the various perspectives? 
> 
> Do other people consider the isolated SBE a problem? 
> 
> Do other people consider 1SBE/hr on a DIMM a real problem that needs to
> be fixed?

Why would anyone consider a recovered error a problem?  ECC corrected
the data so life is good.

The real question is whether the corrected error is an indication that
something bad - a crash due to and uncorrected error - is going to happen.
That is the bad thing we want to avoid.

The answer to the question of whether single bits turn into double bits
is - it depends.  There are a number of underlying causes for SBEs and
different ways in which the SBE could degrade into a MBE.  The DRAM
technology plays a big part.  From experience, some DIMMs have SBEs that
never turn into MBEs.  Other DIMMs get MBEs without preceeding SBEs.

You really have to analyze the specific DIMMs, look at the failure 
characteristics of the technology, to get any specific data to base 
a logical conclusion.  And even then slight changes in the manufacturing 
process can skew those numbers.

What linux really needs is better SBE logging infrastructure, to 
keep track of specific DIMMs and the SBEs within the DIMMs, to
collect real data on which to draw meaningful conclusion.

The one solid answer I can give you is that the overall failure 
rate that causes system crashes remains constant over time.
That's because if a specific memory technology makes the memory
subsystem more reliable, people will just buy more memory until
they reach the same noticeable error rate.  ECC memory did not
eliminate memory errors, it allowed much larger memories with
the same overall memory failure rate.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (14 preceding siblings ...)
  2005-01-11 21:58 ` Russ Anderson
@ 2005-01-11 22:02 ` David Mosberger
  2005-01-11 22:26 ` Matthias Fouquet-Lapar
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: David Mosberger @ 2005-01-11 22:02 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 11 Jan 2005 22:36:52 +0100 ("CET), Matthias Fouquet-Lapar <mfl@kernel.paris.sgi.com> said:

  Matthias> systems which have several tera-bytes of main memory, but
  Matthias> as I said this really is very platform specific

Probably so.  Still, I think a very interesting systems paper could be
written that would spell out at least the basic trends/invariants in
memory error behavior.  Hint, hint... ;-)

	--david

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (15 preceding siblings ...)
  2005-01-11 22:02 ` David Mosberger
@ 2005-01-11 22:26 ` Matthias Fouquet-Lapar
  2005-01-12  4:10 ` Keith Owens
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Matthias Fouquet-Lapar @ 2005-01-11 22:26 UTC (permalink / raw)
  To: linux-ia64

>   Matthias> systems which have several tera-bytes of main memory, but
>   Matthias> as I said this really is very platform specific
> 
> Probably so.  Still, I think a very interesting systems paper could be
> written that would spell out at least the basic trends/invariants in
> memory error behavior.  Hint, hint... ;-)

I'm actually working on such a paper. The real challenge, as you already
pointed out, is to collect some longer term data. I hope to have something
ready in the summer time frame as it simply takes time to run experiments.
Some testing can be done in environmental stress test chambers, but then
the total sample size is lower. One tool I'm currently looking at would
try predictive error analysis based on the data collected by salinfo.

Some other idea I want to explore is to allow to send a signal to the process. 
(which isn't straight forward ...)
This obviously would only be interesting for on-line diagnostics. 
It would allow the diagnostic to focus on a failing location and see if an 
error is repeatable, if it's data dependant etc. Maybe this feedback mechanism
can help to develop better testing strategies.

(I actually have a test system which has known problem DIMMs)

Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (16 preceding siblings ...)
  2005-01-11 22:26 ` Matthias Fouquet-Lapar
@ 2005-01-12  4:10 ` Keith Owens
  2005-01-12  6:08 ` Luck, Tony
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Keith Owens @ 2005-01-12  4:10 UTC (permalink / raw)
  To: linux-ia64

On Tue, 11 Jan 2005 07:46:28 -0800, 
Ben Woodard <woodard@redhat.com> wrote:
>Here is a new utility for looking into salinfo records. It several
>things differently than salinfo_decode. We have found that this helps

The design of salinfo_decode2 is completely unacceptable for SGI
hardware, and probably for HP as well.  You have removed all processing
of the oemdata.

SGI hardware decodes the oemdata in SAL records using prom code.  This
decode _must_ be done while the record is still in the prom's memory
space.  The callback into the prom (via the kernel) must be done after
the main part of the record is printed and before the record is cleared
from SAL.  For some error types such as CPE, the SGI oemdata provides
critical information about which DIMM is failing, including its node
and serial number.

AFAIK HP decode their oemdata via a user space program.  Again this is
done after the main part of the record is printed.

To handle both SGI and HP requirements, the existing salinfo_decode
program calls the optional program salinfo_decode_oemdata.  That call
is made at the right point in the read/decode/clear cycle to satisfy
all vendor requirements.  Removing salinfo_decode_oemdata is not an
option.

The existing salinfo_decode program works fine, including decoding oem
data.  I agree that we need a summary tool to merge data from multiple
records together, but there are better ways of doing that, we do not
need to remove the existing salinfo_decode functionality to get a
summary.

Leave salinfo_decode completely alone, especially the oem decoding.
To get a summary, add a new package that monitors the contents of
/var/log/salinfo/decoded, reads new records and summarizes the
contents.  I am quite happy to add a trigger (pipe or socket) from
salinfo_decode to the summary program to indicate when new records
arrive.

Any summary program must be extensible so a vendor can report on data
that is extracted from their oemdata.

BTW, salinfo_decode2 will spin forever on a kernel < 2.6.9-rc4,
including all 2.4 kernels.  Once again, salinfo_decode 0.7 gets this
right.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (17 preceding siblings ...)
  2005-01-12  4:10 ` Keith Owens
@ 2005-01-12  6:08 ` Luck, Tony
  2005-01-12  6:43 ` Keith Owens
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Luck, Tony @ 2005-01-12  6:08 UTC (permalink / raw)
  To: linux-ia64

>The design of salinfo_decode2 is completely unacceptable for SGI
>hardware, and probably for HP as well.  You have removed all processing
>of the oemdata.
>
>SGI hardware decodes the oemdata in SAL records using prom code.  This
>decode _must_ be done while the record is still in the prom's memory
>space.  The callback into the prom (via the kernel) must be done after
>the main part of the record is printed and before the record is cleared
>from SAL.  For some error types such as CPE, the SGI oemdata provides
>critical information about which DIMM is failing, including its node
>and serial number.

I think that Ben was just plugging the "salinfo_decode2" program that is
included in his alternate salinfo package (though this would perhaps have
been more clear if he'd just posted the program, rather than the whole
package).  The text of his e-mail only talked about salinfo_decode2.

The salinfo_decode2 program just takes the 'raw' images of error records
that have been saved by any daemon, and creates summary reports.

Thanks for clarifying why his daemon won't work for SGI.

-Tony

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (18 preceding siblings ...)
  2005-01-12  6:08 ` Luck, Tony
@ 2005-01-12  6:43 ` Keith Owens
  2005-01-12  9:34 ` Matthias Fouquet-Lapar
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Keith Owens @ 2005-01-12  6:43 UTC (permalink / raw)
  To: linux-ia64

On Tue, 11 Jan 2005 22:08:56 -0800, 
"Luck, Tony" <tony.luck@intel.com> wrote:
>>The design of salinfo_decode2 is completely unacceptable for SGI
>>hardware, and probably for HP as well.  You have removed all processing
>>of the oemdata.
>>
>>SGI hardware decodes the oemdata in SAL records using prom code.  This
>>decode _must_ be done while the record is still in the prom's memory
>>space.  The callback into the prom (via the kernel) must be done after
>>the main part of the record is printed and before the record is cleared
>>from SAL.  For some error types such as CPE, the SGI oemdata provides
>>critical information about which DIMM is failing, including its node
>>and serial number.
>
>I think that Ben was just plugging the "salinfo_decode2" program that is
>included in his alternate salinfo package (though this would perhaps
>have
>been more clear if he'd just posted the program, rather than the whole
>package).  The text of his e-mail only talked about salinfo_decode2.
>
>The salinfo_decode2 program just takes the 'raw' images of error records
>that have been saved by any daemon, and creates summary reports.

I wish it was that simple.  The salinfo_decode2 patch also adds a new
program called salinfo_daemon, which replaces salinfo_decode_all.
salinfo_daemon reads a record, clears the record then calls
salinfo_decode or salinfo_decode2.

salinfo_decode2 has absolutely no support for oem data.  Even if
salinfo_daemon calls the existing salinfo_decode program, the record
has already been cleared from memory by salinfo_daemon.  In either
case, it is unacceptable for SGI hardware.

I have seen the version of salinfo_decode2 that is shipping in RHEL4
beta and I can guarantee that it does not run on our hardware.  If
salinfo_decode2 is really just a summary program then the patch is
complete overkill.  Ship a separate summary program and database, in
its own package, making it completely separate from the existing (and
working) salinfo_decode.  As the patch stands, it looks like an attempt
to take over salinfo_decode and to remove existing functionallity.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (19 preceding siblings ...)
  2005-01-12  6:43 ` Keith Owens
@ 2005-01-12  9:34 ` Matthias Fouquet-Lapar
  2005-01-12 16:57 ` Ben Woodard
  2005-01-12 20:46 ` Keith Owens
  22 siblings, 0 replies; 24+ messages in thread
From: Matthias Fouquet-Lapar @ 2005-01-12  9:34 UTC (permalink / raw)
  To: linux-ia64

> salinfo_decode2 has absolutely no support for oem data.  Even if
> salinfo_daemon calls the existing salinfo_decode program, the record
> has already been cleared from memory by salinfo_daemon.  In either
> case, it is unacceptable for SGI hardware.
> 
> I have seen the version of salinfo_decode2 that is shipping in RHEL4
> beta and I can guarantee that it does not run on our hardware.  If
> salinfo_decode2 is really just a summary program then the patch is
> complete overkill.  Ship a separate summary program and database, in
> its own package, making it completely separate from the existing (and
> working) salinfo_decode.  As the patch stands, it looks like an attempt
> to take over salinfo_decode and to remove existing functionallity.

I want to support Keith's summary. We implemented for example a HW system
dump mechanism which saves the HW state of the system in case of a MCA.
(the HW counterpart of a system dump). We also collect associated salinfo
for the event and _absolutely_ rely on the OEM data.

My initial understanding was that salinfo_decode2 was adding some filtering
option, which would be ok (we simply would not use it). But if critical
data is left out, then it simply does not work on correctly on our HW

What is broken in salinfo_decode requiring a re-implementation ? All sorts
of filtering can be done in post-processing

Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (20 preceding siblings ...)
  2005-01-12  9:34 ` Matthias Fouquet-Lapar
@ 2005-01-12 16:57 ` Ben Woodard
  2005-01-12 20:46 ` Keith Owens
  22 siblings, 0 replies; 24+ messages in thread
From: Ben Woodard @ 2005-01-12 16:57 UTC (permalink / raw)
  To: linux-ia64

Keith, 

I beg to differ with you it is obvious from your post that you didn't
even look at what I sent. You were so spring loaded with your attack on
salinfod (something that I did not send along) that you failed to
actually look at what I produced. In my opinion, that is somewhat
unprofessional.

On Tue, 2005-01-11 at 20:10, Keith Owens wrote:
> On Tue, 11 Jan 2005 07:46:28 -0800, 
> Ben Woodard <woodard@redhat.com> wrote:
> >Here is a new utility for looking into salinfo records. It several
> >things differently than salinfo_decode. We have found that this helps
> 
> The design of salinfo_decode2 is completely unacceptable for SGI
> hardware, and probably for HP as well.  You have removed all processing
> of the oemdata.
> 
> SGI hardware decodes the oemdata in SAL records using prom code.  This
> decode _must_ be done while the record is still in the prom's memory
> space.  The callback into the prom (via the kernel) must be done after
> the main part of the record is printed and before the record is cleared
> from SAL.  For some error types such as CPE, the SGI oemdata provides
> critical information about which DIMM is failing, including its node
> and serial number.

salinfo_decode2 is a completely offline record processor. It does not
interfere with the read, decode, clear cycle. salinfo_decode2 simply
looks at the records that are left by the salinfo_decode2 daemon in raw
raw directory.

salinfo_decode2 may not be able to examine the oem data, in the man page
I point out that salinfo_decode2 has limitations which salinfo_decode is
able to work around.

> 
> AFAIK HP decode their oemdata via a user space program.  Again this is
> done after the main part of the record is printed.

That may also be true but it does not negate the benefits of giving
system administrators, who often times lack and don't need detailed
understanding of the hardware a tool that allows them to maintain and
monitor their machines effectively. In the man page for salinfo_decode2
clearly states that if you need every possible piece of information, you
should look at the decoded output of salinfo_decode.

> 
> To handle both SGI and HP requirements, the existing salinfo_decode
> program calls the optional program salinfo_decode_oemdata.  That call
> is made at the right point in the read/decode/clear cycle to satisfy
> all vendor requirements.  Removing salinfo_decode_oemdata is not an
> option.
> 
> The existing salinfo_decode program works fine, including decoding oem
> data.  I agree that we need a summary tool to merge data from multiple
> records together, but there are better ways of doing that, we do not
> need to remove the existing salinfo_decode functionality to get a
> summary.
> 

If you had actually looked at what I sent, you would have seen that
there is absolutely no existing salinfo_decode functionality removed.

> Leave salinfo_decode completely alone, especially the oem decoding.
> To get a summary, add a new package that monitors the contents of
> /var/log/salinfo/decoded, reads new records and summarizes the
> contents.  I am quite happy to add a trigger (pipe or socket) from
> salinfo_decode to the summary program to indicate when new records
> arrive.

The only difference between what I did and what you suggest is that I
chose to parse the raw records rather than the decoded records. I
believe that there are valid technical reasons for doing this.

> 
> Any summary program must be extensible so a vendor can report on data
> that is extracted from their oemdata.
> 
> BTW, salinfo_decode2 will spin forever on a kernel < 2.6.9-rc4,
> including all 2.4 kernels.  Once again, salinfo_decode 0.7 gets this
> right.

That is distinctly not true. salinfo_decode2 is an offline reader and
doesn't interact at all with the /proc file system. What you are
thinking about salinfod which is not included in this patch.

-ben

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: new utility for decoding salinfo records
  2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
                   ` (21 preceding siblings ...)
  2005-01-12 16:57 ` Ben Woodard
@ 2005-01-12 20:46 ` Keith Owens
  22 siblings, 0 replies; 24+ messages in thread
From: Keith Owens @ 2005-01-12 20:46 UTC (permalink / raw)
  To: linux-ia64

On Wed, 12 Jan 2005 08:57:36 -0800, 
Ben Woodard <woodard@redhat.com> wrote:
>Keith, 
>
>I beg to differ with you it is obvious from your post that you didn't
>even look at what I sent. You were so spring loaded with your attack on
>salinfod (something that I did not send along) that you failed to
>actually look at what I produced. In my opinion, that is somewhat
>unprofessional.

The patch you sent to the list adds salinfo_daemon.C, containing this line

+  strm << "salinfod version " << VERSION

If you include salinfod in your patch then of course I am going to
fight it.

>salinfo_decode2 is a completely offline record processor. It does not
>interfere with the read, decode, clear cycle. salinfo_decode2 simply
>looks at the records that are left by the salinfo_decode2 daemon in raw directory.
                                           ^^^^^^^^^^^^^^^
I am going to give you the benefit of the doubt and assume that you meant
salinfo_decode there, not your attempted replacement.

If salinfo_decode2 is completely offline then make it a separate
package called salinfo_summary instead of trying to add it to
salinfo_decode.  The existing salinfo_decode is vendor neutral and it
is going to stay that way.  Different vendors and/or distributions have
their own requirements for tracking problems, they can choose to use
salinfo_summary or they can ignore it and use their own tracking
system.  When you lump the summary program in with the decode program
then you take away the ability for anybody to determine how they
summarize the SAL records.

>If you had actually looked at what I sent, you would have seen that
>there is absolutely no existing salinfo_decode functionality removed.

I did look, salinfo_daemon.C is in the patch that you sent and it
removes the existing salinfo_decode functionality.

Remove salinfo_daemon.C, make the summary code a separate package from
salinfo_decode and I might believe that you are not trying to break
salinfo_decode.  Given the salinfo mess in RHEL4 beta, I am extremely
suspicious of your current approach.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2005-01-12 20:46 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-11 15:46 new utility for decoding salinfo records Ben Woodard
2005-01-11 19:03 ` David Mosberger
2005-01-11 19:49 ` Luck, Tony
2005-01-11 20:25 ` David Mosberger
2005-01-11 20:26 ` Ben Woodard
2005-01-11 20:53 ` Mark Goodwin
2005-01-11 21:03 ` Ben Woodard
2005-01-11 21:12 ` Ben Woodard
2005-01-11 21:22 ` Russ Anderson
2005-01-11 21:23 ` Luck, Tony
2005-01-11 21:25 ` David Mosberger
2005-01-11 21:36 ` David Mosberger
2005-01-11 21:36 ` Matthias Fouquet-Lapar
2005-01-11 21:37 ` Ben Woodard
2005-01-11 21:42 ` David Mosberger
2005-01-11 21:58 ` Russ Anderson
2005-01-11 22:02 ` David Mosberger
2005-01-11 22:26 ` Matthias Fouquet-Lapar
2005-01-12  4:10 ` Keith Owens
2005-01-12  6:08 ` Luck, Tony
2005-01-12  6:43 ` Keith Owens
2005-01-12  9:34 ` Matthias Fouquet-Lapar
2005-01-12 16:57 ` Ben Woodard
2005-01-12 20:46 ` Keith Owens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox