public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
From: Zoltan Menyhart <Zoltan.Menyhart@bull.net>
To: linux-ia64@vger.kernel.org
Subject: Re: [PATCH] New way of storing MCA/INIT logs
Date: Fri, 07 Mar 2008 12:02:47 +0000	[thread overview]
Message-ID: <47D12EE7.3090906@bull.net> (raw)
In-Reply-To: <47CD8142.7050207@bull.net>

Russ Anderson wrote:

> Figure 2-1 does show SAL passing up CPEI records to OS, too.

Yes, as I also said:
"The SAL / PAL can be the origin of CPEIs / CMCIs if they succeed
in correcting MCAs. They stock the related information until the
OS calls SAL_GET_STATE_INFO()."

I Just want to emphasize that in case of the platform / CPU HW originated
CPEIs / CMCIs, the SAL does not know of them before we call
SAL_GET_STATE_INFO(), therefore it cannot store any information about
them.

> See section 5.3.2 CMC and CPE Records
> 
>   Each processor or physical platform could have multiple valid corrected
>   machine check or corrected platform error records. The maximum number of
>   these records present in a system depends on the SAL implementation and
>   the storage space available on the system. There is no requirement for
>   these records to be logged into NVM. The SAL may use an implementation
>   specific error record replacement algorithm for overflow situations. The
>   OS needs to make an explicit call to the SAL procedure SAL_CLEAR_STATE_INFO
>   to clear the CMC and CPE records in order to free up the memory resources
>   that may be used for future records.

As far as I can understand, it is about the events not signaled by
interrupts, but MCAs, and either the PAL or the SAL manages to correct
them (=> CMCI, CPEI).

You have got N >= 1 buffers for this kind of errors.

> 5.4.1 Corrected Error Event Record
> 
>   In response to a CMC/CPE condition, SAL builds and maintains the error
>   record for OS retrieval.

It does not say that the SAL knows about CMCI / CPEI signaled errors
before we call SAL_GET_STATE_INFO().

Example: the Tiger box with i82870:

There is a register pair of FERRST / SERRST for each component, e.g.
the memory controller.

FERRST: first error status register
SERRST: second / subsequent error status register

Note that the FERRST captures correctly the errors, the SERRST
is mixture (OR logic) of all the other errors.

In case of a corrected memory error, the OS receives a CPEI.
When the OS calls SAL_GET_STATE_INFO(), the SAL reads out the
FERRST / SERRST for each component.
If there are multiple errors, the SAL selects which one is to be
reported.
When the OS calls SAL_CLEAR_STATE_INFO(), the SAL resets the
register pairs whose content were reported by SAL_GET_STATE_INFO().
If there are multiple errors, then you can SAL_GET_STATE_INFO()
repeatedly.

> Here is a manufacturer advertising "over 7 years".
> 7 years is 61,320 hrs, 8 year is 70,080.

It seems to be way too low.
Would not it mean:
"99.999% probability that the product will operate for over 7 years without a failure"
instead of being an MTBF value?

Please have a look at e.g.: http://ramfinder.com/items/ex2gb0132f.html

They mean "without a failure": uncorrectable errors.


Luck, Tony wrote:

> Russ's large systems change these.  Is 30,000 hours a plausible
> MTBF for a DIMM.  What if the system contains 8TB memory in 2GB
> DIMMs.  Now you have 4096 DIMM sticks in the system.  Redo your
> calculations for this large system.

Using the memory seen at http://ramfinder.com/items/ex2gb0132f.html

7 years * 100% / (100% - 99.999%) / 4096 = 170 years

i.e. the MTBF: > 1,000,000 hours with 4096 DIMMs.

> ... about 1 error per gigabyte per two months.

It can be an estimation for the single bit error rate (CPEI).

> But that was a very old study ... newer DIMMs made on denser
> silicon processes will most likely be more vulnerable to
> neutron strikes.

Let's assume the flux of cosmic ray generated particles will hit
the same number of memory cells, unless a particle comes // to the
silicon die, then it can hit more cells until its energy is eaten up.

This is why I think it is the "surface" of the memory exposed
to the flux of cosmic ray generated particles, that is important
and not the number of the gigabytes.

Thanks,

Zoltan





  parent reply	other threads:[~2008-03-07 12:02 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-04 17:05 [PATCH] New way of storing MCA/INIT logs Zoltan Menyhart
2008-03-05  0:23 ` Russ Anderson
2008-03-05 13:14 ` Zoltan Menyhart
2008-03-05 16:59 ` Luck, Tony
2008-03-05 18:56 ` Russ Anderson
2008-03-05 23:38 ` Keith Owens
2008-03-06 10:24 ` Zoltan Menyhart
2008-03-06 13:14 ` Zoltan Menyhart
2008-03-06 17:09 ` Luck, Tony
2008-03-06 17:29 ` Zoltan Menyhart
2008-03-06 17:52 ` Russ Anderson
2008-03-06 21:56 ` Luck, Tony
2008-03-06 22:13 ` Russ Anderson
2008-03-07 12:02 ` Zoltan Menyhart [this message]
2008-03-07 16:55 ` Russ Anderson
2008-03-10  9:36 ` Zoltan Menyhart
2008-03-10 20:36 ` Russ Anderson
2008-03-10 21:10 ` Russ Anderson
2008-03-11 14:07 ` Zoltan Menyhart
2008-03-11 14:32 ` Robin Holt
2008-03-11 21:22 ` Russ Anderson
2008-03-12  1:08 ` Keith Owens
2008-03-12  7:42 ` Zoltan Menyhart
2008-04-01 15:18 ` [PATCH] New way of storing MCA/INIT logs - take 2 Zoltan Menyhart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47D12EE7.3090906@bull.net \
    --to=zoltan.menyhart@bull.net \
    --cc=linux-ia64@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox