From: Zoltan Menyhart <Zoltan.Menyhart@bull.net>
To: linux-ia64@vger.kernel.org
Subject: Re: [PATCH] New way of storing MCA/INIT logs
Date: Thu, 06 Mar 2008 13:14:48 +0000 [thread overview]
Message-ID: <47CFEE48.90803@bull.net> (raw)
In-Reply-To: <47CD8142.7050207@bull.net>
Luck, Tony wrote:
Let's see this first:
> Obviously entering polling
> mode puts the responsibility onto SAL to keep track of all
> the error reports
Please have a look at the
Figure 2-1. Itanium® Processor Family Firmware Machine Check Handling Model
in the Error Handling Guide.
This figure shows that the SAL (or the PAL) cannot see the platform
originated CPEIs, nor the CPU HW originated CMCIs.
When you call SAL_GET_STATE_INFO(), the SAL (and the PAL) will read out
the error status from some HW registers.
Therefore the SAL / PAL cannot store error reports.
Can the HW (platform or CPU) help to save error reports?
A typical "error register set" - whatever it is - saves the first
error and maintains a "cumulative error" status (usually reset
by SAL_CLEAR_STATE_INFO()).
CPEs / CMCs will be lost unless you (want to) "swallow" them
quickly enough.
The SAL / PAL can be the origin of CPEIs / CMCIs if they succeed
in correcting MCAs. They stock the related information until the
OS calls SAL_GET_STATE_INFO().
How many such outstanding CPEIs / CMCIs there can be is an
implementation issue.
Surely there are a limited number of bufferers there.
I do not think they date to implement a complicated buffer
handling mechanism in an MCA context.
- but nobody ever complained that this
> might result in the loss of error information if the SAL runs
> out of space to keep the error records before the next poll
> from the OS. ["solving" problems by shifting the blame point?]
I've got a Tiger box like machine installed with some known
to be bad memory. I scan the known bad addresses via /dev/mem:
volatile unsigned char *p = bad ph. addr.
for (;;){
tmp += *p;
ia64_fc((void *) p);
}
It is a deterministically bad memory location.
You can guess how many errors / sec there are.
Obviously, we switch into polling mode.
(And we lose most of the events.)
Less than half of the cases I get logs like this:
Platform Memory Device Error Info Section
Mem Error Detail
Physical Address: 0x280059b81 Address Mask: 0xfffffffff80 Node: 0
Card: 0 Module: 3 Bank: 3 Device: 1 Row: 2050 Column: 1356
Platform Memory Device Error Info Section
Mem Error Detail
Node: 0
But in more than half of the cases, salinfo_decode gets lost:
Platform Memory Device Error Info Section
Mem Error Detail
Node: 0
Again we lose events.
The SAL spec. does not say a word about how many errors have to be
kept by the SAL. Therefore we cannot reckon on the SAL keeping them.
> Both the CMC and CPE interrupt paths have code to switch to
> polling mode in the presence of a burst of correctable errors.
> Can we tune this threshold w.r.t. the number of buffers we
> pre-allocate to save error records so that we (the OS) won't
> be responsible for losing errors?
We are condemned to lose error logs due to the limited number
of the error buffers in the SAL / PAL / OS, due to the limited
services provided by the HW.
I hope we can agree that the probability of a coincidence of more
that one independent errors is very very low
(otherwise change the machine :-)).
Keeping the first error log that contains pertinent, new
information, is very important.
Keeping the last one is important, because not treating rapidly
enough an error can worsen the situation.
The others are just for the statistics...
Thanks,
Zoltan
next prev parent reply other threads:[~2008-03-06 13:14 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-03-04 17:05 [PATCH] New way of storing MCA/INIT logs Zoltan Menyhart
2008-03-05 0:23 ` Russ Anderson
2008-03-05 13:14 ` Zoltan Menyhart
2008-03-05 16:59 ` Luck, Tony
2008-03-05 18:56 ` Russ Anderson
2008-03-05 23:38 ` Keith Owens
2008-03-06 10:24 ` Zoltan Menyhart
2008-03-06 13:14 ` Zoltan Menyhart [this message]
2008-03-06 17:09 ` Luck, Tony
2008-03-06 17:29 ` Zoltan Menyhart
2008-03-06 17:52 ` Russ Anderson
2008-03-06 21:56 ` Luck, Tony
2008-03-06 22:13 ` Russ Anderson
2008-03-07 12:02 ` Zoltan Menyhart
2008-03-07 16:55 ` Russ Anderson
2008-03-10 9:36 ` Zoltan Menyhart
2008-03-10 20:36 ` Russ Anderson
2008-03-10 21:10 ` Russ Anderson
2008-03-11 14:07 ` Zoltan Menyhart
2008-03-11 14:32 ` Robin Holt
2008-03-11 21:22 ` Russ Anderson
2008-03-12 1:08 ` Keith Owens
2008-03-12 7:42 ` Zoltan Menyhart
2008-04-01 15:18 ` [PATCH] New way of storing MCA/INIT logs - take 2 Zoltan Menyhart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=47CFEE48.90803@bull.net \
--to=zoltan.menyhart@bull.net \
--cc=linux-ia64@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox