* Re: Preserving CMC/CPE records across reboot
2006-01-13 0:46 Preserving CMC/CPE records across reboot Keith Owens
@ 2006-01-13 15:05 ` Alex Williamson
2006-01-13 15:57 ` Jack Steiner
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Alex Williamson @ 2006-01-13 15:05 UTC (permalink / raw)
To: linux-ia64
On Fri, 2006-01-13 at 11:46 +1100, Keith Owens wrote:
>
> We should be able to keep the first few CMC/CPE records for each cpu in
> NVRAM and discard the later ones if we start getting a backlog. Then
> if the system hangs while processing a CMC/CPE, the data will still be
> available in NVRAM and will be processed on the next boot. If the
> reboot hangs again in salinfo processing then we have a solid error,
> either cpu or SAL, so switch the offending cpu out of the system.
>
> Any objections from other platforms?
Keith,
Sorry, it's been a while since I've looked at this code, but how do
we determine how many records can be stored in NVRAM? I would guess
that for CPEs at least, it's platform dependent. If it can be done w/o
losing records, it's probably ok, but I'm not sure I understand the
details. Thanks,
Alex
--
Alex Williamson HP Linux & Open Source Lab
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: Preserving CMC/CPE records across reboot
2006-01-13 0:46 Preserving CMC/CPE records across reboot Keith Owens
2006-01-13 15:05 ` Alex Williamson
@ 2006-01-13 15:57 ` Jack Steiner
2006-01-13 20:23 ` Luck, Tony
2006-01-14 6:50 ` Keith Owens
3 siblings, 0 replies; 5+ messages in thread
From: Jack Steiner @ 2006-01-13 15:57 UTC (permalink / raw)
To: linux-ia64
On Fri, Jan 13, 2006 at 11:46:29AM +1100, Keith Owens wrote:
> CMC/CPE records (unlike MCA/INIT) are copied into kernel space and
> cleared from NVRAM as soon as they occur. That decision was made by
> Bjorn Helgaas some years ago. The idea is that if you do not have
> salinfo_decode or some equivalent program running then the correctable
> errors still need to be deleted from NVRAM. But if the system hangs
> while reading the CMC/CPE then we get no data at all.
>
> SGI just had an example of this. A cpu took a CMC, salinfo_decode
> started running and hung while processing the CMC record, the system
> had to be rebooted. Because the CMC record had been cleared from NVRAM
> before handing a copy to salinfo_decode, the contents were lost.
On SN, CMC/CPE records are never written to NVRAM. They are saved only
in memory. If the system hangs trying to log a CMC/CPE & the system is reset,
all CMC/CPE records are lost.
It is possible that some of this could be changed but it currently works
this way. Also, writing error records to NVRAM is slow - something to
avoid on performance critical paths. I suppose we could threshhold the
error rate & would limit the rate of writing to NVRAM.
>
> We should be able to keep the first few CMC/CPE records for each cpu in
> NVRAM and discard the later ones if we start getting a backlog. Then
> if the system hangs while processing a CMC/CPE, the data will still be
> available in NVRAM and will be processed on the next boot. If the
> reboot hangs again in salinfo processing then we have a solid error,
> either cpu or SAL, so switch the offending cpu out of the system.
>
> Any objections from other platforms?
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: Preserving CMC/CPE records across reboot
2006-01-13 0:46 Preserving CMC/CPE records across reboot Keith Owens
2006-01-13 15:05 ` Alex Williamson
2006-01-13 15:57 ` Jack Steiner
@ 2006-01-13 20:23 ` Luck, Tony
2006-01-14 6:50 ` Keith Owens
3 siblings, 0 replies; 5+ messages in thread
From: Luck, Tony @ 2006-01-13 20:23 UTC (permalink / raw)
To: linux-ia64
> Sorry, it's been a while since I've looked at this code, but how do
>we determine how many records can be stored in NVRAM? I would guess
>that for CPEs at least, it's platform dependent. If it can be done w/o
>losing records, it's probably ok, but I'm not sure I understand the
>details. Thanks,
Agreed. The SAL spec doesn't require any particular capacity for
NVRAM, nor does it provide a mechanism to discover it.
-Tony
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Preserving CMC/CPE records across reboot
2006-01-13 0:46 Preserving CMC/CPE records across reboot Keith Owens
` (2 preceding siblings ...)
2006-01-13 20:23 ` Luck, Tony
@ 2006-01-14 6:50 ` Keith Owens
3 siblings, 0 replies; 5+ messages in thread
From: Keith Owens @ 2006-01-14 6:50 UTC (permalink / raw)
To: linux-ia64
Alex Williamson (on Fri, 13 Jan 2006 08:05:07 -0700) wrote:
>On Fri, 2006-01-13 at 11:46 +1100, Keith Owens wrote:
>>
>> We should be able to keep the first few CMC/CPE records for each cpu in
>> NVRAM and discard the later ones if we start getting a backlog. Then
>> if the system hangs while processing a CMC/CPE, the data will still be
>> available in NVRAM and will be processed on the next boot. If the
>> reboot hangs again in salinfo processing then we have a solid error,
>> either cpu or SAL, so switch the offending cpu out of the system.
>>
>> Any objections from other platforms?
>
> Sorry, it's been a while since I've looked at this code, but how do
>we determine how many records can be stored in NVRAM? I would guess
>that for CPEs at least, it's platform dependent. If it can be done w/o
>losing records, it's probably ok, but I'm not sure I understand the
>details. Thanks,
By counting the number of interrupts and subtracting the number of
'clear' events issued by user space. It would be messy but possible.
Jack Steiner has pointed out that the SGI prom never saves CMC/CPE
records anyway, which means that my idea would not solve the problem of
records being lost due to reboot. So I am dropping this idea.
^ permalink raw reply [flat|nested] 5+ messages in thread