* [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT records
@ 2004-12-09 12:23 Keith Owens
2004-12-10 2:51 ` [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT Hidetoshi Seto
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Keith Owens @ 2004-12-09 12:23 UTC (permalink / raw)
To: linux-ia64
I propose a small change to the effect of SAL_{GET,CLEAR}_STATE_INFO.
This change will only apply when there are multiple MCA or INIT records
being stored for a single cpu. The change requirement has been
triggered by the recent kernel changes to recover from more MCA events.
Rational
====
The SAL specification only provides access to the top record in the SAL
record stack. This is fine for logging purposes (via salinfo_decode),
and is also fine when there is a single MCA or INIT record outstanding
on a cpu. It causes problems when there are multiple MCA or INIT
records stored on a cpu and we are trying to recover from some MCA
events.
The OS MCA and, to a lesser extent, the INIT handlers have to decide if
the MCA/INIT is recoverable or not. To make that decision, the OS
handlers need the current MCA/INIT record because that is the only
record that contains the data needed to make the decision. If there
are multiple MCA/INIT records stored for a cpu then SAL_GET_STATE_INFO
returns the top record every time, resulting in bad data and invalid
decisions about recovery.
The current kernel code does not clear the recoverable MCA/INIT record
immediately. Instead it is delayed until salinfo_timeout() runs, which
means that the record can stay around for a while, currently up to 5
minutes.
I have two outstanding patches that [1]run salinfo_timeout more
frequently and [2]clear all recoverable error records immediately.
With those patches, this problem is significantly reduced, but it is
not completely removed. There are two cases when the OS MCA/INIT
handlers still get invalid data :-
1) Take a fatal MCA. Reboot. Before salinfo_decode has a chance to
run and delete teh fatal MCA record, get a recoverable MCA. The OS
MCA handler reads the data from the fatal MCA, sees it cannot
recover and gives up. The system dies on reboot. This will be more
and more of a problem as memory sizes get larger.
2) Sites that do not run salinfo_decode at all. salinfo_decode is not
a critical system task, a site can choose not to run it. Any fatal
MCA records will not be cleared, which prevents all recoverable MCAs
from working. This will also occur when you boot in single user
mode.
The problem of invalid data being passed to the OS MCA and INIT handler
has been caused by the combination of moving salinfo processing out to
user space together with the addition of recoverable MCA and INIT
events. Both features are good, but now we have to cope with some side
effects that were not anticipated in the SAL specification.
Change
===
Change SAL_{GET,CLEAR}_STATE_INFO to be context sensitive for MCA and
INIT records. If the OS MCA or INIT handler is in control then these
routines operate on the current record. Outside the OS MCA or INIT
handler, we revert the old behaviour and operate on record on the top
of the stack.
I believe that this change is in keeping with the spirit of the
existing SAL specification. It clearly intends that the MCA and INIT
handler should be able to access data about the current event. For
requests for these records outside the MCA or INIT handlers, it only
makes sense to process the top record on the stack.
Making these calls context sensitive is a small change to SAL. The SAL
MCA/INIT handlers have already constructed and stored the required
record before calling the OS handler. All SAL has to do is :-
* Save the address of the current record on this cpu.
* Call the OS handler.
* On return from the OS handler, clear the pointer to the current
record.
* SAL_GET_STATE_INFO - if the current record pointer is set then return
that record, otherwise do existing processing.
* SAL_CLEAR_STATE_INFO - if the current record pointer is set then
clear that record and the current record pointer, otherwise do
existing processing.
This change is backwards compatible. A new kernel running on an old
SAL will get the top of stack record, as before. The worst case is
that it gets invalid data, which is the current situation. An old
kernel running on a new SAL will get the current record in the OS MCA
and INIT handlers, which is exactly what we want. This SAL change does
not require any kernel changes, although we should still apply patches
[1] and [2].
[1] http://marc.theaimsgroup.com/?l=linux-ia64&m\x110187223423867&w=2
[2] http://marc.theaimsgroup.com/?l=linux-ia64&m\x110222038404547&w=2
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT
2004-12-09 12:23 [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT records Keith Owens
@ 2004-12-10 2:51 ` Hidetoshi Seto
2004-12-13 3:54 ` [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT records Keith Owens
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Hidetoshi Seto @ 2004-12-10 2:51 UTC (permalink / raw)
To: linux-ia64
Keith Owens wrote:
> Making these calls context sensitive is a small change to SAL. The SAL
> MCA/INIT handlers have already constructed and stored the required
> record before calling the OS handler. All SAL has to do is :-
>
> * Save the address of the current record on this cpu.
> * Call the OS handler.
> * On return from the OS handler, clear the pointer to the current
> record.
> * SAL_GET_STATE_INFO - if the current record pointer is set then return
> that record, otherwise do existing processing.
> * SAL_CLEAR_STATE_INFO - if the current record pointer is set then
> clear that record and the current record pointer, otherwise do
> existing processing.
That makes sense.
When I was coding my OS_MCA handler, I used to think if there were something
like SAL_{GET,CLEAR}_STATE_INFO_FROM_BOTTOM or some extra parameter for
SAL_{GET,CLEAR}_STATE_INFO to select which log it should deal.
Your proposal sounds good.
It's more simple, and more easy.
> This change is backwards compatible.
Marvelous!
BTW, who is in charge of receptionist for such proposal?
Thanks,
H.Seto
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT records
2004-12-09 12:23 [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT records Keith Owens
2004-12-10 2:51 ` [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT Hidetoshi Seto
@ 2004-12-13 3:54 ` Keith Owens
2004-12-14 1:00 ` Luck, Tony
2004-12-14 1:26 ` Keith Owens
3 siblings, 0 replies; 5+ messages in thread
From: Keith Owens @ 2004-12-13 3:54 UTC (permalink / raw)
To: linux-ia64
On Fri, 10 Dec 2004 11:51:01 +0900,
Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> wrote:
>Keith Owens wrote:
>> Making these calls context sensitive is a small change to SAL. The SAL
>> MCA/INIT handlers have already constructed and stored the required
>> record before calling the OS handler. All SAL has to do is :-
>>
>> * Save the address of the current record on this cpu.
>> * Call the OS handler.
>> * On return from the OS handler, clear the pointer to the current
>> record.
>> * SAL_GET_STATE_INFO - if the current record pointer is set then return
>> that record, otherwise do existing processing.
>> * SAL_CLEAR_STATE_INFO - if the current record pointer is set then
>> clear that record and the current record pointer, otherwise do
>> existing processing.
>
>That makes sense.
>
>When I was coding my OS_MCA handler, I used to think if there were something
>like SAL_{GET,CLEAR}_STATE_INFO_FROM_BOTTOM or some extra parameter for
>SAL_{GET,CLEAR}_STATE_INFO to select which log it should deal.
>
>Your proposal sounds good.
>It's more simple, and more easy.
>
>> This change is backwards compatible.
>
>Marvelous!
>
>BTW, who is in charge of receptionist for such proposal?
Intel own the SAL specification, so they have to accept the change to
make it official. Obviously it helps if all the vendors agree that the
change is useful and causes no problems. I have had no feedback from
HP, Bull or Intel (yet).
^ permalink raw reply [flat|nested] 5+ messages in thread* RE: [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT records
2004-12-09 12:23 [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT records Keith Owens
2004-12-10 2:51 ` [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT Hidetoshi Seto
2004-12-13 3:54 ` [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT records Keith Owens
@ 2004-12-14 1:00 ` Luck, Tony
2004-12-14 1:26 ` Keith Owens
3 siblings, 0 replies; 5+ messages in thread
From: Luck, Tony @ 2004-12-14 1:00 UTC (permalink / raw)
To: linux-ia64
>The OS MCA and, to a lesser extent, the INIT handlers have to decide if
>the MCA/INIT is recoverable or not. To make that decision, the OS
>handlers need the current MCA/INIT record because that is the only
>record that contains the data needed to make the decision. If there
>are multiple MCA/INIT records stored for a cpu then SAL_GET_STATE_INFO
>returns the top record every time, resulting in bad data and invalid
>decisions about recovery.
Isn't all the information on whether the current MCA/INIT event is
recoverable encoded in the processor state parameter ... which is
passed to the OS by SAL in r18. What do you need that requires that
you get at the error record?
-Tony
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT records
2004-12-09 12:23 [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT records Keith Owens
` (2 preceding siblings ...)
2004-12-14 1:00 ` Luck, Tony
@ 2004-12-14 1:26 ` Keith Owens
3 siblings, 0 replies; 5+ messages in thread
From: Keith Owens @ 2004-12-14 1:26 UTC (permalink / raw)
To: linux-ia64
On Mon, 13 Dec 2004 17:00:21 -0800,
"Luck, Tony" <tony.luck@intel.com> wrote:
>>The OS MCA and, to a lesser extent, the INIT handlers have to decide if
>>the MCA/INIT is recoverable or not. To make that decision, the OS
>>handlers need the current MCA/INIT record because that is the only
>>record that contains the data needed to make the decision. If there
>>are multiple MCA/INIT records stored for a cpu then SAL_GET_STATE_INFO
>>returns the top record every time, resulting in bad data and invalid
>>decisions about recovery.
>
>Isn't all the information on whether the current MCA/INIT event is
>recoverable encoded in the processor state parameter ... which is
>passed to the OS by SAL in r18. What do you need that requires that
>you get at the error record?
The PSP bits only tell us that the event might be recoverable. We need
more data from the MCA record to make the final decision.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2004-12-14 1:26 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-09 12:23 [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT records Keith Owens
2004-12-10 2:51 ` [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT Hidetoshi Seto
2004-12-13 3:54 ` [RFC] Change the effect of SAL_{GET,CLEAR}_STATE_INFO for MCA/INIT records Keith Owens
2004-12-14 1:00 ` Luck, Tony
2004-12-14 1:26 ` Keith Owens
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox