public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* Handling nested MCA/INIT
@ 2005-10-17 15:03 Keith Owens
  2005-10-17 16:01 ` Bryan Sutula
  0 siblings, 1 reply; 2+ messages in thread
From: Keith Owens @ 2005-10-17 15:03 UTC (permalink / raw)
  To: linux-ia64

How should we handle nested MCA/INIT events?  There is only one PAL
minstate area per cpu so any nested MCA/INIT will overwrite the current
data, making it impossible to recover.  The best we can do with a
nested event is get some information on why the handlers died then
reboot.

The current MCA/INIT handlers run with psr.mc = 1, so nested events
cannot be delivered.  This makes it impossible to use the nmi button to
find out why the MCA/INIT handler is hung.  I am thinking of changing
mca_asm.S to set psr.mc to 0 to allow nested events.  The handlers
would detect a nested event, gather minimal diagnostics then reboot.
Then we may be able to diagnose hung MCA/INIT handlers, right now we
get no data for this case, which is extremely frustrating.

The only downside that I can see is if the handler is accessing memory
with a hard double bit error, we could get nested MCA events.  Since
the only thing we can do if the MCA handler gets an MCA is to reboot,
the nested event is not really a problem and allowing nested MCA may
still give us better diagnostics.


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Handling nested MCA/INIT
  2005-10-17 15:03 Handling nested MCA/INIT Keith Owens
@ 2005-10-17 16:01 ` Bryan Sutula
  0 siblings, 0 replies; 2+ messages in thread
From: Bryan Sutula @ 2005-10-17 16:01 UTC (permalink / raw)
  To: linux-ia64

On Tue, 2005-10-18 at 01:03 +1000, Keith Owens wrote:

> The current MCA/INIT handlers run with psr.mc = 1, so nested events
> cannot be delivered.  This makes it impossible to use the nmi button to
> find out why the MCA/INIT handler is hung.  I am thinking of changing
> mca_asm.S to set psr.mc to 0 to allow nested events.  The handlers
> would detect a nested event, gather minimal diagnostics then reboot.
> Then we may be able to diagnose hung MCA/INIT handlers, right now we
> get no data for this case, which is extremely frustrating.

Your suggestion seems better than a hang.  In a production environment,
it's pretty important to be able to reset the machine reliably.

> The only downside that I can see is if the handler is accessing memory
> with a hard double bit error, we could get nested MCA events.  Since
> the only thing we can do if the MCA handler gets an MCA is to reboot,
> the nested event is not really a problem and allowing nested MCA may
> still give us better diagnostics.

Another issue I see is the case where a second MCA occurs fairly soon
after the first.  With your proposed change, we may lose some of the
information on the first.  (E.g., the handler wasn't hung but just
"doin' it's thing".)  Would there be a way to detect the difference
without complicating the code to the point where it would be unreliable?

-- 
Bryan Sutula <Bryan.Sutula@hp.com>


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2005-10-17 16:01 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-17 15:03 Handling nested MCA/INIT Keith Owens
2005-10-17 16:01 ` Bryan Sutula

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox