public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* Yet another MCA handler
@ 2004-01-14 10:42 Zoltan Menyhart
  2004-01-14 23:30 ` Luck, Tony
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Zoltan Menyhart @ 2004-01-14 10:42 UTC (permalink / raw)
  To: linux-ia64

This is the season of the MCA handlers :-)
Let me show you the one that Christian Cotte-Barrot and I wrote...

I'd like to take this opportunity to express our special thanks to Jenna
Hall, she gave us the initial version of the ".S" code and much help,
and also to Mani Ayyar, David Song and Tony Luck for the technical
consultations.

Our handler currently deals with the translation register errors only.
I was to write the code for the recovery for poisoned memory, too,
but I've got no way to provoke this kind of error
( I do not really know what it like is :-) )

The key features of our MCA handler are:

* Everything is CPU local ( an MCA data area is allocated and hooked
  to each "cpuinfo" structure )

* No locks

* No rendezvous
 - Does not seem to work if not all the CPUs are started up,
   i.e. you specify a "maxcpus=<NUM>"...
 - A failed rendezvous is a bad omen to start with
 - The correctable / recoverable MCAs are CPU local businesses
 - All the CPUs can handle MCAs simultaneously

* The translation registers are purged / reloaded unconditionally:
  cheaper than calling SAL_GET_STATE_INFO(MCA)

* Table driven TR purging / reload (except for the kernel stack mapping)

* TRs are all purged before the reloading starts ( an erroneous TR can still
  be in conflict with a freshly purged / reloaded one )

* SAL_CLEAR_STATE_INFO(MCA) is called only for MCAs which have been
  corrected (TR errors). For the others, the recovery will be tempted by
  a fake page fault handler, by the device drivers and by the MCA daemon,
  therefore the SAL MCA log is not cleared here -- future extension :-)

* "Silent" MCA handler: no prints by default ( unless debugging )
  - Output uses locks...

* A bit more serious error / status checking

This patch is against the version 2.6.1 + kdb-v4.3-2.6.1-common-b0.bz2 +
kdb-v4.3-2.6.1-ia64-b0.bz2.

Testing:
- Obviously by use of an ITP
- In my next mail I'll include a patch that can insert an illegal
  translation in a TR provoking an MCA

Problems:
Neither "IA64_LOG_NEXT_BUFFER()" nor "salinfo_log_wakeup()" works :-(
I think some addresses are messed up. The system says it cannot
translate virtual address...

I'll send the patch in the next letter.
Should the list refuse it due to its length, please pick it up at our
anonymous FTP server: ftp://visibull.frec.bull.fr/pub/linux/mca/ 

Your remarks will be appreciated.

Zoltan Menyhart

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Yet another MCA handler
  2004-01-14 10:42 Yet another MCA handler Zoltan Menyhart
@ 2004-01-14 23:30 ` Luck, Tony
  2004-01-14 23:32 ` Russ Anderson
  2004-01-15  9:09 ` Zoltan Menyhart
  2 siblings, 0 replies; 4+ messages in thread
From: Luck, Tony @ 2004-01-14 23:30 UTC (permalink / raw)
  To: linux-ia64

I've started looking at this ... I definitely like the idea of the
MCA data area hooked to each percpu area.

-Tony


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Yet another MCA handler
  2004-01-14 10:42 Yet another MCA handler Zoltan Menyhart
  2004-01-14 23:30 ` Luck, Tony
@ 2004-01-14 23:32 ` Russ Anderson
  2004-01-15  9:09 ` Zoltan Menyhart
  2 siblings, 0 replies; 4+ messages in thread
From: Russ Anderson @ 2004-01-14 23:32 UTC (permalink / raw)
  To: linux-ia64

Zoltan Menyhart wrote:
> 
> I was to write the code for the recovery for poisoned memory, too,
> but I've got no way to provoke this kind of error
> ( I do not really know what it like is :-) )

If you get that far, take the Intel Itanium 2 Processor 
Specification Update (December 2003) errata 78 into account

78. L2 cache line with poison data results in unexpected fatal MCA

http://www.intel.com/design/itanium2/specupdt/251141.htm

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Yet another MCA handler
  2004-01-14 10:42 Yet another MCA handler Zoltan Menyhart
  2004-01-14 23:30 ` Luck, Tony
  2004-01-14 23:32 ` Russ Anderson
@ 2004-01-15  9:09 ` Zoltan Menyhart
  2 siblings, 0 replies; 4+ messages in thread
From: Zoltan Menyhart @ 2004-01-15  9:09 UTC (permalink / raw)
  To: linux-ia64

Russ Anderson wrote:
> 
> Zoltan Menyhart wrote:
> >
> > I was to write the code for the recovery for poisoned memory, too,
> > but I've got no way to provoke this kind of error
> > ( I do not really know what it like is :-) )
> 
> If you get that far, take the Intel Itanium 2 Processor
> Specification Update (December 2003) errata 78 into account
> 
> 78. L2 cache line with poison data results in unexpected fatal MCA
> 
> http://www.intel.com/design/itanium2/specupdt/251141.htm

Thanks for the hint, but as far as I can see, it's not what I look for.
This is a fatal MCA. I need something classified as recoverable
(byte 10: ERR_SEVERITY in the SAL log record header).

Zoltan Menyhart

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-01-15  9:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-14 10:42 Yet another MCA handler Zoltan Menyhart
2004-01-14 23:30 ` Luck, Tony
2004-01-14 23:32 ` Russ Anderson
2004-01-15  9:09 ` Zoltan Menyhart

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox