From: Matthias Fouquet-Lapar <mfl@kernel.paris.sgi.com>
To: linux-ia64@vger.kernel.org
Subject: Re: [RFC] Better MCA recovery on IPF
Date: Mon, 27 Oct 2003 16:58:08 +0000 [thread overview]
Message-ID: <marc-linux-ia64-106727468932261@msgid-missing> (raw)
In-Reply-To: <marc-linux-ia64-106724227826901@msgid-missing>
Hi,
my name is Matthias Fouquet-Lapar, I'm working in SGI's
SW platform group mainly on CPU exception and error handling.
As other members of this group, we're also looking into
changing the Linux error handling to suit the needs of
a reliable super-computer environment.
I think error handling needs to be extended to not only
recover from errors and kill for example the concerned
application. Increasing chip density will increase the
soft error rate, so it also becomes important to determinate
if a error is soft (caused for example by cosmic rays)
or if it is a true HW component failure requiring a
replacement.
There are also more complex error scenarios in multiple
CPU environments when for example all CPUs access a cache
line which has an error.
Traditionally we're verifying our error handling by
error injection as well as running tests with real, broken
HW components for verification and regression testing.
Obviously a lot of the error handling will be very
platform dependant, but I think we should be able to come up
with a common frame set. What do you think ?
Thanks
Matthias Fouquet-Lapar Core Platform Software mfl@sgi.com VNET 521-8213
Principal Engineer Silicon Graphics Home Office (+33) 1 3047 4127
> I want to make contributions to the development of MCA Error Handling.
>
> According to IPF Error Handling Guide, OS should have capability to recover from
> error.
>
> There are three types of error, Corrected, Recoverable, and Fatal. They are
> reported to OS by MCA/CPEI/CMCI, and actions required to OS depend on the type
> of them. Relations between the type and the action are as follows;
>
> - Corrected:
> Do nothing.
>
> - Recoverable:
> Depends on the situation,
> - Fix the error, continue interrupted thread.
> - Terminate suffered threads.
> - Just as Fatal, reboot.
>
> - Fatal:
> Reboot system immediately.
>
> In all case, Linux should log error information based on SAL record.
> So, some programs in user land, like fault prediction logic or
> a daemon that reports error to remote site, could use these logs. And
> system administrator also could use these logs to keep their system
> healthy.
>
>
> I have strong expectations for Linux to realize such recovery features.
> However, Linux is deficient in recovery codes, especially on recoverable MCA,
> at this moment. (I know your good job, Tony.)
>
> I want to know what difficulty keep Linux as-is.
>
> What do you think of error recovery on Linux?
> What kind of functions, macros, structures should Linux have for recovery?
>
>
> Best regards,
>
> ------
>
> H.Seto <seto.hidetoshi@jp.fujitsu.com>
next prev parent reply other threads:[~2003-10-27 16:58 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-10-27 8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
2003-10-27 16:58 ` Matthias Fouquet-Lapar [this message]
2003-10-31 5:09 ` Hidetoshi Seto
2003-10-31 17:14 ` Grant Grundler
2003-11-01 6:39 ` Matthias Fouquet-Lapar
2003-11-01 8:38 ` Keith Owens
2003-11-02 13:33 ` Matthias Fouquet-Lapar
2003-11-03 17:09 ` Russ Anderson
2003-11-03 17:37 ` Matthias Fouquet-Lapar
2003-11-03 17:51 ` Alberto Munoz
2003-11-03 17:53 ` Alberto Munoz
2003-11-03 18:23 ` Jack Steiner
2003-11-03 18:42 ` Alberto Munoz
2003-11-03 19:28 ` Jack Steiner
2003-11-03 23:09 ` Alberto Munoz
2003-11-05 4:11 ` Greg Banks
2003-11-05 17:00 ` Luck, Tony
2003-11-05 17:14 ` Alberto Munoz
2003-11-05 17:30 ` Matthew Wilcox
2003-11-05 17:37 ` Alberto Munoz
2003-11-06 12:03 ` Hidetoshi Seto
2003-11-06 14:23 ` Matthias Fouquet-Lapar
2003-11-06 19:09 ` Luck, Tony
2003-11-07 9:58 ` Hidetoshi Seto
2003-11-07 10:52 ` Matthias Fouquet-Lapar
2003-11-08 1:15 ` Luck, Tony
2003-11-08 7:36 ` Matthias Fouquet-Lapar
2003-11-10 10:33 ` Hidetoshi Seto
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=marc-linux-ia64-106727468932261@msgid-missing \
--to=mfl@kernel.paris.sgi.com \
--cc=linux-ia64@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox