[RFC] Better MCA recovery on IPF

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] Better MCA recovery on IPF
@ 2003-10-27  8:07 Hidetoshi Seto
  2003-10-27 16:58 ` Matthias Fouquet-Lapar
                   ` (26 more replies)
  0 siblings, 27 replies; 28+ messages in thread
From: Hidetoshi Seto @ 2003-10-27  8:07 UTC (permalink / raw)
  To: linux-ia64

I want to make contributions to the development of MCA Error Handling.

According to IPF Error Handling Guide, OS should have capability to recover from
error.

There are three types of error, Corrected, Recoverable, and Fatal. They are
reported to OS by MCA/CPEI/CMCI, and actions required to OS depend on the type
of them. Relations between the type and the action are as follows;

 - Corrected:
     Do nothing.

 - Recoverable:
     Depends on the situation,
     - Fix the error, continue interrupted thread.
     - Terminate suffered threads.
     - Just as Fatal, reboot.

 - Fatal:
     Reboot system immediately.

In all case, Linux should log error information based on SAL record.
So, some programs in user land, like fault prediction logic or
a daemon that reports error to remote site, could use these logs. And
system administrator also could use these logs to keep their system
healthy.

I have strong expectations for Linux to realize such recovery features.
However, Linux is deficient in recovery codes, especially on recoverable MCA,
at this moment. (I know your good job, Tony.)

I want to know what difficulty keep Linux as-is.

What do you think of error recovery on Linux?
What kind of functions, macros, structures should Linux have for recovery?

Best regards,

------

H.Seto <seto.hidetoshi@jp.fujitsu.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
@ 2003-10-27 16:58 ` Matthias Fouquet-Lapar
  2003-10-31  5:09 ` Hidetoshi Seto
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Matthias Fouquet-Lapar @ 2003-10-27 16:58 UTC (permalink / raw)
  To: linux-ia64

Hi,

my name is Matthias Fouquet-Lapar, I'm working in SGI's 
SW platform group mainly on CPU exception and error handling.

As other members of this group, we're also looking into
changing the Linux error handling to suit the needs of
a reliable super-computer environment.

I think error handling needs to be extended to not only
recover from errors and kill for example the concerned
application. Increasing chip density will increase the
soft error rate, so it also becomes important to determinate
if a error is soft (caused for example by cosmic rays)
or if it is a true HW component failure requiring a
replacement.

There are also more complex error scenarios in multiple
CPU environments when for example all CPUs access a cache
line which has an error.

Traditionally we're verifying our error handling by
error injection as well as running tests with real, broken
HW components for verification and regression testing.

Obviously a lot of the error handling will be very
platform dependant, but I think we should be able to come up
with a common frame set. What do you think ?

Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127

> I want to make contributions to the development of MCA Error Handling.
> 
> According to IPF Error Handling Guide, OS should have capability to recover from
> error.
> 
> There are three types of error, Corrected, Recoverable, and Fatal. They are
> reported to OS by MCA/CPEI/CMCI, and actions required to OS depend on the type
> of them. Relations between the type and the action are as follows;
> 
>  - Corrected:
>      Do nothing.
> 
>  - Recoverable:
>      Depends on the situation,
>      - Fix the error, continue interrupted thread.
>      - Terminate suffered threads.
>      - Just as Fatal, reboot.
> 
>  - Fatal:
>      Reboot system immediately.
> 
> In all case, Linux should log error information based on SAL record.
> So, some programs in user land, like fault prediction logic or
> a daemon that reports error to remote site, could use these logs. And
> system administrator also could use these logs to keep their system
> healthy.
> 
> 
> I have strong expectations for Linux to realize such recovery features.
> However, Linux is deficient in recovery codes, especially on recoverable MCA,
> at this moment. (I know your good job, Tony.)
> 
> I want to know what difficulty keep Linux as-is.
> 
> What do you think of error recovery on Linux?
> What kind of functions, macros, structures should Linux have for recovery?
> 
> 
> Best regards,
> 
> ------
> 
> H.Seto <seto.hidetoshi@jp.fujitsu.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
  2003-10-27 16:58 ` Matthias Fouquet-Lapar
@ 2003-10-31  5:09 ` Hidetoshi Seto
  2003-10-31 17:14 ` Grant Grundler
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Hidetoshi Seto @ 2003-10-31  5:09 UTC (permalink / raw)
  To: linux-ia64

Hi, Matthias.


> I think error handling needs to be extended to not only
> recover from errors and kill for example the concerned
> application. Increasing chip density will increase the
> soft error rate, so it also becomes important to determinate
> if a error is soft (caused for example by cosmic rays)
> or if it is a true HW component failure requiring a
> replacement.

Surely, it is very important to specify where target
error comes from. I do not want to carry out advice to
replace the component, which working correctly.

> Obviously a lot of the error handling will be very
> platform dependant, but I think we should be able to come up
> with a common frame set. What do you think ?

Of course, I agree with a common frame set.
In the case of platform premising IPF, I think it is
better to regard the Intel's Chipset as the de facto
standard.


Thanks.

------

H.Seto <seto.hidetoshi@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
  2003-10-27 16:58 ` Matthias Fouquet-Lapar
  2003-10-31  5:09 ` Hidetoshi Seto
@ 2003-10-31 17:14 ` Grant Grundler
  2003-11-01  6:39 ` Matthias Fouquet-Lapar
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Grant Grundler @ 2003-10-31 17:14 UTC (permalink / raw)
  To: linux-ia64

On Fri, Oct 31, 2003 at 02:09:12PM +0900, Hidetoshi Seto wrote:
> In the case of platform premising IPF, I think it is
> better to regard the Intel's Chipset as the de facto
> standard.

hmm...given ia64 intel boxes I've played with have no error containment
and softfail on everything, I'm not sure that's a good choice.
Or has enough been published about the chipset to change those
behaviors?

thanks,
grant

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (2 preceding siblings ...)
  2003-10-31 17:14 ` Grant Grundler
@ 2003-11-01  6:39 ` Matthias Fouquet-Lapar
  2003-11-01  8:38 ` Keith Owens
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Matthias Fouquet-Lapar @ 2003-11-01  6:39 UTC (permalink / raw)
  To: linux-ia64

Hi,

> Of course, I agree with a common frame set.
> In the case of platform premising IPF, I think it is
> better to regard the Intel's Chipset as the de facto
> standard.

I think there should be an abstraction layer hiding the underlying
HW implementation. I think handling for example a memory error 
by killing the affected user application, should work on any chipset
and/or CPU architecture (if technically possible). We should not
restrict ourselves to specific platforms, I think the general trend
is that the error rate will go up because :

    - faster off-chip frequencies
	- lower supply voltages decreasing signal/noise ratio
	- higher suspectibility to cosmis rays causing SEU (Single Event Upsets)
	  due to smaller process. There are for example estimations that SEUs
	  will increase by a factor of 100 when going from a .13um process to .9um

The only alternatives to burrying a system under 50 feet of solid rock to avoid
cosmic rays and improvements in HW design (chipkill will help) is to improve
error handling and recovery.

Today we have for example the ability that an application can deal with
an unexpected event, such as a div by 0. In my eyes it would be possible
that an application also could make provisions to handle memory (or cache
errors) up to a certain extend, as long as the offending VA is known.

In other words, I would prefer the option for applications writers to 
have the option to recover within the application if is possible instead
of having the application killed (or even the OS in the current state)

Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (3 preceding siblings ...)
  2003-11-01  6:39 ` Matthias Fouquet-Lapar
@ 2003-11-01  8:38 ` Keith Owens
  2003-11-02 13:33 ` Matthias Fouquet-Lapar
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Keith Owens @ 2003-11-01  8:38 UTC (permalink / raw)
  To: linux-ia64

On Sat, 1 Nov 2003 07:39:52 +0100 ("CET), 
Matthias Fouquet-Lapar <mfl@sgi.com> wrote:
>> Of course, I agree with a common frame set.
>> In the case of platform premising IPF, I think it is
>> better to regard the Intel's Chipset as the de facto
>> standard.
>
>I think there should be an abstraction layer hiding the underlying
>HW implementation. I think handling for example a memory error 
>by killing the affected user application, should work on any chipset
>and/or CPU architecture (if technically possible).

We already have that interface, it is called a signal.  The kernel code
for handling these events has to be architecture dependent but, once
the data has been gathered and the decision made about which user
process to kill, we just send SEGV.

BTW, your email address includes your full hostname, instead of just
mfl@sgi.com.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (4 preceding siblings ...)
  2003-11-01  8:38 ` Keith Owens
@ 2003-11-02 13:33 ` Matthias Fouquet-Lapar
  2003-11-03 17:09 ` Russ Anderson
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Matthias Fouquet-Lapar @ 2003-11-02 13:33 UTC (permalink / raw)
  To: linux-ia64

> We already have that interface, it is called a signal.  The kernel code
> for handling these events has to be architecture dependent but, once
> the data has been gathered and the decision made about which user
> process to kill, we just send SEGV.

:-) I think there is more to it, for example we should avoid 
re-allocating a page which has a hard error to make sure that the next
user stumbles over this again. Maybe this information should also be
kept accross a reboot for the same reason. Then you need some API to maintain
this information, if for example a DIMM is replaced and it is not longer
require to map out a page with an error. 


Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (5 preceding siblings ...)
  2003-11-02 13:33 ` Matthias Fouquet-Lapar
@ 2003-11-03 17:09 ` Russ Anderson
  2003-11-03 17:37 ` Matthias Fouquet-Lapar
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Russ Anderson @ 2003-11-03 17:09 UTC (permalink / raw)
  To: linux-ia64

Grant Grundler wrote:
On Fri, Oct 31, 2003 at 02:09:12PM +0900, Hidetoshi Seto wrote:
>> In the case of platform premising IPF, I think it is
>> better to regard the Intel's Chipset as the de facto
>> standard.
>
> hmm...given ia64 intel boxes I've played with have no error containment
> and softfail on everything, I'm not sure that's a good choice.
> Or has enough been published about the chipset to change those
> behaviors?

There are some errors on ia64 that are recoverable, with the right
SW (PAL,SAL,Linux) and chipset support.  

There are some errors on ia64 that are not recoverable, but hopefully
will be in newer cpu & chipset versions.

A Matthias points out, some of the recovery should abstracted out 
in linux to hide the underlying hardware implementation.  

For example, in the case of an application hitting a memory 
uncorrectable on a multi-processor system, the MCA will be handled 
by PAL and SAL.  If SAL can determine the failing HW physical address,
it could pass that information up to linux.  Linux could look at the
physical address and figure out which application has that address
mapped and kill the application, without crashing the system.  Linux
should also not allow that physical memory to be reused by any other
process.

Part of that recovery is platform specific (HW, PAL, SAL) but
part of it is platform independent (linux converting the physical
address, shooting the app, page handling).

As for IPF being "the defacto standard", IPF is certainly the
platform I'm interested in (hence posting to linux-ia64), but others 
will have their own preference.  The platform independent parts of 
linux should have interfaces designed to work on any platform (duh).  
Actual implementation will likely be done on several different 
architectures.  

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (6 preceding siblings ...)
  2003-11-03 17:09 ` Russ Anderson
@ 2003-11-03 17:37 ` Matthias Fouquet-Lapar
  2003-11-03 17:51 ` Alberto Munoz
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Matthias Fouquet-Lapar @ 2003-11-03 17:37 UTC (permalink / raw)
  To: linux-ia64

> For example, in the case of an application hitting a memory 
> uncorrectable on a multi-processor system, the MCA will be handled 
> by PAL and SAL.  If SAL can determine the failing HW physical address,
> it could pass that information up to linux.  Linux could look at the
> physical address and figure out which application has that address
> mapped and kill the application, without crashing the system.  Linux
> should also not allow that physical memory to be reused by any other
> process.

Hi,

I just wondered if a speculative load hitting a cache or memory
error does cause an exception on IA64 ? 

Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (7 preceding siblings ...)
  2003-11-03 17:37 ` Matthias Fouquet-Lapar
@ 2003-11-03 17:51 ` Alberto Munoz
  2003-11-03 17:53 ` Alberto Munoz
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Alberto Munoz @ 2003-11-03 17:51 UTC (permalink / raw)
  To: linux-ia64

When I was at HP (a good number of years ago, we (HP and Intel) spent a lot
of time trying to architect machine check behavior. Actually all of the
things you guys have been discussing were considered. Because I have not been
following up on this area in many years, I am not sure how much of the work
we did actually made it to official architecture documents, although I do
know that some of it did.

The main idea was that each layer of the machine check handling code will
either be able to transparently (to that layer) recover the error, or pass
the information up to the next layer (this information always included a flag
that would be set if the error was considered non-recoverable by the lower
layer, like for example a tag parity error on a dirty data cache line). The
layers we defined and the order in which they were executed when a machine
check abort occurred were PAL, SAL and the OS. I have seen some of this
information (although I have not checked how complete it is) in chapter 4 of
the SAL spec (Itanium Processor Family System Abstraction layer
Specification) and section 13.3.i of the architecture spec (Intel Itanium
Architecture Software Developers Manual, Volume 2: System Architecture). The
SAL_GET_STATE_INFO call was to be central to getting all this information to
the OS.

Bert Munoz

> -----Original Message-----
> From: Russ Anderson [mailto:rja@sgi.com]
> Sent: Monday, November 03, 2003 9:09 AM
> To: linux-ia64@vger.kernel.org
> Cc: rja@sgi.com
> Subject: Re: [RFC] Better MCA recovery on IPF
> 
> 
> Grant Grundler wrote:
> On Fri, Oct 31, 2003 at 02:09:12PM +0900, Hidetoshi Seto wrote:
> >> In the case of platform premising IPF, I think it is
> >> better to regard the Intel's Chipset as the de facto
> >> standard.
> >
> > hmm...given ia64 intel boxes I've played with have no error 
> containment
> > and softfail on everything, I'm not sure that's a good choice.
> > Or has enough been published about the chipset to change those
> > behaviors?
> 
> There are some errors on ia64 that are recoverable, with the right
> SW (PAL,SAL,Linux) and chipset support.  
> 
> There are some errors on ia64 that are not recoverable, but hopefully
> will be in newer cpu & chipset versions.
> 
> A Matthias points out, some of the recovery should abstracted out 
> in linux to hide the underlying hardware implementation.  
> 
> For example, in the case of an application hitting a memory 
> uncorrectable on a multi-processor system, the MCA will be handled 
> by PAL and SAL.  If SAL can determine the failing HW physical address,
> it could pass that information up to linux.  Linux could look at the
> physical address and figure out which application has that address
> mapped and kill the application, without crashing the system.  Linux
> should also not allow that physical memory to be reused by any other
> process.
> 
> Part of that recovery is platform specific (HW, PAL, SAL) but
> part of it is platform independent (linux converting the physical
> address, shooting the app, page handling).
> 
> As for IPF being "the defacto standard", IPF is certainly the
> platform I'm interested in (hence posting to linux-ia64), but others 
> will have their own preference.  The platform independent parts of 
> linux should have interfaces designed to work on any platform (duh).  
> Actual implementation will likely be done on several different 
> architectures.  
> 
> -- 
> Russ Anderson, OS RAS/Partitioning Project Lead  
> SGI - Silicon Graphics Inc          rja@sgi.com
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (8 preceding siblings ...)
  2003-11-03 17:51 ` Alberto Munoz
@ 2003-11-03 17:53 ` Alberto Munoz
  2003-11-03 18:23 ` Jack Steiner
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Alberto Munoz @ 2003-11-03 17:53 UTC (permalink / raw)
  To: linux-ia64



> Hi,
> 
> I just wondered if a speculative load hitting a cache or memory
> error does cause an exception on IA64 ? 
> 

As far as I know, the answer is yes. The processor has no way of
distinguishing speculative vs. non-speculative accesses on the buses or
caches.

Bert Munoz

> Thanks
> 
> Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com 
>  VNET 521-8213
> Principal Engineer      Silicon Graphics          Home Office 
> (+33) 1 3047 4127
> 
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (9 preceding siblings ...)
  2003-11-03 17:53 ` Alberto Munoz
@ 2003-11-03 18:23 ` Jack Steiner
  2003-11-03 18:42 ` Alberto Munoz
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Jack Steiner @ 2003-11-03 18:23 UTC (permalink / raw)
  To: linux-ia64

On Mon, Nov 03, 2003 at 06:37:37PM +0100, Matthias Fouquet-Lapar wrote:
> > For example, in the case of an application hitting a memory 
> > uncorrectable on a multi-processor system, the MCA will be handled 
> > by PAL and SAL.  If SAL can determine the failing HW physical address,
> > it could pass that information up to linux.  Linux could look at the
> > physical address and figure out which application has that address
> > mapped and kill the application, without crashing the system.  Linux
> > should also not allow that physical memory to be reused by any other
> > process.
> 
> Hi,
> 
> I just wondered if a speculative load hitting a cache or memory
> error does cause an exception on IA64 ? 

I dont think a speculative load should cause a problem - at least until 
code tries to consume the data by transfering it to a processor register.

As I understand the cpu architecture, an error that occurs reading data
will result in a poisoned cache line being delivered to the cpu cache. 
The poisoned cache line can stay in the cache forever. No MCA error is
reported until the data is actually consumed by tranfering the data from 
cache to a cpu register. 

This requires some support from the chipset. Some chipsets dont fully
support this error model.

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (10 preceding siblings ...)
  2003-11-03 18:23 ` Jack Steiner
@ 2003-11-03 18:42 ` Alberto Munoz
  2003-11-03 19:28 ` Jack Steiner
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Alberto Munoz @ 2003-11-03 18:42 UTC (permalink / raw)
  To: linux-ia64



> > Hi,
> > 
> > I just wondered if a speculative load hitting a cache or memory
> > error does cause an exception on IA64 ? 
> 
> I dont think a speculative load should cause a problem - at 
> least until 
> code tries to consume the data by transfering it to a 
> processor register.

If you are doing a read (which is what a speculative load will be
generating), the error will be generated by whatever part of the logic that
detects it. You cannot possible send poisoned data through a memory bus and a
system bus (at least not the Intel system buses I am familiar with) without
having some of the error checking logic (ECC or parity) complaining about it
(this means generating an MCA).

> As I understand the cpu architecture, an error that occurs 
> reading data
> will result in a poisoned cache line being delivered to the 
> cpu cache. 
> The poisoned cache line can stay in the cache forever. No MCA error is
> reported until the data is actually consumed by tranfering 
> the data from 
> cache to a cpu register. 

The problem is that the cache error checking logic has no way of knowing that
the data it is about to supply to some register is going to be used for a
speculative operation. The cache logic is pretty far away (in processor
terms) from the decoding logic.

Bert Munoz

> This requires some support from the chipset. Some chipsets dont fully
> support this error model.
> 
> 
> 
> 
> -- 
> Thanks
> 
> Jack Steiner (steiner@sgi.com)          651-683-5302
> Principal Engineer                      SGI - Silicon Graphics, Inc.
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (11 preceding siblings ...)
  2003-11-03 18:42 ` Alberto Munoz
@ 2003-11-03 19:28 ` Jack Steiner
  2003-11-03 23:09 ` Alberto Munoz
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Jack Steiner @ 2003-11-03 19:28 UTC (permalink / raw)
  To: linux-ia64

On Mon, Nov 03, 2003 at 10:42:48AM -0800, Alberto Munoz wrote:
> 
> 
> > > Hi,
> > > 
> > > I just wondered if a speculative load hitting a cache or memory
> > > error does cause an exception on IA64 ? 
> > 
> > I dont think a speculative load should cause a problem - at 
> > least until 
> > code tries to consume the data by transfering it to a 
> > processor register.
> 
> If you are doing a read (which is what a speculative load will be
> generating), the error will be generated by whatever part of the logic that
> detects it. You cannot possible send poisoned data through a memory bus and a
> system bus (at least not the Intel system buses I am familiar with) without
> having some of the error checking logic (ECC or parity) complaining about it
> (this means generating an MCA).


As the poisoned data flows thru the BUSes, errors may be reported but these errors
are not reported to the OS as uncorrected/fatal MCA errors. Depending on your
chipset, errors are logged as platform errors. 


There is a good paper by Tony Luck (Intel) that describes data poisoning as used
in IA64. You can find it on google or at:

	archive.linuxsymposium.org/ols2003/Proceedings/ All-Reprints/Reprint-Luck-OLS2003.pdf 

See the section on "data poisoning".

> 
> > As I understand the cpu architecture, an error that occurs 
> > reading data
> > will result in a poisoned cache line being delivered to the 
> > cpu cache. 
> > The poisoned cache line can stay in the cache forever. No MCA error is
> > reported until the data is actually consumed by tranfering 
> > the data from 
> > cache to a cpu register. 
> 
> The problem is that the cache error checking logic has no way of knowing that
> the data it is about to supply to some register is going to be used for a
> speculative operation. The cache logic is pretty far away (in processor
> terms) from the decoding logic.
> 
> Bert Munoz
> 
> > This requires some support from the chipset. Some chipsets dont fully
> > support this error model.
> > 
> > 
> > 
> > 

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (12 preceding siblings ...)
  2003-11-03 19:28 ` Jack Steiner
@ 2003-11-03 23:09 ` Alberto Munoz
  2003-11-05  4:11 ` Greg Banks
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Alberto Munoz @ 2003-11-03 23:09 UTC (permalink / raw)
  To: linux-ia64

Because I was really curious as to how much this field may have changed since
the last time I checked, I read fairly quickly through the paper you mention
below.

As stated in section 5, second paragraph, of the document you reference
below, poisoning does not apply to reads (except for delivering an MCA at any
read attempt of the poisoned data). The main value for poisoning is to avoid
delivering a machine check "out of context" when it would be caused by a
write operation. The problem is that an execution context (or thread, or
process) is allowed to retire write operations BEFORE the data has actually
been safely stored in memory. For example, you can complete a write operation
to the cache, and then have an error occur when the data is written from the
cache to main memory. Unfortunately, when this error occurs, chances are that
the original context that generated the write may no longer be executing. It
is also possible that the written data will never be used again, in which
case generating an MCA would be wasteful. Instead of generating an MCA, the
hardware marks the data as poisoned (in an implementation specific way that
allows the data to move through the memory hierarchy without generating
MCAs).

I still believe that a failed speculative read (for example of poisoned data)
will generate an MCA. Perhaps someone from Intel can confirm or deny?

Bert Munoz

> -----Original Message-----
> From: Jack Steiner [mailto:steiner@sgi.com]
> Sent: Monday, November 03, 2003 11:29 AM
> To: Alberto Munoz
> Cc: Matthias Fouquet-Lapar; Russ Anderson; linux-ia64@vger.kernel.org
> Subject: Re: [RFC] Better MCA recovery on IPF
> 
> 
> On Mon, Nov 03, 2003 at 10:42:48AM -0800, Alberto Munoz wrote:
> > 
> > 
> > > > Hi,
> > > > 
> > > > I just wondered if a speculative load hitting a cache or memory
> > > > error does cause an exception on IA64 ? 
> > > 
> > > I dont think a speculative load should cause a problem - at 
> > > least until 
> > > code tries to consume the data by transfering it to a 
> > > processor register.
> > 
> > If you are doing a read (which is what a speculative load will be
> > generating), the error will be generated by whatever part 
> of the logic that
> > detects it. You cannot possible send poisoned data through 
> a memory bus and a
> > system bus (at least not the Intel system buses I am 
> familiar with) without
> > having some of the error checking logic (ECC or parity) 
> complaining about it
> > (this means generating an MCA).
> 
> 
> As the poisoned data flows thru the BUSes, errors may be 
> reported but these errors
> are not reported to the OS as uncorrected/fatal MCA errors. 
> Depending on your
> chipset, errors are logged as platform errors. 
> 
> 
> There is a good paper by Tony Luck (Intel) that describes 
> data poisoning as used
> in IA64. You can find it on google or at:
> 
> 	archive.linuxsymposium.org/ols2003/Proceedings/ 
> All-Reprints/Reprint-Luck-OLS2003.pdf 
> 
> See the section on "data poisoning".
> 
> > 
> > > As I understand the cpu architecture, an error that occurs 
> > > reading data
> > > will result in a poisoned cache line being delivered to the 
> > > cpu cache. 
> > > The poisoned cache line can stay in the cache forever. No 
> MCA error is
> > > reported until the data is actually consumed by tranfering 
> > > the data from 
> > > cache to a cpu register. 
> > 
> > The problem is that the cache error checking logic has no 
> way of knowing that
> > the data it is about to supply to some register is going to 
> be used for a
> > speculative operation. The cache logic is pretty far away 
> (in processor
> > terms) from the decoding logic.
> > 
> > Bert Munoz
> > 
> > > This requires some support from the chipset. Some 
> chipsets dont fully
> > > support this error model.
> > > 
> > > 
> > > 
> > > 
> 
> -- 
> Thanks
> 
> Jack Steiner (steiner@sgi.com)          651-683-5302
> Principal Engineer                      SGI - Silicon Graphics, Inc.
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (13 preceding siblings ...)
  2003-11-03 23:09 ` Alberto Munoz
@ 2003-11-05  4:11 ` Greg Banks
  2003-11-05 17:00 ` Luck, Tony
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Greg Banks @ 2003-11-05  4:11 UTC (permalink / raw)
  To: linux-ia64

Keith Owens wrote:
> 
> On Sat, 1 Nov 2003 07:39:52 +0100 ("CET),
> Matthias Fouquet-Lapar <mfl@sgi.com> wrote:
> >I think there should be an abstraction layer hiding the underlying
> >HW implementation. I think handling for example a memory error
> >by killing the affected user application, should work on any chipset
> >and/or CPU architecture (if technically possible).
> 
> We already have that interface, it is called a signal.  The kernel code
> for handling these events has to be architecture dependent but, once
> the data has been gathered and the decision made about which user
> process to kill, we just send SEGV.

The problem with SEGV is that there exist applications which do
strange mmap/mprot tricks and catch and retry SEGVs to implement
app-level paging-like behaviour.  Two examples which already
run on (some ports of) Linux are:

The Texas Persistent Store (open source)
http://www.iam.unibe.ch/~scg/Archive/Software/FreeDB/FreeDB.23.html

ObjectStore (commercial)
http://www.objectstore.net/products/objectstore/index.ssp

You'd have to use SIGBUS or some other signal, or add a new code
to the sigcontext to allow those apps to handle the difference.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (14 preceding siblings ...)
  2003-11-05  4:11 ` Greg Banks
@ 2003-11-05 17:00 ` Luck, Tony
  2003-11-05 17:14 ` Alberto Munoz
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Luck, Tony @ 2003-11-05 17:00 UTC (permalink / raw)
  To: linux-ia64

> I still believe that a failed speculative read (for example 
> of poisoned data) will generate an MCA. Perhaps someone from
> Intel can confirm or deny?

It depends on exactly what you mean by "speculative read", and
even there it is not architecturally defined, so different
implementations may behave differently ("Ask not the elves for
advice for they will say both yes and no" - Tolkien).

"Speculative" reads from memory as a result of lfetch, or a code
fetch for a mispredicted branch that reference poisoned data may
not generate an MCA (since the processor can know that the poisoned
data will not be consumed.  Speculative reads from "ld.s" have
less scope to avoid the MCA

There's a new bit coming for PAL_PROC_{GET,SET}_FEATURES which
will at least tell you (and may allow you to request, if the
implementation supports it) whether the processor will respond
to poison with CMCI, or upgrade to MCA ... watch the web for a
spec update to the SDV

-Tony

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (15 preceding siblings ...)
  2003-11-05 17:00 ` Luck, Tony
@ 2003-11-05 17:14 ` Alberto Munoz
  2003-11-05 17:30 ` Matthew Wilcox
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Alberto Munoz @ 2003-11-05 17:14 UTC (permalink / raw)
  To: linux-ia64

Hi Tony,

Thank you for the clarification.

> > I still believe that a failed speculative read (for example 
> > of poisoned data) will generate an MCA. Perhaps someone from
> > Intel can confirm or deny?
> 
> It depends on exactly what you mean by "speculative read", and
> even there it is not architecturally defined, so different
> implementations may behave differently ("Ask not the elves for
> advice for they will say both yes and no" - Tolkien).
> 
> "Speculative" reads from memory as a result of lfetch, or a code
> fetch for a mispredicted branch that reference poisoned data may
> not generate an MCA (since the processor can know that the poisoned
> data will not be consumed.  Speculative reads from "ld.s" have
> less scope to avoid the MCA

I actually meant a read generated as a result of a ld.s. 

By the way, what do you mean by "less scope to avoid the MCA"? this statement
seems to imply that some existing implementations do it (avoid generating and
MCA when a ld.s accesses poisoned data) under some circumstances. Can you
elaborate?

What implementation of the Itanium processor supports avoiding MCAs from
lfetch or code fetch operations? I don't think Itanium 1 or 2 do this. How
about Madison?

> There's a new bit coming for PAL_PROC_{GET,SET}_FEATURES which
> will at least tell you (and may allow you to request, if the
> implementation supports it) whether the processor will respond
> to poison with CMCI, or upgrade to MCA ... watch the web for a
> spec update to the SDV

I assume this is only for the write path (not the read). As a CMCI on the
read path cannot possibly guarantee containment (prevent unmarked corrupt
data from making it to a register and eventually to memory, disk or the
network). Unless the (architectural) position is that an OS sets this bit at
its own risk...

Thanks again,

Bert Munoz

> -Tony
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (16 preceding siblings ...)
  2003-11-05 17:14 ` Alberto Munoz
@ 2003-11-05 17:30 ` Matthew Wilcox
  2003-11-05 17:37 ` Alberto Munoz
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Matthew Wilcox @ 2003-11-05 17:30 UTC (permalink / raw)
  To: linux-ia64

On Wed, Nov 05, 2003 at 09:14:08AM -0800, Alberto Munoz wrote:
> What implementation of the Itanium processor supports avoiding MCAs from
> lfetch or code fetch operations? I don't think Itanium 1 or 2 do this. How
> about Madison?

Oops, misunderstanding here.  Madison and Deerfield are also Itanium 2.
It's like Coppermine and Tualatin are both Pentium 3.  When the difference
is only cache size, die size, clock frequency and so on, they're not
going to change the number.  For bigger changes, they might ;-)

-- 
"It's not Hollywood.  War is real, war is primarily not about defeat or
victory, it is about death.  I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (17 preceding siblings ...)
  2003-11-05 17:30 ` Matthew Wilcox
@ 2003-11-05 17:37 ` Alberto Munoz
  2003-11-06 12:03 ` Hidetoshi Seto
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Alberto Munoz @ 2003-11-05 17:37 UTC (permalink / raw)
  To: linux-ia64



> -----Original Message-----
> From: Matthew Wilcox [mailto:willy@debian.org]
> Sent: Wednesday, November 05, 2003 9:31 AM
> To: Alberto Munoz
> Cc: Luck, Tony; Jack Steiner; Matthias Fouquet-Lapar; Russ Anderson;
> linux-ia64@vger.kernel.org
> Subject: Re: [RFC] Better MCA recovery on IPF
> 
> 
> On Wed, Nov 05, 2003 at 09:14:08AM -0800, Alberto Munoz wrote:
> > What implementation of the Itanium processor supports 
> avoiding MCAs from
> > lfetch or code fetch operations? I don't think Itanium 1 or 
> 2 do this. How
> > about Madison?
> 
> Oops, misunderstanding here.  Madison and Deerfield are also 
> Itanium 2.
> It's like Coppermine and Tualatin are both Pentium 3.  When 
> the difference
> is only cache size, die size, clock frequency and so on, they're not
> going to change the number.  For bigger changes, they might ;-)

Not really a misunderstanding. I just got a little lose with the names. I
should have said Merced, McKinley instead of Itanium 1 and 2.

In any case, it has been my experience that some times there are fairly major
changes in RAS features (typically addressing shortcomings of a predecessor)
betweens processors of the same vintage (McKinley, Madison and Deerfield).

Bert Munoz
> -- 
> "It's not Hollywood.  War is real, war is primarily not about 
> defeat or
> victory, it is about death.  I've seen thousands and 
> thousands of dead bodies.
> Do you think I want to have an academic debate on this 
> subject?" -- Robert Fisk
> 
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (18 preceding siblings ...)
  2003-11-05 17:37 ` Alberto Munoz
@ 2003-11-06 12:03 ` Hidetoshi Seto
  2003-11-06 14:23 ` Matthias Fouquet-Lapar
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Hidetoshi Seto @ 2003-11-06 12:03 UTC (permalink / raw)
  To: linux-ia64

Thanks, all.

I was pleased to hear many thought in active developer's mind.

I want to check some in MCA recovery.
Here is some steps, roughly say:

 - Stop whole system on MCA even if kernel not suffered.
 - Keep working except the suffered application.
 - Keep working, and give a chance to self-rehabilitation for the suffered
   application.

Just now, Linux is making an effort to step up the second from the first.

It also seems there are some key procedures, such as:

 - Specify process/thread(s) must be killed
 - Specify damaged resources (for example, poisoned pages)

To specify affected ones, OS requires:

 - Physical address (and if possible, Virtual address) of the offending
   operation.
 - Interruption must be synchronized.
 - States in registers when interrupted.

Is it wrong?


------

H.Seto <seto.hidetoshi@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (19 preceding siblings ...)
  2003-11-06 12:03 ` Hidetoshi Seto
@ 2003-11-06 14:23 ` Matthias Fouquet-Lapar
  2003-11-06 19:09 ` Luck, Tony
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Matthias Fouquet-Lapar @ 2003-11-06 14:23 UTC (permalink / raw)
  To: linux-ia64

> I want to check some in MCA recovery.
> Here is some steps, roughly say:

One of the complexities is recovery on a large-scale system, if for
example, multiple CPUs access a poisoned memory location at the same time.

Other "interesting" errors scenarious are if data is DEX with bad ECC in CPU 
A's cache and CPU B requests the line from CPU A. 

>  - Interruption must be synchronized.

I'm not sure what you mean by this.


Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (20 preceding siblings ...)
  2003-11-06 14:23 ` Matthias Fouquet-Lapar
@ 2003-11-06 19:09 ` Luck, Tony
  2003-11-07  9:58 ` Hidetoshi Seto
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Luck, Tony @ 2003-11-06 19:09 UTC (permalink / raw)
  To: linux-ia64

> By the way, what do you mean by "less scope to avoid the 
> MCA"? this statement seems to imply that some existing implementations do it 
> (avoid generating and MCA when a ld.s accesses poisoned data) under some 
> circumstances. Can you elaborate?

I was just being vague.  In the case of "ld.s" the data is
targetted at a register ... and we don't have a way to indicate
that the contents of a register are poisoned (NaT would at least
stop someone consuming the poisoned data, but they wouldn't know
why) ... so an MCA is inevitable.  In the "lfetch" case we only
requested that the data be pulled into cache, which can keep track
of whether it is poisoned or not ... so it is architecturally possible
to avoid the MCA.

I don't have the details to hand on how existing implementations
actually handle this.

-Tony

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (21 preceding siblings ...)
  2003-11-06 19:09 ` Luck, Tony
@ 2003-11-07  9:58 ` Hidetoshi Seto
  2003-11-07 10:52 ` Matthias Fouquet-Lapar
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Hidetoshi Seto @ 2003-11-07  9:58 UTC (permalink / raw)
  To: linux-ia64

Hi.

> One of the complexities is recovery on a large-scale system, if for
> example, multiple CPUs access a poisoned memory location at the same time.
> 
> Other "interesting" errors scenarious are if data is DEX with bad ECC in CPU 
> A's cache and CPU B requests the line from CPU A. 

My concern for poisoning is that I'm not sure the way to clear the poisoned
data. Maybe, not so many people know the timing and the guaranteed procedure.
I can estimate what the procedure includes, such as changing poisoned memory
to uncacheable, clearing suspect data in cache, and storing zeros to the
poisoned area.
Even for a single poisoned line in memory, it is need to pause all CPUs on a
large-scale system, like Global MCA?

> >  - Interruption must be synchronized.
> 
> I'm not sure what you mean by this.

What I mean by poor English is synchronous MCA.
Executing process can change in the case of asynchronous MCA from platform.

Thanks.

------

H.Seto <seto.hidetoshi@jp.fujitsu.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (22 preceding siblings ...)
  2003-11-07  9:58 ` Hidetoshi Seto
@ 2003-11-07 10:52 ` Matthias Fouquet-Lapar
  2003-11-08  1:15 ` Luck, Tony
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Matthias Fouquet-Lapar @ 2003-11-07 10:52 UTC (permalink / raw)
  To: linux-ia64

Hi,

> My concern for poisoning is that I'm not sure the way to clear the poisoned
> data. Maybe, not so many people know the timing and the guaranteed procedure.
> I can estimate what the procedure includes, such as changing poisoned memory
> to uncacheable, clearing suspect data in cache, and storing zeros to the
> poisoned area.
> Even for a single poisoned line in memory, it is need to pause all CPUs on a
> large-scale system, like Global MCA?

I think before the poisoned location can be cleared, all objects having 
potential references must have been terminated (or suspended ?? but there
are a lot of problems with this).

Once the reference count of the corresponding page is 0, you should be able 
to lock the page and clear out the memory. However, you might have a hard error
in which case it probably would not be good to put the page back into
production. So either adding a flag indicating that the page is not longer
usable or attaching the page to some reaper thread might work.

( On our IRIX implementation I also had added a flag which would note that the 
  page had an increased number of SBEs, so it also would not get re-allocated.
  It's an interesting disussion if a failure can de-generate and a SBE can
  turn into a UCE, but we might get everyone bored with that :-))

> What I mean by poor English is synchronous MCA.
> Executing process can change in the case of asynchronous MCA from platform.

It's my french :)

Are you meaning 

	synchronous MCA is caused within an execution context, for example
    a process is doing a load and hits an exception

whereas a asynchronous MCA could happen when a line is written back
to main memory and this could happen outside of the process's context ?

Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (23 preceding siblings ...)
  2003-11-07 10:52 ` Matthias Fouquet-Lapar
@ 2003-11-08  1:15 ` Luck, Tony
  2003-11-08  7:36 ` Matthias Fouquet-Lapar
  2003-11-10 10:33 ` Hidetoshi Seto
  26 siblings, 0 replies; 28+ messages in thread
From: Luck, Tony @ 2003-11-08  1:15 UTC (permalink / raw)
  To: linux-ia64

> I can estimate what the procedure includes, such as changing 
> poisoned memory to uncacheable, clearing suspect data in cache, and storing 
> zeros to the poisoned area.

There is no way to tell if the error is soft/transient
and can be cleared by that sequence, or hard/permanent.

The safest option is to simply take the page with
the error out of service and not re-use it.

-Tony

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (24 preceding siblings ...)
  2003-11-08  1:15 ` Luck, Tony
@ 2003-11-08  7:36 ` Matthias Fouquet-Lapar
  2003-11-10 10:33 ` Hidetoshi Seto
  26 siblings, 0 replies; 28+ messages in thread
From: Matthias Fouquet-Lapar @ 2003-11-08  7:36 UTC (permalink / raw)
  To: linux-ia64

> > I can estimate what the procedure includes, such as changing 
> > poisoned memory to uncacheable, clearing suspect data in cache, and storing 
> > zeros to the poisoned area.
> 
> There is no way to tell if the error is soft/transient
> and can be cleared by that sequence, or hard/permanent.

I think there is. Depending on your chipset you can re-read the memory
uncached after all outstanding references have terminated. If you don't
get the same error, it is transient. 

Since I would expect that the majority of errors to be transient, I think
this really is the right approach. Again, depending on the chipset architecture
you might want to do some uncached write/reads ("micro-diagnostics") to
see if the problem can be identified to confirm the nature of the problem.

I used similar approaches on other architectures when figuring out if
a Single Bit was transient or hard. The goal was to stop triggering for SBEs
once you know that you have a hard SBE due to the large overhead

> The safest option is to simply take the page with
> the error out of service and not re-use it.

One problem might be that you now miss a page of main memory and it might
require an additional TLB entry if you use large memory segments

- Matthias

> 
> -Tony
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Better MCA recovery on IPF
  2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
                   ` (25 preceding siblings ...)
  2003-11-08  7:36 ` Matthias Fouquet-Lapar
@ 2003-11-10 10:33 ` Hidetoshi Seto
  26 siblings, 0 replies; 28+ messages in thread
From: Hidetoshi Seto @ 2003-11-10 10:33 UTC (permalink / raw)
  To: linux-ia64

Hi.

> Are you meaning 
> 
> synchronous MCA is caused within an execution context, for example
>     a process is doing a load and hits an exception
> 
> whereas a asynchronous MCA could happen when a line is written back
> to main memory and this could happen outside of the process's context ?

That's right.

Thus, the platform should use data poisoning to get capacity of sending
synchronous MCA instead of asynchronous MCA, even though the poisoning
is not the best way.


Thanks.

------

H.Seto <seto.hidetoshi@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2003-11-10 10:33 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-10-27  8:07 [RFC] Better MCA recovery on IPF Hidetoshi Seto
2003-10-27 16:58 ` Matthias Fouquet-Lapar
2003-10-31  5:09 ` Hidetoshi Seto
2003-10-31 17:14 ` Grant Grundler
2003-11-01  6:39 ` Matthias Fouquet-Lapar
2003-11-01  8:38 ` Keith Owens
2003-11-02 13:33 ` Matthias Fouquet-Lapar
2003-11-03 17:09 ` Russ Anderson
2003-11-03 17:37 ` Matthias Fouquet-Lapar
2003-11-03 17:51 ` Alberto Munoz
2003-11-03 17:53 ` Alberto Munoz
2003-11-03 18:23 ` Jack Steiner
2003-11-03 18:42 ` Alberto Munoz
2003-11-03 19:28 ` Jack Steiner
2003-11-03 23:09 ` Alberto Munoz
2003-11-05  4:11 ` Greg Banks
2003-11-05 17:00 ` Luck, Tony
2003-11-05 17:14 ` Alberto Munoz
2003-11-05 17:30 ` Matthew Wilcox
2003-11-05 17:37 ` Alberto Munoz
2003-11-06 12:03 ` Hidetoshi Seto
2003-11-06 14:23 ` Matthias Fouquet-Lapar
2003-11-06 19:09 ` Luck, Tony
2003-11-07  9:58 ` Hidetoshi Seto
2003-11-07 10:52 ` Matthias Fouquet-Lapar
2003-11-08  1:15 ` Luck, Tony
2003-11-08  7:36 ` Matthias Fouquet-Lapar
2003-11-10 10:33 ` Hidetoshi Seto

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox