public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
From: Russ Anderson <rja@sgi.com>
To: Zoltan.Menyhart@bull.net
Cc: linux-ia64@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: Hot plug vs. reliability
Date: Thu, 27 May 2004 16:06:18 +0000	[thread overview]
Message-ID: <200405271606.i4RG6IYC001896@ben.americas.sgi.com> (raw)
In-Reply-To: <40B5D68C.466FE969@nospam.org> from "Zoltan Menyhart" at May 27, 2004 01:52:44 PM
In-Reply-To: <40B5D68C.466FE969@nospam.org>

Zoltan Menyhart wrote:
> 
> We cannot remove safely failing memory / CPUs. In most of the cases
> it is too late. 

This is a key point.  To get the most value out of hot plug (as
a reliability feature) the system must be able to detect and
"ride through" component failures.  Conversly, if the system
crashes on the first component failure, the ability to hot remove
the broken component has little value.

For example, memory hot-plug has the most value if the system can
"ride through" a memory uncorrectable, isolate the bad memory 
(ie not re-use the page with the bad DIMM cells), shoot the application
that hit the uncorrectable (or better yet, have some checkpoint/restart
mechanism to avoid killing the application), migrate data off
the physical DIMM (etc) to get the system to the point that the
bad DIMM can be physically replaced, and re-integrate the new memory.

My point is that a key part of the whole hot plug story is the
ability to detect and ride thought the initial errors that would
prompt someone to want to replace the component.  And without that
part the significant effort to do the rest of the pieces has significantly
less value.

>                  We (in the OS) can see some corrected CPU, memory, I/O
> and platform errors. Yet the OS has not got and should not have the
> knowledge when a component is "enough bad". I think it is the firmware
> that has all the information about the details of the HW events.
> Do you know of some firmware services which can say something like:
> "hey, remove the component X otherwise your MTBF will drop by 95 %..." ?

The difficulty with predictive analysis is determining the exact
indicator of a potential failure.  Many times the first indication
is a fatal error that crashes the system (which is why error recovery
to "ride through" failures is so important).  Other errors, such 
as memory singlebits, may (or may not) increase the probability of
failure, but does is increase the probability enough to warrent
a service action?  (Service actions have costs, too.)

A technical difficulty with predictive analysis is that each component 
has a different failure characteristics and the failure charicteristics
can change with spacific technologies.  For example, smaller die 
technologies can increase the soft failure rates.  And by the 
time the long term failure characteristics are fully understood 
the technology is obsolete.  :-(

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

      parent reply	other threads:[~2004-05-27 16:06 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-05-27 11:52 Hot plug vs. reliability Zoltan Menyhart
2004-05-27 12:13 ` Richard B. Johnson
2004-05-27 14:54   ` Bill Davidsen
2004-05-27 12:17 ` Matthias Fouquet-Lapar
2004-05-27 14:47   ` Zoltan Menyhart
2004-05-27 15:02 ` Matthias Fouquet-Lapar
2004-05-27 16:06 ` Russ Anderson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200405271606.i4RG6IYC001896@ben.americas.sgi.com \
    --to=rja@sgi.com \
    --cc=Zoltan.Menyhart@bull.net \
    --cc=linux-ia64@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox