Hot plug vs. reliability

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* Hot plug vs. reliability
@ 2004-05-27 11:52 Zoltan Menyhart
  2004-05-27 12:13 ` Richard B. Johnson
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Zoltan Menyhart @ 2004-05-27 11:52 UTC (permalink / raw)
  To: linux-ia64, linux-kernel

I've got some questions about how hot plugging can (or cannot)
ensure reliability:

When we produce machines, we execute tests like burn in, stress,
validation, etc. tests. In addition, every time a machine is switched
on, a power on self test is executed.
When we hot plug (add, remove, swap) a component that has never been
seen, how can we make sure that the modified machine achieves the
same MTBF as the original machine had, without passing any of the
tests I mentioned above ?

There are cases when not the "worst case" design is used. You select 
components "carefully". E.g. you use a quicker component after a
slower one to compensate the excessive delay, or you select parallel
components with similar irregularities (no problem if they are too
slow assuming they are similarly slow). How can we match a hot
plug device with the existing ones ?
Our engineers made hard effort e.g. to equalize the delays on the
"back plain" to make sure the signals reaches the components with
the same delay.
A machine is fabricated with the same series of the CPU / memory
boards. (And they are tested together.) What if the new series of
these boards are somewhat quicker ?

A test can do any "irregular" operation whatever it wants. E.g.
the memory controller can be switched into a test mode, that allows
reading / writing the memory without the intervention of the ECC
logic. One can fill in the memory with some predefined pattern and
check if the ECC logic does what it has to do. Can we do this for
memory hot plug without breaking a running OS ?
Another example: we add a CPU board and we need to make sure that
the coherency dialog goes fine. Can we carry out these tests
without perturbing the already on line CPUs ?
How can we make sure that a freshly inserted I/O card can reach
all the memory it has to, it can interrupt any CPU it has to ?
(Again, without breaking the OS.)

And now the most difficult tests: how can we make sure that no error
will be undetected. E.g. at the power on test, we can voluntarily
provoke "machine checks" to make sure that these kinds of errors are
safely detected. Can we really do this on a living operating system ?
No problem resetting several times the machine (by the service
processor). Obviously, it is not a good idea for a running system.
What can we do in case of hungs, time-outs ?

Do you know of some firmware services like "in place testing" ? I mean
the operating system invokes a specific firmware call and hands over
the control of the machine temporarily (say for 1 millisecond) to the
firmware. The firmware can execute a small part of the validation
test (without corrupting any data, without losing an interrupt, etc.)
then it returns to the OS. This latter resumes the operations and
calls again the firmware the tests somewhat later.

We cannot remove safely failing memory / CPUs. In most of the cases
it is too late. We (in the OS) can see some corrected CPU, memory, I/O
and platform errors. Yet the OS has not got and should not have the
knowledge when a component is "enough bad". I think it is the firmware
that has all the information about the details of the HW events.
Do you know of some firmware services which can say something like:
"hey, remove the component X otherwise your MTBF will drop by 95 %..." ?

Today HW components are sold without much testing. They say O.K. got
a problem?, just send it back, we'll refund. Thanks. I just have broken
my system.
Shall I have a PCI test pad just for validate PCI cards before hot
plugging it in ?

Do we really want to hot plug in order to compromise our MTBF ?

Thanks,

Zoltán Menyhárt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hot plug vs. reliability
  2004-05-27 11:52 Hot plug vs. reliability Zoltan Menyhart
@ 2004-05-27 12:13 ` Richard B. Johnson
  2004-05-27 14:54   ` Bill Davidsen
  2004-05-27 12:17 ` Matthias Fouquet-Lapar
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Richard B. Johnson @ 2004-05-27 12:13 UTC (permalink / raw)
  To: Zoltan.Menyhart; +Cc: linux-ia64, linux-kernel

On Thu, 27 May 2004, Zoltan Menyhart wrote:

> I've got some questions about how hot plugging can (or cannot)
> ensure reliability:
>
> When we produce machines, we execute tests like burn in, stress,
> validation, etc. tests. In addition, every time a machine is switched
> on, a power on self test is executed.

The POST routine only verifies that some hardware "works" at the
instant it's tested. It has nothing to do with reliability.

> When we hot plug (add, remove, swap) a component that has never been
> seen, how can we make sure that the modified machine achieves the
> same MTBF as the original machine had, without passing any of the
> tests I mentioned above ?
>

If you want a highly-reliable machine of any type, the components
are normally burned-in to catch "infant mortality" problems. If
you "hot-plug" a component, that component should have undergone
the same kind of burn-in if you wish to maintain some degree
of reliability. Again a POST routine does not assure anything.
And, in fact, it's just normally initialization. If you look
at the stupid, ludicrous, "testing" done in the early IBM/PC
BIOS, you will understand that it was just some junk that
some committee decided had to be done, like moving values
around between CPU registers -- If the CPU didn't work, it
couldn't test itself -- if the CPU did work, it couldn't
test itself, etc... Just crap.

Now, memory testing has some validity because you generally
need to access it once to get all the bits into a "known"
state where the charge-pump (refresh) will keep it. However,
I doubt that much bad memory has actually been detected during
POST. It's much later, when programs or the kernel crash,
that bad memory is detected.

[SNIPPED...]

So your concern that POST hasn't been run when you hot-plug
a component isn't a problem. You cannot "test-in" reliability.
You need to design it in, test it to make sure it's been
built like it was designed, then burn it in to solve the
infant mortality problem.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.26 on an i686 machine (5570.56 BogoMips).
            Note 96.31% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hot plug vs. reliability
  2004-05-27 11:52 Hot plug vs. reliability Zoltan Menyhart
  2004-05-27 12:13 ` Richard B. Johnson
@ 2004-05-27 12:17 ` Matthias Fouquet-Lapar
  2004-05-27 14:47   ` Zoltan Menyhart
  2004-05-27 15:02 ` Matthias Fouquet-Lapar
  2004-05-27 16:06 ` Russ Anderson
  3 siblings, 1 reply; 7+ messages in thread
From: Matthias Fouquet-Lapar @ 2004-05-27 12:17 UTC (permalink / raw)
  To: Zoltan.Menyhart; +Cc: linux-ia64, linux-kernel

My (limited) understanding of hotplug is that the major motivations
are more adminstrative/reconfiguration then actually replacing failing
components "on-the-fly".

> When we produce machines, we execute tests like burn in, stress,
> validation, etc. tests. In addition, every time a machine is switched
> on, a power on self test is executed.
> When we hot plug (add, remove, swap) a component that has never been
> seen, how can we make sure that the modified machine achieves the
> same MTBF as the original machine had, without passing any of the
> tests I mentioned above ?

Tyically you'll have similar burn-in tests for the components. The same
probably is true for component repair

> There are cases when not the "worst case" design is used. You select 
> components "carefully". E.g. you use a quicker component after a
> slower one to compensate the excessive delay, or you select parallel
> components with similar irregularities (no problem if they are too
> slow assuming they are similarly slow). How can we match a hot
> plug device with the existing ones ?

A fair amount of devices copes with this that for example electrical
driver impedance is adjusted by the device after a specific number of
operations. Totally invisible to SW. 

> A test can do any "irregular" operation whatever it wants. E.g.
> the memory controller can be switched into a test mode, that allows
> reading / writing the memory without the intervention of the ECC
> logic. One can fill in the memory with some predefined pattern and
> check if the ECC logic does what it has to do. Can we do this for
> memory hot plug without breaking a running OS ?
> Another example: we add a CPU board and we need to make sure that
> the coherency dialog goes fine. Can we carry out these tests
> without perturbing the already on line CPUs ?
> How can we make sure that a freshly inserted I/O card can reach
> all the memory it has to, it can interrupt any CPU it has to ?
> (Again, without breaking the OS.)

I think a lot (all ?) can be done with on-line diagnostics. Clearly adding
a defective CPU or node board which causes coherency traffic to fail
should not happen. 

> And now the most difficult tests: how can we make sure that no error
> will be undetected. E.g. at the power on test, we can voluntarily
> provoke "machine checks" to make sure that these kinds of errors are
> safely detected. Can we really do this on a living operating system ?
> No problem resetting several times the machine (by the service
> processor). Obviously, it is not a good idea for a running system.
> What can we do in case of hungs, time-outs ?
> 
> Do you know of some firmware services like "in place testing" ? I mean
> the operating system invokes a specific firmware call and hands over
> the control of the machine temporarily (say for 1 millisecond) to the
> firmware. The firmware can execute a small part of the validation
> test (without corrupting any data, without losing an interrupt, etc.)
> then it returns to the OS. This latter resumes the operations and
> calls again the firmware the tests somewhat later.
> 
> We cannot remove safely failing memory / CPUs. In most of the cases
> it is too late. We (in the OS) can see some corrected CPU, memory, I/O
> and platform errors. Yet the OS has not got and should not have the
> knowledge when a component is "enough bad". I think it is the firmware
> that has all the information about the details of the HW events.
> Do you know of some firmware services which can say something like:
> "hey, remove the component X otherwise your MTBF will drop by 95 %..." ?

That's a point where I totally disagree. I think the OS should have at least
the option to know about every failure in the system. The OS should log
these events, I think in a fair amount of cases recovery is possible.
It might impact the application, but the OS can recover. This would not
be possible from the firmware alone.

I'm currently looking at scenarios where the OS would provide hooks
for an application to implement "self-healing", i.e. the application
is notified about an uncorrectable memory error for example and can attempt
to work around it. 

> Today HW components are sold without much testing. They say O.K. got
> a problem?, just send it back, we'll refund. Thanks. I just have broken
> my system.

I think this really is vendor/platform specific. Many vendors will do extensive
testing of components shipped to customers in addition to root-cause failure
analysis of returned components.


Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hot plug vs. reliability
  2004-05-27 12:17 ` Matthias Fouquet-Lapar
@ 2004-05-27 14:47   ` Zoltan Menyhart
  0 siblings, 0 replies; 7+ messages in thread
From: Zoltan Menyhart @ 2004-05-27 14:47 UTC (permalink / raw)
  To: Matthias Fouquet-Lapar; +Cc: linux-ia64, linux-kernel

Matthias Fouquet-Lapar wrote:
> 
> My (limited) understanding of hotplug is that the major motivations
> are more adminstrative/reconfiguration then actually replacing failing
> components "on-the-fly".

I agree, in this case there is no loss of MTBF.
Yet let's call this activity as run time re-partitioning of the machine.
(Most people - me too - consider hot plugging as physically plugging
things in / out.)

> Tyically you'll have similar burn-in tests for the components. The same
> probably is true for component repair

But the new comers are tested in a different environment, with
different tolerance range. I just simply do not trust :-)

> > There are cases when not the "worst case" design is used. You select
> > components "carefully". E.g. you use a quicker component after a
> > slower one to compensate the excessive delay, or you select parallel
> > components with similar irregularities (no problem if they are too
> > slow assuming they are similarly slow). How can we match a hot
> > plug device with the existing ones ?
> 
> A fair amount of devices copes with this that for example electrical
> driver impedance is adjusted by the device after a specific number of
> operations. Totally invisible to SW.

I do not think the timing / the delays are auto adjusting. You select
a component X to work next to the component Y because you know that
X in "here" and Y in "there" in the tolerance range...

> > A test can do any "irregular" operation whatever it wants. E.g.
> > the memory controller can be switched into a test mode, that allows
> > reading / writing the memory without the intervention of the ECC
> > logic. One can fill in the memory with some predefined pattern and
> > check if the ECC logic does what it has to do. Can we do this for
> > memory hot plug without breaking a running OS ?
> > Another example: we add a CPU board and we need to make sure that
> > the coherency dialog goes fine. Can we carry out these tests
> > without perturbing the already on line CPUs ?
> > How can we make sure that a freshly inserted I/O card can reach
> > all the memory it has to, it can interrupt any CPU it has to ?
> > (Again, without breaking the OS.)
> 
> I think a lot (all ?) can be done with on-line diagnostics. Clearly adding
> a defective CPU or node board which causes coherency traffic to fail
> should not happen.

Probably it is platform dependent.
I saw our FW guys doing a couple of black magic, e.g. pumping data in / out
to / from the LSIs through JTAGs, "abusing" the "back doors".
Another example: we bought some Intel IA64 boards (CPUs + memory) and
I saw things, by use of an In Target Probe, like switching back to i386 mode (!!!)
or freely playing with cache attributes or doing tricky synchronization
among the CPUs.
I've simply got concerns...

> > We cannot remove safely failing memory / CPUs. In most of the cases
> > it is too late. We (in the OS) can see some corrected CPU, memory, I/O
> > and platform errors. Yet the OS has not got and should not have the
> > knowledge when a component is "enough bad". I think it is the firmware
> > that has all the information about the details of the HW events.
> > Do you know of some firmware services which can say something like:
> > "hey, remove the component X otherwise your MTBF will drop by 95 %..." ?
> 
> That's a point where I totally disagree. I think the OS should have at least
> the option to know about every failure in the system. The OS should log
> these events, I think in a fair amount of cases recovery is possible.
> It might impact the application, but the OS can recover. This would not
> be possible from the firmware alone.

Well, the OS can log events, why not ?
Yet what do you do if you cannot boot, cannot read the log ?
We've got an embedded computer (service processor) that logs everything
and it's got a private Ethernet plug => you can read its log even if
the main machine is off.

I think the OS has to be platform independent. How can a platform independent
OS know if <n> errors of this / that type requires what intervention ?
We'll have the same binary of the OS (+ drivers) for a small desk top or
for a 32 CPU "main frame". Only the firmware is different...

Most of our clients run a single (HPC) application on a machine.
For them, there is no use to save the OS. I can understand that in other
environments with many applications it is important to save the OS.

> I'm currently looking at scenarios where the OS would provide hooks
> for an application to implement "self-healing", i.e. the application
> is notified about an uncorrectable memory error for example and can attempt
> to work around it.

Most of our clients just do not want to touch their 10 year old rubbish
Fortran programs. If I get a hint of danger (today it does not come from the FW)
I could take a check point and call for service intervention...

> > Today HW components are sold without much testing. They say O.K. got
> > a problem?, just send it back, we'll refund. Thanks. I just have broken
> > my system.
> 
> I think this really is vendor/platform specific. Many vendors will do extensive
> testing of components shipped to customers in addition to root-cause failure
> analysis of returned components.

Probably, we should change platform - just kidding ;-)

Thanks,

Zoltán

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hot plug vs. reliability
  2004-05-27 12:13 ` Richard B. Johnson
@ 2004-05-27 14:54   ` Bill Davidsen
  0 siblings, 0 replies; 7+ messages in thread
From: Bill Davidsen @ 2004-05-27 14:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: Zoltan.Menyhart, linux-ia64

Richard B. Johnson wrote:
> On Thu, 27 May 2004, Zoltan Menyhart wrote:
> 
> 
>>I've got some questions about how hot plugging can (or cannot)
>>ensure reliability:
>>
>>When we produce machines, we execute tests like burn in, stress,
>>validation, etc. tests. In addition, every time a machine is switched
>>on, a power on self test is executed.
> 
> 
> The POST routine only verifies that some hardware "works" at the
> instant it's tested. It has nothing to do with reliability.
> 
> 
>>When we hot plug (add, remove, swap) a component that has never been
>>seen, how can we make sure that the modified machine achieves the
>>same MTBF as the original machine had, without passing any of the
>>tests I mentioned above ?
>>
> 
> 
> If you want a highly-reliable machine of any type, the components
> are normally burned-in to catch "infant mortality" problems. If
> you "hot-plug" a component, that component should have undergone
> the same kind of burn-in if you wish to maintain some degree
> of reliability. Again a POST routine does not assure anything.
> And, in fact, it's just normally initialization. If you look
> at the stupid, ludicrous, "testing" done in the early IBM/PC
> BIOS, you will understand that it was just some junk that
> some committee decided had to be done, like moving values
> around between CPU registers -- If the CPU didn't work, it
> couldn't test itself -- if the CPU did work, it couldn't
> test itself, etc... Just crap.
> 
> Now, memory testing has some validity because you generally
> need to access it once to get all the bits into a "known"
> state where the charge-pump (refresh) will keep it. However,
> I doubt that much bad memory has actually been detected during
> POST. It's much later, when programs or the kernel crash,
> that bad memory is detected.
> 
> [SNIPPED...]
> 
> So your concern that POST hasn't been run when you hot-plug
> a component isn't a problem. You cannot "test-in" reliability.
> You need to design it in, test it to make sure it's been
> built like it was designed, then burn it in to solve the
> infant mortality problem.

If reliability is your goal, testing at plug time is necessary but not 
sufficient. It avoids kernel failures caused by trying to use devices 
which are disfunctional (the kernel is far better at non-functional than 
broken). And some of the better drivers are far more robust at init time 
than in normal operation, not a bad thing at all. The init code can 
function as POST if it's written to do so.

Testing is a part of the reliability chain, as you note it isn't a 
substitute for all the other parts.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hot plug vs. reliability
  2004-05-27 11:52 Hot plug vs. reliability Zoltan Menyhart
  2004-05-27 12:13 ` Richard B. Johnson
  2004-05-27 12:17 ` Matthias Fouquet-Lapar
@ 2004-05-27 15:02 ` Matthias Fouquet-Lapar
  2004-05-27 16:06 ` Russ Anderson
  3 siblings, 0 replies; 7+ messages in thread
From: Matthias Fouquet-Lapar @ 2004-05-27 15:02 UTC (permalink / raw)
  To: Zoltan.Menyhart; +Cc: Matthias Fouquet-Lapar, linux-ia64, linux-kernel

> I agree, in this case there is no loss of MTBF.
> Yet let's call this activity as run time re-partitioning of the machine.
> (Most people - me too - consider hot plugging as physically plugging
> things in / out.)

You're right, it's confusing and I made the same assumptions you make :
physically moving parts. (and I worked on a systems a couple of years
back where we actually had hotswap :-))

> But the new comers are tested in a different environment, with
> different tolerance range. I just simply do not trust :-)

Not really. It's up to the vendor and at least here at SGI we have pretty
tight rules and tolerances.

> I do not think the timing / the delays are auto adjusting. You select
> a component X to work next to the component Y because you know that
> X in "here" and Y in "there" in the tolerance range...

They do (impedance match). An example are SRAMs used for CPUs with external 
caches for example.  I've learned a lot about that :-)). You also
have stuff like auto-learning for echo-clock timings etc, but this is really
very platform and CPU specific

> I think the OS has to be platform independent. How can a platform independent
> OS know if <n> errors of this / that type requires what intervention ?
> We'll have the same binary of the OS (+ drivers) for a small desk top or
> for a 32 CPU "main frame". Only the firmware is different...

An OS is never platform independent, there always is a machine dependant layer.
I'm not really concerned about the total numbers of errors in a system, 
regardless if we have one, 32 or 512 CPUs. If we see a component starting to 
fail, it should be isolated in order to avoid catastrophic failure

> Most of our clients just do not want to touch their 10 year old rubbish
> Fortran programs. If I get a hint of danger (today it does not come from the FW)
> I could take a check point and call for service intervention...

That's a well know problem (although I think 20 years or more are more 
likely ...)
I think however there are new applications coming up using large or 
ultra-scale systems where more fault tolerance can be designed in at the OS,
libarary or even user level

Amicalement

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hot plug vs. reliability
  2004-05-27 11:52 Hot plug vs. reliability Zoltan Menyhart
                   ` (2 preceding siblings ...)
  2004-05-27 15:02 ` Matthias Fouquet-Lapar
@ 2004-05-27 16:06 ` Russ Anderson
  3 siblings, 0 replies; 7+ messages in thread
From: Russ Anderson @ 2004-05-27 16:06 UTC (permalink / raw)
  To: Zoltan.Menyhart; +Cc: linux-ia64, linux-kernel

Zoltan Menyhart wrote:
> 
> We cannot remove safely failing memory / CPUs. In most of the cases
> it is too late. 

This is a key point.  To get the most value out of hot plug (as
a reliability feature) the system must be able to detect and
"ride through" component failures.  Conversly, if the system
crashes on the first component failure, the ability to hot remove
the broken component has little value.

For example, memory hot-plug has the most value if the system can
"ride through" a memory uncorrectable, isolate the bad memory 
(ie not re-use the page with the bad DIMM cells), shoot the application
that hit the uncorrectable (or better yet, have some checkpoint/restart
mechanism to avoid killing the application), migrate data off
the physical DIMM (etc) to get the system to the point that the
bad DIMM can be physically replaced, and re-integrate the new memory.

My point is that a key part of the whole hot plug story is the
ability to detect and ride thought the initial errors that would
prompt someone to want to replace the component.  And without that
part the significant effort to do the rest of the pieces has significantly
less value.

>                  We (in the OS) can see some corrected CPU, memory, I/O
> and platform errors. Yet the OS has not got and should not have the
> knowledge when a component is "enough bad". I think it is the firmware
> that has all the information about the details of the HW events.
> Do you know of some firmware services which can say something like:
> "hey, remove the component X otherwise your MTBF will drop by 95 %..." ?

The difficulty with predictive analysis is determining the exact
indicator of a potential failure.  Many times the first indication
is a fatal error that crashes the system (which is why error recovery
to "ride through" failures is so important).  Other errors, such 
as memory singlebits, may (or may not) increase the probability of
failure, but does is increase the probability enough to warrent
a service action?  (Service actions have costs, too.)

A technical difficulty with predictive analysis is that each component 
has a different failure characteristics and the failure charicteristics
can change with spacific technologies.  For example, smaller die 
technologies can increase the soft failure rates.  And by the 
time the long term failure characteristics are fully understood 
the technology is obsolete.  :-(

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-05-27 16:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-27 11:52 Hot plug vs. reliability Zoltan Menyhart
2004-05-27 12:13 ` Richard B. Johnson
2004-05-27 14:54   ` Bill Davidsen
2004-05-27 12:17 ` Matthias Fouquet-Lapar
2004-05-27 14:47   ` Zoltan Menyhart
2004-05-27 15:02 ` Matthias Fouquet-Lapar
2004-05-27 16:06 ` Russ Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox