public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
@ 2006-06-19 19:15 Andreas Mohr
  2006-06-19 19:39 ` John Richard Moser
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Andreas Mohr @ 2006-06-19 19:15 UTC (permalink / raw)
  To: linux-kernel

Hello all,

while looking for loop places to apply cpu_relax() to, I found the
following gems:

arch/i386/kernel/crash.c/crash_nmi_callback():

        /* Assume hlt works */
        halt();
        for(;;);

        return 1;
}

arch/i386/kernel/doublefault.c/doublefault_fn():

        for (;;) /* nothing */;
}

Let's assume that we have a less than moderate fan failure that causes
the CPU to heat up beyond the critical limit...
That might result in - you guessed it - crashes or doublefaults.
In which case we enter the corresponding handler and do... what?
Exactly, we accelerate the CPUs happy march into bit heaven by letting it
execute a busy-loop under a non-working fan.
Thanks, your users will be very happy, I think ;)
(especially since it was "just" a simple fan failure that could have been
entirely remedied by buying another fan for $3)


The same thing applies to
arch/i386/kernel/smp.c/stop_this_cpu(), albeit there it's less catastrophic
due to most likely normal working conditions there.

IMHO on any critical CPU failure we should:
- try to log it (might be difficult with a broken CPU, though)
- optionally somehow directly alert the user
- STOP the system, COMPLETELY (that way people WILL take notice, hopefully
  before it's too late and actual damage will have occurred)
- make DAMN SURE that the (possibly already broken) CPU won't have a
  less than nice time once the system is stopped

Am I completely missing something here?

If this is an issue, then maybe we should consolidate those places into
one function that safely(!) halts a CPU, optionally disabling APIC etc.

Oh, and once you finished processing my mail here, you could optionally
also look at my report about almost unusably broken USB:
http://lkml.org/lkml/2006/6/19/54
(no replies yet despite advanced breakage)

Thanks!

Andreas Mohr

^ permalink raw reply	[flat|nested] 26+ messages in thread
[parent not found: <6pxs2-1AR-5@gated-at.bofh.it>]
[parent not found: <fa.pC0NfRl4O1eOCqPOBXy8f+7gbqU@ifi.uio.no>]
* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
@ 2006-06-20  3:30 Ken Ryan
  0 siblings, 0 replies; 26+ messages in thread
From: Ken Ryan @ 2006-06-20  3:30 UTC (permalink / raw)
  To: linux-kernel

 > > You accelerate nothing. Bit heaven? A CPU without a fan will go into
 > > a cold, cold, shutdown, requiring a hardware reset to get it out of
 > > that latched, no internal clock running, mode.
 >
 > Some CPU may do this, others will go via the random-generator mode
 > into the self-deformation-mode instead.

A few years ago Tom's Hardware Guide made a cool video as part of an 
article on thermal emergencies.  The article is here:

http://www.tomshardware.com/2001/09/17/hot_spot/index.html

The test was pulling off the CPU fan and heatsink while playing Quake. 
Granted it's not entirely realistic; I don't imagine the heatsink would 
come of during heavy gameplay (a more reasonable scenario THG mentions 
is the fan/heatsink coming off during shipping) however considering the 
preposterous little tabs AMD specs for their sockets I think sudden 
breakage is not out of the question.

The video shows a PIII coping (halting), a P4 gracefully slowing down, 
and two variants of Athlon self-destructing (smoke and running solder).

Evidently this set of tests convinced AMD to alter how they handled
overtemp on their processors.  The mobos in the test were built 
according to spec in terms of the thermal sensors and protection code in 
the BIOS.  It didn't help; the exposed die of the Athlon ramped up its 
temperature way faster than the sensor could react.

As for the ceramic package cracking, it is certainly possible,  The
ceramic is indeed designed for very high temperatures, but only if 
heated evenly.  Give the package a 200C temperature differential within 
a second or two and thermal expansion is going to do some damage...

I can certainly believe modern processors deal with sudden thermal rise 
better than the ones in the THG video.  However not all of us can afford 
to always have the latest 'n' greatest... :-(

		ken

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2006-06-25 11:01 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-19 19:15 [RFC/SERIOUS] grilling troubled CPUs for fun and profit? Andreas Mohr
2006-06-19 19:39 ` John Richard Moser
2006-06-19 20:00 ` linux-os (Dick Johnson)
2006-06-19 20:23   ` Dave Jones
2006-06-19 20:47     ` linux-os (Dick Johnson)
2006-06-19 20:59       ` Dave Jones
2006-06-19 22:25     ` Pavel Machek
2006-06-19 22:41       ` Dave Jones
2006-06-20 11:39         ` linux-os (Dick Johnson)
2006-06-21 17:16           ` Ian Romanick
2006-06-21 17:57             ` linux-os (Dick Johnson)
2006-06-22 17:47         ` Pavel Machek
2006-06-20  9:58       ` Jan Engelhardt
2006-06-22 18:16         ` Pavel Machek
2006-06-23 17:32           ` Jan Engelhardt
2006-06-24 19:54             ` Pavel Machek
2006-06-25 11:01               ` Jan Engelhardt
2006-06-20  9:54     ` Jan Engelhardt
2006-06-19 21:16   ` Claudio Martins
2006-06-19 22:16 ` Pavel Machek
2006-06-19 22:43   ` Dave Jones
2006-06-20  7:29     ` Andreas Mohr
     [not found] <6pxs2-1AR-5@gated-at.bofh.it>
     [not found] ` <6pyer-2Pt-1@gated-at.bofh.it>
2006-06-19 21:40   ` Bodo Eggert
2006-06-19 21:44     ` Dave Jones
     [not found] <fa.pC0NfRl4O1eOCqPOBXy8f+7gbqU@ifi.uio.no>
     [not found] ` <fa.so5wrYE6MzA2swzlOE1Xjw9iqvk@ifi.uio.no>
2006-06-19 23:32   ` Robert Hancock
  -- strict thread matches above, loose matches on Subject: below --
2006-06-20  3:30 Ken Ryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox