* [lm-sensors] Fw: Processes causing CPU to overheat
@ 2005-08-23 5:28 Jon Roland
2005-08-23 6:27 ` Jon Roland
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: Jon Roland @ 2005-08-23 5:28 UTC (permalink / raw)
To: lm-sensors
-------- Original Message --------
Subject: Processes causing CPU to overheat
Date: Mon, 22 Aug 2005 22:11:41 +0000
From: Jon Roland <jroland@linux-migration.net>
Reply-To: jroland@linux-migration.net
Organization: Linux Migration Network
To: sensors@stimpy.netroedge.com
CC: Rob Ristroph <rgr@sdf.lonestar.org>, Les Nemeth <linuxles@yahoo.com>
One of my systems, with an Athlon 64 running Fedora Core 4, has developed an
interesting problem. With a temperature monitor I installed, the CPU runs at about
35 C normally, no matter what I am doing. But then, for no apparent reason, the
temperature rises rapidly to about 52 C and the system freezes, or sometimes
crashes. GKrellM shows a sudden rise in "nice usage" on the CPU and in both
processes and disk usage at the same time. Once the rise begins, I have only
seconds before the system freezes.
Attached are two frames from a log of top -b -c , the first from just before the
transition, the second from just after. Note the sudden appearance of ten perl
processes that consume about 10% of the CPU each.
I killed the first of those perl processes before the freezeup could occur, and
all the perl processes also died, the nice usage dropped to zero, and the CPU
temperature dropped back to the normal level of 35 C. I have since rebooted the
system, but the problem, which previously occurred after about a half-hour of a
session, has not returned, so far, after 10 hours.
We have been trying to figure out how 100 CPU usage could cause the CPU to
overheat. The only thing we can think of is that the processes might have somehow
turned off power to the CPU fan, or increased power to the CPU. Have you ever
encountered that kind of thing?
Where it gets interesting, and why I am contacting you, is that before I killed
that process, I could run sensors. Since then, when I run it, I get "No sensors
found!". I ran sensors-detect, and it seems to have run okay. Log attached. This
suggests to me that the process I killed has something to do with sensors.
Any comments or suggestions?
Please visit my website, http://www.constitution.org
-- Jon
http://www2.lm-sensors.nu/~lm78/support.html
----------------------------------------------------------------
Linux Migration Network 7793 Burnet Road #37, Austin, TX 78757
512/374-9585 www.linux-migration.net jroland@linux-migration.net
----------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: top_050822-00x.log
Type: text/x-log
Size: 20728 bytes
Desc: not available
Url : http://lists.lm-sensors.org/pipermail/lm-sensors/attachments/20050823/997d9906/top_050822-00x-0001.bin
-------------- next part --------------
[root@yara ~]# sensors
No sensors found!
[root@yara ~]# man sensors
Formatting page, please wait...
[root@yara ~]# sensors-detect
This program will help you determine which I2C/SMBus modules you need to
load to use lm_sensors most effectively. You need to have i2c and
lm_sensors installed before running this program.
Also, you need to be `root', or at least have access to the /dev/i2c-*
files, for most things.
If you have patched your kernel and have some drivers built in, you can
safely answer NO if asked to load some modules. In this case, things may
seem a bit confusing, but they will still work.
It is generally safe and recommended to accept the default answers to all
questions, unless you know what you're doing.
We can start with probing for (PCI) I2C or SMBus adapters.
You do not need any special privileges for this.
Do you want to probe now? (YES/no): y
Probing for PCI bus adapters...
Use driver `i2c-nforce2' for device 00:01.1: nVidia Corporation nForce3 250Gb SMBus (MCP)
Probe succesfully concluded.
We will now try to load each adapter module in turn.
Module `i2c-nforce2' already loaded.
If you have undetectable or unsupported adapters, you can have them
scanned by manually loading the modules before running this script.
To continue, we need module `i2c-dev' to be loaded.
If it is built-in into your kernel, you can safely skip this.
i2c-dev is not loaded. Do you want to load it now? (YES/no): y
Module loaded succesfully.
We are now going to do the adapter probings. Some adapters may hang halfway
through; we can't really help that. Also, some chips will be double detected;
we choose the one with the highest confidence value in that case.
If you found that the adapter hung after probing a certain address, you can
specify that address to remain unprobed. That often
includes address 0x69 (clock chip).
Next adapter: SMBus nForce2 adapter at 4c40
Do you want to scan it? (YES/no/selectively): y
Client found at address 0x08
Client found at address 0x4e
Probing for `National Semiconductor LM75'... Failed!
Probing for `Dallas Semiconductor DS1621'... Success!
(confidence 3, driver `ds1621')
Probing for `Analog Devices ADM1021'... Failed!
Probing for `Analog Devices ADM1021A/ADM1023'... Failed!
Probing for `Maxim MAX1617'... Failed!
Probing for `Maxim MAX1617A'... Failed!
Probing for `TI THMC10'... Failed!
Probing for `National Semiconductor LM84'... Failed!
Probing for `Genesys Logic GL523SM'... Failed!
Probing for `Onsemi MC1066'... Failed!
Probing for `Maxim MAX1619'... Failed!
Probing for `National Semiconductor LM82'... Failed!
Probing for `National Semiconductor LM83'... Failed!
Probing for `Maxim MAX6659'... Failed!
Probing for `Maxim MAX6633/MAX6634/MAX6635'... Failed!
Next adapter: SMBus nForce2 adapter at 4c00
Do you want to scan it? (YES/no/selectively): y
Client found at address 0x08
Client found at address 0x50
Probing for `SPD EEPROM'... Success!
(confidence 8, driver `eeprom')
Probing for `DDC monitor'... Failed!
Probing for `Maxim MAX6900'... Failed!
Client found at address 0x51
Probing for `SPD EEPROM'... Success!
(confidence 8, driver `eeprom')
Client found at address 0x52
Probing for `SPD EEPROM'... Success!
(confidence 8, driver `eeprom')
Client found at address 0x6a
Some chips are also accessible through the ISA bus. ISA probes are
typically a bit more dangerous, as we have to write to I/O ports to do
this. This is usually safe though.
Do you want to scan the ISA bus? (YES/no): y
Probing for `National Semiconductor LM78'
Trying address 0x0290... Failed!
Probing for `National Semiconductor LM78-J'
Trying address 0x0290... Failed!
Probing for `National Semiconductor LM79'
Trying address 0x0290... Failed!
Probing for `Winbond W83781D'
Trying address 0x0290... Failed!
Probing for `Winbond W83782D'
Trying address 0x0290... Failed!
Probing for `Winbond W83627HF'
Trying address 0x0290... Failed!
Probing for `Winbond W83627EHF'
Trying address 0x0290... Failed!
Probing for `Winbond W83697HF'
Trying address 0x0290... Failed!
Probing for `Silicon Integrated Systems SIS5595'
Trying general detect... Failed!
Probing for `VIA Technologies VT82C686 Integrated Sensors'
Trying general detect... Failed!
Probing for `VIA Technologies VT8231 Integrated Sensors'
Trying general detect... Failed!
Probing for `ITE IT8712F'
Trying address 0x0290... Success!
(confidence 8, driver `it87')
Probing for `ITE IT8705F / SiS 950'
Trying address 0x0290... Failed!
Probing for `IPMI BMC KCS'
Trying address 0x0ca0... Failed!
Probing for `IPMI BMC SMIC'
Trying address 0x0ca8... Failed!
Some Super I/O chips may also contain sensors. Super I/O probes are
typically a bit more dangerous, as we have to write to I/O ports to do
this. This is usually safe though.
Do you want to scan for Super I/O sensors? (YES/no): y
Probing for `ITE 8702F Super IO Sensors'
Failed! (0x8712)
Probing for `ITE 8705F Super IO Sensors'
Failed! (0x8712)
Probing for `ITE 8712F Super IO Sensors'
Success... found at address 0x0290
Probing for `Nat. Semi. PC87351 Super IO Fan Sensors'
Failed! (skipping family)
Probing for `SMSC 47B27x Super IO Fan Sensors'
Failed! (skipping family)
Probing for `VT1211 Super IO Sensors'
Failed! (skipping family)
Probing for `Winbond W83627EHF Super IO Sensors'
Failed! (skipping family)
Do you want to scan for secondary Super I/O sensors? (YES/no): y
Probing for `ITE 8702F Super IO Sensors'
Failed! (skipping family)
Probing for `Nat. Semi. PC87351 Super IO Fan Sensors'
Failed! (skipping family)
Probing for `SMSC 47B27x Super IO Fan Sensors'
Failed! (skipping family)
Probing for `VT1211 Super IO Sensors'
Failed! (skipping family)
Probing for `Winbond W83627EHF Super IO Sensors'
Failed! (skipping family)
Now follows a summary of the probes I have just done.
Just press ENTER to continue:
Driver `ds1621' (should be inserted):
Detects correctly:
* Bus `SMBus nForce2 adapter at 4c40'
Busdriver `i2c-nforce2', I2C address 0x4e
Chip `Dallas Semiconductor DS1621' (confidence: 3)
Driver `eeprom' (should be inserted):
Detects correctly:
* Bus `SMBus nForce2 adapter at 4c00'
Busdriver `i2c-nforce2', I2C address 0x50
Chip `SPD EEPROM' (confidence: 8)
* Bus `SMBus nForce2 adapter at 4c00'
Busdriver `i2c-nforce2', I2C address 0x51
Chip `SPD EEPROM' (confidence: 8)
* Bus `SMBus nForce2 adapter at 4c00'
Busdriver `i2c-nforce2', I2C address 0x52
Chip `SPD EEPROM' (confidence: 8)
Driver `it87' (should be inserted):
Detects correctly:
* ISA bus address 0x0290 (Busdriver `i2c-isa')
Chip `ITE 8712F Super IO Sensors' (confidence: 9)
I will now generate the commands needed to load the I2C modules.
Sometimes, a chip is available both through the ISA bus and an I2C bus.
ISA bus access is faster, but you need to load an additional driver module
for it. If you have the choice, do you want to use the ISA bus or the
I2C/SMBus (ISA/smbus)? y
To make the sensors modules behave correctly, add these lines to
/etc/modules.conf:
#----cut here----
# I2C module options
alias char-major-89 i2c-dev
#----cut here----
To load everything that is needed, add this to some /etc/rc* file:
#----cut here----
# I2C adapter drivers
modprobe i2c-nforce2
modprobe i2c-isa
# I2C chip drivers
modprobe ds1621
modprobe eeprom
modprobe it87
# sleep 2 # optional
/usr/bin/sensors -s # recommended
#----cut here----
WARNING! If you have some things built into your kernel, the list above
will contain too many modules. Skip the appropriate ones! You really should
try these commands right now to make sure everything is working properly.
Monitoring programs won't work until it's done.
Do you want to generate /etc/sysconfig/lm_sensors? (YES/no): y
Copy prog/init/lm_sensors.init to /etc/rc.d/init.d/lm_sensors
for initialization at boot time.
^ permalink raw reply [flat|nested] 7+ messages in thread* [lm-sensors] Fw: Processes causing CPU to overheat
2005-08-23 5:28 [lm-sensors] Fw: Processes causing CPU to overheat Jon Roland
@ 2005-08-23 6:27 ` Jon Roland
2005-08-23 6:55 ` Phil Edelbrock
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Jon Roland @ 2005-08-23 6:27 UTC (permalink / raw)
To: lm-sensors
Thanks for your response. No, it's not dust. Problem hasn't recurred, so I can't
open the case to see if the fan has stopped or slowed when the temp begins to
rise, but that is an obvious next move if it does.
Of course, mechanical problems would not correlate with CPU load, processes, and
disk load, stopping the CPU fan (or increasing power to the CPU) when the nice
usage (not system or user) pegs at 100%, and dropping back to normal when the
first perl process is killed, which also causes all the other perl processes to
end, and the nice usage to drop to zero. Nothing else causes the CPU temp to rise.
I have done this process kill twice now, so it is not a coincidence.
I don't know why my log file is truncated at 80 columns to reveal more information
about those perl processes.
Phil Edelbrock wrote:
>
> Weird. Something I would check is to make sure that dust hasn't caused
> the CPU fan to slow (or stop). I had problems with a server this
> summer which had lots of dust, and a slightly mis-installed CPU
> heatsink. The result was rapidly rising temps (and hangs) when the
> computer was loaded.
>
> I think you need to look inside the server to see just what is going on
> there at the heatsink and CPU. Are things installed right? Is there
> dust or something causing the heatsink/fan to not work properly? Is
> the environment the server is in OK (temp and humidity should be low).
> Heatsinks sometimes have a piece of plastic which needs to be peeled
> off the heatsink compound before installation.
>
> BTW- I hope you know what's causing the strange loading (those Perl
> processes?). If that's unusual, then you could have some nasty spammer
> or somebody using your box to do evil things. BBTW- Your top output
> doesn't show much becuase that it's cut off on the right side where
> things are just get interesting. Oh, and I have no idea why sensor
> values disappear unless the driver or something is being unloaded(?).
>
>
> Phil
--
----------------------------------------------------------------
Linux Migration Network 7793 Burnet Road #37, Austin, TX 78757
512/374-9585 www.linux-migration.net jroland@linux-migration.net
----------------------------------------------------------------
^ permalink raw reply [flat|nested] 7+ messages in thread* [lm-sensors] Fw: Processes causing CPU to overheat
2005-08-23 5:28 [lm-sensors] Fw: Processes causing CPU to overheat Jon Roland
2005-08-23 6:27 ` Jon Roland
@ 2005-08-23 6:55 ` Phil Edelbrock
2005-08-23 18:45 ` Craig Sylla
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Phil Edelbrock @ 2005-08-23 6:55 UTC (permalink / raw)
To: lm-sensors
On Aug 22, 2005, at 4:27 PM, Jon Roland wrote:
> Thanks for your response. No, it's not dust. Problem hasn't
> recurred, so I can't open the case to see if the fan has stopped or
> slowed when the temp begins to rise, but that is an obvious next
> move if it does.
Yeah, you should be able to peg that CPU to 100% usage forever
without an overheat. The ITE 8712F /does/ have a dedicated CPU fan
speed control, so it's possible something is mucking with it to make
it shut off or slow down that fan.
>
> Of course, mechanical problems would not correlate with CPU load,
> processes, and disk load, stopping the CPU fan (or increasing power
> to the CPU) when the nice usage (not system or user) pegs at 100%,
> and dropping back to normal when the first perl process is killed,
> which also causes all the other perl processes to end, and the nice
> usage to drop to zero. Nothing else causes the CPU temp to rise. I
> have done this process kill twice now, so it is not a coincidence.
Well, if you did have a slow CPU fan (say, all the time), it's likely
the CPU temp would remain low when it's not under load. It would
just make it rise a lot quicker when it got busy (as you may be
experiencing).
And, 'nice' processes will heat up a cpu as fast as any other
process. It's just a process queue priority thing. The CPU never
knows the difference what a 'nice' process is vs. another one.
>
> I don't know why my log file is truncated at 80 columns to reveal
> more information about those perl processes.
You could try using ps instead. You can specify a columns value
there (e.g. ps aux --colsP0). It could be processes processing log
files or something.
Good luck!
Phil
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2373 bytes
Desc: not available
Url : http://lists.lm-sensors.org/pipermail/lm-sensors/attachments/20050823/7a813001/smime.bin
^ permalink raw reply [flat|nested] 7+ messages in thread* [lm-sensors] Fw: Processes causing CPU to overheat
2005-08-23 5:28 [lm-sensors] Fw: Processes causing CPU to overheat Jon Roland
2005-08-23 6:27 ` Jon Roland
2005-08-23 6:55 ` Phil Edelbrock
@ 2005-08-23 18:45 ` Craig Sylla
2005-08-23 20:27 ` [lm-sensors] " Jon Roland
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Craig Sylla @ 2005-08-23 18:45 UTC (permalink / raw)
To: lm-sensors
From your description, this only happens at 100% CPU load. If that is
correct, this sounds like the system is just not cooling well enough.
It could be a bad/inadequate fan, an insufficient heatsink, bad case
airflow, etc.
There is a program called CPUburn that will give you instant 100% load
and push the CPU temp as hard as possible. I would suggested Googling
for that, I use it for stress-testing my watercooled box when I do
overclocking experiments. That will give you an on-demand way to
cause temperature rise. Then I'd check the system and see if it's
actually dropping the fan RPM or if the stock HSF set is just not up
to the task. I've seen off-the-shelf name-brand systems that were
actually not capable of running 100% load full-time due to poor case
design/layout. Mini-ITX fanless boards are a good example of this.
I use sensors just by loading the modules and checking the /sys files,
although that requires knowing the chips and what the driver code puts
into each file entry. I've never seen it leave a perl script running
even when using it normally.
Good luck!
Craig
On 8/22/05, Jon Roland <jroland@linux-migration.net> wrote:
> Thanks for your response. No, it's not dust. Problem hasn't recurred, so I can't
> open the case to see if the fan has stopped or slowed when the temp begins to
> rise, but that is an obvious next move if it does.
>
> Of course, mechanical problems would not correlate with CPU load, processes, and
> disk load, stopping the CPU fan (or increasing power to the CPU) when the nice
> usage (not system or user) pegs at 100%, and dropping back to normal when the
> first perl process is killed, which also causes all the other perl processes to
> end, and the nice usage to drop to zero. Nothing else causes the CPU temp to rise.
> I have done this process kill twice now, so it is not a coincidence.
>
> I don't know why my log file is truncated at 80 columns to reveal more information
> about those perl processes.
>
> Phil Edelbrock wrote:
> >
> > Weird. Something I would check is to make sure that dust hasn't caused
> > the CPU fan to slow (or stop). I had problems with a server this
> > summer which had lots of dust, and a slightly mis-installed CPU
> > heatsink. The result was rapidly rising temps (and hangs) when the
> > computer was loaded.
> >
> > I think you need to look inside the server to see just what is going on
> > there at the heatsink and CPU. Are things installed right? Is there
> > dust or something causing the heatsink/fan to not work properly? Is
> > the environment the server is in OK (temp and humidity should be low).
> > Heatsinks sometimes have a piece of plastic which needs to be peeled
> > off the heatsink compound before installation.
> >
> > BTW- I hope you know what's causing the strange loading (those Perl
> > processes?). If that's unusual, then you could have some nasty spammer
> > or somebody using your box to do evil things. BBTW- Your top output
> > doesn't show much becuase that it's cut off on the right side where
> > things are just get interesting. Oh, and I have no idea why sensor
> > values disappear unless the driver or something is being unloaded(?).
> >
> >
> > Phil
>
> --
>
> ----------------------------------------------------------------
> Linux Migration Network 7793 Burnet Road #37, Austin, TX 78757
> 512/374-9585 www.linux-migration.net jroland@linux-migration.net
> ----------------------------------------------------------------
>
> _______________________________________________
> lm-sensors mailing list
> lm-sensors@lm-sensors.org
> http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
>
^ permalink raw reply [flat|nested] 7+ messages in thread* [lm-sensors] Processes causing CPU to overheat
2005-08-23 5:28 [lm-sensors] Fw: Processes causing CPU to overheat Jon Roland
` (2 preceding siblings ...)
2005-08-23 18:45 ` Craig Sylla
@ 2005-08-23 20:27 ` Jon Roland
2005-08-24 17:59 ` Jon Roland
2005-08-24 22:48 ` Jon Roland
5 siblings, 0 replies; 7+ messages in thread
From: Jon Roland @ 2005-08-23 20:27 UTC (permalink / raw)
To: lm-sensors
I am trying to write some scripts that would:
1. Sense when the CPU temperature rises above a preset threshold level, and if it
does, launch a program to remedy the condition.
2. The launched program would kill the first process named "perl -w".
3. It would then sleep for some period of time, say 30 seconds, to allow time for
the prelink process to start, which seems to always come after the perl processes,
and which itself threatens an overheat.
4. Kill the prelink process.
Some of the scripts I have written so far:
psk:
kill `ps -ef | grep "$1" | egrep -v grep | awk '{print $2}'`
psu:
ps -ef --colsP0 | grep "$1" | egrep -v grep 2>&1 | tee `date +%y%m%d%H%M`_ps.log
soh (pseudocode):
if [overheat trigger, returns true when CPU temp > 49 C]
then
psk "perl -w"
sleep 30
psk prelink
end
My problem is to figure out how to write that overheat trigger, presumably based
on the sensors command in some way. Is there something already written that would
do that? I have been searching for one and haven't found one yet.
It also appears there are two prelink processes, and it is the second one that
needs to be halted. Only the first perl process needs to be killed, and it is
distinguished by the "-w" flag.
Attached are some diagnostic log files. It is clear that the prelink process is
luanched by anacron, but it is not clear what is launching the perl processes. I
have manually checked the CPU fan and it shows no sign of slowing or stopping when
the overheating condition arises.
-- Jon
----------------------------------------------------------------
Linux Migration Network 7793 Burnet Road #37, Austin, TX 78757
512/374-9585 www.linux-migration.net jroland@linux-migration.net
----------------------------------------------------------------
--
----------------------------------------------------------------
Starflight Corporation 7793 Burnet Road #37, Austin, TX 78757
512/374-9585 www.the-spa.com/jon.roland/ jon.roland@the-spa.com
----------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0508230323_sensors.log
Type: text/x-log
Size: 12967 bytes
Desc: not available
Url : http://lists.lm-sensors.org/pipermail/lm-sensors/attachments/20050823/3806f365/0508230323_sensors-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 050823_pst.log
Type: text/x-log
Size: 8169 bytes
Desc: not available
Url : http://lists.lm-sensors.org/pipermail/lm-sensors/attachments/20050823/3806f365/050823_pst-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 050823_psu.log
Type: text/x-log
Size: 1634 bytes
Desc: not available
Url : http://lists.lm-sensors.org/pipermail/lm-sensors/attachments/20050823/3806f365/050823_psu-0001.bin
^ permalink raw reply [flat|nested] 7+ messages in thread* [lm-sensors] Processes causing CPU to overheat
2005-08-23 5:28 [lm-sensors] Fw: Processes causing CPU to overheat Jon Roland
` (3 preceding siblings ...)
2005-08-23 20:27 ` [lm-sensors] " Jon Roland
@ 2005-08-24 17:59 ` Jon Roland
2005-08-24 22:48 ` Jon Roland
5 siblings, 0 replies; 7+ messages in thread
From: Jon Roland @ 2005-08-24 17:59 UTC (permalink / raw)
To: lm-sensors
Yes, I restored sensors by running sensors-detect.
I doubt things have changed that much in going to FC4. If you could provide a
2.6.5 solution I can probably use or adapt it.
I am indeed working on a script that would extract the CPU temp from the output of
the sensors command and use it as a trigger. I was just hoping someone might have
already done something like that, preferably something a little more robust, and
that I could run in background like a watch xxx script. A cron job with only a
one-minute granularity does not seem to be fast enough for this problem, because
freezeup occurs in less than a minute once the processes begin that seem to cause it.
Many of the respondents are also saying I need a better heatsink/fan. Funds for
that are low right now.
Craig Sylla wrote:
> Unfortunately it varies from driver to driver. :/
>
> You would need to look at the source for the driver in question to see
> exactly what it does. Most of them actually provide a fairly useful
> value in the sys entry. Also you had mentioned earlier that sensors
> had died, have you been able to get them working again?
>
> I am also unsure of exactly what the newer methods are for this, as
> I'm working with kernel 2.6.5, which is somewhat dated now. FC4 is
> running 2.6.12 iirc, I'd rather not give you info that is wrong and
> waste your time.
>
> One possibility - if the 'sensors' command and your sensors.conf file
> are good/working/right you could just grep out the line for the temp
> and parse that for your temperature and alarm status. The command
> uses the config file to do the conversions for you and knows how to
> handle each driver correctly. You can also set thresholds in the
> config file.
>
> Craig
>
>
> On 8/23/05, Jon Roland <jon.roland@the-spa.com> wrote:
>
>>This is a tantalizing suggestion, but it is insufficient information. Could you be
>>more specific, or point me to some documentation that would help me make it work?
>>Thanks.
>>
>>Craig Sylla wrote:
>>
>>>The 'raw' driver data comes from the sys file system. You could read
>>>the temp directly (it will require some math conversion but not much).
>>> Or just check the 'alarm' value for a pass-fail type test.
>>
>>--
>>
>>----------------------------------------------------------------
>>Starflight Corporation 7793 Burnet Road #37, Austin, TX 78757
>>512/374-9585 www.the-spa.com/jon.roland/ jon.roland@the-spa.com
>>----------------------------------------------------------------
>>
>
>
--
----------------------------------------------------------------
Starflight Corporation 7793 Burnet Road #37, Austin, TX 78757
512/374-9585 www.the-spa.com/jon.roland/ jon.roland@the-spa.com
----------------------------------------------------------------
^ permalink raw reply [flat|nested] 7+ messages in thread* [lm-sensors] Processes causing CPU to overheat
2005-08-23 5:28 [lm-sensors] Fw: Processes causing CPU to overheat Jon Roland
` (4 preceding siblings ...)
2005-08-24 17:59 ` Jon Roland
@ 2005-08-24 22:48 ` Jon Roland
5 siblings, 0 replies; 7+ messages in thread
From: Jon Roland @ 2005-08-24 22:48 UTC (permalink / raw)
To: lm-sensors
I will explore solutions like better thermal compound and heatsink/fan
when I get some money from somewhere, like a job. In the meantime I am
forced to use methods that only require labor. The problem just caused
file damage that prevents me from logging in as an ordinary user (but I
can login as root). Adding a new user didn't allow me to login as that
user, so I conclude the file damage may have been in one of the etc/*rc
files. I had to rebuild all my email accounts as root, but using the
user mailboxes, to be able to send these messages. So far the problem
has not re-occurred running as root. In other words, I am in a race to
use software solutions get to a point where I can contemplate hardware
solutions.
Don't know much about Iirc lm-sensors. All this is fairly new to me. It
is not clear how a daemon could be harnessed to provide a trigger in a
script unless there is a hook on it for such things.
While in the BIOS the CPU temperature shown by the Nexus monitor panel
rose to 48 C, while the BIOS showed 66 C. That is why I set the shutdown
threshold in the BIOS to 70 C, just above what I found with the BIOS
running for a while, but lower than the level that the pathological
condition I am reporting produces (52 C on the Nexus panel).
Don't know how I will find that sensor chip. I will take the system to a
meeting tonight with others who may be able to figure these things out
with me.
Craig Sylla wrote:
> One thing for the heatsink/fan would be just to remove it, clean
> everything off, and remount it using a good thermal compound (I
> recommend and use Arctic Silver). A poorly interfaced heatsink is a
> common problem. If it has the evil thermal stickytape or foam that's
> a really good reason for it not to work well, it could also just be
> poorly seated. The P4 socket 775 HSF's are notorious for this. The
> plain stock AMD HSF's are usually ok for 100% load, but you might just
> need to populate another case fan or two if there are spaces.
>
> You can't really read the sensors too often btw - the part won't
> sample more often than every few seconds and will really tie up the
> system while it's reading (kernel locks). Once/minute is pretty slow,
> but more than once every 5 seconds would really bog the machine. I
> haven't written a check program yet (it is a task I have to do, but
> other things have priority).
>
> Iirc lm-sensors comes with a demon that can check the sensors
> periodically - is that useful for this?
>
> If the motherboard's BIOS has a screen that shows the temperature, you
> can also just leave it at that screen for a while and see what
> happens. The BIOS setup screen runs the cpu at basically full load
> (it's in a loop) and will warm it up nicely. Since Linux isn't
> running yet it would tell you if the system is just undercooled or if
> something is messing with the fan controller. You can also use the
> BIOS screens to verify that it's not overclocked.
>
> What type of chip is reading temperature (which module do you load)
> and what are your parameters for it to the sensors.conf? If I can get
> a spare bit of time I can look and see which sys object to read.
>
> Craig
>
>
> On 8/24/05, Jon Roland <jon.roland@the-spa.com> wrote:
>
>>Yes, I restored sensors by running sensors-detect.
>>
>>I doubt things have changed that much in going to FC4. If you could provide a
>>2.6.5 solution I can probably use or adapt it.
>>
>>I am indeed working on a script that would extract the CPU temp from the output of
>>the sensors command and use it as a trigger. I was just hoping someone might have
>>already done something like that, preferably something a little more robust, and
>>that I could run in background like a watch xxx script. A cron job with only a
>>one-minute granularity does not seem to be fast enough for this problem, because
>>freezeup occurs in less than a minute once the processes begin that seem to cause it.
>>
>>Many of the respondents are also saying I need a better heatsink/fan. Funds for
>>that are low right now.
>>
>>Craig Sylla wrote:
>>
>>>Unfortunately it varies from driver to driver. :/
>>>
>>>You would need to look at the source for the driver in question to see
>>>exactly what it does. Most of them actually provide a fairly useful
>>>value in the sys entry. Also you had mentioned earlier that sensors
>>>had died, have you been able to get them working again?
>>>
>>>I am also unsure of exactly what the newer methods are for this, as
>>>I'm working with kernel 2.6.5, which is somewhat dated now. FC4 is
>>>running 2.6.12 iirc, I'd rather not give you info that is wrong and
>>>waste your time.
>>>
>>>One possibility - if the 'sensors' command and your sensors.conf file
>>>are good/working/right you could just grep out the line for the temp
>>>and parse that for your temperature and alarm status. The command
>>>uses the config file to do the conversions for you and knows how to
>>>handle each driver correctly. You can also set thresholds in the
>>>config file.
>>>
>>>Craig
>>>
>>>
>>>On 8/23/05, Jon Roland <jon.roland@the-spa.com> wrote:
>>>
>>>
>>>>This is a tantalizing suggestion, but it is insufficient information. Could you be
>>>>more specific, or point me to some documentation that would help me make it work?
>>>>Thanks.
>>>>
>>>>Craig Sylla wrote:
>>>>
>>>>
>>>>>The 'raw' driver data comes from the sys file system. You could read
>>>>>the temp directly (it will require some math conversion but not much).
>>>>>Or just check the 'alarm' value for a pass-fail type test.
>>>>
>>>>--
>>>>
>>>>----------------------------------------------------------------
>>>>Starflight Corporation 7793 Burnet Road #37, Austin, TX 78757
>>>>512/374-9585 www.the-spa.com/jon.roland/ jon.roland@the-spa.com
>>>>----------------------------------------------------------------
>>>>
>>>
>>>
>>
>>--
>>
>>----------------------------------------------------------------
>>Starflight Corporation 7793 Burnet Road #37, Austin, TX 78757
>>512/374-9585 www.the-spa.com/jon.roland/ jon.roland@the-spa.com
>>----------------------------------------------------------------
>>
>
>
--
----------------------------------------------------------------
Starflight Corporation 7793 Burnet Road #37, Austin, TX 78757
512/374-9585 www.the-spa.com/jon.roland/ jon.roland@the-spa.com
----------------------------------------------------------------
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2005-08-24 22:48 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-23 5:28 [lm-sensors] Fw: Processes causing CPU to overheat Jon Roland
2005-08-23 6:27 ` Jon Roland
2005-08-23 6:55 ` Phil Edelbrock
2005-08-23 18:45 ` Craig Sylla
2005-08-23 20:27 ` [lm-sensors] " Jon Roland
2005-08-24 17:59 ` Jon Roland
2005-08-24 22:48 ` Jon Roland
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.