* Re: [lm-sensors] Sudden shutdown and wrong temperature reading (driver jc42)
2013-09-27 15:57 [lm-sensors] Sudden shutdown and wrong temperature reading (driver jc42) Olavo Luppi Silva
@ 2013-09-27 16:32 ` Guenter Roeck
2013-09-28 0:06 ` Guenter Roeck
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Guenter Roeck @ 2013-09-27 16:32 UTC (permalink / raw)
To: lm-sensors
On Fri, Sep 27, 2013 at 12:57:07PM -0300, Olavo Luppi Silva wrote:
> Hi dear lm-sensors developers,
>
> My name is Olavo, I am a newbie in this group and I am writing because I'm
> facing some problems that I suspect it could be a lm-sensors bug. If it's a
> bug I would be happy to help fixing it.
>
>
> SHORT STORY:
> The workstation suddenly shuts down, usually when performing intensive
> computation. Workaround: comment line jc42 at /etc/modules apparently
> solves the problem.
>
>
>
> LONG STORY:
> We have 3 Intel workstations with the specification described below,
> running linux ubuntu and lm-sensors installed. In June, one of the machines
> (raphson) started to shutdown suddenly during intensive computations, all
> processor in use during several hours. The shutdown events where becoming
> more and more frequent (a shutdown at each 5 minutes) and raphson were
> taken to technical assistance. They detected a hardware problem and
> replaced the motherboard which was in warranty period.
>
> Raphson returned but the shutdown events were still present at each 12h to
> 24h, roughly. Then I created a script to save sensors temperatures, which
> is pasted below, and monitored the workstation for many hours. Ploting
> temperature of sensors jc42-i2c-8-1a, jc42-i2c-8-1b, etc, I noticed some
> spikes both down (0 Celsius degrees) and up (250 C).
> Then I disabled sensor jc42 commenting line jc42 at /etc/modules and it
> apparently solves the problem. Raphson is running without interruption
> performing intensive computations for 3 weeks now.
>
> I also performed the same temperature monitoring at the two other machines:
> kalman and gauss. Kalman temperature plots are ok, but Gauss's aren't. It
> presents the same spikes and sometimes produces the following error:
> ERROR: Can't get value of subfeature temp1_input:
> Kalman is running intensive computations without interruption for 2 weeks.
> Gauss was running intensive computations since last week but yesterday
> night and today morning it shutdown.
> Now I'm suspecting jc42 sensor is causing this problem.
>
Kind of unlikely. The sometimes wrong readings suggest that the i2c connection
to the memory chips may be flaky. Another question would be if you have
configured acpi_enforce_resources=lax in your boot command line to be able to
read the sensors. If so, there may be a conflict between the BIOS and the jc42
driver trying to access the sensors.
Secondary question is if temperature limits are set correctly, the value of
those limits, and if the temperature ever comes close to that limit. The only
"default" activity performed by the jc42 driver is to enable the sensors. If the
temperature limits are not set or not set correctly, and the alert output from
the sensor chip is connected to a board reset or NMI, you might well observe
shutdowns.
However, the occassional error in reading sensor information is a real concern.
Again, there is either a problem in the I2C connection between the sensor and
the i2c controller, or the sensor is accessed from multiple sources at the same
time (ie you configured acpi_enforce_resources=lax).
Please post any relevant dmesg output as well as output from the "sensors"
command. That might help us tracking down the problem.
Thanks,
Guenter
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [lm-sensors] Sudden shutdown and wrong temperature reading (driver jc42)
2013-09-27 15:57 [lm-sensors] Sudden shutdown and wrong temperature reading (driver jc42) Olavo Luppi Silva
2013-09-27 16:32 ` Guenter Roeck
@ 2013-09-28 0:06 ` Guenter Roeck
2013-09-28 9:16 ` Jean Delvare
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Guenter Roeck @ 2013-09-28 0:06 UTC (permalink / raw)
To: lm-sensors
On 09/27/2013 01:38 PM, Olavo Luppi Silva wrote:
> Hi Guenter,
> Thanks for replying.
> I didn't configure acpi_enforce_resources=lax in your boot command line. I just made the following steps to install lm-sensors:
>
Hi,
please don't top-post, and please don't drop the mailing list from your replies.
you would not see an error, but something like
ACPI Warning: 0x000000000000f040-0x000000000000f05f SystemIO conflicts with Region \_SB_.PCI0.SBUS.SMBI 1 (20130517/utaddress-251)
ACPI: This conflict may cause random problems and system instability
ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
Let's assume you don't see that. Next question is if your system supports IPMI.
If it does, there is a slight chance that the IPMI controller accesses the SMBUs,
causing an access conflict.
> 1) $ sudo apt-get install lm-sensors
> 2) $ sudo sensors-detect
> 3) Paste the output of sensors-detect at the end of /etc/modprobe
>
> I didn't make any manual settings to temperature limits. I'm pasting the output of sensors -u of all three machines. Raphson is hotter than the others because is was running a computation when I probed the temperature. We can observe that some fields, have zero or negative temperatures.
>
> I don't know how to set "temp_crit" and "temp_crit_alarm" and even if the temperatures indicated in these field are correct. Processor datasheet with thermal specifications is at http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-5600-vol-1-datasheet.pdf but I can't understand what those tables and graphics mean.
The xeon datasheet only reflects the CPU temperatures;
temp_max and temp_crit can not be set but is hard-coded for each CPU type.
This applies to the "coretemp" values.
>
> Below you can find the attachment of sensors -u and dmesg | grep rror of the three machines.
>
> Regrads,
> Olavo
>
>
>
> ====================
> OUTPUT OF sensors -u AT RAPHSON WORKSTATION
> =====================
>
> olavo@raphson:~$ sensors -u
> coretemp-isa-0000
> Adapter: ISA adapter
> Core 0:
> temp2_input: 60.000
> temp2_max: 79.000
> temp2_crit: 89.000
> temp2_crit_alarm: 0.000
> Core 1:
> temp3_input: 59.000
> temp3_max: 79.000
> temp3_crit: 89.000
> temp3_crit_alarm: 0.000
> Core 2:
> temp4_input: 59.000
> temp4_max: 79.000
> temp4_crit: 89.000
> temp4_crit_alarm: 0.000
> Core 8:
> temp10_input: 57.000
> temp10_max: 79.000
> temp10_crit: 89.000
> Core 9:
> temp11_input: 61.000
> temp11_max: 79.000
> temp11_crit: 89.000
> Core 10:
> temp12_input: 60.000
> temp12_max: 79.000
> temp12_crit: 89.000
>
> coretemp-isa-0001
> Adapter: ISA adapter
> Core 0:
> temp2_input: 44.000
> temp2_max: 79.000
> temp2_crit: 89.000
> temp2_crit_alarm: 0.000
> Core 1:
> temp3_input: 48.000
> temp3_max: 79.000
> temp3_crit: 89.000
> temp3_crit_alarm: 0.000
> Core 2:
> temp4_input: 45.000
> temp4_max: 79.000
> temp4_crit: 89.000
> temp4_crit_alarm: 0.000
> Core 8:
> temp10_input: 44.000
> temp10_max: 79.000
> temp10_crit: 89.000
> Core 9:
> temp11_input: 49.000
> temp11_max: 79.000
> temp11_crit: 89.000
> Core 10:
> temp12_input: 43.000
> temp12_max: 79.000
> temp12_crit: 89.000
>
>
>
> ====================
> OUTPUT OF sensors -u AT GAUSS WORKSTATION
> =====================
> olavo@gauss:~$ sensors -u
> radeon-pci-0200
> Adapter: PCI adapter
> temp1:
> temp1_input: 79.500
>
That is a bit hot. Are you running a lot of graphics output on that ?
Does the graphics card have a fan, and is it running ?
> coretemp-isa-0000
> Adapter: ISA adapter
> Core 0:
> temp2_input: 45.000
> temp2_max: 79.000
> temp2_crit: 89.000
> temp2_crit_alarm: 0.000
> Core 1:
> temp3_input: 45.000
> temp3_max: 79.000
> temp3_crit: 89.000
> temp3_crit_alarm: 0.000
> Core 2:
> temp4_input: 45.000
> temp4_max: 79.000
> temp4_crit: 89.000
> temp4_crit_alarm: 0.000
> Core 8:
> temp10_input: 45.000
> temp10_max: 79.000
> temp10_crit: 89.000
> Core 9:
> temp11_input: 48.000
> temp11_max: 79.000
> temp11_crit: 89.000
> Core 10:
> temp12_input: 45.000
> temp12_max: 79.000
> temp12_crit: 89.000
>
> coretemp-isa-0001
> Adapter: ISA adapter
> Core 0:
> temp2_input: 45.000
> temp2_max: 79.000
> temp2_crit: 89.000
> temp2_crit_alarm: 0.000
> Core 1:
> temp3_input: 43.000
> temp3_max: 79.000
> temp3_crit: 89.000
> temp3_crit_alarm: 0.000
> Core 2:
> temp4_input: 48.000
> temp4_max: 79.000
> temp4_crit: 89.000
> temp4_crit_alarm: 0.000
> Core 8:
> temp10_input: 40.000
> temp10_max: 79.000
> temp10_crit: 89.000
> Core 9:
> temp11_input: 41.000
> temp11_max: 79.000
> temp11_crit: 89.000
> Core 10:
> temp12_input: 40.000
> temp12_max: 79.000
> temp12_crit: 89.000
>
> jc42-i2c-8-18
> Adapter: SMBus I801 adapter at 3000
> temp1:
> temp1_input: 50.500
> temp1_max: 0.000
> temp1_max_hyst: -3.000
> temp1_min: 0.000
> temp1_crit: 78.250
> temp1_crit_hyst: 75.250
> temp1_max_alarm: 0.000
> temp1_min_alarm: 0.000
> temp1_crit_alarm: 0.000
>
This shows that the maximum temperature is not configured, which also results in
the negative hysteresis temperature. Not necessarily a concern, though it is interesting
that there is no max_alarm. Maybe maximum temperature detection is disabled if temp1_max
is set to 0.
What sensor chip does sensors-detect report ? Maybe I can find some information
about this in the chip datasheet(s).
Other than that, the RAM on this system is running a bit hot. It is interesting that
it is warmer than the CPUs. Does the RAM temperature ever get close to the critical
temperature ?
Another thing to check might be the critical DRAM temperatures on raphson. It seems
like you have several types of DRAM in the systems with different critical temperatures,
and the maximum temperature is sometimes set and sometimes not.
Unfortunately, you did not include the DRAM sensor output from raphson, which would be
the most important to look at. Can you provide that information ? Just unload the driver
after you obtained the data; that should prevent any reboots.
Another question regarding the reboots: When this happened, did you have any code
running which is accessing the temperature sensors ? If so, do you have a log
of those temperatures at the time the system was rebooting ?
Also, do you by any chance see anything in syslog after the reboot showing
a reboot reason ?
Thanks,
Guenter
> jc42-i2c-8-19
> Adapter: SMBus I801 adapter at 3000
> temp1:
> temp1_input: 48.500
> temp1_max: 0.000
> temp1_max_hyst: -3.000
> temp1_min: 0.000:
> temp1_crit: 78.500
> temp1_crit_hyst: 75.500
> temp1_max_alarm: 0.000
> temp1_min_alarm: 0.000
> temp1_crit_alarm: 0.000
>
> jc42-i2c-8-1a
> Adapter: SMBus I801 adapter at 3000
> temp1:
> temp1_input: 51.000
> temp1_max: 0.000
> temp1_max_hyst: -3.000
> temp1_min: 0.000
> temp1_crit: 78.500
> temp1_crit_hyst: 75.500
> temp1_max_alarm: 0.000
> temp1_min_alarm: 0.000
> temp1_crit_alarm: 0.000
>
> jc42-i2c-8-1b
> Adapter: SMBus I801 adapter at 3000
> temp1:
> temp1_input: 50.500
> temp1_max: 0.000
> temp1_max_hyst: -3.000
> temp1_min: 0.000
> temp1_crit: 78.500
> temp1_crit_hyst: 75.500
> temp1_max_alarm: 0.000
> temp1_min_alarm: 0.000
> temp1_crit_alarm: 0.000
>
> jc42-i2c-8-1c
> Adapter: SMBus I801 adapter at 3000
> temp1:
> temp1_input: 51.000
> temp1_max: 0.000
> temp1_max_hyst: -3.000
> temp1_min: 0.000
> temp1_crit: 78.500
> temp1_crit_hyst: 75.500
> temp1_max_alarm: 0.000
> temp1_min_alarm: 0.000
> temp1_crit_alarm: 0.000
>
> jc42-i2c-8-1d
> Adapter: SMBus I801 adapter at 3000
> temp1:
> temp1_input: 52.000
> temp1_max: 0.000
> temp1_max_hyst: -3.000
> temp1_min: 0.000
> temp1_crit: 78.250
> temp1_crit_hyst: 75.250
> temp1_max_alarm: 0.000
> temp1_min_alarm: 0.000
> temp1_crit_alarm: 0.000
>
>
>
> ====================
> OUTPUT OF sensors -u AT KALMAN WORKSTATION
> =====================
>
> olavo@kalman:~$ sensors -u
> nouveau-pci-0300
> Adapter: PCI adapter
> temp1:
> temp1_input: 41.000
> temp1_max: 100.000
> temp1_crit: 110.000
>
> coretemp-isa-0000
> Adapter: ISA adapter
> Core 0:
> temp2_input: 41.000
> temp2_max: 79.000
> temp2_crit: 89.000
> temp2_crit_alarm: 0.000
> Core 1:
> temp3_input: 36.000
> temp3_max: 79.000
> temp3_crit: 89.000
> temp3_crit_alarm: 0.000
> Core 2:
> temp4_input: 35.000
> temp4_max: 79.000
> temp4_crit: 89.000
> temp4_crit_alarm: 0.000
> Core 8:
> temp10_input: 38.000
> temp10_max: 79.000
> temp10_crit: 89.000
> Core 9:
> temp11_input: 39.000
> temp11_max: 79.000
> temp11_crit: 89.000
> Core 10:
> temp12_input: 41.000
> temp12_max: 79.000
> temp12_crit: 89.000
>
> coretemp-isa-0001
> Adapter: ISA adapter
> Core 0:
> temp2_input: 43.000
> temp2_max: 79.000
> temp2_crit: 89.000
> temp2_crit_alarm: 0.000
> Core 1:
> temp3_input: 42.000
> temp3_max: 79.000
> temp3_crit: 89.000
> temp3_crit_alarm: 0.000
> Core 2:
> temp4_input: 41.000
> temp4_max: 79.000
> temp4_crit: 89.000
> temp4_crit_alarm: 0.000
> Core 8:
> temp10_input: 40.000
> temp10_max: 79.000
> temp10_crit: 89.000
> Core 9:
> temp11_input: 41.000
> temp11_max: 79.000
> temp11_crit: 89.000
> Core 10:
> temp12_input: 39.000
> temp12_max: 79.000
> temp12_crit: 89.000
>
> jc42-i2c-6-18
> Adapter: SMBus I801 adapter at 3000
> temp1:
> temp1_input: 51.875
> temp1_max: 0.000
> temp1_max_hyst: -3.000
> temp1_min: 0.000
> temp1_crit: 74.000
> temp1_crit_hyst: 71.000
> temp1_max_alarm: 0.000
> temp1_min_alarm: 0.000
> temp1_crit_alarm: 0.000
>
> jc42-i2c-6-1a
> Adapter: SMBus I801 adapter at 3000
> temp1:
> temp1_input: 48.875
> temp1_max: 0.000
> temp1_max_hyst: -3.000
> temp1_min: 0.000
> temp1_crit: 74.000
> temp1_crit_hyst: 71.000
> temp1_max_alarm: 0.000
> temp1_min_alarm: 0.000
> temp1_crit_alarm: 0.000
>
> jc42-i2c-6-1c
> Adapter: SMBus I801 adapter at 3000
> temp1:
> temp1_input: 48.750
> temp1_max: 0.000
> temp1_max_hyst: -3.000
> temp1_min: 0.000
> temp1_crit: 74.000
> temp1_crit_hyst: 71.000
> temp1_max_alarm: 0.000
> temp1_min_alarm: 0.000
> temp1_crit_alarm: 0.000
>
>
> ====================
> $ less /etc/defaul/grub
>
> # If you change this file, run 'update-grub' afterwards to update
> # /boot/grub/grub.cfg.
> # For full documentation of the options in this file, see:
> # info -f grub -n 'Simple configuration'
>
> GRUB_DEFAULT=0
> GRUB_HIDDEN_TIMEOUT=0
> GRUB_HIDDEN_TIMEOUT_QUIET=true
> GRUB_TIMEOUT\x10
> GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
> GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
> GRUB_CMDLINE_LINUX=""
>
> # Uncomment to enable BadRAM filtering, modify to suit your needs
> # This works with Linux (no patch required) and with any kernel that obtains
> # the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
> #GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"
>
> # Uncomment to disable graphical terminal (grub-pc only)
> #GRUB_TERMINAL=console
>
> # The resolution used on graphical terminal
> # note that you can use only modes which your graphic card supports via VBE
> # you can see them in real GRUB with the command `vbeinfo'
> #GRUB_GFXMODEd0x480
>
> # Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
> #GRUB_DISABLE_LINUX_UUID=true
>
> # Uncomment to disable generation of recovery mode menu entries
> #GRUB_DISABLE_RECOVERY="true"
>
> # Uncomment to get a beep at grub start
> #GRUB_INIT_TUNE="480 440 1"
>
>
>
>
> =================> olavo@raphson:~$ dmesg | grep rror
>
> [ 2.792363] ACPI Error: Field [CPB3] at 96 exceeds Buffer [NULL] size 64 (bits) (20110623/dsopcode-236)
> [ 2.792369] ACPI Error: Method parse/execution failed [\_SB_._OSC] (Node ffff880a30462ed8), AE_AML_BUFFER_LIMIT (20110623/psparse-536)
> [ 2.924590] ERST: Error Record Serialization Table (ERST) support is initialized.
> [ 12.864478] EXT4-fs (sda2): re-mounted. Opts: errors=remount-ro
> [ 4275.337083] indicator-weath[2575]: segfault at 0 ip 00007f80e1f72bf1 sp 00007fff82e216c8 error 4 in libc-2.15.so <http://libc-2.15.so>[7f80e1e10000+1b5000]
>
> ==================
> olavo@gauss:~$ dmesg | grep rror
>
> [ 2.791955] ACPI Error: Field [CPB3] at 96 exceeds Buffer [NULL] size 64 (bits) (20110623/dsopcode-236)
> [ 2.791961] ACPI Error: Method parse/execution failed [\_SB_._OSC] (Node ffff880648462eb0), AE_AML_BUFFER_LIMIT (20110623/psparse-536)
> [ 2.928526] ERST: Error Record Serialization Table (ERST) support is initialized.
> [ 19.083332] EXT4-fs (sda2): re-mounted. Opts: errors=remount-ro
> [25194.306241] nr[14648]: segfault at 20 ip 0000000000416d26 sp 00007fffdaf03a20 error 4 in nr[400000+1627000]
>
>
>
> =================
> olavo@kalman:~$ dmesg | grep rror
>
> [ 2.773220] ACPI Error: Field [CPB3] at 96 exceeds Buffer [NULL] size 64 (bits) (20110623/dsopcode-236)
> [ 2.773225] ACPI Error: Method parse/execution failed [\_SB_._OSC] (Node ffff880647866eb0), AE_AML_BUFFER_LIMIT (20110623/psparse-536)
> [ 2.897234] ERST: Error Record Serialization Table (ERST) support is initialized.
> [ 8.774739] EXT4-fs (sda2): re-mounted. Opts: errors=remount-ro
>
>
>
> 2013/9/27 Guenter Roeck <linux@roeck-us.net <mailto:linux@roeck-us.net>>
>
> On Fri, Sep 27, 2013 at 12:57:07PM -0300, Olavo Luppi Silva wrote:
> > Hi dear lm-sensors developers,
> >
> > My name is Olavo, I am a newbie in this group and I am writing because I'm
> > facing some problems that I suspect it could be a lm-sensors bug. If it's a
> > bug I would be happy to help fixing it.
> >
> >
> > SHORT STORY:
> > The workstation suddenly shuts down, usually when performing intensive
> > computation. Workaround: comment line jc42 at /etc/modules apparently
> > solves the problem.
> >
> >
> >
> > LONG STORY:
> > We have 3 Intel workstations with the specification described below,
> > running linux ubuntu and lm-sensors installed. In June, one of the machines
> > (raphson) started to shutdown suddenly during intensive computations, all
> > processor in use during several hours. The shutdown events where becoming
> > more and more frequent (a shutdown at each 5 minutes) and raphson were
> > taken to technical assistance. They detected a hardware problem and
> > replaced the motherboard which was in warranty period.
> >
> > Raphson returned but the shutdown events were still present at each 12h to
> > 24h, roughly. Then I created a script to save sensors temperatures, which
> > is pasted below, and monitored the workstation for many hours. Ploting
> > temperature of sensors jc42-i2c-8-1a, jc42-i2c-8-1b, etc, I noticed some
> > spikes both down (0 Celsius degrees) and up (250 C).
> > Then I disabled sensor jc42 commenting line jc42 at /etc/modules and it
> > apparently solves the problem. Raphson is running without interruption
> > performing intensive computations for 3 weeks now.
> >
> > I also performed the same temperature monitoring at the two other machines:
> > kalman and gauss. Kalman temperature plots are ok, but Gauss's aren't. It
> > presents the same spikes and sometimes produces the following error:
> > ERROR: Can't get value of subfeature temp1_input:
> > Kalman is running intensive computations without interruption for 2 weeks.
> > Gauss was running intensive computations since last week but yesterday
> > night and today morning it shutdown.
> > Now I'm suspecting jc42 sensor is causing this problem.
> >
>
> Kind of unlikely. The sometimes wrong readings suggest that the i2c connection
> to the memory chips may be flaky. Another question would be if you have
> configured acpi_enforce_resources=lax in your boot command line to be able to
> read the sensors. If so, there may be a conflict between the BIOS and the jc42
> driver trying to access the sensors.
>
> Secondary question is if temperature limits are set correctly, the value of
> those limits, and if the temperature ever comes close to that limit. The only
> "default" activity performed by the jc42 driver is to enable the sensors. If the
> temperature limits are not set or not set correctly, and the alert output from
> the sensor chip is connected to a board reset or NMI, you might well observe
> shutdowns.
>
> However, the occassional error in reading sensor information is a real concern.
> Again, there is either a problem in the I2C connection between the sensor and
> the i2c controller, or the sensor is accessed from multiple sources at the same
> time (ie you configured acpi_enforce_resources=lax).
>
> Please post any relevant dmesg output as well as output from the "sensors"
> command. That might help us tracking down the problem.
>
> Thanks,
> Guenter
>
>
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [lm-sensors] Sudden shutdown and wrong temperature reading (driver jc42)
2013-09-27 15:57 [lm-sensors] Sudden shutdown and wrong temperature reading (driver jc42) Olavo Luppi Silva
2013-09-27 16:32 ` Guenter Roeck
2013-09-28 0:06 ` Guenter Roeck
@ 2013-09-28 9:16 ` Jean Delvare
2013-10-03 22:51 ` Olavo Luppi Silva
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Jean Delvare @ 2013-09-28 9:16 UTC (permalink / raw)
To: lm-sensors
Hi Olavo,
On Fri, 27 Sep 2013 12:57:07 -0300, Olavo Luppi Silva wrote:
> SHORT STORY:
> The workstation suddenly shuts down, usually when performing intensive
> computation. Workaround: comment line jc42 at /etc/modules apparently
> solves the problem.
On such workstations it is common to have IPMI, possibly even with a
BMC for remote management. In that case it is possible that IPMI and
lm-sensors conflict, as they try to access the same hardware without
synchronization. This would explain the SMBus errors.
If you (or the BMC) are using IPMI, you can't use lm-sensors. Use
"ipmitool sensor" (or a similar command for other IPMI tools) instead.
--
Jean Delvare
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [lm-sensors] Sudden shutdown and wrong temperature reading (driver jc42)
2013-09-27 15:57 [lm-sensors] Sudden shutdown and wrong temperature reading (driver jc42) Olavo Luppi Silva
` (2 preceding siblings ...)
2013-09-28 9:16 ` Jean Delvare
@ 2013-10-03 22:51 ` Olavo Luppi Silva
2013-10-03 23:01 ` Guenter Roeck
2013-10-07 22:18 ` Guenter Roeck
5 siblings, 0 replies; 7+ messages in thread
From: Olavo Luppi Silva @ 2013-10-03 22:51 UTC (permalink / raw)
To: lm-sensors
2013/9/28 Jean Delvare <khali@linux-fr.org>
> Hi Olavo,
>
> On Fri, 27 Sep 2013 12:57:07 -0300, Olavo Luppi Silva wrote:
> > SHORT STORY:
> > The workstation suddenly shuts down, usually when performing intensive
> > computation. Workaround: comment line jc42 at /etc/modules apparently
> > solves the problem.
>
> On such workstations it is common to have IPMI, possibly even with a
> BMC for remote management. In that case it is possible that IPMI and
> lm-sensors conflict, as they try to access the same hardware without
> synchronization. This would explain the SMBus errors.
>
> If you (or the BMC) are using IPMI, you can't use lm-sensors. Use
> "ipmitool sensor" (or a similar command for other IPMI tools) instead.
>
> Hi Jean,
Thanks for replying.
Sorry for the ignorance, but I have just googled for BMC and IPMI to
discover it is a Baseboard Management Controller and a Intelligent Platform
Management Interface. :-). I am the administrator of Raphson workstation
and other phd students are the administrators of Gauss and Kalman
workstations.
So I guess we don't have a BMC or a IPMI installed on our systems.
Thanks
Olavo
--
> Jean Delvare
>
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [lm-sensors] Sudden shutdown and wrong temperature reading (driver jc42)
2013-09-27 15:57 [lm-sensors] Sudden shutdown and wrong temperature reading (driver jc42) Olavo Luppi Silva
` (3 preceding siblings ...)
2013-10-03 22:51 ` Olavo Luppi Silva
@ 2013-10-03 23:01 ` Guenter Roeck
2013-10-07 22:18 ` Guenter Roeck
5 siblings, 0 replies; 7+ messages in thread
From: Guenter Roeck @ 2013-10-03 23:01 UTC (permalink / raw)
To: lm-sensors
On 10/03/2013 03:51 PM, Olavo Luppi Silva wrote:
> 2013/9/28 Jean Delvare <khali@linux-fr.org>
>
>> Hi Olavo,
>>
>> On Fri, 27 Sep 2013 12:57:07 -0300, Olavo Luppi Silva wrote:
>>> SHORT STORY:
>>> The workstation suddenly shuts down, usually when performing intensive
>>> computation. Workaround: comment line jc42 at /etc/modules apparently
>>> solves the problem.
>>
>> On such workstations it is common to have IPMI, possibly even with a
>> BMC for remote management. In that case it is possible that IPMI and
>> lm-sensors conflict, as they try to access the same hardware without
>> synchronization. This would explain the SMBus errors.
>>
>> If you (or the BMC) are using IPMI, you can't use lm-sensors. Use
>> "ipmitool sensor" (or a similar command for other IPMI tools) instead.
>>
>> Hi Jean,
> Thanks for replying.
> Sorry for the ignorance, but I have just googled for BMC and IPMI to
> discover it is a Baseboard Management Controller and a Intelligent Platform
> Management Interface. :-). I am the administrator of Raphson workstation
> and other phd students are the administrators of Gauss and Kalman
> workstations.
> So I guess we don't have a BMC or a IPMI installed on our systems.
>
>
Nothing you can install; it is either there or it isn't. This is a board feature,
not a software feature.
Guenter
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [lm-sensors] Sudden shutdown and wrong temperature reading (driver jc42)
2013-09-27 15:57 [lm-sensors] Sudden shutdown and wrong temperature reading (driver jc42) Olavo Luppi Silva
` (4 preceding siblings ...)
2013-10-03 23:01 ` Guenter Roeck
@ 2013-10-07 22:18 ` Guenter Roeck
5 siblings, 0 replies; 7+ messages in thread
From: Guenter Roeck @ 2013-10-07 22:18 UTC (permalink / raw)
To: lm-sensors
On 10/07/2013 02:27 PM, Olavo Luppi Silva wrote:
>
>
>
> 2013/10/3 Guenter Roeck <linux@roeck-us.net <mailto:linux@roeck-us.net>>
>
> On 10/03/2013 03:41 PM, Olavo Luppi Silva wrote:
>
>
>
>
> 2013/9/27 Guenter Roeck <linux@roeck-us.net <mailto:linux@roeck-us.net> <mailto:linux@roeck-us.net <mailto:linux@roeck-us.net>>>
>
> On 09/27/2013 01:38 PM, Olavo Luppi Silva wrote:
>
> Hi Guenter,
> Thanks for replying.
> I didn't configure acpi_enforce_resources=lax in your boot command line. I just made the following steps to install lm-sensors:
>
> Hi,
>
> please don't top-post, and please don't drop the mailing list from your replies.
>
>
> Hi Guenter, sorry for that. I was not aware of replying style of this mailing list and I clicked 'reply' instead of "reply all".
>
>
> you would not see an error, but something like
>
> ACPI Warning: 0x000000000000f040-____0x000000000000f05f SystemIO conflicts with Region \_SB_.PCI0.SBUS.SMBI 1 (20130517/utaddress-251)
> ACPI: This conflict may cause random problems and system instability
> ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
>
> Unfortunately workstation Raphson died this week. It was running a long process using MKL and ATLAS math libraries and when I got in the office in the morning it was shutdown. Push power button: nothing happens. Unplug and plug power cable: the fans work for a few seconds and then stop. I was taken to the technical assistance.
>
> Let's assume you don't see that. Next question is if your system supports IPMI.
> If it does, there is a slight chance that the IPMI controller accesses the SMBUs,
> causing an access conflict.
>
> IPMI is an Intelligent Platform Management Interface, right? How can I check if my sistem supports IPMI? Our workstations are using Ubuntu and Kubuntu 12.04 LTS. I don't remember if I did install such interface.
>
>
>
> You'll find that information in the board specification. The output from sensors-detect below
> also shows you that the board supports IPMI.
>
> Furthermore, the Intel server board specification states that IPMI monitors the temperature
> and voltage sensors on the board. So if Raphson uses the same board, the most likely explanation
> for your problem is that IPMI and the jc42 driver try to access the DIMM temperature sensors
> at the same time. This would expmain both the read errors and the occassional resets (if IPMI
> resets the board if it happens and it reads a bad/high temperature).
>
> Guenter
>
>
> Thank you for your clarification, Guenter.
> Kalman and Gauss have the same motherboard that Raphson. I'll uninstall lm-sensors and use ipmitools to monitor temperature instead, as Jean Delvare pointed.
>
> Do you think that the conflict between IPMI and lm-sensors could also explain the last Raphson's failure? It worked for about one month at full load and suddenly shut down and never turned on again.
>
That is really quite unlikely.
Of course, it is theoretically possible that this could happen, for example
if there is a badly managed power controller chip involved. I don't think that
is the case here, though. After all, we are talking about access to temperature
sensor chips, not power controllers.
Guenter
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 7+ messages in thread