* MCE error log @ 2009-01-06 13:00 Zdenek Kabelac 2009-01-06 18:42 ` Valdis.Kletnieks 2009-01-06 18:49 ` Andi Kleen 0 siblings, 2 replies; 5+ messages in thread From: Zdenek Kabelac @ 2009-01-06 13:00 UTC (permalink / raw) To: Linux Kernel Mailing List Hi I've noticed mcelog with weird content: MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 128 TSC 57976afd STATUS 88380100 MCGSTATUS 0 MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 128 TSC 53e61034 STATUS 88370100 MCGSTATUS 0 I'm running T61 - 2GB - in this directory /sys/devices/system/machinecheck/machinecheck1 I could only see bank0ctl ... bank5ctl - so where is bank 128 ? (as there are no time stamps, I could hardly guess how often this happens) Is it kernel bug or chipset bug ? Should I start to worry about the stability of my machine ? Zdenek ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: MCE error log 2009-01-06 13:00 MCE error log Zdenek Kabelac @ 2009-01-06 18:42 ` Valdis.Kletnieks 2009-01-25 23:51 ` Vegard Nossum 2009-01-06 18:49 ` Andi Kleen 1 sibling, 1 reply; 5+ messages in thread From: Valdis.Kletnieks @ 2009-01-06 18:42 UTC (permalink / raw) To: Zdenek Kabelac; +Cc: Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 432 bytes --] On Tue, 06 Jan 2009 14:00:03 +0100, Zdenek Kabelac said: > CPU 1 BANK 128 TSC 57976afd > I could only see bank0ctl ... bank5ctl - so where is bank 128 ? I've had bank 128 reported before. Turned out it was for thermal events caused by dust bunnies clogging a cooling vent. I never did find an official statement that's what 128 is for, but I did find a bunch of hints.... What does lm_sensors say the CPU temp is sitting at? [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: MCE error log 2009-01-06 18:42 ` Valdis.Kletnieks @ 2009-01-25 23:51 ` Vegard Nossum 0 siblings, 0 replies; 5+ messages in thread From: Vegard Nossum @ 2009-01-25 23:51 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Zdenek Kabelac, Linux Kernel Mailing List, Maciej W. Rozycki On Tue, Jan 6, 2009 at 7:42 PM, <Valdis.Kletnieks@vt.edu> wrote: > On Tue, 06 Jan 2009 14:00:03 +0100, Zdenek Kabelac said: > >> CPU 1 BANK 128 TSC 57976afd > >> I could only see bank0ctl ... bank5ctl - so where is bank 128 ? > > I've had bank 128 reported before. Turned out it was for thermal events caused > by dust bunnies clogging a cooling vent. I never did find an official > statement that's what 128 is for, but I did find a bunch of hints.... > > What does lm_sensors say the CPU temp is sitting at? I get this also: MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 THERMAL EVENT TSC dc963a087 Processor core below trip temperature. Throttling disabled STATUS 882c0100 MCGSTATUS 0 MCE 1 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 THERMAL EVENT TSC dc970c0d0 Processor core below trip temperature. Throttling disabled STATUS 882d0200 MCGSTATUS 0 and in kernel log: Machine check events logged CPU0: Temperature/speed normal CPU1: Temperature/speed normal This is happening since I installed a x86_64 kernel instead of 32-bit. Maybe this explains those weird (never fatal) APIC errors I always used to get before (error 40, invalid vector received AFAIR)? In any case, the APIC errors are not to be seen now, and the frequency of the MCEs are about that of the APIC errors. What I can say is that it seems they appear sooner when there is a lot of interrupts, e.g. disk or network activity. What is the correlation? Temperature seems completely normal whenever it happens: # sensors coretemp-isa-0000 Adapter: ISA adapter Core 0: +58.0°C (high = +100.0°C, crit = +100.0°C) coretemp-isa-0001 Adapter: ISA adapter Core 1: +59.0°C (high = +100.0°C, crit = +100.0°C) Anyway, system works fine, so it's not much to worry about. But I am curious... Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: MCE error log 2009-01-06 13:00 MCE error log Zdenek Kabelac 2009-01-06 18:42 ` Valdis.Kletnieks @ 2009-01-06 18:49 ` Andi Kleen 2009-01-07 17:44 ` Zdenek Kabelac 1 sibling, 1 reply; 5+ messages in thread From: Andi Kleen @ 2009-01-06 18:49 UTC (permalink / raw) To: Zdenek Kabelac; +Cc: Linux Kernel Mailing List "Zdenek Kabelac" <zdenek.kabelac@gmail.com> writes: > /sys/devices/system/machinecheck/machinecheck1 > I could only see bank0ctl ... bank5ctl - so where is bank 128 ? Update your mcelog. Newer versions decode it. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: MCE error log 2009-01-06 18:49 ` Andi Kleen @ 2009-01-07 17:44 ` Zdenek Kabelac 0 siblings, 0 replies; 5+ messages in thread From: Zdenek Kabelac @ 2009-01-07 17:44 UTC (permalink / raw) To: Andi Kleen; +Cc: Linux Kernel Mailing List 2009/1/6 Andi Kleen <andi@firstfloor.org>: > "Zdenek Kabelac" <zdenek.kabelac@gmail.com> writes: > >> /sys/devices/system/machinecheck/machinecheck1 >> I could only see bank0ctl ... bank5ctl - so where is bank 128 ? > > Update your mcelog. Newer versions decode it. Ok I've replaced binary with the latest code from you. Here is new trace MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 THERMAL EVENT TSC 54ab7bf6 Processor core below trip temperature. Throttling disabled STATUS 88380100 MCGSTATUS 0 So what does this message means now ? Zdenek ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-01-25 23:52 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-01-06 13:00 MCE error log Zdenek Kabelac 2009-01-06 18:42 ` Valdis.Kletnieks 2009-01-25 23:51 ` Vegard Nossum 2009-01-06 18:49 ` Andi Kleen 2009-01-07 17:44 ` Zdenek Kabelac
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox