* Machine check exception? (2.4.6+SMP+VIA) @ 2001-07-07 20:32 Vibol Hou 2001-07-07 20:54 ` H. Peter Anvin 2001-07-07 21:41 ` Alan Cox 0 siblings, 2 replies; 12+ messages in thread From: Vibol Hou @ 2001-07-07 20:32 UTC (permalink / raw) To: Linux-Kernel Hi, I was running 2.4.6-stable in SMP mode on a dual P3-1GHz machine (VIA 694D Chipset / MSI-6321 M/B + ) and the following message popped up after which the system hardlocked (no SysRQ input). What does this message mean? CPU 1: Machine Check Exception: 0000000000000004 Bank 1: b200000000000115 Kernel panic: CPU context corrupt Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ... delta kernel: CPU 1: Machine Check Exception: 0000000000000004 Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ... delta kernel: Bank 1: b200000000000115 Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ... delta kernel: Kernel panic: CPU context corrupt -- Vibol Hou ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Machine check exception? (2.4.6+SMP+VIA) 2001-07-07 20:32 Machine check exception? (2.4.6+SMP+VIA) Vibol Hou @ 2001-07-07 20:54 ` H. Peter Anvin 2001-07-07 21:41 ` Alan Cox 1 sibling, 0 replies; 12+ messages in thread From: H. Peter Anvin @ 2001-07-07 20:54 UTC (permalink / raw) To: linux-kernel Followup to: <HDEBKHLDKIDOBMHPKDDKEEAIEMAA.vhou@khmer.cc> By author: "Vibol Hou" <vhou@khmer.cc> In newsgroup: linux.dev.kernel > > Hi, > > I was running 2.4.6-stable in SMP mode on a dual P3-1GHz machine (VIA 694D > Chipset / MSI-6321 M/B + ) and the following message popped up after which > the system hardlocked (no SysRQ input). What does this message mean? > > CPU 1: Machine Check Exception: 0000000000000004 > Bank 1: b200000000000115 > Kernel panic: CPU context corrupt > > Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ... > delta kernel: CPU 1: Machine Check Exception: 0000000000000004 > > Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ... > delta kernel: Bank 1: b200000000000115 > > Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ... > delta kernel: Kernel panic: CPU context corrupt > It means your hardware is bad. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Machine check exception? (2.4.6+SMP+VIA) 2001-07-07 20:32 Machine check exception? (2.4.6+SMP+VIA) Vibol Hou 2001-07-07 20:54 ` H. Peter Anvin @ 2001-07-07 21:41 ` Alan Cox 2001-07-08 7:28 ` Chris Wedgwood 1 sibling, 1 reply; 12+ messages in thread From: Alan Cox @ 2001-07-07 21:41 UTC (permalink / raw) To: Vibol Hou; +Cc: Linux-Kernel > I was running 2.4.6-stable in SMP mode on a dual P3-1GHz machine (VIA 694D > Chipset / MSI-6321 M/B + ) and the following message popped up after which > the system hardlocked (no SysRQ input). What does this message mean? > > CPU 1: Machine Check Exception: 0000000000000004 > Bank 1: b200000000000115 > Kernel panic: CPU context corrupt It means your processor flagged a fault. The b2....115 number decodes to info about the fault cause if you grab the PIII manual. Stupid things like overheating. wrong voltages can also trigger it ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Machine check exception? (2.4.6+SMP+VIA) 2001-07-07 21:41 ` Alan Cox @ 2001-07-08 7:28 ` Chris Wedgwood 2001-07-08 14:00 ` Alan Cox 2001-07-08 17:32 ` Vibol Hou 0 siblings, 2 replies; 12+ messages in thread From: Chris Wedgwood @ 2001-07-08 7:28 UTC (permalink / raw) To: Alan Cox; +Cc: Vibol Hou, Linux-Kernel On Sat, Jul 07, 2001 at 10:41:23PM +0100, Alan Cox wrote: It means your processor flagged a fault. The b2....115 number decodes to info about the fault cause if you grab the PIII manual. Stupid things like overheating. wrong voltages can also trigger it Is there any reason why, with proper MCE checking for both K7 and PIII we can't automatically off-line processors when they start doing bad things? Sure, its a pretty lousy thing to do, but if you buys you a few minutes and allows userland to initiate some kind of remedy (pager("HELP"); system("shutdown"); sort of thing)... Also, I'm pretty sure I was seeing overheating problems or something on a K7 at one point, but never saw MCE; I take it this code only exists fully in -ac kernels? I looked in Linus' tree and couldn't see anything. --cw ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Machine check exception? (2.4.6+SMP+VIA) 2001-07-08 7:28 ` Chris Wedgwood @ 2001-07-08 14:00 ` Alan Cox 2001-07-08 15:33 ` Dave Jones 2001-07-08 17:32 ` Vibol Hou 1 sibling, 1 reply; 12+ messages in thread From: Alan Cox @ 2001-07-08 14:00 UTC (permalink / raw) To: Chris Wedgwood; +Cc: Alan Cox, Vibol Hou, Linux-Kernel > Is there any reason why, with proper MCE checking for both K7 and PIII > we can't automatically off-line processors when they start doing bad > things? Architectural limitations. Its entirely possible that the cache of the dying processor contains exclusive copies of arbitary data. > Also, I'm pretty sure I was seeing overheating problems or something > on a K7 at one point, but never saw MCE; I take it this code only > exists fully in -ac kernels? I looked in Linus' tree and couldn't see > anything. Only -ac has K7 MCE enabled right now - also MCE is not guaranteed to catch problems. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Machine check exception? (2.4.6+SMP+VIA) 2001-07-08 14:00 ` Alan Cox @ 2001-07-08 15:33 ` Dave Jones 2001-07-08 17:04 ` Chris Wedgwood 0 siblings, 1 reply; 12+ messages in thread From: Dave Jones @ 2001-07-08 15:33 UTC (permalink / raw) To: Alan Cox; +Cc: Chris Wedgwood, Vibol Hou, Linux-Kernel On Sun, 8 Jul 2001, Alan Cox wrote: > Only -ac has K7 MCE enabled right now - also MCE is not guaranteed to catch > problems. Actually you merged that with Linus a few revisions back iirc. regards, Dave. -- | Dave Jones. http://www.suse.de/~davej | SuSE Labs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Machine check exception? (2.4.6+SMP+VIA) 2001-07-08 15:33 ` Dave Jones @ 2001-07-08 17:04 ` Chris Wedgwood 2001-07-08 17:09 ` Dave Jones 0 siblings, 1 reply; 12+ messages in thread From: Chris Wedgwood @ 2001-07-08 17:04 UTC (permalink / raw) To: Dave Jones; +Cc: Alan Cox, Vibol Hou, Linux-Kernel On Sun, Jul 08, 2001 at 05:33:59PM +0200, Dave Jones wrote: Actually you merged that with Linus a few revisions back iirc. I don't see it for K7/AMD: cw:tty5@tapu(kernel)$ pwd /home/cw/wk/linux/linux-2.4.7-pre2+O_DIRECT/arch/i386/kernel cw:tty5@tapu(kernel)$ grep machine_check\(struct\ pt bluesmoke.c static void intel_machine_check(struct pt_regs * regs, long error_code) static void pentium_machine_check(struct pt_regs * regs, long error_code) static void winchip_machine_check(struct pt_regs * regs, long error_code) static void unexpected_machine_check(struct pt_regs * regs, long error_code) void do_machine_check(struct pt_regs * regs, long error_code) --cw ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Machine check exception? (2.4.6+SMP+VIA) 2001-07-08 17:04 ` Chris Wedgwood @ 2001-07-08 17:09 ` Dave Jones 2001-07-08 17:18 ` Chris Wedgwood 2001-07-08 20:39 ` H. Peter Anvin 0 siblings, 2 replies; 12+ messages in thread From: Dave Jones @ 2001-07-08 17:09 UTC (permalink / raw) To: Chris Wedgwood; +Cc: Alan Cox, Vibol Hou, Linux-Kernel On Mon, 9 Jul 2001, Chris Wedgwood wrote: > Actually you merged that with Linus a few revisions back iirc. > I don't see it for K7/AMD: > cw:tty5@tapu(kernel)$ grep machine_check\(struct\ pt bluesmoke.c > static void intel_machine_check(struct pt_regs * regs, long error_code) There is no K7 specific implementation. It's the same as the Intel MSRs. >From the comment in the file: case X86_VENDOR_AMD: /* * AMD K7 machine check is Intel like */ if(c->x86 == 6) intel_mcheck_init(c); break; regards, Dave. -- | Dave Jones. http://www.suse.de/~davej | SuSE Labs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Machine check exception? (2.4.6+SMP+VIA) 2001-07-08 17:09 ` Dave Jones @ 2001-07-08 17:18 ` Chris Wedgwood 2001-07-08 20:39 ` H. Peter Anvin 1 sibling, 0 replies; 12+ messages in thread From: Chris Wedgwood @ 2001-07-08 17:18 UTC (permalink / raw) To: Dave Jones; +Cc: Alan Cox, Vibol Hou, Linux-Kernel On Sun, Jul 08, 2001 at 07:09:11PM +0200, Dave Jones wrote: There is no K7 specific implementation. It's the same as the Intel MSRs. Ah thanks, missed that. --cw ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Machine check exception? (2.4.6+SMP+VIA) 2001-07-08 17:09 ` Dave Jones 2001-07-08 17:18 ` Chris Wedgwood @ 2001-07-08 20:39 ` H. Peter Anvin 1 sibling, 0 replies; 12+ messages in thread From: H. Peter Anvin @ 2001-07-08 20:39 UTC (permalink / raw) To: linux-kernel Followup to: <Pine.LNX.4.30.0107081907440.28660-100000@Appserv.suse.de> By author: Dave Jones <davej@suse.de> In newsgroup: linux.dev.kernel > > On Mon, 9 Jul 2001, Chris Wedgwood wrote: > > > Actually you merged that with Linus a few revisions back iirc. > > I don't see it for K7/AMD: > > > cw:tty5@tapu(kernel)$ grep machine_check\(struct\ pt bluesmoke.c > > static void intel_machine_check(struct pt_regs * regs, long error_code) > > There is no K7 specific implementation. It's the same as the Intel MSRs. > > From the comment in the file: > > case X86_VENDOR_AMD: > /* > * AMD K7 machine check is Intel like > */ > if(c->x86 == 6) > intel_mcheck_init(c); > break; > > Note that I released a patch to make bluesmoke a lot more generic quite a while ago. Linus was in the "I don't want to even hear about anything but critical bugfixes" mode at that point, so it didn't get integrated. If anyone is interested, it is at: http://www.kernel.org/pub/linux/kernel/people/hpa/bluesmoke-2.4.0-test11-pre5-3.diff.gz Let me know if you want me to bring it forward. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Machine check exception? (2.4.6+SMP+VIA) 2001-07-08 7:28 ` Chris Wedgwood 2001-07-08 14:00 ` Alan Cox @ 2001-07-08 17:32 ` Vibol Hou 2001-07-08 20:40 ` H. Peter Anvin 1 sibling, 1 reply; 12+ messages in thread From: Vibol Hou @ 2001-07-08 17:32 UTC (permalink / raw) To: Chris Wedgwood, Alan Cox; +Cc: Vibol Hou, Linux-Kernel Hrm, First off, thanks for the direction Alan, Peter, and Chris. So I've flipped through the Intel docs, and read up on the MCA for P2/3 processors. I've decoded the info from the MC0_STATUS register that was given to me in the Bank 1: b200000000000115 line. The 0115 MCA code indicates a DCACHEL1_RD error, so it seems the L1 cache is bad, though this does not seem to be heat-related since lm_sensors indicate similar temperature readings for both CPUs (within .3 degress celcius of each other ~30 dC). That probably explains why the system hardlocked quickly each time there was heavy I/O and processing with SMP mode enabled with the full 1GB memory in it. Only after removing one of the memory sticks did the system begin spitting out OOPs and MCEs. I also wonder, however, if this could be due to the 2nd processor not getting enough voltage. I don't know the S-SPEC of the processor, but I think it's the same as the 1st. However, the voltage reading for CPU 2 is .05v lower at 1.65v. Any processor gurus here? Thanks, Vibol -----Original Message----- From: Chris Wedgwood [mailto:cw@f00f.org] Sent: Sunday, July 08, 2001 12:28 AM To: Alan Cox Cc: Vibol Hou; Linux-Kernel Subject: Re: Machine check exception? (2.4.6+SMP+VIA) On Sat, Jul 07, 2001 at 10:41:23PM +0100, Alan Cox wrote: It means your processor flagged a fault. The b2....115 number decodes to info about the fault cause if you grab the PIII manual. Stupid things like overheating. wrong voltages can also trigger it Is there any reason why, with proper MCE checking for both K7 and PIII we can't automatically off-line processors when they start doing bad things? Sure, its a pretty lousy thing to do, but if you buys you a few minutes and allows userland to initiate some kind of remedy (pager("HELP"); system("shutdown"); sort of thing)... Also, I'm pretty sure I was seeing overheating problems or something on a K7 at one point, but never saw MCE; I take it this code only exists fully in -ac kernels? I looked in Linus' tree and couldn't see anything. --cw ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Machine check exception? (2.4.6+SMP+VIA) 2001-07-08 17:32 ` Vibol Hou @ 2001-07-08 20:40 ` H. Peter Anvin 0 siblings, 0 replies; 12+ messages in thread From: H. Peter Anvin @ 2001-07-08 20:40 UTC (permalink / raw) To: linux-kernel Followup to: <NDBBKKONDOBLNCIOPCGHIEKOIKAA.vhou@khmer.cc> By author: "Vibol Hou" <vhou@khmer.cc> In newsgroup: linux.dev.kernel > > I also wonder, however, if this could be due to the 2nd processor not > getting enough voltage. I don't know the S-SPEC of the processor, but I > think it's the same as the 1st. However, the voltage reading for CPU 2 is > .05v lower at 1.65v. Any processor gurus here? > That sounds a bit suspicious indeed, and could certainly cause that kind of errors. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2001-07-08 20:41 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-07-07 20:32 Machine check exception? (2.4.6+SMP+VIA) Vibol Hou 2001-07-07 20:54 ` H. Peter Anvin 2001-07-07 21:41 ` Alan Cox 2001-07-08 7:28 ` Chris Wedgwood 2001-07-08 14:00 ` Alan Cox 2001-07-08 15:33 ` Dave Jones 2001-07-08 17:04 ` Chris Wedgwood 2001-07-08 17:09 ` Dave Jones 2001-07-08 17:18 ` Chris Wedgwood 2001-07-08 20:39 ` H. Peter Anvin 2001-07-08 17:32 ` Vibol Hou 2001-07-08 20:40 ` H. Peter Anvin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox