* Re: Machine check expection panic [not found] ` <20030807002722.GA3579@suse.de.suse.lists.linux.kernel> @ 2003-08-07 1:00 ` Andi Kleen 2003-08-07 1:34 ` Dave Jones 2003-08-10 8:12 ` kwijibo 0 siblings, 2 replies; 10+ messages in thread From: Andi Kleen @ 2003-08-07 1:00 UTC (permalink / raw) To: Dave Jones; +Cc: richard.brunner, linux-kernel, kwijibo Dave Jones <davej@redhat.com> writes: > # > diff -Nru a/arch/i386/kernel/cpu/mcheck/k7.c b/arch/i386/kernel/cpu/mcheck/k7.c > --- a/arch/i386/kernel/cpu/mcheck/k7.c Wed Aug 6 23:33:40 2003 > +++ b/arch/i386/kernel/cpu/mcheck/k7.c Wed Aug 6 23:33:40 2003 > @@ -81,7 +81,7 @@ > wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff); > nr_mce_banks = l & 0xff; > > - for (i=0; i<nr_mce_banks; i++) { > + for (i=1; i<nr_mce_banks; i++) { The change looks rather suspicious to me. Bank 0 is the data cache unit (DC) Do you have an errata that says that the DC bank is bad on all Athlons? Normally BIOS or microcode are supposed to turn off bad MCEs by masking them in another register. Maybe the person's CPU has a real problem that is just masked now, e.g. it could be overclocked and stress the cache too much. The original MCE was: Status: (4) Machine Check in progress. Restart IP invalid. parsebank(0): f606200000000833 @ 4040 External tag parity error Uncorrectable ECC error CPU state corrupt. Restart not possible Address in addr register valid Error enabled in control register Error not corrected. Error overflow Bus and interconnect error Participation: Local processor originated request Timeout: Request did not timeout Request: Generic error Transaction type : Instruction Memory/IO : Other Tyan 2466 motherboard 2 Athon MP 1200 processors (1200?) -Andi ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Machine check expection panic 2003-08-07 1:00 ` Machine check expection panic Andi Kleen @ 2003-08-07 1:34 ` Dave Jones 2003-08-10 8:12 ` kwijibo 1 sibling, 0 replies; 10+ messages in thread From: Dave Jones @ 2003-08-07 1:34 UTC (permalink / raw) To: Andi Kleen; +Cc: richard.brunner, linux-kernel, kwijibo On Thu, Aug 07, 2003 at 03:00:14AM +0200, Andi Kleen wrote: > The change looks rather suspicious to me. It's been in 2.4 for months, it solved the same problem there as many people are now seeing in 2.6. The "I don't get MCEs in 2.4 but I get them in 2.6" reports are numerous, and I don't buy the "2.6 stresses hardware more" theory for a second. > Bank 0 is the data cache unit (DC) > Do you have an errata that says that the DC bank is bad on all Athlons? Hmm, I thought this was actually documented, but I can't seem to find it in any of the docs I have. There are however gaps between the errata numbers in a few cases, so its possible it was removed in a later version of the revision guide. Richard ? > Normally BIOS or microcode are supposed to turn off bad MCEs by > masking them in another register. Maybe the person's CPU has a > real problem that is just masked now, e.g. it could be overclocked > and stress the cache too much. I recall seeing Athlon owners complain when I 'fixed' this problem using an inverse of this patch in 2.4.19-pre3. For pre4, Marcelo backed it out, and people were happy again. Whether its documented or not, there are boxes out there that don't like having that bank enabled. Dave -- Dave Jones http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Machine check expection panic 2003-08-07 1:00 ` Machine check expection panic Andi Kleen 2003-08-07 1:34 ` Dave Jones @ 2003-08-10 8:12 ` kwijibo 2003-08-10 13:07 ` Andi Kleen 1 sibling, 1 reply; 10+ messages in thread From: kwijibo @ 2003-08-10 8:12 UTC (permalink / raw) To: Andi Kleen; +Cc: Dave Jones, richard.brunner, linux-kernel Andi Kleen wrote: >Dave Jones <davej@redhat.com> writes: > > >># >>diff -Nru a/arch/i386/kernel/cpu/mcheck/k7.c b/arch/i386/kernel/cpu/mcheck/k7.c >>--- a/arch/i386/kernel/cpu/mcheck/k7.c Wed Aug 6 23:33:40 2003 >>+++ b/arch/i386/kernel/cpu/mcheck/k7.c Wed Aug 6 23:33:40 2003 >>@@ -81,7 +81,7 @@ >> wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff); >> nr_mce_banks = l & 0xff; >> >>- for (i=0; i<nr_mce_banks; i++) { >>+ for (i=1; i<nr_mce_banks; i++) { >> >> > >The change looks rather suspicious to me. > >Bank 0 is the data cache unit (DC) > >Do you have an errata that says that the DC bank is bad on all Athlons? > >Normally BIOS or microcode are supposed to turn off bad MCEs by >masking them in another register. Maybe the person's CPU has a >real problem that is just masked now, e.g. it could be overclocked >and stress the cache too much. > The CPU's aren't overclocked and have worked fine for me under much heavier loads than booting a kernel for at least a year. Using the 2.4 kernel that is. Once I remove the exception code from the kernel it boots fine and runs fine under any load I put it under. > >The original MCE was: > >Status: (4) Machine Check in progress. >Restart IP invalid. >parsebank(0): f606200000000833 @ 4040 > External tag parity error > Uncorrectable ECC error > CPU state corrupt. Restart not possible > Address in addr register valid > Error enabled in control register > Error not corrected. > Error overflow > Bus and interconnect error > Participation: Local processor originated request > Timeout: Request did not timeout > Request: Generic error > Transaction type : Instruction > Memory/IO : Other > >Tyan 2466 motherboard >2 Athon MP 1200 processors (1200?) > > Should say 1.2 GHz processor I imagine. AMD and their wacky naming schemes. This is before they had they wacky number scheme. Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Machine check expection panic 2003-08-10 8:12 ` kwijibo @ 2003-08-10 13:07 ` Andi Kleen 2003-08-10 21:04 ` kwijibo 0 siblings, 1 reply; 10+ messages in thread From: Andi Kleen @ 2003-08-10 13:07 UTC (permalink / raw) To: kwijibo; +Cc: Andi Kleen, Dave Jones, richard.brunner, linux-kernel > The CPU's aren't overclocked and have worked fine for > me under much heavier loads than booting a kernel for It could be corrected ECC errors in the cache. If that happens I would consider it a hardware problem (now hidden with the disabled bank). > at least a year. Using the 2.4 kernel that is. Once > I remove the exception code from the kernel it boots > fine and runs fine under any load I put it under. I maintain that such a magic hack needs at least a big fat comment. I still find the change very suspicious, there isn't any errata that says that bank 0 is bad on Athlon. Also disabling a whole bank just for some buggy CPUs is quite a sledgehammer, it would be probably better to identify the bank 0 sub unit that causes it and only turn that off. -Andi ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Machine check expection panic 2003-08-10 13:07 ` Andi Kleen @ 2003-08-10 21:04 ` kwijibo 2003-08-11 10:15 ` Petr Vandrovec 0 siblings, 1 reply; 10+ messages in thread From: kwijibo @ 2003-08-10 21:04 UTC (permalink / raw) To: Andi Kleen; +Cc: Dave Jones, richard.brunner, linux-kernel Out of curiosity I decided to try this on some other Athlon systems I have. I tried it on a dual Athlon MP 2400(2GHz) system with a Tyan 2462 motherboard. Also I tried it on a single Athlon XP 1800 with a Asus A7V motherboard. They both booted fine with the 2.6.0-test2 kernel and the machine exception code in it. So I am thinking either it is something with the older CPU's or the CPU is actually borked. Like I said though I have been using those 1.2GHz processors for a long time with no problems. Steve Andi Kleen wrote: >>The CPU's aren't overclocked and have worked fine for >>me under much heavier loads than booting a kernel for >> >> > >It could be corrected ECC errors in the cache. If that >happens I would consider it a hardware problem > >(now hidden with the disabled bank). > > > >>at least a year. Using the 2.4 kernel that is. Once >>I remove the exception code from the kernel it boots >>fine and runs fine under any load I put it under. >> >> > >I maintain that such a magic hack needs at least a big fat comment. > >I still find the change very suspicious, there isn't any errata that >says that bank 0 is bad on Athlon. > >Also disabling a whole bank just for some buggy CPUs is quite a sledgehammer, >it would be probably better to identify the bank 0 sub unit that causes it >and only turn that off. > >-Andi > > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Machine check expection panic 2003-08-10 21:04 ` kwijibo @ 2003-08-11 10:15 ` Petr Vandrovec 2003-08-11 11:34 ` Bartlomiej Zolnierkiewicz 0 siblings, 1 reply; 10+ messages in thread From: Petr Vandrovec @ 2003-08-11 10:15 UTC (permalink / raw) To: ak; +Cc: kwijibo, Dave Jones, richard.brunner, linux-kernel On Sun, Aug 10, 2003 at 03:04:01PM -0600, kwijibo@zianet.com wrote: > Out of curiosity I decided to try this on some other Athlon > systems I have. I tried it on a dual Athlon MP 2400(2GHz) > system with a Tyan 2462 motherboard. Also I tried it on a > single Athlon XP 1800 with a Asus A7V motherboard. They > both booted fine with the 2.6.0-test2 kernel and the machine > exception code in it. So I am thinking either it is something > with the older CPU's or the CPU is actually borked. Like I said > though I have been using those 1.2GHz processors for a long time > with no problems. Out of curiosity, I never got MCE on my system at home (last kernel before one below was 2.6.0-test2, and it did not complain for different kernels at least since November 2001), yet after recent MCE changes I got during fsck: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0. Bank 0: f65980000000baff System booted these 2.4.x kernels which are supposed to contain off by one fix, without complaints (first number = no. of boots): 8 Linux version 2.4.20-30 (root@ppc) (gcc version 3.3 (Debian)) 1 Linux version 2.4.20-4GB-athlon (root@Athlon.suse.de) (gcc version 3.3 20030226 (prerelease) (SuSE Linux)) 4 Linux version 2.4.20-sp (root@ppc) (gcc version 3.3 (Debian)) 1 Linux version 2.4.21-0.11mdk (quintela@bi.mandrakesoft.com) (gcc version 3.2.2 (Mandrake Linux 9.1 3.2.2-2mdk)) 2 Linux version 2.4.21-0.11mdksecure (quintela@bi.mandrakesoft.com) (gcc version 3.2.2 (Mandrake Linux 9.1 3.2.2-2mdk)) 2 Linux version 2.4.21-0.13mdk (quintela@bi.mandrakesoft.com) (gcc version 3.2.2 (Mandrake Linux 9.1 3.2.2-3mdk)) 4 Linux version 2.4.21-pre7 (root@ppc) (gcc version 3.2.3 20030331 (Debian prerelease)) Any idea what's going wrong? Best regards, Petr Vandrovec vandrove@vc.cvut.cz processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 4 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1009.064 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow bogomips : 1986.56 Linux version 2.6.0-test3-c1149 (root@ppc) (gcc version 3.3.1 (Debian)) #1 SMP Sun Aug 10 19:42:22 CEST 2003 Video mode to be used for restore is f00 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009e800 (usable) BIOS-e820: 000000000009e800 - 00000000000a0000 (reserved) BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000001ffec000 (usable) BIOS-e820: 000000001ffec000 - 000000001ffef000 (ACPI data) BIOS-e820: 000000001ffef000 - 000000001ffff000 (reserved) BIOS-e820: 000000001ffff000 - 0000000020000000 (ACPI NVS) BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved) 511MB LOWMEM available. On node 0 totalpages: 131052 DMA zone: 4096 pages, LIFO batch:1 Normal zone: 126956 pages, LIFO batch:16 HighMem zone: 0 pages, LIFO batch:1 ACPI: RSDP (v000 ASUS ) @ 0x000f6a90 ACPI: RSDT (v001 ASUS A7V 12336.12337) @ 0x1ffec000 ACPI: FADT (v001 ASUS A7V 12336.12337) @ 0x1ffec080 ACPI: BOOT (v001 ASUS A7V 12336.12337) @ 0x1ffec040 ACPI: DSDT (v001 ASUS A7V 00000.04096) @ 0x00000000 ACPI: BIOS passes blacklist ACPI: MADT not present Building zonelist for node : 0 Kernel command line: BOOT_IMAGE=Linux ro root=2105 video=matrox:vesa:0x117,fv:85 video=matroxfb:vesa:0x117,fv:85 nmi_watchdog=1 devfs=nomount Local APIC disabled by BIOS -- reenabling. Found and enabled local APIC! Initializing CPU#0 PID hash table entries: 2048 (order 11: 16384 bytes) Detected 1009.064 MHz processor. Console: colour VGA+ 80x25 Calibrating delay loop... 1986.56 BogoMIPS Memory: 514556k/524208k available (2195k kernel code, 8856k reserved, 672k data, 364k init, 0k highmem) Dentry cache hash table entries: 65536 (order: 6, 262144 bytes) Inode-cache hash table entries: 32768 (order: 5, 131072 bytes) Mount-cache hash table entries: 512 (order: 0, 4096 bytes) -> /dev -> /dev/console -> /root CPU: After generic identify, caps: 0183fbff c1c7fbff 00000000 00000000 CPU: After vendor identify, caps: 0183fbff c1c7fbff 00000000 00000000 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 256K (64 bytes/line) CPU: After all inits, caps: 0183fbff c1c7fbff 00000000 00000020 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. Enabling fast FPU save and restore... done. Checking 'hlt' instruction... OK. POSIX conformance testing by UNIFIX CPU0: AMD Athlon(tm) Processor stepping 02 per-CPU timeslice cutoff: 731.16 usecs. task migration cache decay timeout: 1 msecs. SMP motherboard not detected. enabled ExtINT on CPU#0 ESR value before enabling vector: 00000000 ESR value after enabling vector: 00000000 testing NMI watchdog ... OK. Using local APIC timer interrupts. calibrating APIC timer ... ..... CPU clock speed is 1008.0833 MHz. ..... host bus clock speed is 201.0766 MHz. Starting migration thread for cpu 0 CPUS done 2 Initializing RT netlink socket PCI: PCI BIOS revision 2.10 entry at 0xf1180, last bus=1 PCI: Using configuration type 1 mtrr: v2.0 (20020519) BIO: pool of 256 setup, 15Kb (60 bytes/bio) biovec pool[0]: 1 bvecs: 256 entries (12 bytes) biovec pool[1]: 4 bvecs: 256 entries (48 bytes) biovec pool[2]: 16 bvecs: 256 entries (192 bytes) biovec pool[3]: 64 bvecs: 256 entries (768 bytes) biovec pool[4]: 128 bvecs: 256 entries (1536 bytes) biovec pool[5]: 256 bvecs: 256 entries (3072 bytes) ACPI: Subsystem revision 20030714 spurious 8259A interrupt: IRQ7. ACPI: Interpreter enabled ACPI: Using PIC for interrupt routing ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11 12 14 15) ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 *10 11 12 14 15) ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 *5 6 7 9 10 11 12 14 15) ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 *9 10 11 12 14 15) ACPI: PCI Root Bridge [PCI0] (00:00) PCI: Probing PCI hardware (bus 00) ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT] drivers/usb/core/usb.c: registered new driver usbfs drivers/usb/core/usb.c: registered new driver hub PCI: Using ACPI for IRQ routing PCI: if you experience problems, try using option 'pci=noacpi' or even 'acpi=off' matroxfb: Matrox G450 detected matroxfb: MTRR's turned on matroxfb: 1024x768x16bpp (virtual: 1024x8190) matroxfb: framebuffer at 0xCE000000, mapped to 0xe080f000, size 33554432 Console: switching to colour frame buffer device 128x48 fb0: MATROX frame buffer device pty: 256 Unix98 ptys configured SBF: Simple Boot Flag extension found and enabled. SBF: Setting boot flags 0x1 Machine check exception polling timer started. IA-32 Microcode Update Driver: v1.11 <tigran@veritas.com> Journalled Block Device driver loaded devfs: v1.22 (20021013) Richard Gooch (rgooch@atnf.csiro.au) devfs: boot_options: 0x0 Initializing Cryptographic API PCI: Disabling Via external APIC routing ACPI: Power Button (FF) [PWRF] ACPI: Processor [CPU0] (supports C1 C2, 16 throttling states) Real Time Clock Driver v1.11 Non-volatile memory driver v1.2 Serial: 8250/16550 driver $Revision: 1.90 $ IRQ sharing disabled ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A parport0: PC-style at 0x378 (0x778) [PCSPP,TRISTATE] parport0: cpp_daisy: aa5500ff(38) parport0: assign_addrs: aa5500ff(38) parport0: cpp_daisy: aa5500ff(38) parport0: assign_addrs: aa5500ff(38) parport_pc: Via 686A parallel port: io=0x378 Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx VP_IDE: IDE controller at PCI slot 0000:00:04.1 VP_IDE: chipset revision 16 VP_IDE: not 100% native mode: will probe irqs later ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx VP_IDE: VIA vt82c686a (rev 22) IDE UDMA66 controller on pci0000:00:04.1 ide0: BM-DMA at 0xd800-0xd807, BIOS settings: hda:DMA, hdb:pio ide1: BM-DMA at 0xd808-0xd80f, BIOS settings: hdc:DMA, hdd:pio hda: PLEXTOR CD-R PX-W2410A, ATAPI CD/DVD-ROM drive Using anticipatory scheduling elevator ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hdc: TOSHIBA MK6409MAV, ATA DISK drive ide1 at 0x170-0x177,0x376 on irq 15 PDC20265: IDE controller at PCI slot 0000:00:11.0 PDC20265: chipset revision 2 PDC20265: 100% native mode on irq 10 PDC20265: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode. ide2: BM-DMA at 0x8400-0x8407, BIOS settings: hde:pio, hdf:pio ide3: BM-DMA at 0x8408-0x840f, BIOS settings: hdg:pio, hdh:pio hde: WDC WD1200BB-00CAA1, ATA DISK drive ide2 at 0x9800-0x9807,0x9402 on irq 10 hdh: WDC WD1200BB-00CAA1, ATA DISK drive ide3 at 0x9000-0x9007,0x8802 on irq 10 hdc: max request size: 128KiB hdc: 12685680 sectors (6495 MB), CHS=13424/15/63, UDMA(33) /dev/ide/host0/bus1/target0/lun0: p1 hde: max request size: 128KiB hde: 234441648 sectors (120034 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(100) /dev/ide/host2/bus0/target0/lun0: p1 p2 < p5 p6 p7 > hdh: max request size: 128KiB hdh: 234441648 sectors (120034 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(100) /dev/ide/host2/bus1/target1/lun0: p1 p2 < p5 p6 > hda: ATAPI 40X CD-ROM CD-R/RW drive, 4096kB Cache, UDMA(33) Uniform CD-ROM driver Revision: 3.12 matroxfb_crtc2: secondary head of fb0 was registered as fb1 drivers/usb/host/uhci-hcd.c: USB Universal Host Controller Interface driver v2.1 uhci-hcd 0000:00:04.2: UHCI Host Controller uhci-hcd 0000:00:04.2: irq 9, io base 0000d400 uhci-hcd 0000:00:04.2: new USB bus registered, assigned bus number 1 hub 1-0:0: USB hub found hub 1-0:0: 2 ports detected uhci-hcd 0000:00:04.3: UHCI Host Controller uhci-hcd 0000:00:04.3: irq 9, io base 0000d000 uhci-hcd 0000:00:04.3: new USB bus registered, assigned bus number 2 hub 2-0:0: USB hub found hub 2-0:0: 2 ports detected mice: PS/2 mouse device common for all mice input: PC Speaker input: ImExPS/2 Generic Explorer Mouse on isa0060/serio1 serio: i8042 AUX port at 0x60,0x64 irq 12 input: AT Set 2 keyboard on isa0060/serio0 serio: i8042 KBD port at 0x60,0x64 irq 1 i2c /dev entries driver module version 2.7.0 (20021208) oprofile: using NMI interrupt. NET4: Linux TCP/IP 1.0 for NET4.0 IP: routing cache hash table of 2048 buckets, 32Kbytes TCP: Hash tables configured (established 16384 bind 21845) Initializing IPsec netlink socket NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. BIOS EDD facility v0.09 2003-Jan-22, 3 devices found VFS: Mounted root (ext2 filesystem) readonly. Freeing unused kernel memory: 364k freed hub 2-0:0: debounce: port 2: delay 100ms stable 4 status 0x101 hub 2-0:0: new USB device on port 2, assigned address 2 hub 2-2:0: USB hub found hub 2-2:0: 4 ports detected Adding 1951856k swap on /dev/hde6. Priority:-1 extents:1 MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0. Bank 0: f65980000000baff ne2k-pci.c:v1.02 10/19/2000 D. Becker/P. Gortmaker http://www.scyld.com/network/ne2k-pci.html eth0: RealTek RTL-8029 found at 0xa000, IRQ 10, 00:C0:26:30:B0:2D. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Machine check expection panic 2003-08-11 10:15 ` Petr Vandrovec @ 2003-08-11 11:34 ` Bartlomiej Zolnierkiewicz 0 siblings, 0 replies; 10+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2003-08-11 11:34 UTC (permalink / raw) To: Petr Vandrovec; +Cc: ak, kwijibo, Dave Jones, richard.brunner, linux-kernel Just "me too". MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0. Bank 0: 8000000000002140 $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : AMD Athlon(tm) XP 1700+ stepping : 1 cpu MHz : 1467.033 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 2883.58 --bartlomiej On Mon, 11 Aug 2003, Petr Vandrovec wrote: > Out of curiosity, I never got MCE on my system at home (last kernel > before one below was 2.6.0-test2, and it did not complain for > different kernels at least since November 2001), yet after recent MCE > changes I got during fsck: > > MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0. > Bank 0: f65980000000baff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Machine check expection panic @ 2003-08-06 22:35 kwijibo 2003-08-06 23:05 ` Matt Mackall 2003-08-07 0:27 ` Dave Jones 0 siblings, 2 replies; 10+ messages in thread From: kwijibo @ 2003-08-06 22:35 UTC (permalink / raw) To: linux-kernel I decided to try out the new 2.6.0-test2 kernel today but ran into a problem with booting it. I narrowed it down to the machine check expection code. I get this panic from the kernel on boot when I have it enabled CPU0: Machine Check Exception: 0000000000000004 Bank0: f606200000000833 at 0000000000004040 Kernel Panic: CPU context corrupt. I disabled this option in the kernel and recompiled and everything went smooth. I figured maybe there could actually possibly be something wrong with the CPU but I can boot with RedHat's 2.4.20-19 kernel fine which I *think* includes machine check exception code. I have no beef with leaving the exception code out but I figured someone on this list may want to know. Little bit of hardware info: Tyan 2466 motherboard 2 Athon MP 1200 processors Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Machine check expection panic 2003-08-06 22:35 kwijibo @ 2003-08-06 23:05 ` Matt Mackall 2003-08-07 0:27 ` Dave Jones 1 sibling, 0 replies; 10+ messages in thread From: Matt Mackall @ 2003-08-06 23:05 UTC (permalink / raw) To: kwijibo; +Cc: linux-kernel On Wed, Aug 06, 2003 at 04:35:33PM -0600, kwijibo@zianet.com wrote: > I decided to try out the new 2.6.0-test2 kernel today but > ran into a problem with booting it. I narrowed it down to > the machine check expection code. I get this panic from > the kernel on boot when I have it enabled > > CPU0: Machine Check Exception: 0000000000000004 > Bank0: f606200000000833 at 0000000000004040 > Kernel Panic: CPU context corrupt. $ parsemce -b 0 -e 0000000000000004 -s f606200000000833 -a 0000000000004040 Status: (4) Machine Check in progress. Restart IP invalid. parsebank(0): f606200000000833 @ 4040 External tag parity error Uncorrectable ECC error CPU state corrupt. Restart not possible Address in addr register valid Error enabled in control register Error not corrected. Error overflow Bus and interconnect error Participation: Local processor originated request Timeout: Request did not timeout Request: Generic error Transaction type : Instruction Memory/IO : Other Looks like corruption with your L2 cache. Odds are its heat-related. > I disabled this option in the kernel and recompiled and everything > went smooth. I figured maybe there could actually possibly be > something wrong with the CPU but I can boot with RedHat's > 2.4.20-19 kernel fine which I *think* includes machine check exception > code. I have no beef with leaving the exception code out but I figured > someone on this list may want to know. > > Little bit of hardware info: > Tyan 2466 motherboard > 2 Athon MP 1200 processors -- Matt Mackall : http://www.selenic.com : of or relating to the moon ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Machine check expection panic 2003-08-06 22:35 kwijibo 2003-08-06 23:05 ` Matt Mackall @ 2003-08-07 0:27 ` Dave Jones 1 sibling, 0 replies; 10+ messages in thread From: Dave Jones @ 2003-08-07 0:27 UTC (permalink / raw) To: kwijibo; +Cc: linux-kernel On Wed, Aug 06, 2003 at 04:35:33PM -0600, kwijibo@zianet.com wrote: > I decided to try out the new 2.6.0-test2 kernel today but > ran into a problem with booting it. I narrowed it down to > the machine check expection code. I get this panic from > the kernel on boot when I have it enabled > > CPU0: Machine Check Exception: 0000000000000004 > Bank0: f606200000000833 at 0000000000004040 > Kernel Panic: CPU context corrupt. Missing bugfix from the 2.4 kernel that never made it into 2.5. Chances are you (and many other Athlon users) are hitting problems because of this chunk.. Already pushed to Linus/Andrew. Dave # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1055 -> 1.1056 # arch/i386/kernel/cpu/mcheck/k7.c 1.4 -> 1.5 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 03/08/06 davej@redhat.com 1.1056 # stupid off by one # -------------------------------------------- # diff -Nru a/arch/i386/kernel/cpu/mcheck/k7.c b/arch/i386/kernel/cpu/mcheck/k7.c --- a/arch/i386/kernel/cpu/mcheck/k7.c Wed Aug 6 23:33:40 2003 +++ b/arch/i386/kernel/cpu/mcheck/k7.c Wed Aug 6 23:33:40 2003 @@ -81,7 +81,7 @@ wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff); nr_mce_banks = l & 0xff; - for (i=0; i<nr_mce_banks; i++) { + for (i=1; i<nr_mce_banks; i++) { wrmsr (MSR_IA32_MC0_CTL+4*i, 0xffffffff, 0xffffffff); wrmsr (MSR_IA32_MC0_STATUS+4*i, 0x0, 0x0); } ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2003-08-11 11:35 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <3F3182B5.3040301@zianet.com.suse.lists.linux.kernel>
[not found] ` <20030807002722.GA3579@suse.de.suse.lists.linux.kernel>
2003-08-07 1:00 ` Machine check expection panic Andi Kleen
2003-08-07 1:34 ` Dave Jones
2003-08-10 8:12 ` kwijibo
2003-08-10 13:07 ` Andi Kleen
2003-08-10 21:04 ` kwijibo
2003-08-11 10:15 ` Petr Vandrovec
2003-08-11 11:34 ` Bartlomiej Zolnierkiewicz
2003-08-06 22:35 kwijibo
2003-08-06 23:05 ` Matt Mackall
2003-08-07 0:27 ` Dave Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox