* Re: Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident [not found] ` <1vq6q-7YO-33@gated-at.bofh.it> @ 2004-03-04 11:39 ` Andi Kleen 2004-03-04 15:01 ` David Weinehall 0 siblings, 1 reply; 6+ messages in thread From: Andi Kleen @ 2004-03-04 11:39 UTC (permalink / raw) To: Dave Jones; +Cc: linux-kernel Dave Jones <davej@redhat.com> writes: > I'm toying with the idea of marking it CONFIG_BROKEN for 2.6, > and fixing it up later. I would actually suggest to switch over to the rewritten MCE handler from x86-64 for i386 too. IMHO it is much better. It is race free, does not panic the machine if not needed, CPU independent, follows the Intel and AMD recommendations, run time sysfs configurable, logs to a separate device and does lots of other things much better [of course I'm biased on that a bit]. Disadvantage is that it isn't as well tested. I haven't tried it on i386, but i wrote it to be easily portable to 32bit too. It does periodic MCEs too, but with a much lower frequency and they could be turned off. I'm considering to turn them off for x86-64 too, because they seem to only log one bit ECC errors all the time. But with the new separate log device it's much less of a problem. The only thing you would lose is the support for P5 MCEs, but these could be relatively easily readded if that should be a problem. Extended register logging for P4 is also not implemented right now, but that hardly seems like a needed feature. -Andi ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident 2004-03-04 11:39 ` Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident Andi Kleen @ 2004-03-04 15:01 ` David Weinehall 0 siblings, 0 replies; 6+ messages in thread From: David Weinehall @ 2004-03-04 15:01 UTC (permalink / raw) To: Andi Kleen; +Cc: Dave Jones, linux-kernel On Thu, Mar 04, 2004 at 12:39:59PM +0100, Andi Kleen wrote: > Dave Jones <davej@redhat.com> writes: > > > I'm toying with the idea of marking it CONFIG_BROKEN for 2.6, > > and fixing it up later. > > I would actually suggest to switch over to the rewritten MCE handler > from x86-64 for i386 too. IMHO it is much better. It is race free, > does not panic the machine if not needed, CPU independent, follows the > Intel and AMD recommendations, run time sysfs configurable, logs to a > separate device and does lots of other things much better > [of course I'm biased on that a bit]. Disadvantage is that it isn't > as well tested. Well, the only way to solve that problem is to test it, right? And what better way to test it than to switch i386 over to it too :-) > I haven't tried it on i386, but i wrote it to be easily portable > to 32bit too. It does periodic MCEs too, but with a much lower > frequency and they could be turned off. I'm considering to turn > them off for x86-64 too, because they seem to only log one bit > ECC errors all the time. But with the new separate log device it's much > less of a problem. > > The only thing you would lose is the support for P5 MCEs, but these > could be relatively easily readded if that should be a problem. Well, losing functionality would be bad. > Extended register logging for P4 is also not implemented right now, > but that hardly seems like a needed feature. No opinion here. Regards: David Weinehall -- /) David Weinehall <tao@acc.umu.se> /) Northern lights wander (\ // Maintainer of the v2.0 kernel // Dance across the winter sky // \) http://www.acc.umu.se/~tao/ (/ Full colour fire (/ ^ permalink raw reply [flat|nested] 6+ messages in thread
* Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident @ 2004-03-02 18:00 Davi Leal 2004-03-02 21:55 ` Dave Jones 0 siblings, 1 reply; 6+ messages in thread From: Davi Leal @ 2004-03-02 18:00 UTC (permalink / raw) To: linux-kernel What about this message?. Note that the system works. I have not had to reboot. What meens the below message?. Do not hesitate ask for more information. Regards, Davi Leal Message from syslogd@AMD at Tue Mar 2 11:27:00 2004 ... AMD kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0. Message from syslogd@AMD at Tue Mar 2 11:27:00 2004 ... AMD kernel: Bank 1: 9400400000000152 $ cat /proc/version Linux version 2.6.2 (root@AMD) (gcc version 3.3.3 20040125 (prerelease) (Debian)) #1 Wed Feb 4 19:26:25 CET 2004 $ cat /proc/version Linux version 2.6.2 (root@AMD) (gcc version 3.3.3 20040125 (prerelease) (Debian)) #1 Wed Feb 4 19:26:25 CET 2004 davi@AMD:/Compartida$ davi@AMD:/Compartida$ davi@AMD:/Compartida$ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 10 model name : AMD Athlon(tm) XP 2400+ stepping : 0 cpu MHz : 2010.002 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3956.73 $ ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident 2004-03-02 18:00 Davi Leal @ 2004-03-02 21:55 ` Dave Jones 2004-03-03 8:58 ` Philippe Elie 2004-03-04 4:02 ` Joseph Fannin 0 siblings, 2 replies; 6+ messages in thread From: Dave Jones @ 2004-03-02 21:55 UTC (permalink / raw) To: Davi Leal; +Cc: linux-kernel On Tue, Mar 02, 2004 at 07:00:16PM +0100, Davi Leal wrote: > What about this message?. Note that the system works. I have not had to > reboot. What meens the below message?. > The original plan behind that option was to find hardware faults early, but it seems to trigger a lot of false positives for various reasons. Part of this problem is that MCEs can also be generated on some hardware by doing something silly like reading from a reserved part of your motherboard chipset.. There are also CPU errata that can cause them to falsely trigger in some unusual cases, but I've not had time to go through the various errata datasheets to blacklist affected CPUs unfortunatly. I'm toying with the idea of marking it CONFIG_BROKEN for 2.6, and fixing it up later. Dave ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident 2004-03-02 21:55 ` Dave Jones @ 2004-03-03 8:58 ` Philippe Elie 2004-03-04 4:02 ` Joseph Fannin 1 sibling, 0 replies; 6+ messages in thread From: Philippe Elie @ 2004-03-03 8:58 UTC (permalink / raw) To: Dave Jones, Davi Leal, linux-kernel On Tue, 02 Mar 2004 at 21:55 +0000, Dave Jones wrote: > On Tue, Mar 02, 2004 at 07:00:16PM +0100, Davi Leal wrote: > > What about this message?. Note that the system works. I have not had to > > reboot. What meens the below message?. > > > > The original plan behind that option was to find hardware faults early, > but it seems to trigger a lot of false positives for various reasons. > Part of this problem is that MCEs can also be generated on some hardware > by doing something silly like reading from a reserved part of your > motherboard chipset.. > > There are also CPU errata that can cause them to falsely trigger in > some unusual cases, but I've not had time to go through the various > errata datasheets to blacklist affected CPUs unfortunatly. > > I'm toying with the idea of marking it CONFIG_BROKEN for 2.6, > and fixing it up later. I'm unsure if it's a good idea it's broken only on broken HW, people wanting stable box try to buy sane HW and don't enable CONFIG_BROKEN so they will never see if their HW are starting to be out of spec. Perhaps rewording the option help and the error message to say it's known to report false positive... regards, Phil ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident 2004-03-02 21:55 ` Dave Jones 2004-03-03 8:58 ` Philippe Elie @ 2004-03-04 4:02 ` Joseph Fannin 1 sibling, 0 replies; 6+ messages in thread From: Joseph Fannin @ 2004-03-04 4:02 UTC (permalink / raw) To: linux-kernel; +Cc: Davi Leal [-- Attachment #1: Type: text/plain, Size: 1456 bytes --] On Tue, Mar 02, 2004 at 09:55:54PM +0000, Dave Jones wrote: > On Tue, Mar 02, 2004 at 07:00:16PM +0100, Davi Leal wrote: >> What about this message?. Note that the system works. I have not had to >> reboot. What meens the below message?. >> > > The original plan behind that option was to find hardware faults early, > but it seems to trigger a lot of false positives for various reasons. > Part of this problem is that MCEs can also be generated on some hardware > by doing something silly like reading from a reserved part of your > motherboard chipset.. The MCE stuff truly did find a hardware fault early for me; my Athlon system was MCE'ing and I ignored it, and later I got sig11 errors and fs corruption, which I finally traced to a failing stick of memory. > There are also CPU errata that can cause them to falsely trigger in > some unusual cases, but I've not had time to go through the various > errata datasheets to blacklist affected CPUs unfortunatly. > > I'm toying with the idea of marking it CONFIG_BROKEN for 2.6, > and fixing it up later. I wouldn't be so quick to write off MCEs as bugs or errata, especially if the exceptions have only just begun showing up. Running CPUBurn, memtest86 and the like is still probably a good idea, especially if you value the data on your file system. -- Joseph Fannin jhf@rivenstone.net "Anyone who quotes me in their sig is an idiot." -- Rusty Russell. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2004-03-04 15:02 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1vmlH-3HK-5@gated-at.bofh.it>
[not found] ` <1vq6q-7YO-33@gated-at.bofh.it>
2004-03-04 11:39 ` Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident Andi Kleen
2004-03-04 15:01 ` David Weinehall
2004-03-02 18:00 Davi Leal
2004-03-02 21:55 ` Dave Jones
2004-03-03 8:58 ` Philippe Elie
2004-03-04 4:02 ` Joseph Fannin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox