* [2.6.1 MCE falseness?] Hardware reports non-fatal error
@ 2004-01-18 1:44 Niel Lambrechts
2004-01-18 2:03 ` Dave Jones
0 siblings, 1 reply; 7+ messages in thread
From: Niel Lambrechts @ 2004-01-18 1:44 UTC (permalink / raw)
To: linux-kernel
I get the following problem with 2.6.1 consistently after apm resuming:
"ksyrium kernel: MCE: The hardware reports a non fatal, correctable
incident occurred on CPU 0.
Message from syslogd@ksyrium at Wed Jan 14 13:33:06 2004 ...
ksyrium kernel: Bank 1: f2000000000001c5"
It does not happen on any other kernels I use (vanilla 2.4.24, SuSE 9
2.4.21-166) - even though CONFIG_X86_MCE=y for both. The equipment is
brand-new - an IBM Thinkpad R50P - and it passes all IBM's s/w
diagnostic.
I'd appreciate help with the parameters for parsemce to interpret the
problem...not sure if my usage is correct? ;)
# ./parsemce -b 1 -a 0 -e f2000000000001c5
Status: (f2000000000001c5) Machine Check in progress.
Restart IP valid.
Is this really hardware (maybe a bug in the BIOS?) or are false
positives possible with 2.6 MCE code?
-Niel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error
2004-01-18 1:44 [2.6.1 MCE falseness?] Hardware reports non-fatal error Niel Lambrechts
@ 2004-01-18 2:03 ` Dave Jones
2004-01-21 12:13 ` Stephen Rothwell
0 siblings, 1 reply; 7+ messages in thread
From: Dave Jones @ 2004-01-18 2:03 UTC (permalink / raw)
To: Niel Lambrechts; +Cc: linux-kernel
On Sun, Jan 18, 2004 at 03:44:16AM +0200, Niel Lambrechts wrote:
> I get the following problem with 2.6.1 consistently after apm resuming:
> "ksyrium kernel: MCE: The hardware reports a non fatal, correctable
> incident occurred on CPU 0.
>
> Message from syslogd@ksyrium at Wed Jan 14 13:33:06 2004 ...
> ksyrium kernel: Bank 1: f2000000000001c5"
As it only happens when you resume from APM, I'm inclined to believe
its a BIOS bug. With the output of dmidecode, we could blacklist this
box to not do the nonfatal checking.
> It does not happen on any other kernels I use (vanilla 2.4.24, SuSE 9
> 2.4.21-166) - even though CONFIG_X86_MCE=y for both. The equipment is
> brand-new - an IBM Thinkpad R50P - and it passes all IBM's s/w
> diagnostic.
None of the other kernels you mention have this, its a new feature of 2.6
Dave
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error
@ 2004-01-18 13:30 Pedro Larroy
2004-01-18 14:23 ` glee
0 siblings, 1 reply; 7+ messages in thread
From: Pedro Larroy @ 2004-01-18 13:30 UTC (permalink / raw)
To: linux-kernel
I also have been getting apparently false MCEs since 2.5.xx
I even had kernel panics in early 2.5 with MCE enabled. Now in 2.6.0-xx
and in 2.6.1 I just get them from time to time but none fatal.
most of the time in CPU 0
request_module: failed /sbin/modprobe -- char-major-6-0. error = 256
MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Bank 0: e606200000000833
request_module: failed /sbin/modprobe --
the box is dual athlon mp with AMD 760MP chipset.
nebula:/home/piotr# ./parsemce b 1 -a 0 -e e606200000000833
Status: (e606200000000833) Error IP valid
Restart IP valid.
nebula:/home/piotr#
--
Pedro Larroy Tovar | piotr%member.fsf.org
Software patents are a threat to innovation in Europe please check:
http://www.eurolinux.org/
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error
2004-01-18 13:30 Pedro Larroy
@ 2004-01-18 14:23 ` glee
2004-01-18 20:04 ` Dave Jones
0 siblings, 1 reply; 7+ messages in thread
From: glee @ 2004-01-18 14:23 UTC (permalink / raw)
To: Pedro Larroy; +Cc: linux-kernel
[-- Attachment #1: Type: text/plain, Size: 347 bytes --]
On Sun, Jan 18, 2004 at 02:30:48PM +0100, Pedro Larroy wrote:
> I also have been getting apparently false MCEs since 2.5.xx
> I even had kernel panics in early 2.5 with MCE enabled. Now in 2.6.0-xx
> and in 2.6.1 I just get them from time to time but none fatal.
> most of the time in CPU 0
>
I get them too, so I applied this patch.
- g.
[-- Attachment #2: mce-fix.patch --]
[-- Type: text/plain, Size: 1061 bytes --]
--- linux-2.6.0-test9/arch/i386/kernel/cpu/mcheck/non-fatal.c.orig 2003-11-02 13:31:43.000000000 +0800
+++ linux-2.6.0-test9/arch/i386/kernel/cpu/mcheck/non-fatal.c 2003-11-02 21:50:36.000000000 +0800
@@ -21,6 +21,7 @@
static struct timer_list mce_timer;
static int timerset;
+static int startbank;
#define MCE_RATE 15*HZ /* timer rate is 15s */
@@ -30,7 +31,7 @@
int i;
preempt_disable();
- for (i=0; i<nr_mce_banks; i++) {
+ for (i=startbank; i<nr_mce_banks; i++) {
rdmsr (MSR_IA32_MC0_STATUS+i*4, low, high);
if (high & (1<<31)) {
@@ -68,6 +69,19 @@
static int __init init_nonfatal_mce_checker(void)
{
+ /*
+ Certain Athlons would cause spurious MCE non-fatal
+ exception check to be reported, if we poke at bank 0.
+ Avoid this if the running machine is an Athlon.
+
+ -- geoff.
+ */
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
+ boot_cpu_data.x86 == 6)
+ startbank = 1;
+ else
+ startbank = 0;
+
if (timerset == 0) {
/* Set the timer to check for non-fatal
errors every MCE_RATE seconds */
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error
2004-01-18 14:23 ` glee
@ 2004-01-18 20:04 ` Dave Jones
0 siblings, 0 replies; 7+ messages in thread
From: Dave Jones @ 2004-01-18 20:04 UTC (permalink / raw)
To: glee; +Cc: Pedro Larroy, linux-kernel
On Sun, Jan 18, 2004 at 10:23:38PM +0800, glee@gnupilgrims.org wrote:
>
> I get them too, so I applied this patch.
gah, that still didn't get applied?
Dave
--
Dave Jones http://www.codemonkey.org.uk
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [2.6.1 MCE falseness?] Hardware reports non-fatal error
@ 2004-01-20 19:43 Niel Lambrechts
0 siblings, 0 replies; 7+ messages in thread
From: Niel Lambrechts @ 2004-01-20 19:43 UTC (permalink / raw)
To: linux-kernel
I tried the mentioned patch, with a modification for my CPU type, but
still get the problem:
"Jan 20 21:30:23 ksyrium kernel: MCE: The hardware reports a non fatal,
correctable incident occurred on CPU 0.
Jan 20 21:30:23 ksyrium kernel: MCE: startbank = 1, vendor : 0, x86 = 6,
model = 9, mask = 5.
Jan 20 21:30:23 ksyrium kernel: Bank 1: f200000000000185"
As you can see, I added a little extra debugging info. Here is the
relevant portion of the code:
" if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD && boot_cpu_data.x86
== 6) || (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
boot_cpu_data.x86 == 6 && boot_cpu_data.x86_model == 9 &&
boot_cpu_data.x86_mask == 5))
startbank = 1;"
Comments would be appreciated.
-Niel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error
2004-01-18 2:03 ` Dave Jones
@ 2004-01-21 12:13 ` Stephen Rothwell
0 siblings, 0 replies; 7+ messages in thread
From: Stephen Rothwell @ 2004-01-21 12:13 UTC (permalink / raw)
To: Dave Jones; +Cc: antispam, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1376 bytes --]
On Sun, 18 Jan 2004 02:03:01 +0000 Dave Jones <davej@redhat.com> wrote:
>
> On Sun, Jan 18, 2004 at 03:44:16AM +0200, Niel Lambrechts wrote:
>
> > I get the following problem with 2.6.1 consistently after apm resuming:
> > "ksyrium kernel: MCE: The hardware reports a non fatal, correctable
> > incident occurred on CPU 0.
> >
> > Message from syslogd@ksyrium at Wed Jan 14 13:33:06 2004 ...
> > ksyrium kernel: Bank 1: f2000000000001c5"
>
> As it only happens when you resume from APM, I'm inclined to believe
> its a BIOS bug. With the output of dmidecode, we could blacklist this
> box to not do the nonfatal checking.
My Thinkpad T22 produces a similar warning on resume using APM:
kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
kernel: Bank 1: f200000000000104
dmidecode output starts with:
# dmidecode 2.3
SMBIOS 2.3 present.
46 structures occupying 1585 bytes.
Table at 0x1FFF0000.
Handle 0x0000
DMI type 0, 20 bytes.
BIOS Information
Vendor: IBM
Version: 16ET31WW (1.11 )
Release Date: 03/20/2003
.
.
Handle 0x0001
DMI type 1, 25 bytes.
System Information
Manufacturer: IBM
Product Name: 26475EA
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2004-01-21 12:17 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-18 1:44 [2.6.1 MCE falseness?] Hardware reports non-fatal error Niel Lambrechts
2004-01-18 2:03 ` Dave Jones
2004-01-21 12:13 ` Stephen Rothwell
-- strict thread matches above, loose matches on Subject: below --
2004-01-18 13:30 Pedro Larroy
2004-01-18 14:23 ` glee
2004-01-18 20:04 ` Dave Jones
2004-01-20 19:43 Niel Lambrechts
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox