[2.6.1 MCE falseness?] Hardware reports non-fatal error

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [2.6.1 MCE falseness?] Hardware reports non-fatal error
@ 2004-01-18  1:44 Niel Lambrechts
  2004-01-18  2:03 ` Dave Jones
  0 siblings, 1 reply; 7+ messages in thread
From: Niel Lambrechts @ 2004-01-18  1:44 UTC (permalink / raw)
  To: linux-kernel

I get the following problem with 2.6.1 consistently after apm resuming:

"ksyrium kernel: MCE: The hardware reports a non fatal, correctable
incident occurred on CPU 0.

Message from syslogd@ksyrium at Wed Jan 14 13:33:06 2004 ...
ksyrium kernel: Bank 1: f2000000000001c5"

It does not happen on any other kernels I use (vanilla 2.4.24, SuSE 9
2.4.21-166) - even though CONFIG_X86_MCE=y for both. The equipment is
brand-new - an IBM Thinkpad R50P - and it passes all IBM's s/w
diagnostic.

I'd appreciate help with the parameters for parsemce to interpret the
problem...not sure if my usage is correct? ;)

# ./parsemce -b 1 -a 0 -e f2000000000001c5
Status: (f2000000000001c5) Machine Check in progress.
Restart IP valid.

Is this really hardware (maybe a bug in  the BIOS?) or are false
positives possible with 2.6 MCE code?

-Niel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error
  2004-01-18  1:44 Niel Lambrechts
@ 2004-01-18  2:03 ` Dave Jones
  2004-01-21 12:13   ` Stephen Rothwell
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Jones @ 2004-01-18  2:03 UTC (permalink / raw)
  To: Niel Lambrechts; +Cc: linux-kernel

On Sun, Jan 18, 2004 at 03:44:16AM +0200, Niel Lambrechts wrote:

 > I get the following problem with 2.6.1 consistently after apm resuming:
 > "ksyrium kernel: MCE: The hardware reports a non fatal, correctable
 > incident occurred on CPU 0.
 > 
 > Message from syslogd@ksyrium at Wed Jan 14 13:33:06 2004 ...
 > ksyrium kernel: Bank 1: f2000000000001c5"

As it only happens when you resume from APM, I'm inclined to believe
its a BIOS bug.  With the output of dmidecode, we could blacklist this
box to not do the nonfatal checking.

 > It does not happen on any other kernels I use (vanilla 2.4.24, SuSE 9
 > 2.4.21-166) - even though CONFIG_X86_MCE=y for both. The equipment is
 > brand-new - an IBM Thinkpad R50P - and it passes all IBM's s/w
 > diagnostic.

None of the other kernels you mention have this, its a new feature of 2.6

		Dave


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error
@ 2004-01-18 13:30 Pedro Larroy
  2004-01-18 14:23 ` glee
  0 siblings, 1 reply; 7+ messages in thread
From: Pedro Larroy @ 2004-01-18 13:30 UTC (permalink / raw)
  To: linux-kernel

I also have been getting apparently false MCEs since 2.5.xx 
I even had kernel panics in early 2.5 with MCE enabled. Now in 2.6.0-xx
and in 2.6.1 I just get them from time to time but none fatal.
most of the time in CPU 0

request_module: failed /sbin/modprobe -- char-major-6-0. error = 256
MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Bank 0: e606200000000833
request_module: failed /sbin/modprobe --

the box is dual athlon mp with AMD 760MP chipset.

nebula:/home/piotr# ./parsemce b 1 -a 0 -e e606200000000833
Status: (e606200000000833) Error IP valid
Restart IP valid.
nebula:/home/piotr#

-- 
  Pedro Larroy Tovar  |  piotr%member.fsf.org 

Software patents are a threat to innovation in Europe please check: 
	http://www.eurolinux.org/     

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error
  2004-01-18 13:30 [2.6.1 MCE falseness?] Hardware reports non-fatal error Pedro Larroy
@ 2004-01-18 14:23 ` glee
  2004-01-18 20:04   ` Dave Jones
  0 siblings, 1 reply; 7+ messages in thread
From: glee @ 2004-01-18 14:23 UTC (permalink / raw)
  To: Pedro Larroy; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 347 bytes --]

On Sun, Jan 18, 2004 at 02:30:48PM +0100, Pedro Larroy wrote:
> I also have been getting apparently false MCEs since 2.5.xx 
> I even had kernel panics in early 2.5 with MCE enabled. Now in 2.6.0-xx
> and in 2.6.1 I just get them from time to time but none fatal.
> most of the time in CPU 0
> 


I get them too, so I applied this patch.


	- g.


[-- Attachment #2: mce-fix.patch --]
[-- Type: text/plain, Size: 1061 bytes --]

--- linux-2.6.0-test9/arch/i386/kernel/cpu/mcheck/non-fatal.c.orig	2003-11-02 13:31:43.000000000 +0800
+++ linux-2.6.0-test9/arch/i386/kernel/cpu/mcheck/non-fatal.c	2003-11-02 21:50:36.000000000 +0800
@@ -21,6 +21,7 @@
 
 static struct timer_list mce_timer;
 static int timerset;
+static int startbank;
 
 #define MCE_RATE	15*HZ	/* timer rate is 15s */
 
@@ -30,7 +31,7 @@
 	int i;
 
 	preempt_disable(); 
-	for (i=0; i<nr_mce_banks; i++) {
+	for (i=startbank; i<nr_mce_banks; i++) {
 		rdmsr (MSR_IA32_MC0_STATUS+i*4, low, high);
 
 		if (high & (1<<31)) {
@@ -68,6 +69,19 @@
 
 static int __init init_nonfatal_mce_checker(void)
 {
+	/*
+	   Certain Athlons would cause spurious MCE non-fatal
+	   exception check to be reported, if we poke at bank 0.
+	   Avoid this if the running machine is an Athlon.
+
+	   -- geoff.
+	*/
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD && 
+	    boot_cpu_data.x86 == 6)
+		startbank = 1;
+	else
+		startbank = 0;
+
 	if (timerset == 0) {
 		/* Set the timer to check for non-fatal
 		   errors every MCE_RATE seconds */

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error
  2004-01-18 14:23 ` glee
@ 2004-01-18 20:04   ` Dave Jones
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Jones @ 2004-01-18 20:04 UTC (permalink / raw)
  To: glee; +Cc: Pedro Larroy, linux-kernel

On Sun, Jan 18, 2004 at 10:23:38PM +0800, glee@gnupilgrims.org wrote:
 > 
 > I get them too, so I applied this patch.

gah, that still didn't get applied?

		Dave

-- 
 Dave Jones     http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [2.6.1 MCE falseness?] Hardware reports non-fatal error
@ 2004-01-20 19:43 Niel Lambrechts
  0 siblings, 0 replies; 7+ messages in thread
From: Niel Lambrechts @ 2004-01-20 19:43 UTC (permalink / raw)
  To: linux-kernel

I tried the mentioned patch, with a modification for my CPU type, but
still get the problem:

"Jan 20 21:30:23 ksyrium kernel: MCE: The hardware reports a non fatal,
correctable incident occurred on CPU 0.
Jan 20 21:30:23 ksyrium kernel: MCE: startbank = 1, vendor : 0, x86 = 6,
model = 9, mask = 5.
Jan 20 21:30:23 ksyrium kernel: Bank 1: f200000000000185"

As you can see, I added a little extra debugging info. Here is the
relevant portion of the code:
" if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD && boot_cpu_data.x86
== 6) || (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
boot_cpu_data.x86 == 6 && boot_cpu_data.x86_model == 9 &&
boot_cpu_data.x86_mask == 5))

startbank = 1;"

Comments would be appreciated.

-Niel




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error
  2004-01-18  2:03 ` Dave Jones
@ 2004-01-21 12:13   ` Stephen Rothwell
  0 siblings, 0 replies; 7+ messages in thread
From: Stephen Rothwell @ 2004-01-21 12:13 UTC (permalink / raw)
  To: Dave Jones; +Cc: antispam, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1376 bytes --]

On Sun, 18 Jan 2004 02:03:01 +0000 Dave Jones <davej@redhat.com> wrote:
>
> On Sun, Jan 18, 2004 at 03:44:16AM +0200, Niel Lambrechts wrote:
> 
>  > I get the following problem with 2.6.1 consistently after apm resuming:
>  > "ksyrium kernel: MCE: The hardware reports a non fatal, correctable
>  > incident occurred on CPU 0.
>  > 
>  > Message from syslogd@ksyrium at Wed Jan 14 13:33:06 2004 ...
>  > ksyrium kernel: Bank 1: f2000000000001c5"
> 
> As it only happens when you resume from APM, I'm inclined to believe
> its a BIOS bug.  With the output of dmidecode, we could blacklist this
> box to not do the nonfatal checking.

My Thinkpad T22 produces a similar warning on resume using APM:

kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
kernel: Bank 1: f200000000000104

dmidecode output starts with:

# dmidecode 2.3
SMBIOS 2.3 present.
46 structures occupying 1585 bytes.
Table at 0x1FFF0000.
Handle 0x0000
        DMI type 0, 20 bytes.
        BIOS Information
                Vendor: IBM
                Version: 16ET31WW (1.11 )
                Release Date: 03/20/2003
	.
	.
Handle 0x0001
        DMI type 1, 25 bytes.
        System Information
                Manufacturer: IBM
                Product Name: 26475EA

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-01-21 12:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-18 13:30 [2.6.1 MCE falseness?] Hardware reports non-fatal error Pedro Larroy
2004-01-18 14:23 ` glee
2004-01-18 20:04   ` Dave Jones
  -- strict thread matches above, loose matches on Subject: below --
2004-01-20 19:43 Niel Lambrechts
2004-01-18  1:44 Niel Lambrechts
2004-01-18  2:03 ` Dave Jones
2004-01-21 12:13   ` Stephen Rothwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox