2.6.0-test4 and hardware reports a non fatal incident

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.6.0-test4 and hardware reports a non fatal incident
@ 2003-08-28 13:48 Tomasz Czaus
  2003-08-28 15:46 ` Randy.Dunlap
  0 siblings, 1 reply; 11+ messages in thread
From: Tomasz Czaus @ 2003-08-28 13:48 UTC (permalink / raw)
  To: linux-kernel

Hello,

when my system is booting I can see such a message:

kernel: MCE: The hardware reports a non fatal, correctable incident occurred 
on CPU 0.
kernel: Bank 0: e664000000000185

What does it mean ??? My kernel 2.6.0-test4 has applyed "Nick's scheduler 
policy v8" patch. 

When I boot 2.4.x kernel I can't see this message.

Thanks,
Tomasz Czaus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0-test4 and hardware reports a non fatal incident
  2003-08-28 13:48 2.6.0-test4 and hardware reports a non fatal incident Tomasz Czaus
@ 2003-08-28 15:46 ` Randy.Dunlap
  2003-08-28 17:28   ` Matt Gibson
  2003-08-28 19:02   ` Matt Gibson
  0 siblings, 2 replies; 11+ messages in thread
From: Randy.Dunlap @ 2003-08-28 15:46 UTC (permalink / raw)
  To: Tomasz Czaus; +Cc: linux-kernel

On Thu, 28 Aug 2003 15:48:44 +0200 Tomasz Czaus <tomasz_czaus@go2.pl> wrote:

| Hello,
| 
| when my system is booting I can see such a message:
| 
| kernel: MCE: The hardware reports a non fatal, correctable incident occurred 
| on CPU 0.
| kernel: Bank 0: e664000000000185
| 
| What does it mean ??? My kernel 2.6.0-test4 has applyed "Nick's scheduler 
| policy v8" patch. 

Use "parsemce" from here:
  http://www.codemonkey.org.uk/projects/parsemce/
to decode it.

| When I boot 2.4.x kernel I can't see this message.

So 2.6 has more/better/different processor error checking.

--
~Randy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0-test4 and hardware reports a non fatal incident
  2003-08-28 15:46 ` Randy.Dunlap
@ 2003-08-28 17:28   ` Matt Gibson
  2003-08-28 19:02   ` Matt Gibson
  1 sibling, 0 replies; 11+ messages in thread
From: Matt Gibson @ 2003-08-28 17:28 UTC (permalink / raw)
  To: linux-kernel

On Thursday 28 Aug 2003 16:46, Randy.Dunlap wrote:
> On Thu, 28 Aug 2003 15:48:44 +0200 Tomasz Czaus <tomasz_czaus@go2.pl> 
wrote:
> | Hello,
> |
> | when my system is booting I can see such a message:
> |
> | kernel: MCE: The hardware reports a non fatal, correctable incident
> | occurred on CPU 0.
> | kernel: Bank 0: e664000000000185

Yeah, I get one of those on boot, too.  Or at least I did.  I was going to 
turn the processor checking stuff back on to see if it happened 
consistently.  What processor is it, Tomasz?  Mine's an Athlon.  Output of 
"cat /proc/cpuinfo" at the end, if anyone's remotely interested...

> Use "parsemce" from here:
>   http://www.codemonkey.org.uk/projects/parsemce/
> to decode it.
>
> So 2.6 has more/better/different processor error checking.

Thanks for the link, Randy, I'll give it a go tonight.  Although with my 
knowledge of current processor archictecture, I'm guessing it'll parse it 
from one format I don't have a clue about into a more verbose format I don't 
have a clue about ;-)

Cheers,

M

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 4
model name      : AMD Athlon(tm) Processor
stepping        : 4
cpu MHz         : 1195.130
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov 
pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 2367.48


-- 
"It's the small gaps between the rain that count,
 and learning how to live amongst them."
	      -- Jeff Noon

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0-test4 and hardware reports a non fatal incident
  2003-08-28 15:46 ` Randy.Dunlap
  2003-08-28 17:28   ` Matt Gibson
@ 2003-08-28 19:02   ` Matt Gibson
  2003-08-28 22:17     ` Randy.Dunlap
  1 sibling, 1 reply; 11+ messages in thread
From: Matt Gibson @ 2003-08-28 19:02 UTC (permalink / raw)
  To: linux-kernel

On Thursday 28 Aug 2003 16:46, Randy.Dunlap wrote:
> Use "parsemce" from here:
>   http://www.codemonkey.org.uk/projects/parsemce/
> to decode it.

Hi Randy,

The format seems to have changed rather a lot since that was written.  All I 
get is:

Aug 17 11:25:13 codewave kernel: MCE: The hardware reports a non fatal, 
correctable incident occurred on CPU 0.
Aug 17 11:25:13 codewave kernel: Bank 0: dc0000000000050b

...but what parsemce seems to be expecting is:

 Sample kernel output..
 Sep  4 21:43:41 hamlet kernel: CPU 0: Machine Check Exception: 
0000000000000004
Sep  4 21:43:41 hamlet kernel: Bank 1: f600200000000152 at 7600200000000152
Sep  4 21:43:41 hamlet kernel: Bank 2: d40040000000017a at 540040000000017a
Sep  4 21:43:41 hamlet kernel: Kernel panic: CPU context corrupt

As a result, I'm still no more enlightened.  I can't quite figure out from 
reading the parser what values to put where, as it seems to expect a few 
more than I have.  Any tips?

Ta,

Matt

-- 
"It's the small gaps between the rain that count,
 and learning how to live amongst them."
	      -- Jeff Noon

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0-test4 and hardware reports a non fatal incident
  2003-08-28 19:02   ` Matt Gibson
@ 2003-08-28 22:17     ` Randy.Dunlap
  2003-08-30 10:49       ` Matt Gibson
  2003-08-30 13:10       ` Dave Jones
  0 siblings, 2 replies; 11+ messages in thread
From: Randy.Dunlap @ 2003-08-28 22:17 UTC (permalink / raw)
  To: Matt Gibson; +Cc: linux-kernel

On Thu, 28 Aug 2003 20:02:00 +0100 Matt Gibson <gothick@gothick.org.uk> wrote:

| On Thursday 28 Aug 2003 16:46, Randy.Dunlap wrote:
| > Use "parsemce" from here:
| >   http://www.codemonkey.org.uk/projects/parsemce/
| > to decode it.
| 
| Hi Randy,

| I'm guessing it'll parse it 
| from one format I don't have a clue about into a more verbose format I don't 
| have a clue about ;-)

That was insightful.  :(

| The format seems to have changed rather a lot since that was written.  All I 
| get is:
| 
| Aug 17 11:25:13 codewave kernel: MCE: The hardware reports a non fatal, 
| correctable incident occurred on CPU 0.
| Aug 17 11:25:13 codewave kernel: Bank 0: dc0000000000050b
| 
| ...but what parsemce seems to be expecting is:
| 
|  Sample kernel output..
|  Sep  4 21:43:41 hamlet kernel: CPU 0: Machine Check Exception: 
| 0000000000000004
| Sep  4 21:43:41 hamlet kernel: Bank 1: f600200000000152 at 7600200000000152
| Sep  4 21:43:41 hamlet kernel: Bank 2: d40040000000017a at 540040000000017a
| Sep  4 21:43:41 hamlet kernel: Kernel panic: CPU context corrupt
| 
| As a result, I'm still no more enlightened.  I can't quite figure out from 
| reading the parser what values to put where, as it seems to expect a few 
| more than I have.  Any tips?

Yes, the kernel has decided that your processor only has 1 Bank of
MCE register data to report.  I don't know how/why.  Sorry.

--
~Randy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0-test4 and hardware reports a non fatal incident
  2003-08-28 22:17     ` Randy.Dunlap
@ 2003-08-30 10:49       ` Matt Gibson
  2003-08-30 12:44         ` Matt Gibson
  2003-08-30 13:10       ` Dave Jones
  1 sibling, 1 reply; 11+ messages in thread
From: Matt Gibson @ 2003-08-30 10:49 UTC (permalink / raw)
  To: Randy.Dunlap, linux-kernel

On Thursday 28 Aug 2003 23:17, Randy.Dunlap wrote:
> Yes, the kernel has decided that your processor only has 1 Bank of
> MCE register data to report.  I don't know how/why.  Sorry.

Could it be something to do with this (in arch/i386/kernel/cpu/mcheck/k7.c)?

	if (l & (1<<8))	/* Control register present ? */
		wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff);
	nr_mce_banks = l & 0xff;

	for (i=1; i<nr_mce_banks; i++) {

Check out the "for".  Or am I reading this wrong?

M

-- 
"It's the small gaps between the rain that count,
 and learning how to live amongst them."
	      -- Jeff Noon

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0-test4 and hardware reports a non fatal incident
  2003-08-30 10:49       ` Matt Gibson
@ 2003-08-30 12:44         ` Matt Gibson
  2003-08-30 13:35           ` Dave Jones
  0 siblings, 1 reply; 11+ messages in thread
From: Matt Gibson @ 2003-08-30 12:44 UTC (permalink / raw)
  To: linux-kernel

On Saturday 30 Aug 2003 11:49, Matt Gibson wrote:
> On Thursday 28 Aug 2003 23:17, Randy.Dunlap wrote:
> > Yes, the kernel has decided that your processor only has 1 Bank of
> > MCE register data to report.  I don't know how/why.  Sorry.
>
> Could it be something to do with this (in
> arch/i386/kernel/cpu/mcheck/k7.c)?
>
> 	if (l & (1<<8))	/* Control register present ? */
> 		wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff);
> 	nr_mce_banks = l & 0xff;
>
> 	for (i=1; i<nr_mce_banks; i++) {
>
> Check out the "for".  Or am I reading this wrong?

Having checked back, this was changed between test-2 and test-3.  The 
checking code in k7_machine_check() still loops from 0 rather than 1.  I 
think this may be leading to false reporting of problems, which may be why I 
and Tomasz are seeing these MCE messages on our Athlons.

Anyone who knows more about this stuff care to comment?  Is someone looking 
after MCE at the moment?  I couldn't find out much info on it.

Thanks,

Matt

-- 
"It's the small gaps between the rain that count,
 and learning how to live amongst them."
	      -- Jeff Noon

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0-test4 and hardware reports a non fatal incident
  2003-08-28 22:17     ` Randy.Dunlap
  2003-08-30 10:49       ` Matt Gibson
@ 2003-08-30 13:10       ` Dave Jones
  1 sibling, 0 replies; 11+ messages in thread
From: Dave Jones @ 2003-08-30 13:10 UTC (permalink / raw)
  To: Randy.Dunlap; +Cc: Matt Gibson, linux-kernel

On Thu, Aug 28, 2003 at 03:17:08PM -0700, Randy.Dunlap wrote:
 > | As a result, I'm still no more enlightened.  I can't quite figure out from 
 > | reading the parser what values to put where, as it seems to expect a few 
 > | more than I have.  Any tips?
 > 
 > Yes, the kernel has decided that your processor only has 1 Bank of
 > MCE register data to report.  I don't know how/why.  Sorry.

The non-fatal checker dumps the single bank that is reporting failures.
parsemce should still have enough info there to decode into something
useful however.  (just use 0 for the address).

	Dave

-- 
 Dave Jones     http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0-test4 and hardware reports a non fatal incident
  2003-08-30 12:44         ` Matt Gibson
@ 2003-08-30 13:35           ` Dave Jones
  2003-08-30 13:48             ` Matt Gibson
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Jones @ 2003-08-30 13:35 UTC (permalink / raw)
  To: Matt Gibson; +Cc: linux-kernel

On Sat, Aug 30, 2003 at 01:44:56PM +0100, Matt Gibson wrote:
 > > 	for (i=1; i<nr_mce_banks; i++) {
 > >
 > > Check out the "for".  Or am I reading this wrong?
 > 
 > Having checked back, this was changed between test-2 and test-3.  The 
 > checking code in k7_machine_check() still loops from 0 rather than 1.  I 
 > think this may be leading to false reporting of problems, which may be why I 
 > and Tomasz are seeing these MCE messages on our Athlons.

When it was i=0 people were seeing false positives. Starting from 1
reduces that.

 > Anyone who knows more about this stuff care to comment?  Is someone looking 
 > after MCE at the moment?  I couldn't find out much info on it.

in the past, Alan and myself took care of i386, Andi Kleen did AMD64.

		Dave

-- 
 Dave Jones     http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0-test4 and hardware reports a non fatal incident
  2003-08-30 13:35           ` Dave Jones
@ 2003-08-30 13:48             ` Matt Gibson
  2003-08-30 13:51               ` Dave Jones
  0 siblings, 1 reply; 11+ messages in thread
From: Matt Gibson @ 2003-08-30 13:48 UTC (permalink / raw)
  To: Dave Jones, linux-kernel

On Saturday 30 Aug 2003 14:35, you wrote:
> When it was i=0 people were seeing false positives. Starting from 1
> reduces that.

Cool.  Can you point me towards any background-reading on MCE?  This's got me 
interested.

Rather ironically, since I changed my kernel back to starting from 0, I 
haven't seen any errors.  Having said that, I was only getting a couple each 
day anyway, so I'll leave it a few days and see what develops.  I think it's 
happening only once on boot, every now and again, but I've not had time to 
analyse the logs properly yet.  Maybe there's a problem when my machine's 
cold...

> in the past, Alan and myself took care of i386, Andi Kleen did AMD64.

Thanks for responding; it was fairly clear to me that I was out of my depth, 
and it's nice to know that there's someone out there that isn't *grin*

Cheers,

Matt

-- 
"It's the small gaps between the rain that count,
 and learning how to live amongst them."
	      -- Jeff Noon

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.0-test4 and hardware reports a non fatal incident
  2003-08-30 13:48             ` Matt Gibson
@ 2003-08-30 13:51               ` Dave Jones
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Jones @ 2003-08-30 13:51 UTC (permalink / raw)
  To: Matt Gibson; +Cc: linux-kernel

On Sat, Aug 30, 2003 at 02:48:30PM +0100, Matt Gibson wrote:
 > On Saturday 30 Aug 2003 14:35, you wrote:
 > > When it was i=0 people were seeing false positives. Starting from 1
 > > reduces that.
 > Cool.  Can you point me towards any background-reading on MCE?  This's got me 
 > interested.

not sure if any of the public amd docs have info on the mce registers,
but the stuff in the intel system archicture manuals on
developer.intel.com is largely relevant.

 > Rather ironically, since I changed my kernel back to starting from 0, I 
 > haven't seen any errors.

coincidence. By enabling more error checking you're seeing less doesn't
really make sense.

		Dave

-- 
 Dave Jones     http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2003-08-30 13:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-08-28 13:48 2.6.0-test4 and hardware reports a non fatal incident Tomasz Czaus
2003-08-28 15:46 ` Randy.Dunlap
2003-08-28 17:28   ` Matt Gibson
2003-08-28 19:02   ` Matt Gibson
2003-08-28 22:17     ` Randy.Dunlap
2003-08-30 10:49       ` Matt Gibson
2003-08-30 12:44         ` Matt Gibson
2003-08-30 13:35           ` Dave Jones
2003-08-30 13:48             ` Matt Gibson
2003-08-30 13:51               ` Dave Jones
2003-08-30 13:10       ` Dave Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox