public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [x86_64] how worried should I be about MCEs?
@ 2005-04-30 20:16 Andy Lutomirski
  2005-04-30 20:37 ` bert hubert
  2005-05-02 16:49 ` [x86_64] how worried should I be about MCEs? Andi Kleen
  0 siblings, 2 replies; 5+ messages in thread
From: Andy Lutomirski @ 2005-04-30 20:16 UTC (permalink / raw)
  To: linux-kernel

Every now and then, after rebooting, the kernel notices some MCEs. 
Should I be worried about this?

(mcelog attached)

Thanks,
Andy


MCE 0
CPU 0 0 data cache from boot or resume
ADDR 480b0c84df48
   Data cache ECC error (syndrome c8)
        bit46 = corrected ecc error
        bit57 = processor context corrupt
        bit61 = error uncorrected
        bit62 = error overflow (multiple errors)
STATUS f66440000000438d MCGSTATUS 0
MCE 1
CPU 0 1 instruction cache from boot or resume
ADDR 75e2bb87ec57f8e0
   Instruction cache ECC error
        bit32 = err cpu0
        bit33 = err cpu1
        bit35 = res3
        bit43 = res11
        bit45 = uncorrected ecc error
        bit46 = corrected ecc error
        bit55 = res23
        bit56 = res24
        bit57 = processor context corrupt
        bit59 = misc error valid
        bit61 = error uncorrected
        bit62 = error overflow (multiple errors)
STATUS ffe4681bd0e45d81 MCGSTATUS 0
MCE 2
CPU 0 3 load/store unit from boot or resume
MISC 8005003b8005003b
        bit57 = processor context corrupt
        bit59 = misc error valid
        bit61 = error uncorrected
        bit62 = error overflow (multiple errors)
STATUS fa0000000000d0c5 MCGSTATUS 0
MCE 3
CPU 0 4 northbridge from boot or resume
ADDR 102000020
   Northbridge ECC error
   ECC syndrome = 0
        bit32 = err cpu0
        bit33 = err cpu1
        bit40 = error found by scrub
        bit45 = uncorrected ecc error
        bit57 = processor context corrupt
        bit61 = error uncorrected
STATUS b600215300001e0f MCGSTATUS 0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [x86_64] how worried should I be about MCEs?
  2005-04-30 20:16 [x86_64] how worried should I be about MCEs? Andy Lutomirski
@ 2005-04-30 20:37 ` bert hubert
  2005-04-30 21:02   ` Andy Lutomirski
  2005-05-02 16:49 ` [x86_64] how worried should I be about MCEs? Andi Kleen
  1 sibling, 1 reply; 5+ messages in thread
From: bert hubert @ 2005-04-30 20:37 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel

On Sat, Apr 30, 2005 at 01:16:49PM -0700, Andy Lutomirski wrote:
> Every now and then, after rebooting, the kernel notices some MCEs. 
> Should I be worried about this?

If these reports are true, they would be worrying. But I find them a bit
hard to believe - the bit combinations don't appear to make sense.

I have an AMD64 machine which logs 'MCE reported' every once in a while but
otherwise functions perfectly and I haven't yet coaxed it into telling me
the content of the errors.

Might there be a bug here? How did you create this log?

> STATUS f66440000000438d MCGSTATUS 0
> MCE 1
> CPU 0 1 instruction cache from boot or resume
> ADDR 75e2bb87ec57f8e0
>   Instruction cache ECC error
>        bit32 = err cpu0
>        bit33 = err cpu1
>        bit35 = res3
>        bit43 = res11
>        bit45 = uncorrected ecc error
>        bit46 = corrected ecc error
>        bit55 = res23
>        bit56 = res24
>        bit57 = processor context corrupt
>        bit59 = misc error valid
>        bit61 = error uncorrected
>        bit62 = error overflow (multiple errors)

This would be one hell of an error - both corrected and uncorrected.

Regards,

bert

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [x86_64] how worried should I be about MCEs?
  2005-04-30 20:37 ` bert hubert
@ 2005-04-30 21:02   ` Andy Lutomirski
  2005-04-30 21:41     ` possibly bogus AMD64 MCE reporting bert hubert
  0 siblings, 1 reply; 5+ messages in thread
From: Andy Lutomirski @ 2005-04-30 21:02 UTC (permalink / raw)
  To: bert hubert; +Cc: Andy Lutomirski, linux-kernel

bert hubert wrote:
> On Sat, Apr 30, 2005 at 01:16:49PM -0700, Andy Lutomirski wrote:
> 
>>Every now and then, after rebooting, the kernel notices some MCEs. 
>>Should I be worried about this?
> 
> 
> If these reports are true, they would be worrying. But I find them a bit
> hard to believe - the bit combinations don't appear to make sense.

True.

> 
> I have an AMD64 machine which logs 'MCE reported' every once in a while but
> otherwise functions perfectly and I haven't yet coaxed it into telling me
> the content of the errors.
> 
> Might there be a bug here? How did you create this log?

This is from mcelog 0.3, dumped with a daily cron job to 
/var/log/mcelog.  I think it came from 2.6.11-gentoo-r6 (which should be 
essentially 2.6.11.7).

The machine is Athlon 64 3200+ (754), on an MSI K8T Neo-FIS2R, running a 
moderately old BIOS but one that has erratum #93 (or whatever it was) fixed.

Anything I should attach to provide more info?

I just upgraded to mcelog-0.4, but at this rate I don't expect a new 
dump for awhile.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* possibly bogus AMD64 MCE reporting.
  2005-04-30 21:02   ` Andy Lutomirski
@ 2005-04-30 21:41     ` bert hubert
  0 siblings, 0 replies; 5+ messages in thread
From: bert hubert @ 2005-04-30 21:41 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel

On Sat, Apr 30, 2005 at 02:02:04PM -0700, Andy Lutomirski wrote:

> Anything I should attach to provide more info?
> 
> I just upgraded to mcelog-0.4, but at this rate I don't expect a new 
> dump for awhile.

I'll investigate the MCE reports from my opteron machine in the coming days
and report back if they are bogus as well

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [x86_64] how worried should I be about MCEs?
  2005-04-30 20:16 [x86_64] how worried should I be about MCEs? Andy Lutomirski
  2005-04-30 20:37 ` bert hubert
@ 2005-05-02 16:49 ` Andi Kleen
  1 sibling, 0 replies; 5+ messages in thread
From: Andi Kleen @ 2005-05-02 16:49 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel

Andy Lutomirski <luto@myrealbox.com> writes:

> Every now and then, after rebooting, the kernel notices some
> MCEs. Should I be worried about this?


>
> (mcelog attached)
>
> Thanks,
> Andy
>
>
> MCE 0
> CPU 0 0 data cache from boot or resume
> ADDR 480b0c84df48
>    Data cache ECC error (syndrome c8)

These are harmless. I have one machine that generates them too.
I think they happen because the BIOS either does something
incorrectly while booting the POSting the CPU or these are
expected and it forgets to clear them. Only a few BIOS
seem to do it, so it is probably a BIOS bug. 

You see them because the MCE code logs boot MCEs now.
That is because it is the only way to log MCEs that 
cause the system to reboot is to log them after the reboot.

Some of the bit combinations are clearly non sensical, like
corrected ECC error with error uncorrected and the Address
is bogus.

I have been pondering to add some filter to remove
these bogus MCEs, but I have not come up with 
a good heuristic yet. Perhaps ignore all MCEs at resume
with addresses that are beyond the physical memory.
But that would not have caught the last one.

-Andi

[intentional full quote for Mark]

>         bit46 = corrected ecc error
>         bit57 = processor context corrupt
>         bit61 = error uncorrected
>         bit62 = error overflow (multiple errors)
> STATUS f66440000000438d MCGSTATUS 0
> MCE 1
> CPU 0 1 instruction cache from boot or resume
> ADDR 75e2bb87ec57f8e0
>    Instruction cache ECC error
>         bit32 = err cpu0
>         bit33 = err cpu1
>         bit35 = res3
>         bit43 = res11
>         bit45 = uncorrected ecc error
>         bit46 = corrected ecc error
>         bit55 = res23
>         bit56 = res24
>         bit57 = processor context corrupt
>         bit59 = misc error valid
>         bit61 = error uncorrected
>         bit62 = error overflow (multiple errors)
> STATUS ffe4681bd0e45d81 MCGSTATUS 0
> MCE 2
> CPU 0 3 load/store unit from boot or resume
> MISC 8005003b8005003b
>         bit57 = processor context corrupt
>         bit59 = misc error valid
>         bit61 = error uncorrected
>         bit62 = error overflow (multiple errors)
> STATUS fa0000000000d0c5 MCGSTATUS 0
> MCE 3
> CPU 0 4 northbridge from boot or resume
> ADDR 102000020
>    Northbridge ECC error
>    ECC syndrome = 0
>         bit32 = err cpu0
>         bit33 = err cpu1
>         bit40 = error found by scrub
>         bit45 = uncorrected ecc error
>         bit57 = processor context corrupt
>         bit61 = error uncorrected
> STATUS b600215300001e0f MCGSTATUS 0

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-05-02 16:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-30 20:16 [x86_64] how worried should I be about MCEs? Andy Lutomirski
2005-04-30 20:37 ` bert hubert
2005-04-30 21:02   ` Andy Lutomirski
2005-04-30 21:41     ` possibly bogus AMD64 MCE reporting bert hubert
2005-05-02 16:49 ` [x86_64] how worried should I be about MCEs? Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox