AMD A10: MCE Instruction Cache Error

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* AMD A10: MCE Instruction Cache Error
@ 2012-11-02 10:50 Alexander Holler
  2012-11-02 13:53 ` Alexander Holler
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Holler @ 2012-11-02 10:50 UTC (permalink / raw)
  To: linux-kernel

Hello,

I've just got the following on an AMD A10 5800K:

------
[ 8395.999581] [Hardware Error]: CPU:0 
MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151
[ 8395.999586] [Hardware Error]:        MC1_ADDR: 0x0000ffffa00e1203
[ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error 
during data load from IC.
[ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
------

Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest).

Can someone enlight me about what might be wrong with my (new) system 
(memtest didn't show an errors)?

What IC is meant? As far as I know, this processor doesn't support ECC, 
so I wonder where that parity error does come from.

Regards,

Alexander

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD A10: MCE Instruction Cache Error
  2012-11-02 10:50 AMD A10: MCE Instruction Cache Error Alexander Holler
@ 2012-11-02 13:53 ` Alexander Holler
  2012-11-03  4:49   ` Borislav Petkov
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Holler @ 2012-11-02 13:53 UTC (permalink / raw)
  To: linux-kernel

Am 02.11.2012 11:50, schrieb Alexander Holler:
> Hello,
>
> I've just got the following on an AMD A10 5800K:
>
> ------
> [ 8395.999581] [Hardware Error]: CPU:0
> MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151
> [ 8395.999586] [Hardware Error]:        MC1_ADDR: 0x0000ffffa00e1203
> [ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error
> during data load from IC.
> [ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
> ------
>
> Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest).
>
> Can someone enlight me about what might be wrong with my (new) system
> (memtest didn't show an errors)?
>
> What IC is meant? As far as I know, this processor doesn't support ECC,
> so I wonder where that parity error does come from.

I assume IC means Instruction Cache. ;)

As the kernel didn't reboot or halt, this seems to have been a 
correctable error.

Which leads me to another question. I have mcelog running, but it 
doesn't seem to receive the error. With my previous Intel-HW and an 
older kernel, mcelog received MCE errors (trip temperatur). But since 
the kernel now decodes those message themself, that doesn't seem to 
happen anymore. mcelog is silent, but now I've seen the above message on 
all my consoles.

So now I have two question:

- First, if the error is something I should ask AMD about,

- Second, if the kernel could mention that it is an recoverable error. 
And if so and if such errors aren't something to get panic (e.g. it 
isn't unusual to receive such), if the kernel could output that message 
with another priority.

Regards,

Alexander

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD A10: MCE Instruction Cache Error
  2012-11-02 13:53 ` Alexander Holler
@ 2012-11-03  4:49   ` Borislav Petkov
  2012-11-03 10:45     ` Alexander Holler
  0 siblings, 1 reply; 12+ messages in thread
From: Borislav Petkov @ 2012-11-03  4:49 UTC (permalink / raw)
  To: Alexander Holler; +Cc: linux-kernel

On Fri, Nov 02, 2012 at 02:53:45PM +0100, Alexander Holler wrote:
> Am 02.11.2012 11:50, schrieb Alexander Holler:
> >Hello,
> >
> >I've just got the following on an AMD A10 5800K:
> >
> >------
> >[ 8395.999581] [Hardware Error]: CPU:0
> >MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151
> >[ 8395.999586] [Hardware Error]:        MC1_ADDR: 0x0000ffffa00e1203
> >[ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error
> >during data load from IC.
> >[ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
> >------
> >
> >Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest).
> >
> >Can someone enlight me about what might be wrong with my (new) system
> >(memtest didn't show an errors)?
> >
> >What IC is meant? As far as I know, this processor doesn't support ECC,
> >so I wonder where that parity error does come from.
> 
> I assume IC means Instruction Cache. ;)

It says so earlier in the sentence: "Instruction Cache Error" :)

> As the kernel didn't reboot or halt, this seems to have been a
> correctable error.

Yes, it is (the "CE" bit in MC1_STATUS). Btw, I have reworked this code
to spit human-readable information first. It also says what the error
severity is now.

> Which leads me to another question. I have mcelog running, but it
> doesn't seem to receive the error. With my previous Intel-HW and an
> older kernel, mcelog received MCE errors (trip temperatur). But
> since the kernel now decodes those message themself, that doesn't
> seem to happen anymore. mcelog is silent, but now I've seen the
> above message on all my consoles.

Yes, AMD doesn't use mcelog.

> So now I have two question:
> 
> - First, if the error is something I should ask AMD about,

Not really, it is a single bit flip which got corrected, simply watch
out if you get more of those.

> - Second, if the kernel could mention that it is an recoverable
> error. And if so and if such errors aren't something to get panic
> (e.g. it isn't unusual to receive such), if the kernel could output
> that message with another priority.

As I said above, it got corrected. If it were critical, it would've
either panicked or you wouldnt've seen it at all (probably only after
reboot).

HTH.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD A10: MCE Instruction Cache Error
  2012-11-03  4:49   ` Borislav Petkov
@ 2012-11-03 10:45     ` Alexander Holler
  2012-11-04 15:21       ` Borislav Petkov
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Holler @ 2012-11-03 10:45 UTC (permalink / raw)
  To: Borislav Petkov, linux-kernel

Am 03.11.2012 05:49, schrieb Borislav Petkov:
> On Fri, Nov 02, 2012 at 02:53:45PM +0100, Alexander Holler wrote:
>> Am 02.11.2012 11:50, schrieb Alexander Holler:
>>> Hello,
>>>
>>> I've just got the following on an AMD A10 5800K:
>>>
>>> ------
>>> [ 8395.999581] [Hardware Error]: CPU:0
>>> MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151
>>> [ 8395.999586] [Hardware Error]:        MC1_ADDR: 0x0000ffffa00e1203
>>> [ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error
>>> during data load from IC.
>>> [ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
>>> ------
>>>
>>> Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest).
>>>
...
>> So now I have two question:
>>
>> - First, if the error is something I should ask AMD about,
>
> Not really, it is a single bit flip which got corrected, simply watch
> out if you get more of those.
>
>> - Second, if the kernel could mention that it is an recoverable
>> error. And if so and if such errors aren't something to get panic
>> (e.g. it isn't unusual to receive such), if the kernel could output
>> that message with another priority.
>
> As I said above, it got corrected. If it were critical, it would've
> either panicked or you wouldnt've seen it at all (probably only after
> reboot).

Hmm, exactly that just happened twice in a row. Unfortunately the screen 
was already disabled (screen saving mode), so I couldn't see any 
message, if there was any. Just a dead box (not overclocked, I don't do 
such, I even had enabled the power saving mode in the BIOS, which seems 
to mean max. 3800 MHz). I think I should start getting nervous. :(

What I meant with another priority is using something else than 
pr_emerg(), because pr_emerge() causes the message to become displayed 
on every console, at least on my F17 with default settings.

Of course, I'm happy it was displayed using pr_emerg() so I haven't 
missed it. Now I know that even if ECC isn't available for users which 
don't want or need power hungry and loud servers, at least some parity 
is used to verify the operations with the internal memory (cache).

But on the other way, if that message isn't really critical, something 
else than pr_emerge() should be used.

Thanks for the answer.

Regards,

Alexander

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD A10: MCE Instruction Cache Error
  2012-11-03 10:45     ` Alexander Holler
@ 2012-11-04 15:21       ` Borislav Petkov
  2012-11-04 17:19         ` Alexander Holler
  0 siblings, 1 reply; 12+ messages in thread
From: Borislav Petkov @ 2012-11-04 15:21 UTC (permalink / raw)
  To: Alexander Holler; +Cc: linux-kernel

On Sat, Nov 03, 2012 at 11:45:25AM +0100, Alexander Holler wrote:
> Hmm, exactly that just happened twice in a row. Unfortunately the
> screen was already disabled (screen saving mode), so I couldn't see
> any message, if there was any. Just a dead box (not overclocked, I
> don't do such, I even had enabled the power saving mode in the BIOS,
> which seems to mean max. 3800 MHz). I think I should start getting
> nervous. :(

How do you know this happened twice if you couldn't see any message?

Also, can you enable netconsole or serial console, if possible, and try
to catch full dmesg from the boot and up until it happens.

Also, catch the dmesg of the box on the next time it reboots after the
freeze (btw, try doing a warm reboot because this is the only way we can
preserve valid error information) - if all works ok, it should decode
the MCE before the freeze (if the MCE caused it and it actually is the
reason for the freeze).

> What I meant with another priority is using something else than
> pr_emerg(), because pr_emerge() causes the message to become
> displayed on every console, at least on my F17 with default
> settings.

This is needed because we want those errors to actually be seen.

Thanks.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD A10: MCE Instruction Cache Error
  2012-11-04 15:21       ` Borislav Petkov
@ 2012-11-04 17:19         ` Alexander Holler
  2012-11-06  9:10           ` Borislav Petkov
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Holler @ 2012-11-04 17:19 UTC (permalink / raw)
  To: Borislav Petkov, linux-kernel

Am 04.11.2012 16:21, schrieb Borislav Petkov:
> On Sat, Nov 03, 2012 at 11:45:25AM +0100, Alexander Holler wrote:
>> Hmm, exactly that just happened twice in a row. Unfortunately the
>> screen was already disabled (screen saving mode), so I couldn't see
>> any message, if there was any. Just a dead box (not overclocked, I
>> don't do such, I even had enabled the power saving mode in the BIOS,
>> which seems to mean max. 3800 MHz). I think I should start getting
>> nervous. :(
>
> How do you know this happened twice if you couldn't see any message?

I was remotely logged in and there aren't that many faults which lead to 
complete stand still of hw (no reset).

But as you said I can't know, the only thing I know is that a box with 
new mb, memory and apu come to a complete stand still, and such shortly 
after I've received an emergency message which told me that a bit inside 
the cpu switched unexpected. Adding to that, the box did the same as 
what it did while it received the MCE, a backup from a sata-atached ssd 
to an usb3-hd which includes compression and encryption which keeps all 
cores at work most of the time for several hours.

> Also, can you enable netconsole or serial console, if possible, and try
> to catch full dmesg from the boot and up until it happens.

As I was logged in remotely by network, I know it wasn't the same MCE as 
before (just a disconnect and dead hw). But I don't know what else it 
was. And as I recently got hit by a broken RAM module, which was a pain 
to find, I'm not very happy that I have to go through similiar pain 
again with new HW.

The probability to get a working HW and SW combination just has become 
to low in the last years. All the (IT) companies better should spend the 
money they now give their lawyers their QA and engineering departments 
instead.

Sorry for the rant, also I'm used to live with hw and sw errors (as a 
sw-dev), I'm currently just a bit annoyed. ;)

I will setup something to monitor the box through the serial and will 
let it backup itself all the time, trying to catch some usefull information.

Regards,

Alexander

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD A10: MCE Instruction Cache Error
  2012-11-04 17:19         ` Alexander Holler
@ 2012-11-06  9:10           ` Borislav Petkov
  2012-11-06 11:18             ` Alexander Holler
  0 siblings, 1 reply; 12+ messages in thread
From: Borislav Petkov @ 2012-11-06  9:10 UTC (permalink / raw)
  To: Alexander Holler; +Cc: linux-kernel

On Sun, Nov 04, 2012 at 06:19:32PM +0100, Alexander Holler wrote:
> I was remotely logged in and there aren't that many faults which
> lead to complete stand still of hw (no reset).

Right, can you retry triggering the freeze without the fglrx driver?
Simply remove it completely so that even the possibility to load it is
not there.

> But as you said I can't know, the only thing I know is that a box
> with new mb, memory and apu come to a complete stand still, and
> such shortly after I've received an emergency message which told me
> that a bit inside the cpu switched unexpected. Adding to that, the
> box did the same as what it did while it received the MCE, a backup
> from a sata-atached ssd to an usb3-hd which includes compression and
> encryption which keeps all cores at work most of the time for several
> hours.

So do you get that MCE each time you execute that same workload?

Thanks.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD A10: MCE Instruction Cache Error
  2012-11-06  9:10           ` Borislav Petkov
@ 2012-11-06 11:18             ` Alexander Holler
  2012-11-06 11:44               ` Alexander Holler
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Holler @ 2012-11-06 11:18 UTC (permalink / raw)
  To: Borislav Petkov, linux-kernel

Am 06.11.2012 10:10, schrieb Borislav Petkov:
> On Sun, Nov 04, 2012 at 06:19:32PM +0100, Alexander Holler wrote:
>> I was remotely logged in and there aren't that many faults which
>> lead to complete stand still of hw (no reset).
>
> Right, can you retry triggering the freeze without the fglrx driver?
> Simply remove it completely so that even the possibility to load it is
> not there.

Will do. But I don't think it is fglrx. I'm using it since several years 
(just with an external graphics card before) and never had a problem 
with it. Besides that, during the hangs nothing on the display happened, 
I was logged out  and just had a remote ssh session on.

>> But as you said I can't know, the only thing I know is that a box
>> with new mb, memory and apu come to a complete stand still, and
>> such shortly after I've received an emergency message which told me
>> that a bit inside the cpu switched unexpected. Adding to that, the
>> box did the same as what it did while it received the MCE, a backup
>> from a sata-atached ssd to an usb3-hd which includes compression and
>> encryption which keeps all cores at work most of the time for several
>> hours.
>
> So do you get that MCE each time you execute that same workload?

No, up to now the MCE only was visible once. But stressing the box 
yesterday (with loads of 3 for several hours and such) revealed some 
other serious failures which all look like the stuff which happens when 
the cache (or memory) is broken (I don't know how many bits of the cache 
can be corrected until something else happens or what happens). E.g. the 
checksum of a backup is wrong, or bzip2 failed with an error which it 
suggests is because of an HW failure like bad RAM (I've never seen that 
error from bzip2 before).

I've just done a memory test using  memtest86+-4.20 for about 7h (3 
complete passes of all 16GB), no errors, so the new memory itself seems 
to be ok.

I will now to tests with leaving fglrx off.

Regards,

Alexander

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD A10: MCE Instruction Cache Error
  2012-11-06 11:18             ` Alexander Holler
@ 2012-11-06 11:44               ` Alexander Holler
  2012-11-06 13:14                 ` Alexander Holler
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Holler @ 2012-11-06 11:44 UTC (permalink / raw)
  To: Borislav Petkov, linux-kernel

Am 06.11.2012 12:18, schrieb Alexander Holler:
> I will now to tests with leaving fglrx off.

s/to/do/ ;)

That was gone fast. Disabled fglrx, started tests, full halt without any 
visible on the serial (I needed to press the reset button):

------------------------
[   77.360180] EXT4-fs (dm-0): mounted filesystem with ordered data 
mode. Opts: data=ordered,nodelalloc
[  461.503107] EXT4-fs (sdd3): mounted filesystem with ordered data 
mode. Opts: (null)
[  473.869770] EXT4-fs (dm-1): mounted filesystem with ordered data 
mode. Opts: data=ordered,nodelalloc
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 3.6.6-00008-gd13f937 
(root@krabat.ahsoftware) (gcc version 4.7.2 20120921 (Red Hat 4.7.2-2) 
(GCC) ) #274 SMP Mon Nov 5 17:26:01 CET 2012
[    0.000000] Command line: ro root=/dev/sdb2 rootfstype=ext4 
enforcing=0 cgroup_disable=memory vga=0x346 video=vesafb:mtrr:3,ywrap 
radeon.modeset=0 earlycon=uart8250,io,0x3f8,115200n8 
console=ttyS0,115200n8 console=tty0
------------------------

Regards,

Alexander


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD A10: MCE Instruction Cache Error
  2012-11-06 11:44               ` Alexander Holler
@ 2012-11-06 13:14                 ` Alexander Holler
  2012-11-06 14:47                   ` Borislav Petkov
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Holler @ 2012-11-06 13:14 UTC (permalink / raw)
  To: Borislav Petkov, linux-kernel

Am 06.11.2012 12:44, schrieb Alexander Holler:
> Am 06.11.2012 12:18, schrieb Alexander Holler:
>> I will now to tests with leaving fglrx off.
>
> s/to/do/ ;)
>
> That was gone fast. Disabled fglrx, started tests, full halt without any
> visible on the serial (I needed to press the reset button):

One after another, now I've got this:

[ 5698.640830] [Hardware Error]: CPU:0 
MC2_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc2540c000040136
[ 5698.649866] [Hardware Error]:        MC2_ADDR: 0x0000000002299678
[ 5698.655443] [Hardware Error]: Combined Unit Error: Fill ECC error on 
data fills.
[ 5698.662849] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD

I think it's now really an RMA and I can stop doing further tests.

Regards,

Alexander

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD A10: MCE Instruction Cache Error
  2012-11-06 13:14                 ` Alexander Holler
@ 2012-11-06 14:47                   ` Borislav Petkov
  2012-11-06 16:02                     ` Alexander Holler
  0 siblings, 1 reply; 12+ messages in thread
From: Borislav Petkov @ 2012-11-06 14:47 UTC (permalink / raw)
  To: Alexander Holler; +Cc: linux-kernel

On Tue, Nov 06, 2012 at 02:14:46PM +0100, Alexander Holler wrote:
> Am 06.11.2012 12:44, schrieb Alexander Holler:
> >Am 06.11.2012 12:18, schrieb Alexander Holler:
> >>I will now to tests with leaving fglrx off.
> >
> >s/to/do/ ;)
> >
> >That was gone fast. Disabled fglrx, started tests, full halt without any
> >visible on the serial (I needed to press the reset button):
> 
> One after another, now I've got this:
> 
> [ 5698.640830] [Hardware Error]: CPU:0
> MC2_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc2540c000040136
> [ 5698.649866] [Hardware Error]:        MC2_ADDR: 0x0000000002299678
> [ 5698.655443] [Hardware Error]: Combined Unit Error: Fill ECC error
> on data fills.
> [ 5698.662849] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD
> 
> I think it's now really an RMA and I can stop doing further tests.

Are you sure the temperature conditions of the box are optimal? IOW,
there's nothing overheating in there?

Thanks.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD A10: MCE Instruction Cache Error
  2012-11-06 14:47                   ` Borislav Petkov
@ 2012-11-06 16:02                     ` Alexander Holler
  0 siblings, 0 replies; 12+ messages in thread
From: Alexander Holler @ 2012-11-06 16:02 UTC (permalink / raw)
  To: Borislav Petkov, linux-kernel

Am 06.11.2012 15:47, schrieb Borislav Petkov:
> On Tue, Nov 06, 2012 at 02:14:46PM +0100, Alexander Holler wrote:
>> Am 06.11.2012 12:44, schrieb Alexander Holler:
>>> Am 06.11.2012 12:18, schrieb Alexander Holler:
>>>> I will now to tests with leaving fglrx off.
>>>
>>> s/to/do/ ;)
>>>
>>> That was gone fast. Disabled fglrx, started tests, full halt without any
>>> visible on the serial (I needed to press the reset button):
>>
>> One after another, now I've got this:
>>
>> [ 5698.640830] [Hardware Error]: CPU:0
>> MC2_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc2540c000040136
>> [ 5698.649866] [Hardware Error]:        MC2_ADDR: 0x0000000002299678
>> [ 5698.655443] [Hardware Error]: Combined Unit Error: Fill ECC error
>> on data fills.
>> [ 5698.662849] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD
>>
>> I think it's now really an RMA and I can stop doing further tests.
>
> Are you sure the temperature conditions of the box are optimal? IOW,
> there's nothing overheating in there?

Yes. At least if the boxed fan is enough, which I have to assume. 
Environment temperature is around 18° C or even colder.

Regards,

Alexander


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-11-06 16:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-02 10:50 AMD A10: MCE Instruction Cache Error Alexander Holler
2012-11-02 13:53 ` Alexander Holler
2012-11-03  4:49   ` Borislav Petkov
2012-11-03 10:45     ` Alexander Holler
2012-11-04 15:21       ` Borislav Petkov
2012-11-04 17:19         ` Alexander Holler
2012-11-06  9:10           ` Borislav Petkov
2012-11-06 11:18             ` Alexander Holler
2012-11-06 11:44               ` Alexander Holler
2012-11-06 13:14                 ` Alexander Holler
2012-11-06 14:47                   ` Borislav Petkov
2012-11-06 16:02                     ` Alexander Holler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox