* AMD A10: MCE Instruction Cache Error @ 2012-11-02 10:50 Alexander Holler 2012-11-02 13:53 ` Alexander Holler 0 siblings, 1 reply; 12+ messages in thread From: Alexander Holler @ 2012-11-02 10:50 UTC (permalink / raw) To: linux-kernel Hello, I've just got the following on an AMD A10 5800K: ------ [ 8395.999581] [Hardware Error]: CPU:0 MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151 [ 8395.999586] [Hardware Error]: MC1_ADDR: 0x0000ffffa00e1203 [ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error during data load from IC. [ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD ------ Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest). Can someone enlight me about what might be wrong with my (new) system (memtest didn't show an errors)? What IC is meant? As far as I know, this processor doesn't support ECC, so I wonder where that parity error does come from. Regards, Alexander ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AMD A10: MCE Instruction Cache Error 2012-11-02 10:50 AMD A10: MCE Instruction Cache Error Alexander Holler @ 2012-11-02 13:53 ` Alexander Holler 2012-11-03 4:49 ` Borislav Petkov 0 siblings, 1 reply; 12+ messages in thread From: Alexander Holler @ 2012-11-02 13:53 UTC (permalink / raw) To: linux-kernel Am 02.11.2012 11:50, schrieb Alexander Holler: > Hello, > > I've just got the following on an AMD A10 5800K: > > ------ > [ 8395.999581] [Hardware Error]: CPU:0 > MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151 > [ 8395.999586] [Hardware Error]: MC1_ADDR: 0x0000ffffa00e1203 > [ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error > during data load from IC. > [ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD > ------ > > Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest). > > Can someone enlight me about what might be wrong with my (new) system > (memtest didn't show an errors)? > > What IC is meant? As far as I know, this processor doesn't support ECC, > so I wonder where that parity error does come from. I assume IC means Instruction Cache. ;) As the kernel didn't reboot or halt, this seems to have been a correctable error. Which leads me to another question. I have mcelog running, but it doesn't seem to receive the error. With my previous Intel-HW and an older kernel, mcelog received MCE errors (trip temperatur). But since the kernel now decodes those message themself, that doesn't seem to happen anymore. mcelog is silent, but now I've seen the above message on all my consoles. So now I have two question: - First, if the error is something I should ask AMD about, - Second, if the kernel could mention that it is an recoverable error. And if so and if such errors aren't something to get panic (e.g. it isn't unusual to receive such), if the kernel could output that message with another priority. Regards, Alexander ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AMD A10: MCE Instruction Cache Error 2012-11-02 13:53 ` Alexander Holler @ 2012-11-03 4:49 ` Borislav Petkov 2012-11-03 10:45 ` Alexander Holler 0 siblings, 1 reply; 12+ messages in thread From: Borislav Petkov @ 2012-11-03 4:49 UTC (permalink / raw) To: Alexander Holler; +Cc: linux-kernel On Fri, Nov 02, 2012 at 02:53:45PM +0100, Alexander Holler wrote: > Am 02.11.2012 11:50, schrieb Alexander Holler: > >Hello, > > > >I've just got the following on an AMD A10 5800K: > > > >------ > >[ 8395.999581] [Hardware Error]: CPU:0 > >MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151 > >[ 8395.999586] [Hardware Error]: MC1_ADDR: 0x0000ffffa00e1203 > >[ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error > >during data load from IC. > >[ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD > >------ > > > >Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest). > > > >Can someone enlight me about what might be wrong with my (new) system > >(memtest didn't show an errors)? > > > >What IC is meant? As far as I know, this processor doesn't support ECC, > >so I wonder where that parity error does come from. > > I assume IC means Instruction Cache. ;) It says so earlier in the sentence: "Instruction Cache Error" :) > As the kernel didn't reboot or halt, this seems to have been a > correctable error. Yes, it is (the "CE" bit in MC1_STATUS). Btw, I have reworked this code to spit human-readable information first. It also says what the error severity is now. > Which leads me to another question. I have mcelog running, but it > doesn't seem to receive the error. With my previous Intel-HW and an > older kernel, mcelog received MCE errors (trip temperatur). But > since the kernel now decodes those message themself, that doesn't > seem to happen anymore. mcelog is silent, but now I've seen the > above message on all my consoles. Yes, AMD doesn't use mcelog. > So now I have two question: > > - First, if the error is something I should ask AMD about, Not really, it is a single bit flip which got corrected, simply watch out if you get more of those. > - Second, if the kernel could mention that it is an recoverable > error. And if so and if such errors aren't something to get panic > (e.g. it isn't unusual to receive such), if the kernel could output > that message with another priority. As I said above, it got corrected. If it were critical, it would've either panicked or you wouldnt've seen it at all (probably only after reboot). HTH. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AMD A10: MCE Instruction Cache Error 2012-11-03 4:49 ` Borislav Petkov @ 2012-11-03 10:45 ` Alexander Holler 2012-11-04 15:21 ` Borislav Petkov 0 siblings, 1 reply; 12+ messages in thread From: Alexander Holler @ 2012-11-03 10:45 UTC (permalink / raw) To: Borislav Petkov, linux-kernel Am 03.11.2012 05:49, schrieb Borislav Petkov: > On Fri, Nov 02, 2012 at 02:53:45PM +0100, Alexander Holler wrote: >> Am 02.11.2012 11:50, schrieb Alexander Holler: >>> Hello, >>> >>> I've just got the following on an AMD A10 5800K: >>> >>> ------ >>> [ 8395.999581] [Hardware Error]: CPU:0 >>> MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151 >>> [ 8395.999586] [Hardware Error]: MC1_ADDR: 0x0000ffffa00e1203 >>> [ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error >>> during data load from IC. >>> [ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD >>> ------ >>> >>> Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest). >>> ... >> So now I have two question: >> >> - First, if the error is something I should ask AMD about, > > Not really, it is a single bit flip which got corrected, simply watch > out if you get more of those. > >> - Second, if the kernel could mention that it is an recoverable >> error. And if so and if such errors aren't something to get panic >> (e.g. it isn't unusual to receive such), if the kernel could output >> that message with another priority. > > As I said above, it got corrected. If it were critical, it would've > either panicked or you wouldnt've seen it at all (probably only after > reboot). Hmm, exactly that just happened twice in a row. Unfortunately the screen was already disabled (screen saving mode), so I couldn't see any message, if there was any. Just a dead box (not overclocked, I don't do such, I even had enabled the power saving mode in the BIOS, which seems to mean max. 3800 MHz). I think I should start getting nervous. :( What I meant with another priority is using something else than pr_emerg(), because pr_emerge() causes the message to become displayed on every console, at least on my F17 with default settings. Of course, I'm happy it was displayed using pr_emerg() so I haven't missed it. Now I know that even if ECC isn't available for users which don't want or need power hungry and loud servers, at least some parity is used to verify the operations with the internal memory (cache). But on the other way, if that message isn't really critical, something else than pr_emerge() should be used. Thanks for the answer. Regards, Alexander ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AMD A10: MCE Instruction Cache Error 2012-11-03 10:45 ` Alexander Holler @ 2012-11-04 15:21 ` Borislav Petkov 2012-11-04 17:19 ` Alexander Holler 0 siblings, 1 reply; 12+ messages in thread From: Borislav Petkov @ 2012-11-04 15:21 UTC (permalink / raw) To: Alexander Holler; +Cc: linux-kernel On Sat, Nov 03, 2012 at 11:45:25AM +0100, Alexander Holler wrote: > Hmm, exactly that just happened twice in a row. Unfortunately the > screen was already disabled (screen saving mode), so I couldn't see > any message, if there was any. Just a dead box (not overclocked, I > don't do such, I even had enabled the power saving mode in the BIOS, > which seems to mean max. 3800 MHz). I think I should start getting > nervous. :( How do you know this happened twice if you couldn't see any message? Also, can you enable netconsole or serial console, if possible, and try to catch full dmesg from the boot and up until it happens. Also, catch the dmesg of the box on the next time it reboots after the freeze (btw, try doing a warm reboot because this is the only way we can preserve valid error information) - if all works ok, it should decode the MCE before the freeze (if the MCE caused it and it actually is the reason for the freeze). > What I meant with another priority is using something else than > pr_emerg(), because pr_emerge() causes the message to become > displayed on every console, at least on my F17 with default > settings. This is needed because we want those errors to actually be seen. Thanks. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AMD A10: MCE Instruction Cache Error 2012-11-04 15:21 ` Borislav Petkov @ 2012-11-04 17:19 ` Alexander Holler 2012-11-06 9:10 ` Borislav Petkov 0 siblings, 1 reply; 12+ messages in thread From: Alexander Holler @ 2012-11-04 17:19 UTC (permalink / raw) To: Borislav Petkov, linux-kernel Am 04.11.2012 16:21, schrieb Borislav Petkov: > On Sat, Nov 03, 2012 at 11:45:25AM +0100, Alexander Holler wrote: >> Hmm, exactly that just happened twice in a row. Unfortunately the >> screen was already disabled (screen saving mode), so I couldn't see >> any message, if there was any. Just a dead box (not overclocked, I >> don't do such, I even had enabled the power saving mode in the BIOS, >> which seems to mean max. 3800 MHz). I think I should start getting >> nervous. :( > > How do you know this happened twice if you couldn't see any message? I was remotely logged in and there aren't that many faults which lead to complete stand still of hw (no reset). But as you said I can't know, the only thing I know is that a box with new mb, memory and apu come to a complete stand still, and such shortly after I've received an emergency message which told me that a bit inside the cpu switched unexpected. Adding to that, the box did the same as what it did while it received the MCE, a backup from a sata-atached ssd to an usb3-hd which includes compression and encryption which keeps all cores at work most of the time for several hours. > Also, can you enable netconsole or serial console, if possible, and try > to catch full dmesg from the boot and up until it happens. As I was logged in remotely by network, I know it wasn't the same MCE as before (just a disconnect and dead hw). But I don't know what else it was. And as I recently got hit by a broken RAM module, which was a pain to find, I'm not very happy that I have to go through similiar pain again with new HW. The probability to get a working HW and SW combination just has become to low in the last years. All the (IT) companies better should spend the money they now give their lawyers their QA and engineering departments instead. Sorry for the rant, also I'm used to live with hw and sw errors (as a sw-dev), I'm currently just a bit annoyed. ;) I will setup something to monitor the box through the serial and will let it backup itself all the time, trying to catch some usefull information. Regards, Alexander ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AMD A10: MCE Instruction Cache Error 2012-11-04 17:19 ` Alexander Holler @ 2012-11-06 9:10 ` Borislav Petkov 2012-11-06 11:18 ` Alexander Holler 0 siblings, 1 reply; 12+ messages in thread From: Borislav Petkov @ 2012-11-06 9:10 UTC (permalink / raw) To: Alexander Holler; +Cc: linux-kernel On Sun, Nov 04, 2012 at 06:19:32PM +0100, Alexander Holler wrote: > I was remotely logged in and there aren't that many faults which > lead to complete stand still of hw (no reset). Right, can you retry triggering the freeze without the fglrx driver? Simply remove it completely so that even the possibility to load it is not there. > But as you said I can't know, the only thing I know is that a box > with new mb, memory and apu come to a complete stand still, and > such shortly after I've received an emergency message which told me > that a bit inside the cpu switched unexpected. Adding to that, the > box did the same as what it did while it received the MCE, a backup > from a sata-atached ssd to an usb3-hd which includes compression and > encryption which keeps all cores at work most of the time for several > hours. So do you get that MCE each time you execute that same workload? Thanks. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AMD A10: MCE Instruction Cache Error 2012-11-06 9:10 ` Borislav Petkov @ 2012-11-06 11:18 ` Alexander Holler 2012-11-06 11:44 ` Alexander Holler 0 siblings, 1 reply; 12+ messages in thread From: Alexander Holler @ 2012-11-06 11:18 UTC (permalink / raw) To: Borislav Petkov, linux-kernel Am 06.11.2012 10:10, schrieb Borislav Petkov: > On Sun, Nov 04, 2012 at 06:19:32PM +0100, Alexander Holler wrote: >> I was remotely logged in and there aren't that many faults which >> lead to complete stand still of hw (no reset). > > Right, can you retry triggering the freeze without the fglrx driver? > Simply remove it completely so that even the possibility to load it is > not there. Will do. But I don't think it is fglrx. I'm using it since several years (just with an external graphics card before) and never had a problem with it. Besides that, during the hangs nothing on the display happened, I was logged out and just had a remote ssh session on. >> But as you said I can't know, the only thing I know is that a box >> with new mb, memory and apu come to a complete stand still, and >> such shortly after I've received an emergency message which told me >> that a bit inside the cpu switched unexpected. Adding to that, the >> box did the same as what it did while it received the MCE, a backup >> from a sata-atached ssd to an usb3-hd which includes compression and >> encryption which keeps all cores at work most of the time for several >> hours. > > So do you get that MCE each time you execute that same workload? No, up to now the MCE only was visible once. But stressing the box yesterday (with loads of 3 for several hours and such) revealed some other serious failures which all look like the stuff which happens when the cache (or memory) is broken (I don't know how many bits of the cache can be corrected until something else happens or what happens). E.g. the checksum of a backup is wrong, or bzip2 failed with an error which it suggests is because of an HW failure like bad RAM (I've never seen that error from bzip2 before). I've just done a memory test using memtest86+-4.20 for about 7h (3 complete passes of all 16GB), no errors, so the new memory itself seems to be ok. I will now to tests with leaving fglrx off. Regards, Alexander ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AMD A10: MCE Instruction Cache Error 2012-11-06 11:18 ` Alexander Holler @ 2012-11-06 11:44 ` Alexander Holler 2012-11-06 13:14 ` Alexander Holler 0 siblings, 1 reply; 12+ messages in thread From: Alexander Holler @ 2012-11-06 11:44 UTC (permalink / raw) To: Borislav Petkov, linux-kernel Am 06.11.2012 12:18, schrieb Alexander Holler: > I will now to tests with leaving fglrx off. s/to/do/ ;) That was gone fast. Disabled fglrx, started tests, full halt without any visible on the serial (I needed to press the reset button): ------------------------ [ 77.360180] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: data=ordered,nodelalloc [ 461.503107] EXT4-fs (sdd3): mounted filesystem with ordered data mode. Opts: (null) [ 473.869770] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: data=ordered,nodelalloc [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Linux version 3.6.6-00008-gd13f937 (root@krabat.ahsoftware) (gcc version 4.7.2 20120921 (Red Hat 4.7.2-2) (GCC) ) #274 SMP Mon Nov 5 17:26:01 CET 2012 [ 0.000000] Command line: ro root=/dev/sdb2 rootfstype=ext4 enforcing=0 cgroup_disable=memory vga=0x346 video=vesafb:mtrr:3,ywrap radeon.modeset=0 earlycon=uart8250,io,0x3f8,115200n8 console=ttyS0,115200n8 console=tty0 ------------------------ Regards, Alexander ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AMD A10: MCE Instruction Cache Error 2012-11-06 11:44 ` Alexander Holler @ 2012-11-06 13:14 ` Alexander Holler 2012-11-06 14:47 ` Borislav Petkov 0 siblings, 1 reply; 12+ messages in thread From: Alexander Holler @ 2012-11-06 13:14 UTC (permalink / raw) To: Borislav Petkov, linux-kernel Am 06.11.2012 12:44, schrieb Alexander Holler: > Am 06.11.2012 12:18, schrieb Alexander Holler: >> I will now to tests with leaving fglrx off. > > s/to/do/ ;) > > That was gone fast. Disabled fglrx, started tests, full halt without any > visible on the serial (I needed to press the reset button): One after another, now I've got this: [ 5698.640830] [Hardware Error]: CPU:0 MC2_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc2540c000040136 [ 5698.649866] [Hardware Error]: MC2_ADDR: 0x0000000002299678 [ 5698.655443] [Hardware Error]: Combined Unit Error: Fill ECC error on data fills. [ 5698.662849] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD I think it's now really an RMA and I can stop doing further tests. Regards, Alexander ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AMD A10: MCE Instruction Cache Error 2012-11-06 13:14 ` Alexander Holler @ 2012-11-06 14:47 ` Borislav Petkov 2012-11-06 16:02 ` Alexander Holler 0 siblings, 1 reply; 12+ messages in thread From: Borislav Petkov @ 2012-11-06 14:47 UTC (permalink / raw) To: Alexander Holler; +Cc: linux-kernel On Tue, Nov 06, 2012 at 02:14:46PM +0100, Alexander Holler wrote: > Am 06.11.2012 12:44, schrieb Alexander Holler: > >Am 06.11.2012 12:18, schrieb Alexander Holler: > >>I will now to tests with leaving fglrx off. > > > >s/to/do/ ;) > > > >That was gone fast. Disabled fglrx, started tests, full halt without any > >visible on the serial (I needed to press the reset button): > > One after another, now I've got this: > > [ 5698.640830] [Hardware Error]: CPU:0 > MC2_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc2540c000040136 > [ 5698.649866] [Hardware Error]: MC2_ADDR: 0x0000000002299678 > [ 5698.655443] [Hardware Error]: Combined Unit Error: Fill ECC error > on data fills. > [ 5698.662849] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD > > I think it's now really an RMA and I can stop doing further tests. Are you sure the temperature conditions of the box are optimal? IOW, there's nothing overheating in there? Thanks. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AMD A10: MCE Instruction Cache Error 2012-11-06 14:47 ` Borislav Petkov @ 2012-11-06 16:02 ` Alexander Holler 0 siblings, 0 replies; 12+ messages in thread From: Alexander Holler @ 2012-11-06 16:02 UTC (permalink / raw) To: Borislav Petkov, linux-kernel Am 06.11.2012 15:47, schrieb Borislav Petkov: > On Tue, Nov 06, 2012 at 02:14:46PM +0100, Alexander Holler wrote: >> Am 06.11.2012 12:44, schrieb Alexander Holler: >>> Am 06.11.2012 12:18, schrieb Alexander Holler: >>>> I will now to tests with leaving fglrx off. >>> >>> s/to/do/ ;) >>> >>> That was gone fast. Disabled fglrx, started tests, full halt without any >>> visible on the serial (I needed to press the reset button): >> >> One after another, now I've got this: >> >> [ 5698.640830] [Hardware Error]: CPU:0 >> MC2_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc2540c000040136 >> [ 5698.649866] [Hardware Error]: MC2_ADDR: 0x0000000002299678 >> [ 5698.655443] [Hardware Error]: Combined Unit Error: Fill ECC error >> on data fills. >> [ 5698.662849] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD >> >> I think it's now really an RMA and I can stop doing further tests. > > Are you sure the temperature conditions of the box are optimal? IOW, > there's nothing overheating in there? Yes. At least if the boxed fan is enough, which I have to assume. Environment temperature is around 18° C or even colder. Regards, Alexander ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2012-11-06 16:02 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-11-02 10:50 AMD A10: MCE Instruction Cache Error Alexander Holler 2012-11-02 13:53 ` Alexander Holler 2012-11-03 4:49 ` Borislav Petkov 2012-11-03 10:45 ` Alexander Holler 2012-11-04 15:21 ` Borislav Petkov 2012-11-04 17:19 ` Alexander Holler 2012-11-06 9:10 ` Borislav Petkov 2012-11-06 11:18 ` Alexander Holler 2012-11-06 11:44 ` Alexander Holler 2012-11-06 13:14 ` Alexander Holler 2012-11-06 14:47 ` Borislav Petkov 2012-11-06 16:02 ` Alexander Holler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox