* Interpreting EDAC errors
@ 2011-06-20 4:57 Kevin Bowling
2011-06-20 7:34 ` Borislav Petkov
0 siblings, 1 reply; 4+ messages in thread
From: Kevin Bowling @ 2011-06-20 4:57 UTC (permalink / raw)
To: bluesmoke-devel
Hello,
I've been seeing the following errors from the EDAC system. I'm not
quite sure how to associate the output from edac-util to physical
DIMMs. How do we account for multi-rank DIMMs, interleaving, NUMA,
etc?
Any help would be greatly appreciated.
Regards,
Kevin
root@PM-LAS-PROD-0:~# edac-util
mc0: csrow3: ch0: 1 Corrected Errors
mc1: csrow2: ch0: 1 Corrected Errors
mc2: csrow3: ch0: 1 Corrected Errors
mc2: csrow3: ch1: 1 Corrected Errors
root@PM-LAS-PROD-0:~# edac-ctl --mainboard
edac-ctl: mainboard: Supermicro H8DGU
root@PM-LAS-PROD-0:~# dmidecode -t memory
# dmidecode 2.9
SMBIOS 2.6 present.
Handle 0x0011, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: Unknown
Error Information Handle: Not Provided
Number Of Devices: 8
Handle 0x0013, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0011
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: <OUT OF SPEC>
Set: None
Locator: P1_DIMM1B
Bank Locator: BANK0
Type: Unknown
Type Detail: None
Speed: Unknown
Manufacturer: Manufacturer00
Serial Number: SerNum00
Asset Tag: AssetTagNum0
Part Number: ModulePartNumber00
Handle 0x0015, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0011
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: P1_DIMM1A
Bank Locator: BANK1
Type: <OUT OF SPEC>
Type Detail: Synchronous
Speed: 1333 MHz (0.8 ns)
Manufacturer: Hyundai
Serial Number: C0073434
Asset Tag: AssetTagNum1
Part Number: HMT31GR7BFR4C-H9
Handle 0x0017, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0011
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: <OUT OF SPEC>
Set: None
Locator: P1_DIMM2B
Bank Locator: BANK2
Type: Unknown
Type Detail: None
Speed: Unknown
Manufacturer: Manufacturer02
Serial Number: SerNum02
Asset Tag: AssetTagNum2
Part Number: ModulePartNumber02
Handle 0x0019, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0011
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: P1_DIMM2A
Bank Locator: BANK3
Type: <OUT OF SPEC>
Type Detail: Synchronous
Speed: 1333 MHz (0.8 ns)
Manufacturer: Hyundai
Serial Number: 5A1F440F
Asset Tag: AssetTagNum3
Part Number: HMT31GR7BFR4C-H9
Handle 0x001B, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0011
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: <OUT OF SPEC>
Set: None
Locator: P1_DIMM3B
Bank Locator: BANK4
Type: Unknown
Type Detail: None
Speed: Unknown
Manufacturer: Manufacturer04
Serial Number: SerNum04
Asset Tag: AssetTagNum4
Part Number: ModulePartNumber04
Handle 0x001D, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0011
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: P1_DIMM3A
Bank Locator: BANK5
Type: <OUT OF SPEC>
Type Detail: Synchronous
Speed: 1333 MHz (0.8 ns)
Manufacturer: Hyundai
Serial Number: 7BEB1017
Asset Tag: AssetTagNum5
Part Number: HMT31GR7BFR4C-H9
Handle 0x001F, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0011
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: <OUT OF SPEC>
Set: None
Locator: P1_DIMM4B
Bank Locator: BANK6
Type: Unknown
Type Detail: None
Speed: Unknown
Manufacturer: Manufacturer06
Serial Number: SerNum06
Asset Tag: AssetTagNum6
Part Number: ModulePartNumber06
Handle 0x0021, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0011
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: P1_DIMM4A
Bank Locator: BANK7
Type: <OUT OF SPEC>
Type Detail: Synchronous
Speed: 1333 MHz (0.8 ns)
Manufacturer: Hyundai
Serial Number: E6CCC22E
Asset Tag: AssetTagNum7
Part Number: HMT31GR7BFR4C-H9
Handle 0x0023, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: Unknown
Error Information Handle: Not Provided
Number Of Devices: 8
Handle 0x0025, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: <OUT OF SPEC>
Set: None
Locator: P2_DIMM1B
Bank Locator: BANK8
Type: Unknown
Type Detail: None
Speed: Unknown
Manufacturer: Manufacturer08
Serial Number: SerNum08
Asset Tag: AssetTagNum8
Part Number: ModulePartNumber08
Handle 0x0027, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: P2_DIMM1A
Bank Locator: BANK9
Type: <OUT OF SPEC>
Type Detail: Synchronous
Speed: 1333 MHz (0.8 ns)
Manufacturer: Hyundai
Serial Number: 651FC40F
Asset Tag: AssetTagNum9
Part Number: HMT31GR7BFR4C-H9
Handle 0x0029, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: <OUT OF SPEC>
Set: None
Locator: P2_DIMM2B
Bank Locator: BANK10
Type: Unknown
Type Detail: None
Speed: Unknown
Manufacturer: Manufacturer10
Serial Number: SerNum10
Asset Tag: AssetTagNum10
Part Number: ModulePartNumber10
Handle 0x002B, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: P2_DIMM2A
Bank Locator: BANK11
Type: <OUT OF SPEC>
Type Detail: Synchronous
Speed: 1333 MHz (0.8 ns)
Manufacturer: Hyundai
Serial Number: 841F440F
Asset Tag: AssetTagNum11
Part Number: HMT31GR7BFR4C-H9
Handle 0x002D, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: <OUT OF SPEC>
Set: None
Locator: P2_DIMM3B
Bank Locator: BANK12
Type: Unknown
Type Detail: None
Speed: Unknown
Manufacturer: Manufacturer12
Serial Number: SerNum12
Asset Tag: AssetTagNum12
Part Number: ModulePartNumber12
Handle 0x002F, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: P2_DIMM3A
Bank Locator: BANK13
Type: <OUT OF SPEC>
Type Detail: Synchronous
Speed: 1333 MHz (0.8 ns)
Manufacturer: Hyundai
Serial Number: 771F940F
Asset Tag: AssetTagNum13
Part Number: HMT31GR7BFR4C-H9
Handle 0x0031, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: <OUT OF SPEC>
Set: None
Locator: P2_DIMM4B
Bank Locator: BANK14
Type: Unknown
Type Detail: None
Speed: Unknown
Manufacturer: Manufacturer14
Serial Number: SerNum14
Asset Tag: AssetTagNum14
Part Number: ModulePartNumber14
Handle 0x0033, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: P2_DIMM4A
Bank Locator: BANK15
Type: <OUT OF SPEC>
Type Detail: Synchronous
Speed: 1333 MHz (0.8 ns)
Manufacturer: Hyundai
Serial Number: 881F540F
Asset Tag: AssetTagNum15
Part Number: HMT31GR7BFR4C-H9
------------------------------------------------------------------------------
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: Interpreting EDAC errors 2011-06-20 4:57 Interpreting EDAC errors Kevin Bowling @ 2011-06-20 7:34 ` Borislav Petkov 2011-06-20 8:31 ` Kevin Bowling 0 siblings, 1 reply; 4+ messages in thread From: Borislav Petkov @ 2011-06-20 7:34 UTC (permalink / raw) To: Kevin Bowling; +Cc: bluesmoke-devel@lists.sourceforge.net Hi, On Mon, Jun 20, 2011 at 12:57:26AM -0400, Kevin Bowling wrote: > Hello, > > I've been seeing the following errors from the EDAC system. I'm not > quite sure how to associate the output from edac-util to physical > DIMMs. How do we account for multi-rank DIMMs, interleaving, NUMA, > etc? Judging by the mainboard, this is a dual socket Magny-Cours. A couple of things: * interpreting DRAM ECC errors is still suboptimal and we're working on it, I'll try to come up with an interim solution to make the decoded error info a bit more understandable. * you have one singe-bit error which got corrected by the memory controller on 4 DIMMs and over the current system uptime so I wouldn't worry too much. I would monitor the DIMMs though and take action only if those error rates start to grow over time. You have 4 8G DIMMs per node but I don't know they rank count so please take the below with a grain of salt. Wait, http://www.alldatasheet.com/datasheet-pdf/pdf/332888/HYNIX/HMT31GR7BFR4C-H9.html says that yours are actually dual-ranked. Btw, kernel dmesg output of EDAC should help to pinpoint them better. > root@PM-LAS-PROD-0:~# edac-util > mc0: csrow3: ch0: 1 Corrected Errors This should be P1_DIMM1A if your DIMMs are quadranked, P1_DIMM2A if dual-ranked. > mc1: csrow2: ch0: 1 Corrected Errors P1_DIMM3A or P1_DIMM4A as above. Also, I'm assuming that the increasing nomenclature in the silkscreen labeling is mapping the memory controllers in the same way, i.e.: mc0 -> 1A, 2A mc1 -> 3A, 4A > mc2: csrow3: ch0: 1 Corrected Errors > mc2: csrow3: ch1: 1 Corrected Errors This looks like P2_DIMM3A So, yeah, it is suboptimal and it needs fixing, I know. HTH. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 ------------------------------------------------------------------------------ EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Interpreting EDAC errors 2011-06-20 7:34 ` Borislav Petkov @ 2011-06-20 8:31 ` Kevin Bowling 2011-06-20 10:15 ` Borislav Petkov 0 siblings, 1 reply; 4+ messages in thread From: Kevin Bowling @ 2011-06-20 8:31 UTC (permalink / raw) To: Borislav Petkov; +Cc: bluesmoke-devel@lists.sourceforge.net On Mon, Jun 20, 2011 at 12:34 AM, Borislav Petkov <bp@amd64.org> wrote: > Hi, > > On Mon, Jun 20, 2011 at 12:57:26AM -0400, Kevin Bowling wrote: >> Hello, >> >> I've been seeing the following errors from the EDAC system. I'm not >> quite sure how to associate the output from edac-util to physical >> DIMMs. How do we account for multi-rank DIMMs, interleaving, NUMA, >> etc? > > Judging by the mainboard, this is a dual socket Magny-Cours. A couple of > things: Correct. > * interpreting DRAM ECC errors is still suboptimal and we're working on > it, I'll try to come up with an interim solution to make the decoded > error info a bit more understandable. Ok, glad the confusion wasn't solely my own ignorance :p > * you have one singe-bit error which got corrected by the memory > controller on 4 DIMMs and over the current system uptime so I wouldn't > worry too much. I would monitor the DIMMs though and take action only if > those error rates start to grow over time. The system is readily throwing these single-bit errors every 1-2 days across reboots. This machine has an identical twin at the same site that is not exhibiting this problem and is even running a bit hotter internally. The RAM for these machines came in a large tray so I would guess they are the same batch. I've run EDAC on many systems and never seen one this chatty. I recall reading literature in the past that DRAM errors should be a bit more rare than this. I'm suspecting the motherboard since it's across so many DIMMs. It does scare me to say the least as this box will be part of a mission critical system. > You have 4 8G DIMMs per node but I don't know they rank > count so please take the below with a grain of salt. Wait, > http://www.alldatasheet.com/datasheet-pdf/pdf/332888/HYNIX/HMT31GR7BFR4C-H9.html > says that yours are actually dual-ranked. Looking at the dmesg output, I agree; dual-ranked. > Btw, kernel dmesg output of EDAC should help to pinpoint them better. [ 9.086759] EDAC MC: Ver: 2.1.0 Apr 11 2011 [ 9.100467] EDAC amd64_edac: v3.3.0 [ 9.100576] EDAC amd64: DRAM ECC enabled. [ 9.100587] EDAC amd64: F10h detected (node 0). [ 9.100607] EDAC MC: DCT0 chip selects: [ 9.100608] EDAC amd64: MC: 0: 0MB 1: 0MB [ 9.100610] EDAC amd64: MC: 2: 4096MB 3: 4096MB [ 9.100612] EDAC amd64: MC: 4: 0MB 5: 0MB [ 9.100614] EDAC amd64: MC: 6: 0MB 7: 0MB [ 9.100615] EDAC MC: DCT1 chip selects: [ 9.100616] EDAC amd64: MC: 0: 0MB 1: 0MB [ 9.100618] EDAC amd64: MC: 2: 4096MB 3: 4096MB [ 9.100619] EDAC amd64: MC: 4: 0MB 5: 0MB [ 9.100621] EDAC amd64: MC: 6: 0MB 7: 0MB [ 9.100622] EDAC amd64: using x8 syndromes. [ 9.100624] EDAC amd64: MCT channel count: 2 [ 9.100649] EDAC amd64: CS2: Registered DDR3 RAM [ 9.100651] EDAC amd64: CS3: Registered DDR3 RAM [ 9.100671] EDAC MC0: Giving out device to 'amd64_edac' 'F10h': DEV 0000:00:18.2 [ 9.100778] EDAC amd64: DRAM ECC enabled. [ 9.100820] EDAC amd64: F10h detected (node 1). [ 9.100853] EDAC MC: DCT0 chip selects: [ 9.100855] EDAC amd64: MC: 0: 0MB 1: 0MB [ 9.100857] EDAC amd64: MC: 2: 4096MB 3: 4096MB [ 9.100859] EDAC amd64: MC: 4: 0MB 5: 0MB [ 9.100860] EDAC amd64: MC: 6: 0MB 7: 0MB [ 9.100862] EDAC MC: DCT1 chip selects: [ 9.100863] EDAC amd64: MC: 0: 0MB 1: 0MB [ 9.100865] EDAC amd64: MC: 2: 4096MB 3: 4096MB [ 9.100866] EDAC amd64: MC: 4: 0MB 5: 0MB [ 9.100868] EDAC amd64: MC: 6: 0MB 7: 0MB [ 9.100869] EDAC amd64: using x8 syndromes. [ 9.100871] EDAC amd64: MCT channel count: 2 [ 9.100903] EDAC amd64: CS2: Registered DDR3 RAM [ 9.100905] EDAC amd64: CS3: Registered DDR3 RAM [ 9.100932] EDAC MC1: Giving out device to 'amd64_edac' 'F10h': DEV 0000:00:19.2 [ 9.101050] EDAC amd64: DRAM ECC enabled. [ 9.101091] EDAC amd64: F10h detected (node 2). [ 9.101125] EDAC MC: DCT0 chip selects: [ 9.101127] EDAC amd64: MC: 0: 0MB 1: 0MB [ 9.101129] EDAC amd64: MC: 2: 4096MB 3: 4096MB [ 9.101131] EDAC amd64: MC: 4: 0MB 5: 0MB [ 9.101132] EDAC amd64: MC: 6: 0MB 7: 0MB [ 9.101134] EDAC MC: DCT1 chip selects: [ 9.101135] EDAC amd64: MC: 0: 0MB 1: 0MB [ 9.101137] EDAC amd64: MC: 2: 4096MB 3: 4096MB [ 9.101138] EDAC amd64: MC: 4: 0MB 5: 0MB [ 9.101140] EDAC amd64: MC: 6: 0MB 7: 0MB [ 9.101141] EDAC amd64: using x8 syndromes. [ 9.101143] EDAC amd64: MCT channel count: 2 [ 9.101180] EDAC amd64: CS2: Registered DDR3 RAM [ 9.101182] EDAC amd64: CS3: Registered DDR3 RAM [ 9.101221] EDAC MC2: Giving out device to 'amd64_edac' 'F10h': DEV 0000:00:1a.2 [ 9.101337] EDAC amd64: DRAM ECC enabled. [ 9.101383] EDAC amd64: F10h detected (node 3). [ 9.101424] EDAC MC: DCT0 chip selects: [ 9.101426] EDAC amd64: MC: 0: 0MB 1: 0MB [ 9.101428] EDAC amd64: MC: 2: 4096MB 3: 4096MB [ 9.101430] EDAC amd64: MC: 4: 0MB 5: 0MB [ 9.101431] EDAC amd64: MC: 6: 0MB 7: 0MB [ 9.101433] EDAC MC: DCT1 chip selects: [ 9.101434] EDAC amd64: MC: 0: 0MB 1: 0MB [ 9.101436] EDAC amd64: MC: 2: 4096MB 3: 4096MB [ 9.101437] EDAC amd64: MC: 4: 0MB 5: 0MB [ 9.101439] EDAC amd64: MC: 6: 0MB 7: 0MB [ 9.101440] EDAC amd64: using x8 syndromes. [ 9.101442] EDAC amd64: MCT channel count: 2 [ 9.101495] EDAC amd64: CS2: Registered DDR3 RAM [ 9.101497] EDAC amd64: CS3: Registered DDR3 RAM [ 9.101533] EDAC MC3: Giving out device to 'amd64_edac' 'F10h': DEV 0000:00:1b.2 [ 9.101622] EDAC PCI0: Giving out device to module 'amd64_edac' controller 'EDAC PCI controller': DEV '0000:00:18.2' (POLLED) [282601.860098] [Hardware Error]: MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc7f400097080813 [282601.863567] [Hardware Error]: Northbridge Error (node 1): DRAM ECC error detected on the NB. [282601.867178] EDAC amd64 MC1: CE ERROR_ADDRESS= 0x81fc93d00 [282601.869445] EDAC MC1: CE page 0x81fc93, offset 0xd00, grain 0, syndrome 0x97fe, row 2, channel 0, label "": amd64_edac [282601.869452] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) [282601.873529] Disabling lock debugging due to kernel taint [282601.873535] [Hardware Error]: Machine check events logged [321603.500128] [Hardware Error]: MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c00c10004080a13 [321603.503484] [Hardware Error]: Northbridge Error (node 2): DRAM ECC error detected on the NB. [321603.507096] EDAC amd64 MC2: CE ERROR_ADDRESS= 0x85a8eec40 [321603.509362] EDAC MC2: CE page 0x85a8ee, offset 0xc40, grain 0, syndrome 0x401, row 3, channel 1, label "": amd64_edac [321603.509369] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout) [321603.513450] [Hardware Error]: Machine check events logged [402125.606309] audit_printk_skb: 36 callbacks suppressed [402125.606318] type=1400 audit(1308314274.325:23): apparmor="STATUS" operation="profile_replace" name="/usr/sbin/libvirtd" pid=14994 comm="apparmor_parser" [402125.644372] type=1400 audit(1308314274.365:24): apparmor="STATUS" operation="profile_replace" name="/usr/lib/libvirt/virt-aa-helper" pid=14996 comm="apparmor_parser" [492000.040077] [Hardware Error]: MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc7840001e080a13 [492000.043538] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB. [492000.047150] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x3fcec0040 [492000.049415] EDAC MC0: CE page 0x3fcec0, offset 0x40, grain 0, syndrome 0x1ef0, row 3, channel 0, label "": amd64_edac [492000.049422] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout) [492000.053499] [Hardware Error]: Machine check events logged [547053.500095] [Hardware Error]: MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc41c10077080a13 [547053.503561] [Hardware Error]: Northbridge Error (node 2): DRAM ECC error detected on the NB. [547053.507172] EDAC amd64 MC2: CE ERROR_ADDRESS= 0xbff5c9600 [547053.581042] EDAC MC2: CE page 0xbff5c9, offset 0x600, grain 0, syndrome 0x7783, row 3, channel 0, label "": amd64_edac [547053.581049] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout) [547053.730417] [Hardware Error]: Machine check events logged > >> root@PM-LAS-PROD-0:~# edac-util >> mc0: csrow3: ch0: 1 Corrected Errors > > This should be P1_DIMM1A if your DIMMs are quadranked, P1_DIMM2A if > dual-ranked. > >> mc1: csrow2: ch0: 1 Corrected Errors > > P1_DIMM3A or P1_DIMM4A as above. Also, I'm assuming that the increasing > nomenclature in the silkscreen labeling is mapping the memory > controllers in the same way, i.e.: > > mc0 -> 1A, 2A > mc1 -> 3A, 4A > >> mc2: csrow3: ch0: 1 Corrected Errors >> mc2: csrow3: ch1: 1 Corrected Errors > > This looks like P2_DIMM3A > > So, yeah, it is suboptimal and it needs fixing, I know. > > HTH. > > -- > Regards/Gruss, > Boris. > > Advanced Micro Devices GmbH > Einsteinring 24, 85609 Dornach > GM: Alberto Bozzo > Reg: Dornach, Landkreis Muenchen > HRB Nr. 43632 WEEE Registernr: 129 19551 > ------------------------------------------------------------------------------ EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Interpreting EDAC errors 2011-06-20 8:31 ` Kevin Bowling @ 2011-06-20 10:15 ` Borislav Petkov 0 siblings, 0 replies; 4+ messages in thread From: Borislav Petkov @ 2011-06-20 10:15 UTC (permalink / raw) To: Kevin Bowling; +Cc: bluesmoke-devel On Mon, Jun 20, 2011 at 04:31:43AM -0400, Kevin Bowling wrote: > > * you have one singe-bit error which got corrected by the memory > > controller on 4 DIMMs and over the current system uptime so I wouldn't > > worry too much. I would monitor the DIMMs though and take action only if > > those error rates start to grow over time. > > The system is readily throwing these single-bit errors every 1-2 days > across reboots. > > This machine has an identical twin at the same site that is not > exhibiting this problem and is even running a bit hotter internally. > The RAM for these machines came in a large tray so I would guess they > are the same batch. > > I've run EDAC on many systems and never seen one this chatty. I > recall reading literature in the past that DRAM errors should be a bit > more rare than this. I'm suspecting the motherboard since it's across > so many DIMMs. It does scare me to say the least as this box will be > part of a mission critical system. A very simple test would be to take out the DIMMs and stick them in the identical twin. I have the funny feeling that this might not be that easy, logistically :). > > You have 4 8G DIMMs per node but I don't know they rank > > count so please take the below with a grain of salt. Wait, > > http://www.alldatasheet.com/datasheet-pdf/pdf/332888/HYNIX/HMT31GR7BFR4C-H9.html > > says that yours are actually dual-ranked. > > Looking at the dmesg output, I agree; dual-ranked. > > > Btw, kernel dmesg output of EDAC should help to pinpoint them better. > > [ 9.086759] EDAC MC: Ver: 2.1.0 Apr 11 2011 > [ 9.100467] EDAC amd64_edac: v3.3.0 > [ 9.100576] EDAC amd64: DRAM ECC enabled. > [ 9.100587] EDAC amd64: F10h detected (node 0). > [ 9.100607] EDAC MC: DCT0 chip selects: > [ 9.100608] EDAC amd64: MC: 0: 0MB 1: 0MB > [ 9.100610] EDAC amd64: MC: 2: 4096MB 3: 4096MB > [ 9.100612] EDAC amd64: MC: 4: 0MB 5: 0MB > [ 9.100614] EDAC amd64: MC: 6: 0MB 7: 0MB > [ 9.100615] EDAC MC: DCT1 chip selects: > [ 9.100616] EDAC amd64: MC: 0: 0MB 1: 0MB > [ 9.100618] EDAC amd64: MC: 2: 4096MB 3: 4096MB > [ 9.100619] EDAC amd64: MC: 4: 0MB 5: 0MB > [ 9.100621] EDAC amd64: MC: 6: 0MB 7: 0MB > [ 9.100622] EDAC amd64: using x8 syndromes. > [ 9.100624] EDAC amd64: MCT channel count: 2 > [ 9.100649] EDAC amd64: CS2: Registered DDR3 RAM > [ 9.100651] EDAC amd64: CS3: Registered DDR3 RAM > [ 9.100671] EDAC MC0: Giving out device to 'amd64_edac' 'F10h': DEV > 0000:00:18.2 > [ 9.100778] EDAC amd64: DRAM ECC enabled. > [ 9.100820] EDAC amd64: F10h detected (node 1). > [ 9.100853] EDAC MC: DCT0 chip selects: > [ 9.100855] EDAC amd64: MC: 0: 0MB 1: 0MB > [ 9.100857] EDAC amd64: MC: 2: 4096MB 3: 4096MB > [ 9.100859] EDAC amd64: MC: 4: 0MB 5: 0MB > [ 9.100860] EDAC amd64: MC: 6: 0MB 7: 0MB > [ 9.100862] EDAC MC: DCT1 chip selects: > [ 9.100863] EDAC amd64: MC: 0: 0MB 1: 0MB > [ 9.100865] EDAC amd64: MC: 2: 4096MB 3: 4096MB > [ 9.100866] EDAC amd64: MC: 4: 0MB 5: 0MB > [ 9.100868] EDAC amd64: MC: 6: 0MB 7: 0MB > [ 9.100869] EDAC amd64: using x8 syndromes. > [ 9.100871] EDAC amd64: MCT channel count: 2 > [ 9.100903] EDAC amd64: CS2: Registered DDR3 RAM > [ 9.100905] EDAC amd64: CS3: Registered DDR3 RAM > [ 9.100932] EDAC MC1: Giving out device to 'amd64_edac' 'F10h': DEV > 0000:00:19.2 > [ 9.101050] EDAC amd64: DRAM ECC enabled. > [ 9.101091] EDAC amd64: F10h detected (node 2). > [ 9.101125] EDAC MC: DCT0 chip selects: > [ 9.101127] EDAC amd64: MC: 0: 0MB 1: 0MB > [ 9.101129] EDAC amd64: MC: 2: 4096MB 3: 4096MB > [ 9.101131] EDAC amd64: MC: 4: 0MB 5: 0MB > [ 9.101132] EDAC amd64: MC: 6: 0MB 7: 0MB > [ 9.101134] EDAC MC: DCT1 chip selects: > [ 9.101135] EDAC amd64: MC: 0: 0MB 1: 0MB > [ 9.101137] EDAC amd64: MC: 2: 4096MB 3: 4096MB > [ 9.101138] EDAC amd64: MC: 4: 0MB 5: 0MB > [ 9.101140] EDAC amd64: MC: 6: 0MB 7: 0MB > [ 9.101141] EDAC amd64: using x8 syndromes. > [ 9.101143] EDAC amd64: MCT channel count: 2 > [ 9.101180] EDAC amd64: CS2: Registered DDR3 RAM > [ 9.101182] EDAC amd64: CS3: Registered DDR3 RAM > [ 9.101221] EDAC MC2: Giving out device to 'amd64_edac' 'F10h': DEV > 0000:00:1a.2 > [ 9.101337] EDAC amd64: DRAM ECC enabled. > [ 9.101383] EDAC amd64: F10h detected (node 3). > [ 9.101424] EDAC MC: DCT0 chip selects: > [ 9.101426] EDAC amd64: MC: 0: 0MB 1: 0MB > [ 9.101428] EDAC amd64: MC: 2: 4096MB 3: 4096MB > [ 9.101430] EDAC amd64: MC: 4: 0MB 5: 0MB > [ 9.101431] EDAC amd64: MC: 6: 0MB 7: 0MB > [ 9.101433] EDAC MC: DCT1 chip selects: > [ 9.101434] EDAC amd64: MC: 0: 0MB 1: 0MB > [ 9.101436] EDAC amd64: MC: 2: 4096MB 3: 4096MB > [ 9.101437] EDAC amd64: MC: 4: 0MB 5: 0MB > [ 9.101439] EDAC amd64: MC: 6: 0MB 7: 0MB > [ 9.101440] EDAC amd64: using x8 syndromes. > [ 9.101442] EDAC amd64: MCT channel count: 2 > [ 9.101495] EDAC amd64: CS2: Registered DDR3 RAM > [ 9.101497] EDAC amd64: CS3: Registered DDR3 RAM > [ 9.101533] EDAC MC3: Giving out device to 'amd64_edac' 'F10h': DEV > 0000:00:1b.2 > [ 9.101622] EDAC PCI0: Giving out device to module 'amd64_edac' > controller 'EDAC PCI controller': DEV '0000:00:18.2' (POLLED) > > [282601.860098] [Hardware Error]: > MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc7f400097080813 > [282601.863567] [Hardware Error]: Northbridge Error (node 1): DRAM ECC > error detected on the NB. > [282601.867178] EDAC amd64 MC1: CE ERROR_ADDRESS= 0x81fc93d00 > [282601.869445] EDAC MC1: CE page 0x81fc93, offset 0xd00, grain 0, > syndrome 0x97fe, row 2, channel 0, label "": amd64_edac > [282601.869452] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, > mem-tx: RD, part-proc: SRC (no timeout) right, socket 0, internal node 1, first DIMM, looks probably P1_DIMM3A > [282601.873529] Disabling lock debugging due to kernel taint > [282601.873535] [Hardware Error]: Machine check events logged > [321603.500128] [Hardware Error]: MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: > 0x9c00c10004080a13 > [321603.503484] [Hardware Error]: Northbridge Error (node 2): DRAM ECC > error detected on the NB. > [321603.507096] EDAC amd64 MC2: CE ERROR_ADDRESS= 0x85a8eec40 > [321603.509362] EDAC MC2: CE page 0x85a8ee, offset 0xc40, grain 0, > syndrome 0x401, row 3, channel 1, label "": amd64_edac > [321603.509369] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, > mem-tx: RD, part-proc: RES (no timeout) socket 1 (second socket), internal node 0, second DIMM, probably P2_DIMM2A. > [321603.513450] [Hardware Error]: Machine check events logged > [402125.606309] audit_printk_skb: 36 callbacks suppressed > [402125.606318] type=1400 audit(1308314274.325:23): apparmor="STATUS" > operation="profile_replace" name="/usr/sbin/libvirtd" pid=14994 > comm="apparmor_parser" > [402125.644372] type=1400 audit(1308314274.365:24): apparmor="STATUS" > operation="profile_replace" name="/usr/lib/libvirt/virt-aa-helper" > pid=14996 comm="apparmor_parser" > [492000.040077] [Hardware Error]: > MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc7840001e080a13 > [492000.043538] [Hardware Error]: Northbridge Error (node 0): DRAM ECC > error detected on the NB. > [492000.047150] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x3fcec0040 > [492000.049415] EDAC MC0: CE page 0x3fcec0, offset 0x40, grain 0, > syndrome 0x1ef0, row 3, channel 0, label "": amd64_edac > [492000.049422] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, > mem-tx: RD, part-proc: RES (no timeout) socket 0, internal node 0, first DIMM, probably P1_DIMM1A > [492000.053499] [Hardware Error]: Machine check events logged > [547053.500095] [Hardware Error]: > MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc41c10077080a13 > [547053.503561] [Hardware Error]: Northbridge Error (node 2): DRAM ECC > error detected on the NB. > [547053.507172] EDAC amd64 MC2: CE ERROR_ADDRESS= 0xbff5c9600 > [547053.581042] EDAC MC2: CE page 0xbff5c9, offset 0x600, grain 0, > syndrome 0x7783, row 3, channel 0, label "": amd64_edac > [547053.581049] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, > mem-tx: RD, part-proc: RES (no timeout) > [547053.730417] [Hardware Error]: Machine check events logged socket 1, internal node 0, first DIMM, P2_DIMM1A maybe Again, please digest with a grain of salt. I'll add more helpful printks to the driver as an interim solution - something similar to the decodings above - before we start dumping the silkscreen labels straightaway. HTH. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 ------------------------------------------------------------------------------ EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-06-20 10:15 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-06-20 4:57 Interpreting EDAC errors Kevin Bowling 2011-06-20 7:34 ` Borislav Petkov 2011-06-20 8:31 ` Kevin Bowling 2011-06-20 10:15 ` Borislav Petkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).