* Re: EDAC chipkill messages [not found] <fa.CZkzgccjpiJW6cRrbtRvg/+4HMg@ifi.uio.no> @ 2007-01-19 0:12 ` Robert Hancock 2007-01-19 16:45 ` Orion Poplawski 0 siblings, 1 reply; 4+ messages in thread From: Robert Hancock @ 2007-01-19 0:12 UTC (permalink / raw) To: linux-kernel; +Cc: Orion Poplawski Orion Poplawski wrote: > Can someone please explain to me what these mean? > > EDAC k8 MC1: general bus error: participating processor(local node > origin), time-out(no timeout) memory transaction type(generic read), mem > or i/o(mem access), cache level(generic) > EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4, row > 1, channel 0, label "": k8_edac > EDAC MC1: CE - no information available: k8_edac Error Overflow set > EDAC k8 MC1: extended error code: ECC chipkill x4 error > > Thanks! > Sounds like you're having some memory ECC errors.. some Memtest86, etc. runs may be in order. You may be able to figure out from this info what DIMM is having the problem. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from hancockr@nospamshaw.ca Home Page: http://www.roberthancock.com/ ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: EDAC chipkill messages 2007-01-19 0:12 ` EDAC chipkill messages Robert Hancock @ 2007-01-19 16:45 ` Orion Poplawski 2007-01-19 19:34 ` Doug Thompson 0 siblings, 1 reply; 4+ messages in thread From: Orion Poplawski @ 2007-01-19 16:45 UTC (permalink / raw) To: linux-kernel Robert Hancock wrote: > Orion Poplawski wrote: >> Can someone please explain to me what these mean? >> >> EDAC k8 MC1: general bus error: participating processor(local node >> origin), time-out(no timeout) memory transaction type(generic read), >> mem or i/o(mem access), cache level(generic) >> EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4, row >> 1, channel 0, label "": k8_edac >> EDAC MC1: CE - no information available: k8_edac Error Overflow set >> EDAC k8 MC1: extended error code: ECC chipkill x4 error >> >> Thanks! >> > > Sounds like you're having some memory ECC errors.. some Memtest86, etc. > runs may be in order. You may be able to figure out from this info what > DIMM is having the problem. > That was my assumption as well, but was hoping someone could decode the above information and point me to the problem chip. I ran Memtest86 overnight but found no problems, but don't know if it needs to run in a particular ECC mode. This is a dual proc 275 system with 4 1GB DIMMs. Guessing that MC1 is the controller on the second CPU. Would row 1 be the second DIMM? ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: EDAC chipkill messages 2007-01-19 16:45 ` Orion Poplawski @ 2007-01-19 19:34 ` Doug Thompson 0 siblings, 0 replies; 4+ messages in thread From: Doug Thompson @ 2007-01-19 19:34 UTC (permalink / raw) To: Orion Poplawski, linux-kernel --- Orion Poplawski <orion@cora.nwra.com> wrote: > Robert Hancock wrote: > > Orion Poplawski wrote: > >> Can someone please explain to me what these mean? > >> > >> EDAC k8 MC1: general bus error: participating processor(local node > > >> origin), time-out(no timeout) memory transaction type(generic > read), > >> mem or i/o(mem access), cache level(generic) > >> EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4, > row > >> 1, channel 0, label "": k8_edac > >> EDAC MC1: CE - no information available: k8_edac Error Overflow > set > >> EDAC k8 MC1: extended error code: ECC chipkill x4 error > >> > >> Thanks! > >> > > > > Sounds like you're having some memory ECC errors.. some Memtest86, > etc. > > runs may be in order. You may be able to figure out from this info > what > > DIMM is having the problem. > > > > That was my assumption as well, but was hoping someone could decode > the > above information and point me to the problem chip. I ran Memtest86 > overnight but found no problems, but don't know if it needs to run in > a > particular ECC mode. > > This is a dual proc 275 system with 4 1GB DIMMs. Guessing that MC1 > is > the controller on the second CPU. Would row 1 be the second DIMM? No that would be the FIRST DIMM, on Channel 0 Each DIMM has 2 ChipSelect Rows (CSROW) Each csrow covers two channels across, therefore on a 4 socket memory array, there are CSROWS 0 and 1 on the first DIMM row and CSROWS 2 and 3 on the second DIMM row. WWWWWWWWW XXXXXXXXXXX YYYYYYYYY ZZZZZZZZZZZ The W and the Y DIMMs are channel 0 The X and the Z DIMMs are channel 1 csrows 0 and 1 would cross over Y and Z DIMMs csrows 2 and 3 would cross over W and X DIMMs The mapping problem occurs in then identifying each of the above goes to which silk screen labeled sockets on the mobo. Usually they are labeled: H0_DIMM2A H0_DIMM2B H0_DIMM1A H0_DIMM1B where A is channel 0 and B is channel 1 and the "DIMM1" would indicate the CSROWs 0 and 1 and "DIMM2" would indicate the CSROWs 2 and 3 The string 'label ""' can be filled in by a userspace script to properly identify the DIMM silk screen according to the motherboard used. The lines with "EDAC MC1:" are EDAC CORE output messages, while the "EDAC K8:" lines are EDAC Memory Controller driver messages. "CE" is correctable error MC1 is memory controller 1 (0 based) ECC ChipKill x4 was what found the error and corrected it. The FRU, (field replaceable unit) is the DIMM located at socket H1_DIMM1A, according to the labeling I mentioned above. caveat: the detector is not 100% perfect but gives a general area to look at, the DIMM specification. Sometimes other errors can cause what looks like a memory error, but usually a bad memory DIMM is the root cause of the vast majority of such errors. In addition, memtest86+ doesn't find all the bad memory in all cases, but it is still a VERY useful tool doug thompson ^ permalink raw reply [flat|nested] 4+ messages in thread
* EDAC chipkill messages @ 2007-01-18 22:55 Orion Poplawski 0 siblings, 0 replies; 4+ messages in thread From: Orion Poplawski @ 2007-01-18 22:55 UTC (permalink / raw) To: linux-kernel Can someone please explain to me what these mean? EDAC k8 MC1: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4, row 1, channel 0, label "": k8_edac EDAC MC1: CE - no information available: k8_edac Error Overflow set EDAC k8 MC1: extended error code: ECC chipkill x4 error Thanks! -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion@cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2007-01-19 19:34 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <fa.CZkzgccjpiJW6cRrbtRvg/+4HMg@ifi.uio.no>
2007-01-19 0:12 ` EDAC chipkill messages Robert Hancock
2007-01-19 16:45 ` Orion Poplawski
2007-01-19 19:34 ` Doug Thompson
2007-01-18 22:55 Orion Poplawski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox