Re: EDAC chipkill messages

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: EDAC chipkill messages
       [not found] <fa.CZkzgccjpiJW6cRrbtRvg/+4HMg@ifi.uio.no>
@ 2007-01-19  0:12 ` Robert Hancock
  2007-01-19 16:45   ` Orion Poplawski
  0 siblings, 1 reply; 4+ messages in thread
From: Robert Hancock @ 2007-01-19  0:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Orion Poplawski

Orion Poplawski wrote:
> Can someone please explain to me what these mean?
> 
> EDAC k8 MC1: general bus error: participating processor(local node 
> origin), time-out(no timeout) memory transaction type(generic read), mem 
> or i/o(mem access), cache level(generic)
> EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4, row 
> 1, channel 0, label "": k8_edac
> EDAC MC1: CE - no information available: k8_edac Error Overflow set
> EDAC k8 MC1: extended error code: ECC chipkill x4 error
> 
> Thanks!
> 

Sounds like you're having some memory ECC errors.. some Memtest86, etc. 
runs may be in order. You may be able to figure out from this info what 
DIMM is having the problem.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: EDAC chipkill messages
  2007-01-19  0:12 ` EDAC chipkill messages Robert Hancock
@ 2007-01-19 16:45   ` Orion Poplawski
  2007-01-19 19:34     ` Doug Thompson
  0 siblings, 1 reply; 4+ messages in thread
From: Orion Poplawski @ 2007-01-19 16:45 UTC (permalink / raw)
  To: linux-kernel

Robert Hancock wrote:
> Orion Poplawski wrote:
>> Can someone please explain to me what these mean?
>>
>> EDAC k8 MC1: general bus error: participating processor(local node 
>> origin), time-out(no timeout) memory transaction type(generic read), 
>> mem or i/o(mem access), cache level(generic)
>> EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4, row 
>> 1, channel 0, label "": k8_edac
>> EDAC MC1: CE - no information available: k8_edac Error Overflow set
>> EDAC k8 MC1: extended error code: ECC chipkill x4 error
>>
>> Thanks!
>>
> 
> Sounds like you're having some memory ECC errors.. some Memtest86, etc. 
> runs may be in order. You may be able to figure out from this info what 
> DIMM is having the problem.
> 

That was my assumption as well, but was hoping someone could decode the 
above information and point me to the problem chip.  I ran Memtest86 
overnight but found no problems, but don't know if it needs to run in a 
particular ECC mode.

This is a dual proc 275 system with 4 1GB DIMMs.  Guessing that MC1 is 
the controller on the second CPU.  Would row 1 be the second DIMM?


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: EDAC chipkill messages
  2007-01-19 16:45   ` Orion Poplawski
@ 2007-01-19 19:34     ` Doug Thompson
  0 siblings, 0 replies; 4+ messages in thread
From: Doug Thompson @ 2007-01-19 19:34 UTC (permalink / raw)
  To: Orion Poplawski, linux-kernel

--- Orion Poplawski <orion@cora.nwra.com> wrote:

> Robert Hancock wrote:
> > Orion Poplawski wrote:
> >> Can someone please explain to me what these mean?
> >>
> >> EDAC k8 MC1: general bus error: participating processor(local node
> 
> >> origin), time-out(no timeout) memory transaction type(generic
> read), 
> >> mem or i/o(mem access), cache level(generic)
> >> EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4,
> row 
> >> 1, channel 0, label "": k8_edac
> >> EDAC MC1: CE - no information available: k8_edac Error Overflow
> set
> >> EDAC k8 MC1: extended error code: ECC chipkill x4 error
> >>
> >> Thanks!
> >>
> > 
> > Sounds like you're having some memory ECC errors.. some Memtest86,
> etc. 
> > runs may be in order. You may be able to figure out from this info
> what 
> > DIMM is having the problem.
> > 
> 
> That was my assumption as well, but was hoping someone could decode
> the 
> above information and point me to the problem chip.  I ran Memtest86 
> overnight but found no problems, but don't know if it needs to run in
> a 
> particular ECC mode.
> 
> This is a dual proc 275 system with 4 1GB DIMMs.  Guessing that MC1
> is 
> the controller on the second CPU.  Would row 1 be the second DIMM?

No that would be the FIRST DIMM, on Channel 0

Each DIMM has 2 ChipSelect Rows (CSROW)

Each csrow covers two channels across, therefore on a 4 socket memory
array, there are CSROWS 0 and 1 on the first DIMM row and CSROWS 2 and
3 on the second DIMM row.

WWWWWWWWW  XXXXXXXXXXX
YYYYYYYYY  ZZZZZZZZZZZ

The W and the Y DIMMs are channel 0
The X and the Z DIMMs are channel 1

csrows 0 and 1 would cross over Y and Z DIMMs
csrows 2 and 3 would cross over W and X DIMMs

The mapping problem occurs in then identifying each of the above goes
to which silk screen labeled sockets on the mobo.

Usually they are labeled:

H0_DIMM2A   H0_DIMM2B
H0_DIMM1A   H0_DIMM1B

where A is channel 0 and B is channel 1 and
the "DIMM1" would indicate the CSROWs 0 and 1
and "DIMM2" would indicate the CSROWs 2 and 3

The string 'label ""' can be filled in by a userspace script to
properly identify the DIMM silk screen according to the motherboard
used.

The lines with "EDAC MC1:" are EDAC CORE output messages, while the
"EDAC K8:" lines are EDAC Memory Controller driver messages.
"CE" is correctable error 
MC1 is memory controller 1 (0 based)

ECC ChipKill x4 was what found the error and corrected it.

The FRU, (field replaceable unit) is the DIMM located at socket
H1_DIMM1A, according to the labeling I mentioned above.

caveat: the detector is not 100% perfect but gives a general area to
look at, the DIMM specification. Sometimes other errors can cause what
looks like a memory error, but usually a bad memory DIMM is the root
cause of the vast majority of such errors.

In addition, memtest86+ doesn't find all the bad memory in all cases,
but it is still a VERY useful tool

doug thompson

^ permalink raw reply	[flat|nested] 4+ messages in thread

* EDAC chipkill messages
@ 2007-01-18 22:55 Orion Poplawski
  0 siblings, 0 replies; 4+ messages in thread
From: Orion Poplawski @ 2007-01-18 22:55 UTC (permalink / raw)
  To: linux-kernel

Can someone please explain to me what these mean?

EDAC k8 MC1: general bus error: participating processor(local node 
origin), time-out(no timeout) memory transaction type(generic read), mem 
or i/o(mem access), cache level(generic)
EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4, row 
1, channel 0, label "": k8_edac
EDAC MC1: CE - no information available: k8_edac Error Overflow set
EDAC k8 MC1: extended error code: ECC chipkill x4 error

Thanks!

-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA/CoRA Division                    FAX: 303-415-9702
3380 Mitchell Lane                  orion@cora.nwra.com
Boulder, CO 80301              http://www.cora.nwra.com


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-01-19 19:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <fa.CZkzgccjpiJW6cRrbtRvg/+4HMg@ifi.uio.no>
2007-01-19  0:12 ` EDAC chipkill messages Robert Hancock
2007-01-19 16:45   ` Orion Poplawski
2007-01-19 19:34     ` Doug Thompson
2007-01-18 22:55 Orion Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox