All of lore.kernel.org
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@amd64.org>
To: Kevin Bowling <kevin.bowling@kev009.com>
Cc: "bluesmoke-devel@lists.sourceforge.net"
	<bluesmoke-devel@lists.sourceforge.net>
Subject: Re: Interpreting EDAC errors
Date: Mon, 20 Jun 2011 09:34:22 +0200	[thread overview]
Message-ID: <20110620073422.GB9070@aftab> (raw)
In-Reply-To: <BANLkTin-O8co99AnG+KT3+CUGmgaudRvYA@mail.gmail.com>

Hi,

On Mon, Jun 20, 2011 at 12:57:26AM -0400, Kevin Bowling wrote:
> Hello,
> 
> I've been seeing the following errors from the EDAC system.  I'm not
> quite sure how to associate the output from edac-util to physical
> DIMMs.  How do we account for multi-rank DIMMs, interleaving, NUMA,
> etc?

Judging by the mainboard, this is a dual socket Magny-Cours. A couple of
things:

* interpreting DRAM ECC errors is still suboptimal and we're working on
it, I'll try to come up with an interim solution to make the decoded
error info a bit more understandable.

* you have one singe-bit error which got corrected by the memory
controller on 4 DIMMs and over the current system uptime so I wouldn't
worry too much. I would monitor the DIMMs though and take action only if
those error rates start to grow over time.

You have 4 8G DIMMs per node but I don't know they rank
count so please take the below with a grain of salt. Wait,
http://www.alldatasheet.com/datasheet-pdf/pdf/332888/HYNIX/HMT31GR7BFR4C-H9.html
says that yours are actually dual-ranked.

Btw, kernel dmesg output of EDAC should help to pinpoint them better.

> root@PM-LAS-PROD-0:~# edac-util
> mc0: csrow3: ch0: 1 Corrected Errors

This should be P1_DIMM1A if your DIMMs are quadranked, P1_DIMM2A if
dual-ranked.

> mc1: csrow2: ch0: 1 Corrected Errors

P1_DIMM3A or P1_DIMM4A as above. Also, I'm assuming that the increasing
nomenclature in the silkscreen labeling is mapping the memory
controllers in the same way, i.e.:

mc0 -> 1A, 2A
mc1 -> 3A, 4A

> mc2: csrow3: ch0: 1 Corrected Errors
> mc2: csrow3: ch1: 1 Corrected Errors

This looks like P2_DIMM3A

So, yeah, it is suboptimal and it needs fixing, I know.

HTH.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

------------------------------------------------------------------------------
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev

  reply	other threads:[~2011-06-20  7:34 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-20  4:57 Interpreting EDAC errors Kevin Bowling
2011-06-20  7:34 ` Borislav Petkov [this message]
2011-06-20  8:31   ` Kevin Bowling
2011-06-20 10:15     ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110620073422.GB9070@aftab \
    --to=bp@amd64.org \
    --cc=bluesmoke-devel@lists.sourceforge.net \
    --cc=kevin.bowling@kev009.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.