From: Borislav Petkov <bp@amd64.org>
To: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: "Mark A. Grondona" <mgrondona@llnl.gov>,
Linux Edac Mailing List <linux-edac@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 0/6] Add a per-dimm structure
Date: Mon, 12 Mar 2012 17:39:11 +0100 [thread overview]
Message-ID: <20120312163911.GC7255@aftab> (raw)
In-Reply-To: <4F5C9B6C.2080904@redhat.com>
On Sun, Mar 11, 2012 at 09:32:44AM -0300, Mauro Carvalho Chehab wrote:
> Well, this change can be done, but still we need to decide how to export it ;)
>
> The new edac_mc_handle_error() with replaces all the legacy edac_mc_handle* calls
> does what the other calls used to do. I didn't change its behavior. Anyway, what
> it does for UE errors is:
>
> ...
> /* Some logic to get the memory DIMM labels */
> trace_mc_error(type, mci->mc_idx, msg, label, location,
> detail, other_detail);
>
> if (type == HW_EVENT_ERR_CORRECTED) {
> ...
> } else {
> ...
> if (edac_mc_get_log_ue())
> edac_mc_printk(mci, KERN_WARNING,
> "UE %s on %s (%s%s %s)\n",
> msg, label, location, detail, other_detail);
>
> if (edac_mc_get_panic_on_ue())
> panic("UE %s on %s (%s%s %s)\n",
> msg, label, location, detail, other_detail);
>
> edac_increment_ue_error(mci, enable_filter, pos);
> }
>
> So, it basically:
> 1) prints the memory location and the DIMM label(s) of the memory(ies)
> from where the error originates;
> 2) if edac_mc_panic_on_ue is set, it will panic;
> 3) otherwise, it will increment the UE error counters.
>
> It shouldn't be hard to add a patch to disable the sysfs error UE counters if
> edac_mc_panic_on_ue is enabled.
Err, the fact that you have UE counters doesn't have anything to do with
the request that you want to panic on an UE. Especially if conservative
systems would panic on the first UE anyway without asking software.
So what I meant was to make it optional in the core edac code whether
you want to install a UE counter in the ranks or not. So that, for
example, if amd64_edac doesn't want to have UE counters, it simply
says so and the core generates only CE counters per rank. Or, with
positive logic, an edac driver explicitly requests what counters it
wants installed.
> Anyway, an UE error with a 128 bits cacheline points to a location that has
> two DIMMs (or 4 DIMMs, on memory controllers with mirror mode enabled). So,
> incrementing a DIMM error counter doesn't seem to be the right thing to do.
>
> Well, it may increment two DIMM error counters (or 4 DIMM error counters), but
> it would change the current behavior.
>
> It should also be noticed that the MCA-based Intel memory controllers have the
> (likely limited) capability of recovering from an UE error. So, an UE error
> may not mean a fatal error. So, the UE error counter value can actually be
> bigger than 1.
Yes, that's why make it optional - if the hardware can support it,
it can have it. If it doesn't make sense, then no need for it - that
simple.
>
> >
> > [..]
> >
> >> One alternative would simply to remove all those intermediate
> >> counters, letting userspace to count the errors via perf (provided
> >> that we have a proper location field).
> >
> > Yes, that would be where we want to go eventually because I too don't
> > see any reason for those counters. Besides, they don't decay over time,
> > for example, say you have a DIMM which experiences a temporary failure
> > and generates k CEs. Then, the source of that error disappears and the
> > DIMM works fine for months.
>
> Userspace applications may reset the error counters. There is a sysfs node
> for it.
No, I'm not talking about resetting but decaying. I.e., each
error counted has a certain validity and gets discarded
after a while - similar to the leaky bucket algorithm:
http://en.wikipedia.org/wiki/Leaky_bucket
Thanks.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
next prev parent reply other threads:[~2012-03-12 16:39 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-03-07 11:40 [PATCH 0/6] Add a per-dimm structure Mauro Carvalho Chehab
2012-03-07 11:40 ` [PATCH 1/6] edac: Create a dimm struct and move the labels into it Mauro Carvalho Chehab
2012-03-07 11:40 ` [PATCH 2/6] edac: Add per dimm's sysfs nodes Mauro Carvalho Chehab
2012-03-07 11:40 ` [PATCH 3/6] edac: move dimm properties to struct memset_info Mauro Carvalho Chehab
2012-03-07 11:40 ` [PATCH 4/6] edac: Don't initialize csrow's first_page & friends when not needed Mauro Carvalho Chehab
2012-03-07 11:40 ` [PATCH 5/6] edac: move nr_pages to dimm struct Mauro Carvalho Chehab
2012-03-07 11:40 ` [PATCH 6/6] edac: Add per-dimm sysfs show nodes Mauro Carvalho Chehab
2012-03-08 21:57 ` [PATCH 0/6] Add a per-dimm structure Borislav Petkov
2012-03-09 10:32 ` Mauro Carvalho Chehab
2012-03-09 14:38 ` Borislav Petkov
2012-03-09 16:40 ` Mauro Carvalho Chehab
2012-03-09 18:47 ` Borislav Petkov
2012-03-09 19:46 ` Mauro Carvalho Chehab
2012-03-11 11:34 ` Borislav Petkov
2012-03-11 12:32 ` Mauro Carvalho Chehab
2012-03-12 16:39 ` Borislav Petkov [this message]
2012-03-12 17:03 ` Luck, Tony
2012-03-12 18:10 ` Borislav Petkov
2012-03-13 23:32 ` Greg KH
2012-03-14 19:35 ` Mauro Carvalho Chehab
2012-03-14 20:43 ` Greg KH
2012-03-14 22:20 ` Mauro Carvalho Chehab
2012-03-14 23:32 ` Greg KH
2012-03-15 2:22 ` Mauro Carvalho Chehab
2012-03-15 15:00 ` Greg KH
2012-03-14 22:31 ` Borislav Petkov
2012-03-14 22:40 ` Greg KH
2012-03-15 1:37 ` Mauro Carvalho Chehab
2012-03-15 1:44 ` Mauro Carvalho Chehab
2012-03-15 11:31 ` Borislav Petkov
2012-03-15 12:40 ` Mauro Carvalho Chehab
2012-03-15 21:38 ` Borislav Petkov
2012-03-16 8:47 ` Mauro Carvalho Chehab
2012-03-16 11:15 ` Borislav Petkov
2012-03-16 12:07 ` Mauro Carvalho Chehab
2012-03-16 14:07 ` Mauro Carvalho Chehab
2012-03-16 15:31 ` Greg KH
2012-03-16 16:54 ` Borislav Petkov
2012-03-16 15:30 ` Greg KH
2012-03-16 15:44 ` Mauro Carvalho Chehab
2012-03-16 16:01 ` Greg KH
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120312163911.GC7255@aftab \
--to=bp@amd64.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@redhat.com \
--cc=mgrondona@llnl.gov \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox