From: Borislav Petkov <bp@amd64.org>
To: Mike Waychison <mikew@google.com>
Cc: "mingo@elte.hu" <mingo@elte.hu>,
"rdunlap@xenotime.net" <rdunlap@xenotime.net>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
Mauro Carvalho Chehab <mchehab@redhat.com>,
Duncan Laurie <dlaurie@google.com>
Subject: Re: [PATCH] x86: Add an option to disable decoding of MCE
Date: Tue, 11 Jan 2011 23:48:09 +0100 [thread overview]
Message-ID: <20110111224809.GA20839@aftab> (raw)
In-Reply-To: <AANLkTimQNnxnoFPWjh6cHp_x9ZayDdcXdV+qbee8XK_r@mail.gmail.com>
On Tue, Jan 11, 2011 at 04:51:57PM -0500, Mike Waychison wrote:
> On Tue, Jan 11, 2011 at 12:49 PM, Borislav Petkov <bp@amd64.org> wrote:
> > Ok, let me preface this with an even easier suggestion: Can you simply
> > not compile EDAC (which includes CONFIG_EDAC_DECODE_MCE) in your kernels
> > and the whole issue with decoding disappears simply because no module
> > registers as a decoder...?
>
> The trouble here is that default_decode_mce() is still getting called
> no matter what :( It didn't really cause problems until you added
> atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m) to
> machine_check_poll().
Yeah, I figured as much... after hitting send :(. Feel free to add my
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
to your patch.
> > Right, and this means that you need to know all the memory controller
> > topologies of all the different architectures and also the SPD accessing
> > based on a board type could be a pain. One of the main reasons for
> > fleshing out MCE decoding in the kernel was to avoid needless trouble
> > like that.
>
> It's not that painful for us. Our firmware guys own this userland
> code :) I can see it being a pain for others however.
Yeah, what we actually need is a chipset-agnostic way (ACPI maybe?)
of mapping chip selects to the actual DIMMs so all that above can be
avoided...
> Doing this all in userland does have it's upsides as well fwiw. For
> example, MC4 is usually kept around across a reset, which means
> that the firmware can pick it up when the system goes down due to a
> processor context corruption.
True, although not always reliable due to that context corruption. Tony
Luck is adding code for writing the error information to a persistent
storage _before_ you reset so on platforms with such a device you'll
be able to look at MC4 (or any of the MCE regs, for that matter) after
reset, which, I think, would make error analysis much more convenient.
See http://marc.info/?l=linux-arch&m=129356804427008&w=2.
> We rely on these libraries as well to decode egregious uncorrectable
> memory errors as well as bus errors (like hypertransport sync floods).
Yeah, I don't think error decoding would help, at least on AMD, with
uncorrectable errors causing syncflood. We might do something about it
though :).
> > I've
> > heard similar troubles reported by other big server farm people and
> > what I'm currently working on is a RAS daemon that hooks into perf thus
> > enabling persistent performance events. This way, you could open a
> > debugfs file (this'll move to sysfs someday) and read the same decoded
> > data by mmaping the perf ringbuffer.
>
> I'll definitely keep an eye out for your developments with RAS :)
Cool, I think I have the bulk of your requirements now, from your and
Duncan's mails. I'm always open for other suggestions too.
Thanks.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
next prev parent reply other threads:[~2011-01-11 22:47 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-10 22:46 [PATCH] x86: Add an option to disable decoding of MCE Mike Waychison
2011-01-10 22:55 ` Randy Dunlap
2011-01-10 23:03 ` Mike Waychison
2011-01-11 6:55 ` Borislav Petkov
2011-01-11 19:56 ` Mike Waychison
2011-01-11 20:49 ` Borislav Petkov
2011-01-11 21:51 ` Mike Waychison
2011-01-11 22:48 ` Borislav Petkov [this message]
2011-01-11 23:19 ` Mike Waychison
2011-01-12 17:12 ` Tony Luck
2011-01-11 22:07 ` Duncan Laurie
2011-01-14 0:07 ` Tony Luck
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110111224809.GA20839@aftab \
--to=bp@amd64.org \
--cc=dlaurie@google.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@redhat.com \
--cc=mikew@google.com \
--cc=mingo@elte.hu \
--cc=rdunlap@xenotime.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.