From: Mike Waychison <mikew@google.com>
To: Borislav Petkov <bp@amd64.org>
Cc: "mingo@elte.hu" <mingo@elte.hu>,
"rdunlap@xenotime.net" <rdunlap@xenotime.net>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
Mauro Carvalho Chehab <mchehab@redhat.com>,
Duncan Laurie <dlaurie@google.com>
Subject: Re: [PATCH] x86: Add an option to disable decoding of MCE
Date: Tue, 11 Jan 2011 13:51:57 -0800 [thread overview]
Message-ID: <AANLkTimQNnxnoFPWjh6cHp_x9ZayDdcXdV+qbee8XK_r@mail.gmail.com> (raw)
In-Reply-To: <20110111204909.GE17619@aftab>
On Tue, Jan 11, 2011 at 12:49 PM, Borislav Petkov <bp@amd64.org> wrote:
> Ok, let me preface this with an even easier suggestion: Can you simply
> not compile EDAC (which includes CONFIG_EDAC_DECODE_MCE) in your kernels
> and the whole issue with decoding disappears simply because no module
> registers as a decoder...?
The trouble here is that default_decode_mce() is still getting called
no matter what :( It didn't really cause problems until you added
atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m) to
machine_check_poll().
> On Tue, Jan 11, 2011 at 02:56:50PM -0500, Mike Waychison wrote:
>> >> On our systems, we do not want to have any "decoders" called on machine
>> >> check events. These decoders can easily spam our logs and cause space
>> >> problems on machines that have a lot of correctable error events. We
>> >> _do_ however want to get the messages delivered via /dev/mcelog for
>> >> userland processing.
>> >
>> > Ok, question: how do you guys process DRAM ECCs? And more specifically,
>> > with a large number of machines, how do you do the mapping from the DRAM
>> > ECC error address reported by MCA to a DIMM that's failing in userspace
>> > on a particular machine?
>>
>> We process machine checks in userland, using carnal knowledge of the
>> memory controller and the board specific addressing of the SPDs on the
>> various i2c busses to deswizzle and make sense of the addresses and
>> symptoms. We then expose this digested data on the network, which is
>> dealt with at the cluster level.
>
> Right, and this means that you need to know all the memory controller
> topologies of all the different architectures and also the SPD accessing
> based on a board type could be a pain. One of the main reasons for
> fleshing out MCE decoding in the kernel was to avoid needless trouble
> like that.
It's not that painful for us. Our firmware guys own this userland
code :) I can see it being a pain for others however.
Doing this all in userland does have it's upsides as well fwiw. For
example, MC4 is usually kept around across a reset, which means that
the firmware can pick it up when the system goes down due to a
processor context corruption. We rely on these libraries as well to
decode egregious uncorrectable memory errors as well as bus errors
(like hypertransport sync floods).
*snip*
> I've
> heard similar troubles reported by other big server farm people and
> what I'm currently working on is a RAS daemon that hooks into perf thus
> enabling persistent performance events. This way, you could open a
> debugfs file (this'll move to sysfs someday) and read the same decoded
> data by mmaping the perf ringbuffer.
I'll definitely keep an eye out for your developments with RAS :)
>
> I think this is also easy disabled by not configuring EDAC, as I said
> above. Basically, if you don't enable EDAC, you can drop that patch
> too and run your kernels without any modification, or am I missing
> something..?
See the comment above. It's the spamming for the default decoder that
I wanted to disable specifically, though my approach was a knob to
disable decoders generally.
next prev parent reply other threads:[~2011-01-11 21:52 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-10 22:46 [PATCH] x86: Add an option to disable decoding of MCE Mike Waychison
2011-01-10 22:55 ` Randy Dunlap
2011-01-10 23:03 ` Mike Waychison
2011-01-11 6:55 ` Borislav Petkov
2011-01-11 19:56 ` Mike Waychison
2011-01-11 20:49 ` Borislav Petkov
2011-01-11 21:51 ` Mike Waychison [this message]
2011-01-11 22:48 ` Borislav Petkov
2011-01-11 23:19 ` Mike Waychison
2011-01-12 17:12 ` Tony Luck
2011-01-11 22:07 ` Duncan Laurie
2011-01-14 0:07 ` Tony Luck
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=AANLkTimQNnxnoFPWjh6cHp_x9ZayDdcXdV+qbee8XK_r@mail.gmail.com \
--to=mikew@google.com \
--cc=bp@amd64.org \
--cc=dlaurie@google.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@redhat.com \
--cc=mingo@elte.hu \
--cc=rdunlap@xenotime.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).