From: Borislav Petkov <bp@amd64.org>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andi Kleen <andi@firstfloor.org>, Borislav Petkov <bp@amd64.org>,
"Luck, Tony" <tony.luck@intel.com>,
Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
Mauro Carvalho Chehab <mchehab@redhat.com>,
"Young, Brent" <brent.young@intel.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Ingo Molnar <mingo@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Matt Domsch <Matt_Domsch@dell.com>,
Doug Thompson <dougthompson@xmission.com>,
Joe Perches <joe@perches.com>, Ingo Molnar <mingo@elte.hu>,
"bluesmoke-devel@lists.sourceforge.net"
<bluesmoke-devel@lists.sourceforge.net>,
Linux Edac Mailing List <linux-edac@vger.kernel.org>
Subject: Re: Hardware Error Kernel Mini-Summit
Date: Wed, 19 May 2010 08:46:19 +0200 [thread overview]
Message-ID: <20100519064619.GA30320@aftab> (raw)
In-Reply-To: <m1wrv0zo9q.fsf@fess.ebiederm.org>
From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Tue, May 18, 2010 at 09:14:09PM -0400
> - Errors that occur frequently. That is broken hardware of one time or
> another. I want to know about that so I can schedule down time to replace
> my memory before I get an uncorrected ECC error. Errors of this kind
> are likely happening frequently enough as to impact performance.
This is exactly the reason why we need a better error logging and
reporting than a log. How do you want to discover trends and count CECCs
per DIMM if you scan the logs all the time and grep for the DRAM page
it happened, the CS row it is located in and whether this is located in
the same DIMM as the 115th error back in the log? This gets especially
tricky if you're using one of the gazillion memory interleaving schemes.
Ok, and what about other errors like L3 cache errors, for example? You
want to count those too and upon reaching a threshold disable a cache
index _before_ it turns a correctable ECC into an uncorrectable error
bringing the whole system down with a critical MCE.
How about error injection, you want to test the hardware/software with
injecting real hardware errors and not simulating it all in software.
And also you want to be able to schedule different maintenance actions
depending on the severity of the error and in certain cases get away
with a clean shutdown even in the face of an uncorrectable error.
So, the whole idea entails much more than reporting errors in the syslog
but rather making the system intelligent enough to prolong its own life
and be able to warn the user that something bad is about to happen.
And we don't have that right now - right now we say that some machine
checks have been logged and with uncorrectable MCEs we freeze cowardly
and hope to be able to make a warm reset so that the MCA MSRs still
contain some valid data which we can decode painstakingly by hand.
I hope this makes our intentions a bit clearer.
--
Regards/Gruss,
Boris.
Operating Systems Research Center
Advanced Micro Devices, Inc.
next prev parent reply other threads:[~2010-05-19 6:45 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-05-17 18:23 Hardware Error Kernel Mini-Summit Mauro Carvalho Chehab
2010-05-17 22:41 ` Andi Kleen
2010-05-18 16:50 ` Mauro Carvalho Chehab
2010-05-18 18:10 ` Andi Kleen
2010-05-18 6:52 ` Hidetoshi Seto
2010-05-18 16:44 ` Mauro Carvalho Chehab
2010-05-18 17:42 ` Joe Perches
2010-05-18 17:59 ` Mauro Carvalho Chehab
2010-05-18 18:45 ` Andi Kleen
2010-05-18 18:57 ` Joe Perches
2010-05-18 18:53 ` Ingo Molnar
2010-05-18 19:08 ` Luck, Tony
2010-05-18 19:18 ` Borislav Petkov
2010-05-18 19:34 ` Ingo Molnar
2010-05-18 22:14 ` Eric W. Biederman
2010-05-18 22:28 ` Andi Kleen
2010-05-19 1:14 ` Eric W. Biederman
2010-05-19 6:46 ` Borislav Petkov [this message]
2010-05-19 7:09 ` Ingo Molnar
2010-05-19 11:54 ` Mauro Carvalho Chehab
2010-05-20 12:37 ` Ingo Molnar
2010-06-14 10:03 ` Nils Carlson
2010-06-14 11:49 ` Andi Kleen
2010-06-14 19:47 ` Nils Carlson
2010-06-14 20:21 ` Andi Kleen
2010-06-14 20:06 ` Eric W. Biederman
2010-06-14 20:21 ` Luck, Tony
2010-06-14 20:36 ` Andi Kleen
2010-06-14 21:34 ` Tony Luck
2010-06-15 6:44 ` Andi Kleen
[not found] ` <35525.41387.qm@web50105.mail.re2.yahoo.com>
2010-06-15 6:56 ` Andi Kleen
2010-06-15 8:06 ` Nils Carlson
2010-06-15 10:01 ` Borislav Petkov
2010-06-15 11:41 ` Andi Kleen
2010-06-15 12:21 ` Nils Carlson
2010-06-15 18:15 ` Luck, Tony
2010-06-15 18:38 ` Nils Carlson
2010-06-15 19:37 ` Andi Kleen
2010-06-15 19:35 ` Andi Kleen
2010-06-15 20:48 ` Nils Carlson
2010-06-16 9:40 ` Andi Kleen
2010-06-15 22:33 ` Tony Luck
2010-05-19 9:03 ` Andi Kleen
2010-05-24 16:21 ` Russ Anderson
2010-05-24 18:26 ` Andi Kleen
2010-05-19 17:30 ` Tony Luck
2010-05-24 15:55 ` Russ Anderson
2010-05-24 17:35 ` Tony Luck
2010-05-24 18:31 ` Andi Kleen
2010-05-18 22:29 ` Ingo Molnar
2010-05-18 19:30 ` Ingo Molnar
2010-05-18 20:42 ` Ingo Molnar
2010-05-18 21:37 ` Tony Luck
2010-05-18 22:00 ` Ingo Molnar
2010-05-24 17:13 ` Russ Anderson
2010-05-19 6:39 ` Ingo Molnar
2010-05-18 13:06 ` Borislav Petkov
2010-05-18 16:52 ` Mauro Carvalho Chehab
2010-05-18 17:06 ` Mauro Carvalho Chehab
-- strict thread matches above, loose matches on Subject: below --
2010-06-16 8:57 George Spelvin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100519064619.GA30320@aftab \
--to=bp@amd64.org \
--cc=Matt_Domsch@dell.com \
--cc=andi@firstfloor.org \
--cc=bluesmoke-devel@lists.sourceforge.net \
--cc=brent.young@intel.com \
--cc=dougthompson@xmission.com \
--cc=ebiederm@xmission.com \
--cc=joe@perches.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@redhat.com \
--cc=mingo@elte.hu \
--cc=mingo@redhat.com \
--cc=seto.hidetoshi@jp.fujitsu.com \
--cc=tglx@linutronix.de \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).