All of lore.kernel.org
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <bp@amd64.org>,
	"Luck, Tony" <tony.luck@intel.com>,
	Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
	Mauro Carvalho Chehab <mchehab@redhat.com>,
	"Young, Brent" <brent.young@intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Matt Domsch <Matt_Domsch@dell.com>,
	Doug Thompson <dougthompson@xmission.com>,
	Joe Perches <joe@perches.com>, Ingo Molnar <mingo@elte.hu>,
	"bluesmoke-devel@lists.sourceforge.net"
	<bluesmoke-devel@lists.sourceforge.net>,
	Linux Edac Mailing List <linux-edac@vger.kernel.org>
Subject: Re: Hardware Error Kernel Mini-Summit
Date: Tue, 18 May 2010 18:14:09 -0700	[thread overview]
Message-ID: <m1wrv0zo9q.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <20100518222832.GJ22675@basil.fritz.box> (Andi Kleen's message of "Wed\, 19 May 2010 00\:28\:33 +0200")

Andi Kleen <andi@firstfloor.org> writes:

> The original motivation to put them somewhere else
> because I was sick of people reporting them as kernel bugs.

This suggests that to get things reported in dmesg I should
setup a cron job that pulls the latest kernel checks to see
if things are reported into syslog and sends you an email
if things are wrong.

I'm not ready to believe the average person that is running linux
is too stupid to understand the difference between a hardware
error and a software error.

> But there's more to it now:
>
>> If your system isn't broken correctable errors are rare.  People look
>
> Actually the more memory you have the more common they are.
> And the trend is to more and more memory.

The error rate should not be fixed per bit but should be roughly fixed
per DIMM.  If the error rate over time is fixed per bit we are in deep
trouble.

> Really to do anything useful with them you need trends
> and automatic actions (like predictive page offlining)

Not at all, and I don't have a clue where you start thinking
predictive page offlining makes the least bit of sense.  Broken
or even weak bits are rarely the common reason for ECC errors.

> A log isn't really a good format for that

A log is a fine format for realizing you have a problem.  A
log doesn't need to be the only place errors are reported
but a log should be the default place ECC errors are reported.
We do that with hard drive errors and other kinds of hardware
errors and we have done it for years without problems.

My experience is that correctable ECC errors come in two kinds of
frequencies.

- The expected single bit correctable error range.  Which is somewhere
  between once a month and once a year per dimm.

  On the most unreasonable configuration I ever worked with. 4TB of ram
  in 1GB sticks up at Los Alomos, at 7000ft in an environment know
  to trigger errors I saw roughly one correctable ECC error an hour.
  Huge but just barely within the expected range.

  I can live with a log message once a month on a mundane system.

- Errors that occur frequently. That is broken hardware of one time or
  another.  I want to know about that so I can schedule down time to replace
  my memory before I get an uncorrected ECC error.  Errors of this kind
  are likely happening frequently enough as to impact performance.

Eric

WARNING: multiple messages have this Message-ID (diff)
From: ebiederm@xmission.com (Eric W. Biederman)
To: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <bp@amd64.org>, "Luck\,
	Tony" <tony.luck@intel.com>,
	Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
	Mauro Carvalho Chehab <mchehab@redhat.com>, "Young\,
	Brent" <brent.young@intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Matt Domsch <Matt_Domsch@dell.com>,
	Doug Thompson <dougthompson@xmission.com>,
	Joe Perches <joe@perches.com>, Ingo Molnar <mingo@elte.hu>,
	"bluesmoke-devel\@lists.sourceforge.net" 
	<bluesmoke-devel@lists.sourceforge.net>,
	Linux Edac Mailing List <linux-edac@vger.kernel.org>
Subject: Re: Hardware Error Kernel Mini-Summit
Date: Tue, 18 May 2010 18:14:09 -0700	[thread overview]
Message-ID: <m1wrv0zo9q.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <20100518222832.GJ22675@basil.fritz.box> (Andi Kleen's message of "Wed\, 19 May 2010 00\:28\:33 +0200")

Andi Kleen <andi@firstfloor.org> writes:

> The original motivation to put them somewhere else
> because I was sick of people reporting them as kernel bugs.

This suggests that to get things reported in dmesg I should
setup a cron job that pulls the latest kernel checks to see
if things are reported into syslog and sends you an email
if things are wrong.

I'm not ready to believe the average person that is running linux
is too stupid to understand the difference between a hardware
error and a software error.

> But there's more to it now:
>
>> If your system isn't broken correctable errors are rare.  People look
>
> Actually the more memory you have the more common they are.
> And the trend is to more and more memory.

The error rate should not be fixed per bit but should be roughly fixed
per DIMM.  If the error rate over time is fixed per bit we are in deep
trouble.

> Really to do anything useful with them you need trends
> and automatic actions (like predictive page offlining)

Not at all, and I don't have a clue where you start thinking
predictive page offlining makes the least bit of sense.  Broken
or even weak bits are rarely the common reason for ECC errors.

> A log isn't really a good format for that

A log is a fine format for realizing you have a problem.  A
log doesn't need to be the only place errors are reported
but a log should be the default place ECC errors are reported.
We do that with hard drive errors and other kinds of hardware
errors and we have done it for years without problems.

My experience is that correctable ECC errors come in two kinds of
frequencies.

- The expected single bit correctable error range.  Which is somewhere
  between once a month and once a year per dimm.

  On the most unreasonable configuration I ever worked with. 4TB of ram
  in 1GB sticks up at Los Alomos, at 7000ft in an environment know
  to trigger errors I saw roughly one correctable ECC error an hour.
  Huge but just barely within the expected range.

  I can live with a log message once a month on a mundane system.

- Errors that occur frequently. That is broken hardware of one time or
  another.  I want to know about that so I can schedule down time to replace
  my memory before I get an uncorrected ECC error.  Errors of this kind
  are likely happening frequently enough as to impact performance.

Eric

  reply	other threads:[~2010-05-19  1:14 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-17 18:23 Hardware Error Kernel Mini-Summit Mauro Carvalho Chehab
2010-05-17 22:41 ` Andi Kleen
2010-05-18 16:50   ` Mauro Carvalho Chehab
2010-05-18 18:10     ` Andi Kleen
2010-05-18 18:10       ` Andi Kleen
2010-05-18  6:52 ` Hidetoshi Seto
2010-05-18  6:52   ` Hidetoshi Seto
2010-05-18 16:44   ` Mauro Carvalho Chehab
2010-05-18 16:44     ` Mauro Carvalho Chehab
2010-05-18 17:42     ` Joe Perches
2010-05-18 17:59       ` Mauro Carvalho Chehab
2010-05-18 18:45       ` Andi Kleen
2010-05-18 18:57         ` Joe Perches
2010-05-18 18:53       ` Ingo Molnar
2010-05-18 19:08         ` Luck, Tony
2010-05-18 19:18           ` Borislav Petkov
2010-05-18 19:34             ` Ingo Molnar
2010-05-18 22:14             ` Eric W. Biederman
2010-05-18 22:14               ` Eric W. Biederman
2010-05-18 22:28               ` Andi Kleen
2010-05-19  1:14                 ` Eric W. Biederman [this message]
2010-05-19  1:14                   ` Eric W. Biederman
2010-05-19  6:46                   ` Borislav Petkov
2010-05-19  7:09                     ` Ingo Molnar
2010-05-19 11:54                       ` Mauro Carvalho Chehab
2010-05-19 11:54                         ` Mauro Carvalho Chehab
2010-05-20 12:37                         ` Ingo Molnar
2010-06-14 10:03                       ` Nils Carlson
2010-06-14 10:03                         ` Nils Carlson
2010-06-14 11:49                         ` Andi Kleen
2010-06-14 19:47                           ` Nils Carlson
2010-06-14 19:47                             ` Nils Carlson
2010-06-14 20:21                             ` Andi Kleen
2010-06-14 21:02                               ` Nils Carlson
2010-06-14 20:06                           ` Eric W. Biederman
2010-06-14 20:06                             ` Eric W. Biederman
2010-06-14 20:21                             ` Luck, Tony
2010-06-14 20:36                             ` Andi Kleen
2010-06-14 20:36                               ` Andi Kleen
2010-06-14 21:34                               ` Tony Luck
2010-06-14 21:34                                 ` Tony Luck
2010-06-14 23:46                                 ` Doug Thompson
2010-06-15  6:56                                   ` Andi Kleen
2010-06-15  8:06                                     ` Nils Carlson
2010-06-15  8:06                                       ` Nils Carlson
2010-06-15 10:01                                       ` Borislav Petkov
2010-06-15 11:41                                       ` Andi Kleen
2010-06-15 11:41                                         ` Andi Kleen
2010-06-15 12:21                                         ` Nils Carlson
2010-06-15 18:15                                           ` Luck, Tony
2010-06-15 18:38                                             ` Nils Carlson
2010-06-15 18:38                                               ` Nils Carlson
2010-06-15 19:37                                             ` Andi Kleen
2010-06-15 19:37                                               ` Andi Kleen
2010-06-15 19:35                                           ` Andi Kleen
2010-06-15 20:48                                             ` Nils Carlson
2010-06-15 20:48                                               ` Nils Carlson
2010-06-16  9:40                                               ` Andi Kleen
2010-06-16  9:40                                                 ` Andi Kleen
2010-06-15 22:33                                     ` Tony Luck
2010-06-15  6:44                                 ` Andi Kleen
2010-06-15  6:44                                   ` Andi Kleen
2010-05-19  9:03                   ` Andi Kleen
2010-05-24 16:21                     ` Russ Anderson
2010-05-24 18:26                       ` Andi Kleen
2010-05-24 18:26                         ` Andi Kleen
2010-05-19 17:30                   ` Tony Luck
2010-05-24 15:55                     ` Russ Anderson
2010-05-24 17:35                       ` Tony Luck
2010-05-24 18:31                         ` Andi Kleen
2010-05-18 22:29               ` Ingo Molnar
2010-05-18 19:30           ` Ingo Molnar
2010-05-18 20:42             ` Ingo Molnar
2010-05-18 21:37               ` Tony Luck
2010-05-18 22:00                 ` Ingo Molnar
2010-05-24 17:13                   ` Russ Anderson
2010-05-19  6:39                 ` Ingo Molnar
2010-05-18 13:06 ` Borislav Petkov
2010-05-18 16:52   ` Mauro Carvalho Chehab
2010-05-18 16:52     ` Mauro Carvalho Chehab
2010-05-18 17:06 ` Mauro Carvalho Chehab
2010-05-18 17:06   ` Mauro Carvalho Chehab
  -- strict thread matches above, loose matches on Subject: below --
2010-06-16  8:57 George Spelvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1wrv0zo9q.fsf@fess.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=Matt_Domsch@dell.com \
    --cc=andi@firstfloor.org \
    --cc=bluesmoke-devel@lists.sourceforge.net \
    --cc=bp@amd64.org \
    --cc=brent.young@intel.com \
    --cc=dougthompson@xmission.com \
    --cc=joe@perches.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@redhat.com \
    --cc=mingo@elte.hu \
    --cc=mingo@redhat.com \
    --cc=seto.hidetoshi@jp.fujitsu.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.