public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andi Kleen <andi@firstfloor.org>
To: Russ Anderson <rja@sgi.com>
Cc: mingo@elte.hu, tglx@linutronix.de,
	Tony Luck <tony.luck@intel.com>,
	linux-kernel@vger.kernel.org, linux-ia64@vger.kernel.org
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)
Date: Sat, 19 Jul 2008 12:37:11 +0200	[thread overview]
Message-ID: <87prpa88iw.fsf@basil.nowhere.org> (raw)
In-Reply-To: <20080718203514.GD29621@sgi.com> (Russ Anderson's message of "Fri, 18 Jul 2008 15:35:14 -0500")

Russ Anderson <rja@sgi.com> writes:

> [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

FWIW I discussed this with some hardware people and the general
opinion was that it was way too aggressive to disable a page on the
first corrected error like this patchkit currently does.  

The corrected bit error could be caused by a temporary condition
e.g. in the DIMM link, and does not necessarily mean that part of the
DIMM is really going bad. Permanently disabling would only be
justified if you saw repeated corrected errors over a long time from
the same DIMM.

There are also some potential scenarios where being so aggressive
could hurt, e.g. if you have a low rate of random corrected events
spread randomly all over your memory (e.g. with a flakey DIMM
connection) after a long enough uptime you could lose significant parts
of your memory even though the DIMM is actually still ok.

Also the other issue that if the DIMM is going bad then it's likely
larger areas than just the lines making up this page. So you
would still risk uncorrected errors anyways because disabling
the page would only cover a small subset of the affected area.

If you really wanted to do this you probably should hook it up
to mcelog's (or the IA64 equivalent) DIMM database and then
control it from user space with suitable large thresholds
and DIMM specific knowledge. But it's unlikely it can be really
done nicely in a way that is isolated from very specific 
knowledge about the underlying memory configuration.

-Andi

  reply	other threads:[~2008-07-19 10:38 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-18 20:35 [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7) Russ Anderson
2008-07-19 10:37 ` Andi Kleen [this message]
2008-07-19 12:13   ` Matthew Wilcox
2008-07-19 15:06     ` Andi Kleen
2008-07-20 17:50     ` Russ Anderson
2008-07-20 17:39   ` Russ Anderson
2008-07-21 19:11     ` Alex Williamson
2008-07-21 19:45       ` Russ Anderson
2008-07-21 19:40     ` Andi Kleen
2008-07-28 21:44       ` Russ Anderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87prpa88iw.fsf@basil.nowhere.org \
    --to=andi@firstfloor.org \
    --cc=linux-ia64@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rja@sgi.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox