public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andi Kleen <andi@firstfloor.org>
To: Matthew Wilcox <matthew@wil.cx>
Cc: Russ Anderson <rja@sgi.com>,
	mingo@elte.hu, tglx@linutronix.de,
	Tony Luck <tony.luck@intel.com>,
	linux-kernel@vger.kernel.org, linux-ia64@vger.kernel.org
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)
Date: Sat, 19 Jul 2008 17:06:49 +0200	[thread overview]
Message-ID: <87ljzx9aly.fsf@basil.nowhere.org> (raw)
In-Reply-To: <20080719121328.GA20138@parisc-linux.org> (Matthew Wilcox's message of "Sat, 19 Jul 2008 06:13:28 -0600")

Matthew Wilcox <matthew@wil.cx> writes:

> On Sat, Jul 19, 2008 at 12:37:11PM +0200, Andi Kleen wrote:
>> Russ Anderson <rja@sgi.com> writes:
>> 
>> > [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)
>> 
>> FWIW I discussed this with some hardware people and the general
>> opinion was that it was way too aggressive to disable a page on the
>> first corrected error like this patchkit currently does.  
>
> I think it's reasonable to take a page out of service on the first error.
> Then a user program needs to be notified of which bit is suspected.
> It can then subject that page to an intense set of tests (I'd start
> by stealing the ones from memtest86+) and if no more errors are found,
> it could return the page to service.

That would only really help if really only parts of that specific page
is corrupted.  But my understanding is that DIMM failures usually
cluster in larger units (channels, DIMMs, memory chips on them, banks
inside the chips etc., all far larger than a 4K page)

So to do your proposal you would need to do this on the units of whole
DIMMs or at least their pages, otherwise it is somewhat
pointless. Since the memory systems typically interleave this would
likely need to be done on multiple DIMMs, potentially covering a large
memory area.

In the end you'll end up with most of the mess of memory hot unplug
because the more memory is affected the more likely it is 
some unmoveable kernel data is affected.

-Andi

  reply	other threads:[~2008-07-19 15:07 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-18 20:35 [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7) Russ Anderson
2008-07-19 10:37 ` Andi Kleen
2008-07-19 12:13   ` Matthew Wilcox
2008-07-19 15:06     ` Andi Kleen [this message]
2008-07-20 17:50     ` Russ Anderson
2008-07-20 17:39   ` Russ Anderson
2008-07-21 19:11     ` Alex Williamson
2008-07-21 19:45       ` Russ Anderson
2008-07-21 19:40     ` Andi Kleen
2008-07-28 21:44       ` Russ Anderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ljzx9aly.fsf@basil.nowhere.org \
    --to=andi@firstfloor.org \
    --cc=linux-ia64@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matthew@wil.cx \
    --cc=mingo@elte.hu \
    --cc=rja@sgi.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox