public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Nick Piggin <npiggin@suse.de>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Andi Kleen <andi@firstfloor.org>,
	"riel@redhat.com" <riel@redhat.com>,
	"chris.mason@oracle.com" <chris.mason@oracle.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when	feature is disabled
Date: Mon, 15 Jun 2009 09:04:14 +0200	[thread overview]
Message-ID: <20090615070414.GD18390@wotan.suse.de> (raw)
In-Reply-To: <4A328444.3010301@zytor.com>

On Fri, Jun 12, 2009 at 09:37:24AM -0700, H. Peter Anvin wrote:
> Ingo Molnar wrote:
> > 
> > So i think hwpoison simply does not affect our ability to get log 
> > messages out - but it sure allows crappier hardware to be used.
> > Am i wrong about that for some reason?
> > 
> 
> Crappy hardware isn't the kind of hardware that is likely to have the
> hwpoison features, just like crappy hardware generally doesn't even have
> ECC -- or even basic parity checking (I personally think non-ECC memory
> should be considered a crime against humanity in this day and age.)

What I would find interesting with this hwpoison would be the probability 
difference between detecting an uncorrected error, and undetected errors.

 
> These kinds of features are used when extremely high reliability is
> required, think for example a telco core router.  A page error may have
> happened due to stray radiation or through power supply glitches (which
> happen even in the best of systems), but if they are a pattern, a box
> needs to be replaced.  *How quickly* a box can be taken out of service
> and replaced can vary greatly, and its urgency depend on patterns;
> furthermore, in the meantime the device has to work the best it can.

I don't know how much improvements that hwpoison will give. Significant
amount of RAM cannot be corrected, so especially on like a core router
or embedded system which does not use a lot of disk/pagecache, then it
is probably more like 2x improvement rather than an order of magnitude
improvement.


> Consider, for example, a control computer on the Hubble Space Telescope
> -- the only way to replace it is by space shuttle, and you can safely
> guarantee that *that* won't happen in a heartbeat.  On the new Herschel
> Space Observatory, not even the space shuttle can help: if the computers
> die, *or* if bad data gets fed to its control system, the spacecraft is
> lost.  As such, it's of paramount importance for the computers to (a)
> continue to provide service at the level the hardware is capable of
> doing, (b) as accurately as possible continually assess and report that
> level of service, and (c) not allow a failure to pass undetected.  A lot
> of failures are simple one-time events (especially in space, a high-rad
> environment), others reflect decaying hardware but can be isolated (e.g.
> a RAM cell which has developed a short circuit, or a CPU core which has
> a damaged ALU), while others yet reflect a general ill health of the
> system that cannot be recovered.

I guess most of these examples have to go far beyond this and use
multiply redundant computation and voting systems and quickly
reboot members that are kicked out. :)

Not that it is a detrement of hwpoison. If they used Linux I'm
sure they would like to panic on uncorrected error too (but would
probably not bother trying to do heuristic recovery).

  parent reply	other threads:[~2009-06-15  7:04 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-11 14:22 [PATCH 0/5] [RFC] HWPOISON incremental fixes Wu Fengguang
2009-06-11 14:22 ` [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
2009-06-11 15:44   ` Rik van Riel
2009-06-12 10:00   ` Andi Kleen
2009-06-12 13:15     ` Wu Fengguang
2009-06-12 11:22   ` Ingo Molnar
2009-06-12 12:57     ` Wu Fengguang
2009-06-12 13:17       ` Ingo Molnar
2009-06-12 13:33         ` Wu Fengguang
2009-06-12 15:36           ` Ingo Molnar
2009-06-12 16:14             ` Wu Fengguang
2009-06-12 18:07               ` Alan Cox
2009-06-12 17:55             ` Theodore Tso
2009-06-12 13:58         ` Andi Kleen
2009-06-12 15:28         ` Linus Torvalds
2009-06-12 15:35           ` Ingo Molnar
2009-06-12 16:05             ` Rik van Riel
2009-06-12 16:37             ` H. Peter Anvin
2009-06-12 16:48               ` Ingo Molnar
2009-06-15  7:04               ` Nick Piggin [this message]
2009-06-15  6:52             ` Nick Piggin
2009-06-16 20:27               ` Russ Anderson
2009-06-17  7:51                 ` Nick Piggin
2009-06-12 15:45         ` Ingo Molnar
2009-06-12 16:12           ` Linus Torvalds
2009-06-11 14:22 ` [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order Wu Fengguang
2009-06-11 15:59   ` Rik van Riel
2009-06-12 10:03   ` Andi Kleen
2009-06-12 10:07     ` Nick Piggin
2009-06-12 13:27     ` Wu Fengguang
2009-06-12 14:04       ` Wu Fengguang
2009-06-11 14:22 ` [PATCH 3/5] HWPOISON: remove early kill option for now Wu Fengguang
2009-06-11 16:06   ` Rik van Riel
2009-06-12  9:59   ` Andi Kleen
2009-06-11 14:22 ` [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file Wu Fengguang
2009-06-11 16:31   ` Rik van Riel
2009-06-12 10:07   ` Andi Kleen
2009-06-12 13:41     ` Wu Fengguang
2009-06-11 14:22 ` [PATCH 5/5] HWPOISON: use the safer invalidate page for possible metadata pages Wu Fengguang
2009-06-11 16:36   ` Rik van Riel
2009-06-12 10:56 ` [PATCH 0/5] [RFC] HWPOISON incremental fixes Andi Kleen
2009-06-12 13:59   ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090615070414.GD18390@wotan.suse.de \
    --to=npiggin@suse.de \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=chris.mason@oracle.com \
    --cc=fengguang.wu@intel.com \
    --cc=hpa@zytor.com \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox