linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "H. Peter Anvin" <hpa@zytor.com>
To: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Nick Piggin <npiggin@suse.de>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Andi Kleen <andi@firstfloor.org>,
	"riel@redhat.com" <riel@redhat.com>,
	"chris.mason@oracle.com" <chris.mason@oracle.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when	feature is disabled
Date: Fri, 12 Jun 2009 09:37:24 -0700	[thread overview]
Message-ID: <4A328444.3010301@zytor.com> (raw)
In-Reply-To: <20090612153501.GA5737@elte.hu>

Ingo Molnar wrote:
> 
> So i think hwpoison simply does not affect our ability to get log 
> messages out - but it sure allows crappier hardware to be used.
> Am i wrong about that for some reason?
> 

Crappy hardware isn't the kind of hardware that is likely to have the
hwpoison features, just like crappy hardware generally doesn't even have
ECC -- or even basic parity checking (I personally think non-ECC memory
should be considered a crime against humanity in this day and age.)

You're making the fundamental assumption that failover and hardware
replacement is a relatively cheap and fast operation.  In high
reliability applications, of course, failover is always an option -- it
*HAS* to be an option -- but that doesn't mean that hardware replacement
is cheap, fast or even possible -- and now you've blown your failover
option.

These kinds of features are used when extremely high reliability is
required, think for example a telco core router.  A page error may have
happened due to stray radiation or through power supply glitches (which
happen even in the best of systems), but if they are a pattern, a box
needs to be replaced.  *How quickly* a box can be taken out of service
and replaced can vary greatly, and its urgency depend on patterns;
furthermore, in the meantime the device has to work the best it can.

Consider, for example, a control computer on the Hubble Space Telescope
-- the only way to replace it is by space shuttle, and you can safely
guarantee that *that* won't happen in a heartbeat.  On the new Herschel
Space Observatory, not even the space shuttle can help: if the computers
die, *or* if bad data gets fed to its control system, the spacecraft is
lost.  As such, it's of paramount importance for the computers to (a)
continue to provide service at the level the hardware is capable of
doing, (b) as accurately as possible continually assess and report that
level of service, and (c) not allow a failure to pass undetected.  A lot
of failures are simple one-time events (especially in space, a high-rad
environment), others reflect decaying hardware but can be isolated (e.g.
a RAM cell which has developed a short circuit, or a CPU core which has
a damaged ALU), while others yet reflect a general ill health of the
system that cannot be recovered.

What these kinds of features do is it gives the overall-system designers
and the administrators more options.

(Note: this is an hpa position statement, not necessarily an Intel one.)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2009-06-12 16:36 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-11 14:22 [PATCH 0/5] [RFC] HWPOISON incremental fixes Wu Fengguang
2009-06-11 14:22 ` [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
2009-06-11 15:44   ` Rik van Riel
2009-06-12 10:00   ` Andi Kleen
2009-06-12 13:15     ` Wu Fengguang
2009-06-12 11:22   ` Ingo Molnar
2009-06-12 12:57     ` Wu Fengguang
2009-06-12 13:17       ` Ingo Molnar
2009-06-12 13:33         ` Wu Fengguang
2009-06-12 15:36           ` Ingo Molnar
2009-06-12 16:14             ` Wu Fengguang
2009-06-12 18:07               ` Alan Cox
2009-06-12 17:55             ` Theodore Tso
2009-06-12 13:58         ` Andi Kleen
2009-06-12 15:28         ` Linus Torvalds
2009-06-12 15:35           ` Ingo Molnar
2009-06-12 16:05             ` Rik van Riel
2009-06-12 16:37             ` H. Peter Anvin [this message]
2009-06-12 16:48               ` Ingo Molnar
2009-06-15  7:04               ` Nick Piggin
2009-06-15  6:52             ` Nick Piggin
2009-06-16 20:27               ` Russ Anderson
2009-06-17  7:51                 ` Nick Piggin
2009-06-12 15:45         ` Ingo Molnar
2009-06-12 16:12           ` Linus Torvalds
2009-06-11 14:22 ` [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order Wu Fengguang
2009-06-11 15:59   ` Rik van Riel
2009-06-12 10:03   ` Andi Kleen
2009-06-12 10:07     ` Nick Piggin
2009-06-12 13:27     ` Wu Fengguang
2009-06-12 14:04       ` Wu Fengguang
2009-06-11 14:22 ` [PATCH 3/5] HWPOISON: remove early kill option for now Wu Fengguang
2009-06-11 16:06   ` Rik van Riel
2009-06-12  9:59   ` Andi Kleen
2009-06-11 14:22 ` [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file Wu Fengguang
2009-06-11 16:31   ` Rik van Riel
2009-06-12 10:07   ` Andi Kleen
2009-06-12 13:41     ` Wu Fengguang
2009-06-11 14:22 ` [PATCH 5/5] HWPOISON: use the safer invalidate page for possible metadata pages Wu Fengguang
2009-06-11 16:36   ` Rik van Riel
2009-06-12 10:56 ` [PATCH 0/5] [RFC] HWPOISON incremental fixes Andi Kleen
2009-06-12 13:59   ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A328444.3010301@zytor.com \
    --to=hpa@zytor.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=chris.mason@oracle.com \
    --cc=fengguang.wu@intel.com \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=npiggin@suse.de \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).