From: Ingo Molnar <mingo@elte.hu>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Wu Fengguang <fengguang.wu@intel.com>,
Thomas Gleixner <tglx@linutronix.de>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Andrew Morton <akpm@linux-foundation.org>,
LKML <linux-kernel@vger.kernel.org>,
Nick Piggin <npiggin@suse.de>,
Hugh Dickins <hugh.dickins@tiscali.co.uk>,
Andi Kleen <andi@firstfloor.org>,
"riel@redhat.com" <riel@redhat.com>,
"chris.mason@oracle.com" <chris.mason@oracle.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
Date: Fri, 12 Jun 2009 18:48:15 +0200 [thread overview]
Message-ID: <20090612164815.GA30773@elte.hu> (raw)
In-Reply-To: <4A328444.3010301@zytor.com>
* H. Peter Anvin <hpa@zytor.com> wrote:
> Ingo Molnar wrote:
> >
> > So i think hwpoison simply does not affect our ability to get
> > log messages out - but it sure allows crappier hardware to be
> > used. Am i wrong about that for some reason?
>
> Crappy hardware isn't the kind of hardware that is likely to have
> the hwpoison features, just like crappy hardware generally doesn't
> even have ECC -- or even basic parity checking (I personally think
> non-ECC memory should be considered a crime against humanity in
> this day and age.)
>
> You're making the fundamental assumption that failover and
> hardware replacement is a relatively cheap and fast operation. In
> high reliability applications, of course, failover is always an
> option -- it *HAS* to be an option -- but that doesn't mean that
> hardware replacement is cheap, fast or even possible -- and now
> you've blown your failover option.
>
> These kinds of features are used when extremely high reliability
> is required, think for example a telco core router. A page error
> may have happened due to stray radiation or through power supply
> glitches (which happen even in the best of systems), but if they
> are a pattern, a box needs to be replaced. *How quickly* a box
> can be taken out of service and replaced can vary greatly, and its
> urgency depend on patterns; furthermore, in the meantime the
> device has to work the best it can.
>
> Consider, for example, a control computer on the Hubble Space
> Telescope -- the only way to replace it is by space shuttle, and
> you can safely guarantee that *that* won't happen in a heartbeat.
> On the new Herschel Space Observatory, not even the space shuttle
> can help: if the computers die, *or* if bad data gets fed to its
> control system, the spacecraft is lost. As such, it's of
> paramount importance for the computers to (a) continue to provide
> service at the level the hardware is capable of doing, (b) as
> accurately as possible continually assess and report that level of
> service, and (c) not allow a failure to pass undetected. A lot of
> failures are simple one-time events (especially in space, a
> high-rad environment), others reflect decaying hardware but can be
> isolated (e.g. a RAM cell which has developed a short circuit, or
> a CPU core which has a damaged ALU), while others yet reflect a
> general ill health of the system that cannot be recovered.
>
> What these kinds of features do is it gives the overall-system
> designers and the administrators more options.
Ok, these arguments are pretty convincing - thanks everyone for the
detailed explanation.
Ingo
WARNING: multiple messages have this Message-ID (diff)
From: Ingo Molnar <mingo@elte.hu>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Wu Fengguang <fengguang.wu@intel.com>,
Thomas Gleixner <tglx@linutronix.de>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Andrew Morton <akpm@linux-foundation.org>,
LKML <linux-kernel@vger.kernel.org>,
Nick Piggin <npiggin@suse.de>,
Hugh Dickins <hugh.dickins@tiscali.co.uk>,
Andi Kleen <andi@firstfloor.org>,
"riel@redhat.com" <riel@redhat.com>,
"chris.mason@oracle.com" <chris.mason@oracle.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
Date: Fri, 12 Jun 2009 18:48:15 +0200 [thread overview]
Message-ID: <20090612164815.GA30773@elte.hu> (raw)
In-Reply-To: <4A328444.3010301@zytor.com>
* H. Peter Anvin <hpa@zytor.com> wrote:
> Ingo Molnar wrote:
> >
> > So i think hwpoison simply does not affect our ability to get
> > log messages out - but it sure allows crappier hardware to be
> > used. Am i wrong about that for some reason?
>
> Crappy hardware isn't the kind of hardware that is likely to have
> the hwpoison features, just like crappy hardware generally doesn't
> even have ECC -- or even basic parity checking (I personally think
> non-ECC memory should be considered a crime against humanity in
> this day and age.)
>
> You're making the fundamental assumption that failover and
> hardware replacement is a relatively cheap and fast operation. In
> high reliability applications, of course, failover is always an
> option -- it *HAS* to be an option -- but that doesn't mean that
> hardware replacement is cheap, fast or even possible -- and now
> you've blown your failover option.
>
> These kinds of features are used when extremely high reliability
> is required, think for example a telco core router. A page error
> may have happened due to stray radiation or through power supply
> glitches (which happen even in the best of systems), but if they
> are a pattern, a box needs to be replaced. *How quickly* a box
> can be taken out of service and replaced can vary greatly, and its
> urgency depend on patterns; furthermore, in the meantime the
> device has to work the best it can.
>
> Consider, for example, a control computer on the Hubble Space
> Telescope -- the only way to replace it is by space shuttle, and
> you can safely guarantee that *that* won't happen in a heartbeat.
> On the new Herschel Space Observatory, not even the space shuttle
> can help: if the computers die, *or* if bad data gets fed to its
> control system, the spacecraft is lost. As such, it's of
> paramount importance for the computers to (a) continue to provide
> service at the level the hardware is capable of doing, (b) as
> accurately as possible continually assess and report that level of
> service, and (c) not allow a failure to pass undetected. A lot of
> failures are simple one-time events (especially in space, a
> high-rad environment), others reflect decaying hardware but can be
> isolated (e.g. a RAM cell which has developed a short circuit, or
> a CPU core which has a damaged ALU), while others yet reflect a
> general ill health of the system that cannot be recovered.
>
> What these kinds of features do is it gives the overall-system
> designers and the administrators more options.
Ok, these arguments are pretty convincing - thanks everyone for the
detailed explanation.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-06-12 16:49 UTC|newest]
Thread overview: 84+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-06-11 14:22 [PATCH 0/5] [RFC] HWPOISON incremental fixes Wu Fengguang
2009-06-11 14:22 ` Wu Fengguang
2009-06-11 14:22 ` [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
2009-06-11 14:22 ` Wu Fengguang
2009-06-11 15:44 ` Rik van Riel
2009-06-11 15:44 ` Rik van Riel
2009-06-12 10:00 ` Andi Kleen
2009-06-12 10:00 ` Andi Kleen
2009-06-12 13:15 ` Wu Fengguang
2009-06-12 13:15 ` Wu Fengguang
2009-06-12 11:22 ` Ingo Molnar
2009-06-12 11:22 ` Ingo Molnar
2009-06-12 12:57 ` Wu Fengguang
2009-06-12 12:57 ` Wu Fengguang
2009-06-12 13:17 ` Ingo Molnar
2009-06-12 13:17 ` Ingo Molnar
2009-06-12 13:33 ` Wu Fengguang
2009-06-12 13:33 ` Wu Fengguang
2009-06-12 15:36 ` Ingo Molnar
2009-06-12 15:36 ` Ingo Molnar
2009-06-12 16:14 ` Wu Fengguang
2009-06-12 16:14 ` Wu Fengguang
2009-06-12 18:07 ` Alan Cox
2009-06-12 18:07 ` Alan Cox
2009-06-12 17:55 ` Theodore Tso
2009-06-12 17:55 ` Theodore Tso
2009-06-12 13:58 ` Andi Kleen
2009-06-12 13:58 ` Andi Kleen
2009-06-12 15:28 ` Linus Torvalds
2009-06-12 15:28 ` Linus Torvalds
2009-06-12 15:35 ` Ingo Molnar
2009-06-12 15:35 ` Ingo Molnar
2009-06-12 16:05 ` Rik van Riel
2009-06-12 16:05 ` Rik van Riel
2009-06-12 16:37 ` H. Peter Anvin
2009-06-12 16:37 ` H. Peter Anvin
2009-06-12 16:48 ` Ingo Molnar [this message]
2009-06-12 16:48 ` Ingo Molnar
2009-06-15 7:04 ` Nick Piggin
2009-06-15 7:04 ` Nick Piggin
2009-06-15 6:52 ` Nick Piggin
2009-06-15 6:52 ` Nick Piggin
2009-06-16 20:27 ` Russ Anderson
2009-06-16 20:27 ` Russ Anderson
2009-06-17 7:51 ` Nick Piggin
2009-06-17 7:51 ` Nick Piggin
2009-06-12 15:45 ` Ingo Molnar
2009-06-12 15:45 ` Ingo Molnar
2009-06-12 16:12 ` Linus Torvalds
2009-06-12 16:12 ` Linus Torvalds
2009-06-11 14:22 ` [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order Wu Fengguang
2009-06-11 14:22 ` Wu Fengguang
2009-06-11 15:59 ` Rik van Riel
2009-06-11 15:59 ` Rik van Riel
2009-06-12 10:03 ` Andi Kleen
2009-06-12 10:03 ` Andi Kleen
2009-06-12 10:07 ` Nick Piggin
2009-06-12 10:07 ` Nick Piggin
2009-06-12 13:27 ` Wu Fengguang
2009-06-12 13:27 ` Wu Fengguang
2009-06-12 14:04 ` Wu Fengguang
2009-06-12 14:04 ` Wu Fengguang
2009-06-11 14:22 ` [PATCH 3/5] HWPOISON: remove early kill option for now Wu Fengguang
2009-06-11 14:22 ` Wu Fengguang
2009-06-11 16:06 ` Rik van Riel
2009-06-11 16:06 ` Rik van Riel
2009-06-12 9:59 ` Andi Kleen
2009-06-12 9:59 ` Andi Kleen
2009-06-11 14:22 ` [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file Wu Fengguang
2009-06-11 14:22 ` Wu Fengguang
2009-06-11 16:31 ` Rik van Riel
2009-06-11 16:31 ` Rik van Riel
2009-06-12 10:07 ` Andi Kleen
2009-06-12 10:07 ` Andi Kleen
2009-06-12 13:41 ` Wu Fengguang
2009-06-12 13:41 ` Wu Fengguang
2009-06-11 14:22 ` [PATCH 5/5] HWPOISON: use the safer invalidate page for possible metadata pages Wu Fengguang
2009-06-11 14:22 ` Wu Fengguang
2009-06-11 16:36 ` Rik van Riel
2009-06-11 16:36 ` Rik van Riel
2009-06-12 10:56 ` [PATCH 0/5] [RFC] HWPOISON incremental fixes Andi Kleen
2009-06-12 10:56 ` Andi Kleen
2009-06-12 13:59 ` Wu Fengguang
2009-06-12 13:59 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090612164815.GA30773@elte.hu \
--to=mingo@elte.hu \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=chris.mason@oracle.com \
--cc=fengguang.wu@intel.com \
--cc=hpa@zytor.com \
--cc=hugh.dickins@tiscali.co.uk \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
--cc=riel@redhat.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.