From: Nick Piggin <npiggin@suse.de>
To: Russ Anderson <rja@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>,
Linus Torvalds <torvalds@linux-foundation.org>,
Wu Fengguang <fengguang.wu@intel.com>,
Thomas Gleixner <tglx@linutronix.de>,
"H. Peter Anvin" <hpa@zytor.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Andrew Morton <akpm@linux-foundation.org>,
LKML <linux-kernel@vger.kernel.org>,
Hugh Dickins <hugh.dickins@tiscali.co.uk>,
Andi Kleen <andi@firstfloor.org>,
"riel@redhat.com" <riel@redhat.com>,
"chris.mason@oracle.com" <chris.mason@oracle.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
Date: Wed, 17 Jun 2009 09:51:31 +0200 [thread overview]
Message-ID: <20090617075131.GC26664@wotan.suse.de> (raw)
In-Reply-To: <20090616202726.GB31443@sgi.com>
On Tue, Jun 16, 2009 at 03:27:26PM -0500, Russ Anderson wrote:
> On Mon, Jun 15, 2009 at 08:52:32AM +0200, Nick Piggin wrote:
> > On Fri, Jun 12, 2009 at 05:35:01PM +0200, Ingo Molnar wrote:
> > > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > > On Fri, 12 Jun 2009, Ingo Molnar wrote:
> > > > >
> > > > > This seems like trying to handle a failure mode that cannot be
> > > > > and shouldnt be 'handled' really. If there's an 'already
> > > > > corrupted' page then the box should go down hard and fast, and
> > > > > we should not risk _even more user data corruption_ by trying to
> > > > > 'continue' in the hope of having hit some 'harmless' user
> > > > > process that can be killed ...
> > > >
> > > > No, the box should _not_ go down hard-and-fast. That's the last
> > > > thing we should *ever* do.
> > > >
> > > > We need to log it. Often at a user level (ie we want to make sure
> > > > it actually hits syslog, possibly goes out the network, maybe pops
> > > > up a window, whatever).
> > > >
> > > > Shutting down the machine is the last thing we ever want to do.
> > > >
> > > > The whole "let's panic" mentality is a disease.
> > >
> > > No doubt about that - and i'm removing BUG_ON()s and panic()s
> > > wherever i can and havent added a single new one myself in the past
> > > 5 years or so, its a disease.
> >
> > In HA failover systems you often do want to panic ASAP (after logging
> > to serial cosole I guess) if anything like this happens so the system
> > can be rebooted with minimal chance of data corruption spreading.
>
> The whole point of hardware data poisoning is to avoid having to
> panic the system due to the potential of undetected data corruption,
> because the corrupt data is always marked bad. This has worked
> well on ia64 where applications that encounter bad data are killed
> and the memory poisoned and not reallocated, avoiding a system panic.
>
> This has been used at customer sites for a few years. The type
> customers that really check their data. It is nice to see
> the hardware poison feature moving to the x86 "mainstream".
So long as you can get an MCE and panic if the corrupt data
actually gets consumed anywhere, then yes a "corrupt data
detected but not consumed" exception would not require a
panic.
I don't know enough about the arch details to know what kinds
of exceptions happen when.
WARNING: multiple messages have this Message-ID (diff)
From: Nick Piggin <npiggin@suse.de>
To: Russ Anderson <rja@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>,
Linus Torvalds <torvalds@linux-foundation.org>,
Wu Fengguang <fengguang.wu@intel.com>,
Thomas Gleixner <tglx@linutronix.de>,
"H. Peter Anvin" <hpa@zytor.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Andrew Morton <akpm@linux-foundation.org>,
LKML <linux-kernel@vger.kernel.org>,
Hugh Dickins <hugh.dickins@tiscali.co.uk>,
Andi Kleen <andi@firstfloor.org>,
"riel@redhat.com" <riel@redhat.com>,
"chris.mason@oracle.com" <chris.mason@oracle.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
Date: Wed, 17 Jun 2009 09:51:31 +0200 [thread overview]
Message-ID: <20090617075131.GC26664@wotan.suse.de> (raw)
In-Reply-To: <20090616202726.GB31443@sgi.com>
On Tue, Jun 16, 2009 at 03:27:26PM -0500, Russ Anderson wrote:
> On Mon, Jun 15, 2009 at 08:52:32AM +0200, Nick Piggin wrote:
> > On Fri, Jun 12, 2009 at 05:35:01PM +0200, Ingo Molnar wrote:
> > > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > > On Fri, 12 Jun 2009, Ingo Molnar wrote:
> > > > >
> > > > > This seems like trying to handle a failure mode that cannot be
> > > > > and shouldnt be 'handled' really. If there's an 'already
> > > > > corrupted' page then the box should go down hard and fast, and
> > > > > we should not risk _even more user data corruption_ by trying to
> > > > > 'continue' in the hope of having hit some 'harmless' user
> > > > > process that can be killed ...
> > > >
> > > > No, the box should _not_ go down hard-and-fast. That's the last
> > > > thing we should *ever* do.
> > > >
> > > > We need to log it. Often at a user level (ie we want to make sure
> > > > it actually hits syslog, possibly goes out the network, maybe pops
> > > > up a window, whatever).
> > > >
> > > > Shutting down the machine is the last thing we ever want to do.
> > > >
> > > > The whole "let's panic" mentality is a disease.
> > >
> > > No doubt about that - and i'm removing BUG_ON()s and panic()s
> > > wherever i can and havent added a single new one myself in the past
> > > 5 years or so, its a disease.
> >
> > In HA failover systems you often do want to panic ASAP (after logging
> > to serial cosole I guess) if anything like this happens so the system
> > can be rebooted with minimal chance of data corruption spreading.
>
> The whole point of hardware data poisoning is to avoid having to
> panic the system due to the potential of undetected data corruption,
> because the corrupt data is always marked bad. This has worked
> well on ia64 where applications that encounter bad data are killed
> and the memory poisoned and not reallocated, avoiding a system panic.
>
> This has been used at customer sites for a few years. The type
> customers that really check their data. It is nice to see
> the hardware poison feature moving to the x86 "mainstream".
So long as you can get an MCE and panic if the corrupt data
actually gets consumed anywhere, then yes a "corrupt data
detected but not consumed" exception would not require a
panic.
I don't know enough about the arch details to know what kinds
of exceptions happen when.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-06-17 7:51 UTC|newest]
Thread overview: 84+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-06-11 14:22 [PATCH 0/5] [RFC] HWPOISON incremental fixes Wu Fengguang
2009-06-11 14:22 ` Wu Fengguang
2009-06-11 14:22 ` [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
2009-06-11 14:22 ` Wu Fengguang
2009-06-11 15:44 ` Rik van Riel
2009-06-11 15:44 ` Rik van Riel
2009-06-12 10:00 ` Andi Kleen
2009-06-12 10:00 ` Andi Kleen
2009-06-12 13:15 ` Wu Fengguang
2009-06-12 13:15 ` Wu Fengguang
2009-06-12 11:22 ` Ingo Molnar
2009-06-12 11:22 ` Ingo Molnar
2009-06-12 12:57 ` Wu Fengguang
2009-06-12 12:57 ` Wu Fengguang
2009-06-12 13:17 ` Ingo Molnar
2009-06-12 13:17 ` Ingo Molnar
2009-06-12 13:33 ` Wu Fengguang
2009-06-12 13:33 ` Wu Fengguang
2009-06-12 15:36 ` Ingo Molnar
2009-06-12 15:36 ` Ingo Molnar
2009-06-12 16:14 ` Wu Fengguang
2009-06-12 16:14 ` Wu Fengguang
2009-06-12 18:07 ` Alan Cox
2009-06-12 18:07 ` Alan Cox
2009-06-12 17:55 ` Theodore Tso
2009-06-12 17:55 ` Theodore Tso
2009-06-12 13:58 ` Andi Kleen
2009-06-12 13:58 ` Andi Kleen
2009-06-12 15:28 ` Linus Torvalds
2009-06-12 15:28 ` Linus Torvalds
2009-06-12 15:35 ` Ingo Molnar
2009-06-12 15:35 ` Ingo Molnar
2009-06-12 16:05 ` Rik van Riel
2009-06-12 16:05 ` Rik van Riel
2009-06-12 16:37 ` H. Peter Anvin
2009-06-12 16:37 ` H. Peter Anvin
2009-06-12 16:48 ` Ingo Molnar
2009-06-12 16:48 ` Ingo Molnar
2009-06-15 7:04 ` Nick Piggin
2009-06-15 7:04 ` Nick Piggin
2009-06-15 6:52 ` Nick Piggin
2009-06-15 6:52 ` Nick Piggin
2009-06-16 20:27 ` Russ Anderson
2009-06-16 20:27 ` Russ Anderson
2009-06-17 7:51 ` Nick Piggin [this message]
2009-06-17 7:51 ` Nick Piggin
2009-06-12 15:45 ` Ingo Molnar
2009-06-12 15:45 ` Ingo Molnar
2009-06-12 16:12 ` Linus Torvalds
2009-06-12 16:12 ` Linus Torvalds
2009-06-11 14:22 ` [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order Wu Fengguang
2009-06-11 14:22 ` Wu Fengguang
2009-06-11 15:59 ` Rik van Riel
2009-06-11 15:59 ` Rik van Riel
2009-06-12 10:03 ` Andi Kleen
2009-06-12 10:03 ` Andi Kleen
2009-06-12 10:07 ` Nick Piggin
2009-06-12 10:07 ` Nick Piggin
2009-06-12 13:27 ` Wu Fengguang
2009-06-12 13:27 ` Wu Fengguang
2009-06-12 14:04 ` Wu Fengguang
2009-06-12 14:04 ` Wu Fengguang
2009-06-11 14:22 ` [PATCH 3/5] HWPOISON: remove early kill option for now Wu Fengguang
2009-06-11 14:22 ` Wu Fengguang
2009-06-11 16:06 ` Rik van Riel
2009-06-11 16:06 ` Rik van Riel
2009-06-12 9:59 ` Andi Kleen
2009-06-12 9:59 ` Andi Kleen
2009-06-11 14:22 ` [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file Wu Fengguang
2009-06-11 14:22 ` Wu Fengguang
2009-06-11 16:31 ` Rik van Riel
2009-06-11 16:31 ` Rik van Riel
2009-06-12 10:07 ` Andi Kleen
2009-06-12 10:07 ` Andi Kleen
2009-06-12 13:41 ` Wu Fengguang
2009-06-12 13:41 ` Wu Fengguang
2009-06-11 14:22 ` [PATCH 5/5] HWPOISON: use the safer invalidate page for possible metadata pages Wu Fengguang
2009-06-11 14:22 ` Wu Fengguang
2009-06-11 16:36 ` Rik van Riel
2009-06-11 16:36 ` Rik van Riel
2009-06-12 10:56 ` [PATCH 0/5] [RFC] HWPOISON incremental fixes Andi Kleen
2009-06-12 10:56 ` Andi Kleen
2009-06-12 13:59 ` Wu Fengguang
2009-06-12 13:59 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090617075131.GC26664@wotan.suse.de \
--to=npiggin@suse.de \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=chris.mason@oracle.com \
--cc=fengguang.wu@intel.com \
--cc=hpa@zytor.com \
--cc=hugh.dickins@tiscali.co.uk \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@elte.hu \
--cc=riel@redhat.com \
--cc=rja@sgi.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.