All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Luck, Tony" <tony.luck@intel.com>
To: linux-kernel@vger.kernel.org
Cc: "Ingo Molnar" <mingo@elte.hu>,
	"Huang, Ying" <ying.huang@intel.com>,
	"Andi Kleen" <andi@firstfloor.org>,
	"Borislav Petkov" <bp@alien8.de>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Andrew Morton" <akpm@linux-foundation.org>
Subject: [RFC 0/9] mce recovery for Sandy Bridge server
Date: Mon, 23 May 2011 14:54:27 -0700	[thread overview]
Message-ID: <4ddad79317108eb33d@agluck-desktop.sc.intel.com> (raw)

Here's a nine-part patch series to implement "AR=1" recovery
that will be available on high-end Sandy Bridge server processors.
In this case the process detects an uncorrectable memory error
while doing an instruction of data fetch that is about to be
consumed.  This is in contrast to the recoverable errors on
Nehalem and Westmere that were out of immediate execution context
(patrol scrubber and cache line write-back).

The code is based on work done by Andi last year and published in
the "mce/action-required" branch of his mce git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git
Thus he gets author credit on 6 out of 9 patches (but I'll take
the blame for all of them).

The first eight patches are mostly cleanups and minor new bits
that are needed by part 9 where the interesting stuff happens.

For the "in context" case, we must not return from the machine
check handler (in the data fetch case we'd re-execute the fetch
and take another machine check, in the instruction fetch case
we actually don't have a precise IP to return to).  We use the
TIF_MCE_NOTIFY task flag bit to ensure that we don't return to
the user context - but we also need to keep track of the memory
address where the fault occurred. The h/w only gives us the physical
address which we must keep track of ... to do so we have added
"mce_error_pfn" to the task structure - this feels odd, but it
is an attribute of the task (e.g. this task may be migrated to
another processor before we get to look at TIF_MCE_NOTIFY and
head to do_notify_resume() to process it).

Andi's recovery code can also handle a few cases where the
error is detected while running kernel code (when copying
data to/from a user process) - but the TIF_MCE_NOTIFY method
doesn't actually ever get to this code (since the entry_64.S code
only checks TIF_MCE_NOTIFY on return to userspace). I'd
appreciate any ideas on how to handle this. Perhaps we could
do good things when CONFIG_PREEMPT=y (it seems probable that
any error in a non-preemtible section of kernel code is going
to be fatal).

-Tony

 arch/x86/include/asm/mce.h                |    3 +-
 arch/x86/kernel/cpu/mcheck/mce-severity.c |   37 +++-
 arch/x86/kernel/cpu/mcheck/mce.c          |  286 ++++++++++++++++++++++++-----
 arch/x86/kernel/signal.c                  |    2 +-
 include/linux/init_task.h                 |    7 +
 include/linux/sched.h                     |    3 +
 mm/memory-failure.c                       |   28 ++--
 7 files changed, 300 insertions(+), 66 deletions(-)

Andi Kleen (6):
      MCE: Always retrieve mce rip before calling no_way_out
      MCE: Move ADDR/MISC reading code into common function
      MCE: Mask out address mask bits below address granuality
      HWPOISON: Handle hwpoison in current process
      MCE: Pass registers to work handlers
      MCE: Add Action-Required support

Tony Luck (3):
      mce: fixes for mce severity table
      mce: save most severe error information
      mce: run through processors with more severe problems first


             reply	other threads:[~2011-05-23 21:54 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-23 21:54 Luck, Tony [this message]
2011-05-23 22:02 ` [RFC 1/9] mce: fixes for mce severity table Luck, Tony
2011-05-23 22:12 ` [RFC 2/9] mce: save most severe error information Luck, Tony
2011-05-23 22:13 ` [RFC 3/9] MCE: Always retrieve mce rip before calling no_way_out Luck, Tony
2011-05-23 22:13 ` [RFC 4/9] MCE: Move ADDR/MISC reading code into common function Luck, Tony
2011-05-23 22:13 ` [RFC 5/9] MCE: Mask out address mask bits below address granuality Luck, Tony
2011-05-23 22:14 ` [RFC 6/9] HWPOISON: Handle hwpoison in current process Luck, Tony
2011-05-23 22:14 ` [RFC 7/9] MCE: Pass registers to work handlers Luck, Tony
2011-05-23 22:14 ` [RFC 8/9] mce: run through processors with more severe problems first Luck, Tony
2011-05-23 22:15 ` [RFC 9/9] MCE: Add Action-Required support Luck, Tony
2011-05-24  3:40 ` [RFC 0/9] mce recovery for Sandy Bridge server Ingo Molnar
2011-05-24  8:14   ` Borislav Petkov
2011-05-24 16:57   ` Luck, Tony
2011-05-24 17:33     ` Borislav Petkov
2011-05-24 17:56       ` Tony Luck
2011-05-24 21:04         ` Borislav Petkov
2011-05-24 21:24         ` Peter Zijlstra
2011-05-24 21:30           ` Linus Torvalds
2011-05-24 21:37             ` Peter Zijlstra
2011-05-24 21:41               ` Ingo Molnar
2011-05-24 21:48             ` Tony Luck
2011-05-25 10:02               ` Joerg Roedel
2011-05-25 13:44     ` Ingo Molnar
2011-05-25 21:43       ` Tony Luck
2011-05-25 21:47         ` Ingo Molnar
2011-05-25 23:53       ` Tony Luck
2011-05-26 20:16         ` Tony Luck
2011-05-25  6:03 ` Hidetoshi Seto
2011-05-25 16:44   ` Luck, Tony

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4ddad79317108eb33d@agluck-desktop.sc.intel.com \
    --to=tony.luck@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=bp@alien8.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=torvalds@linux-foundation.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.