public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] reworked machine check recovery patches
@ 2011-06-09 21:25 Luck, Tony
  2011-06-09 21:29 ` [PATCH 01/10] MCE: fixes for mce severity table Luck, Tony
                   ` (9 more replies)
  0 siblings, 10 replies; 65+ messages in thread
From: Luck, Tony @ 2011-06-09 21:25 UTC (permalink / raw)
  To: Ingo Molnar, Borislav Petkov
  Cc: linux-kernel, Huang, Ying, Hidetoshi Seto, Avi Kivity

This is the in-kernel recovery path for memory errors. Still
no change to the eventual reporting mechanism (mcelog vs. perf).

Lots of changes based on feedback from the previous set of
discussions.  Big stuff:

0) Scaled back on ambitions - just worry about recovery from
   memory errors while executing in user mode. Adding recovery
   for some kernel cases is deferred to some future date.

1) Took Seto-san's "mce_gather_info()" patch in place of the
   one to just move mce_get_rip() around. Added an expanded
   comment to explain that we need to collect all this information
   to make decisions about severity for errors we find when scanning
   the machine check banks.  The rest of Seto-san's patches to
   make the severity table easier to read look interesting.

2) Ingo pointed me the the user-return-notifier code as an alternative
   to TIF_MCE_NOTIFY.  Part 7 uses return notifiers to cover the
   existing use of this TIF bit (Avi: Thanks for the review)

3) Take the newly freed TIF_MCE_NOTIFY bit and use it to implement
   an extension in part 8 to make a "per-task" user notifier.

3) Re-implement the recovery path using the task_return_notifier ...
   this gets rid of the "mce_error_pfn" field that was used in the
   previous implementation that was so obnoxious.

N.B. both the existing parts and my extensions to the user return
notifiers are not NMI safe because they use ordinary Linux lists.
If this seems to be a worthwhile direction, this deficiency can
be fixed using Ying's NMI safe single linked lists.  Ditto the
hacky "allocator" of task_notify structures in mce_action_required().
Just a proof of concept ... should be replaced with a more
generic NMI safe allocator.

-Tony

0001-MCE-fixes-for-mce-severity-table
0002-MCE-save-most-severe-error-information
0003-MCE-introduce-mce_gather_info
0004-MCE-Move-ADDR-MISC-reading-code-into-common-function
0005-MCE-Mask-out-address-mask-bits-below-address-granual
0006-HWPOISON-Handle-hwpoison-in-current-process
0007-MCE-replace-mce.c-use-of-TIF_MCE_NOTIFY-with-user_re
0008-NOTIFIER-Take-over-TIF_MCE_NOTIFY-and-implement-task
0009-MCE-run-through-processors-with-more-severe-problems
0010-MCE-Add-Action-Required-support


 arch/x86/Kconfig                          |    1 +
 arch/x86/include/asm/thread_info.h        |    8 +-
 arch/x86/kernel/cpu/mcheck/mce-severity.c |   37 +++-
 arch/x86/kernel/cpu/mcheck/mce.c          |  339 ++++++++++++++++++++++-------
 arch/x86/kernel/signal.c                  |    7 +-
 include/linux/sched.h                     |    3 +
 include/linux/user-return-notifier.h      |   14 ++
 kernel/fork.c                             |    1 +
 kernel/user-return-notifier.c             |   36 +++
 mm/memory-failure.c                       |   28 ++-
 10 files changed, 369 insertions(+), 105 deletions(-)

Andi Kleen (3):
      MCE: Move ADDR/MISC reading code into common function
      MCE: Mask out address mask bits below address granuality
      HWPOISON: Handle hwpoison in current process

Hidetoshi Seto (1):
      MCE: introduce mce_gather_info()

Tony Luck (6):
      MCE: fixes for mce severity table
      MCE: save most severe error information
      MCE: replace mce.c use of TIF_MCE_NOTIFY with user_return_notifier
      NOTIFIER: Take over TIF_MCE_NOTIFY and implement task return notifier
      MCE: run through processors with more severe problems first
      MCE: Add Action-Required support


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2011-06-15  8:53 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-09 21:25 [RFC] reworked machine check recovery patches Luck, Tony
2011-06-09 21:29 ` [PATCH 01/10] MCE: fixes for mce severity table Luck, Tony
2011-06-09 21:30 ` [PATCH 02/10] MCE: save most severe error information Luck, Tony
2011-06-10  8:06   ` Hidetoshi Seto
2011-06-10 18:08     ` Tony Luck
2011-06-09 21:31 ` [PATCH 03/10] MCE: introduce mce_gather_info() Luck, Tony
2011-06-09 21:32 ` [PATCH 04/10] MCE: Move ADDR/MISC reading code into common function Luck, Tony
2011-06-10  9:33   ` Borislav Petkov
2011-06-10 18:17     ` Tony Luck
2011-06-09 21:33 ` [PATCH 05/10] MCE: Mask out address mask bits below address granuality Luck, Tony
2011-06-10  8:07   ` Hidetoshi Seto
2011-06-10  9:46     ` Borislav Petkov
2011-06-10 19:06     ` Tony Luck
2011-06-11  0:12       ` Andi Kleen
2011-06-10  9:42   ` Borislav Petkov
2011-06-10 19:09     ` Tony Luck
2011-06-09 21:34 ` [PATCH 06/10] HWPOISON: Handle hwpoison in current process Luck, Tony
2011-06-10  8:07   ` Hidetoshi Seto
2011-06-10 20:36     ` Tony Luck
2011-06-09 21:35 ` [PATCH 07/10] MCE: replace mce.c use of TIF_MCE_NOTIFY with user_return_notifier Luck, Tony
2011-06-10  8:08   ` Hidetoshi Seto
2011-06-10 20:42     ` Tony Luck
2011-06-11 10:24       ` Borislav Petkov
2011-06-12  8:31       ` Avi Kivity
2011-06-12  8:29   ` Avi Kivity
2011-06-12 10:24     ` Borislav Petkov
2011-06-12 10:30       ` Avi Kivity
2011-06-12 13:53         ` Borislav Petkov
2011-06-09 21:36 ` [PATCH 08/10] NOTIFIER: Take over TIF_MCE_NOTIFY and implement task return notifier Luck, Tony
2011-06-12 22:38   ` Borislav Petkov
2011-06-13  5:31     ` Tony Luck
2011-06-13  7:59       ` Avi Kivity
2011-06-13  9:55         ` Borislav Petkov
2011-06-13 11:40           ` Avi Kivity
2011-06-13 12:40             ` Borislav Petkov
2011-06-13 12:47               ` Avi Kivity
2011-06-13 15:12                 ` Borislav Petkov
2011-06-13 16:31                   ` Avi Kivity
2011-06-13 17:13                     ` Tony Luck
2011-06-14  2:50                       ` Hidetoshi Seto
2011-06-14  2:51                         ` [PATCH 1/2] x86, mce: introduce mce_memory_failure_process Hidetoshi Seto
2011-06-14  2:53                         ` [PATCH 2/2] x86, mce: rework use of TIF_MCE_NOTIFY Hidetoshi Seto
2011-06-14 18:02                           ` Tony Luck
2011-06-14 18:28                             ` Tony Luck
2011-06-15  1:29                               ` Hidetoshi Seto
2011-06-15  2:10                                 ` Tony Luck
2011-06-15  3:17                                   ` Hidetoshi Seto
2011-06-14  3:09                         ` [PATCH 08/10] NOTIFIER: Take over TIF_MCE_NOTIFY and implement task return notifier Tony Luck
2011-06-14 11:40                       ` Avi Kivity
2011-06-14 13:33                         ` Borislav Petkov
2011-06-14 13:43                           ` Avi Kivity
2011-06-14 17:13                             ` Luck, Tony
2011-06-15  8:51                               ` Avi Kivity
2011-06-14 16:59                         ` Luck, Tony
2011-06-15  8:52                           ` Avi Kivity
2011-06-13 16:43               ` Tony Luck
2011-06-09 21:37 ` [PATCH 09/10] MCE: run through processors with more severe problems first Luck, Tony
2011-06-10  8:09   ` Hidetoshi Seto
2011-06-10 20:49     ` Tony Luck
2011-06-13 22:03       ` Tony Luck
2011-06-14  1:27         ` Hidetoshi Seto
2011-06-14  3:04           ` Tony Luck
2011-06-09 21:38 ` [PATCH 10/10] MCE: Add Action-Required support Luck, Tony
2011-06-10  8:06   ` Hidetoshi Seto
2011-06-10 21:00     ` Tony Luck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox